Showing entries 1 to 4
Displaying posts with tag: dataset (reset)
Random human recognizable dataset

We all do need sometimes to generate raw valid dummy data for our use cases and applications as we start them. Obviously, one can write their own scripts to generate random data, but it is much better to have data, to which human beings can associate with like names, addresses instead of having them filled with random "lorem ipsum" string data :)

While searching for such a tool, I found a site which does exactly this:


This can also be downloaded and installed locally. It supports three types of installations:
- A single, anonymous user account
- A single user account, requires login
- Multiple accounts

Below is the set of wide varied data types it supports for …

[Read more]
Advantages of weighted lists in RDBMS processing

A list is simply a list of things. The list has no structure, except in some cases, the length of the list may be known. The list may contain duplicate items. In the following example the number 1 is included twice.

Example list:


A set is similar to a list, but has the following differences:

  1. The size of the set is always known
  2. A set may not contain duplicates

You can convert a list to a set by creating a 'weighted list'. The weighted list includes a count column so that you can determine when an item in the list appears more than once:


Notice that there are two number 1 values in the weighted list. In order to make insertions into such a list scalable, consider using partitioning to avoid large indexes.

[Read more]
MySQL: Using Views as Performance Improvement Tools

The most basic and most oft-repeated task that a DBA has to accomplish is to look at slow logs and filter out queries that are suboptimal, that consume lots of unnecessary resources and that hence slow down the database server. This post looks at why and how VIEWs can help against such suboptimal operations.

Tools to generate large synthetic data sets for testing?

I need to generate large (1TB-3TB) synthetic MySQL datasets for testing, with a number of requirements:

a) custom output formatting (SQL, CSV, fixed-len row, etc)
b) referential integrity support (ie, child tables should reference PK values, no orphans,etc)
c) able to generate multiple tables in parallel
d) preferably able to operate without a GUI and/or manual intervention
e) uses a well defined templating construct for data generation
f) preferably open source

Does anyone out there know of a product that meets at least most of these requirements?

I found a PHP based data generation script ( that is extensible in its output formatting, so it should do everything I need it to do.

Showing entries 1 to 4