Showing entries 1 to 5
Displaying posts with tag: data quality (reset)
Data Masking 101

I continue to dig up and share this simple approach for production data masking via SQL to create testing data sets. Time to codify it into a post.

Rather than generating a set of names and data from tools such as Mockaroo, it is more practical to use actual data for a variety of testing reasons.

The SQL below is a self-explanatory approach of removing Personal Identifiable Information (PII), but keeping data relevant. I use this approach for a number of reasons.

  • We are using production data rather than synthetic data. Data volume, distribution, and additional column values are realistic. This is a subset of an example, but dates and locations are therefore realistic
  • Indexes (and unique indexes) still work, and distribution across the index is adequate for searching. Technically the index …
[Read more]
Data Cleaner 2

Dear Kettle friends,

Some time ago while I visited the nice folks from Human Inference in Arnhem, I ran into Kasper Sørensen, the lead developer of DataCleaner.

DataCleaner is an open source data quality tool released (like Kettle) under the LGPL license.  It is essentially to blame for the lack of a profiling tool inside of Kettle.  That is because having DataCleaner available to our users was enough to push the priority of having our own data profiling tool far enough down.

Kasper worked on DataCleaner pretty much in his spare time in the past.  Now that Human Inference took over the project I was expecting more frequent updates and that’s what we …

[Read more]
Part 2: Comparing Numerics in Pentaho Data Integration

As a followup to my previous post about comparing numeric values, I've since discovered a little more about the problem. To repeat my original problem: certain numeric field values that should be equal are being detected as different in the Filter rows step. I think it's important to be able to perform accurate comparisons since it is a frequent task in data quality analysis.

Originally, I assumed this had something to do with jdbc. However, since I can re-produce the issue without any SQL, I'm sure this has nothing to do with the version of the MySQL Connector/J jdbc driver. I tried the 5.0.8 version of the driver and I observed the same behavior. I couldn't even get my transform to work correctly with the 5.1.12 version of the connector -- it does not recognize column aliases in my SQL query.

Now for the rest of the …

[Read more]
451 CAOS Links 2009.08.04

OIN offers cash for patents. CentOS crisis averted. Microsoft denies GPL violation. And more.

Follow 451 CAOS Links live @caostheory on Twitter and
“Tracking the open source news wires, so you don’t have to.”

# Open Invention Network offered individual inventors cash for patents, and acquired patents from V_Graph.

# The H Open reported that the management problems at CentOS are now resolved.

# Sam Ramji told Network World in detail why Microsoft believes its Linux IC code did not violate the GPL (from 15m 30s).

# Canonical delivered an on-premise version of …

[Read more]
SQL_MODE and MySQL Data Quality

As my former boss will attest, I have a reputation for being a bit of a data quality zealot. The storage of data that is unfit for use leads to many problems, but I suppose that’s another subject for another day.

It’s tough enough to manage data quality problems introduced by source code errors, system failures, and requirements misunderstandings…But a default installation of MySQL introduces a new and exciting way to give us data quality evangelists fits: It allows unfit data to be inserted in the database. That’s the bad news. The good news is that by making a simple configuration change you can prevent this, and override the setting when you don’t care.

In a default MySQL installation, the value of the SQL_MODE system variable is set to ‘’. This allows you to force inserts and updates that may violate the intended design of the table. This point is more philosophical than technical, but in a mission …

[Read more]
Showing entries 1 to 5