Planet MySQL

Displaying posts with tag: Data Integration (reset)

Aug

2007

Being lazy

Posted by Matt Casters on Thu 23 Aug 2007 22:44 UTC
Tags:

Data Integration

Dear Kettle fan,
Since our code is open, we have to be honest: in the past, the performance of Kettle was less than stellar in the “Text File” department. It’s true that we did offer some workarounds with respect to database loading, but there are cases when people don’t want to touch any database at all. Let’s take a closer look at that specific problem…

Reading and writing text files…
Let’s take a look at this delimited (CSV) file (28MB). Unzipped, the file is around 89MB in size.

Suppose you read this file using version 2.5.1 (soon to be out) with a single “Text File Input” step. On my machine, that process consumes most of the available CPU power it can take and takes around 57 seconds to complete. (1M rows/minute or 60M rows/hour)

When we analyze what’s eating the CPU …

[Read more]

Aug

2007

Clustering & partitioning

Posted by Matt Casters on Thu 23 Aug 2007 11:09 UTC
Tags:

Data Integration

Let’s have a quick look at some clustering examples in the new 3.0 engine:

This example runs all steps in the transformation in a clustered mode. That means that there are 4 slave transformations that run in parallel with each other.
The interesting part is that first of all the “Fixed Input” step is running in parallel, each copy reading a certain part of a file.
The second thing to mention about it is that we now allow you to run multiple copies of a single step on a cluster. In this example, we run 3 copies of a step per slave transformation. In total there are 12 copies of the sort step in action in parallel.

IMPORTANT: this is a test-transformation, a real world sorting exercise would also include a “Sorted Merge” step to keep the data sorted. I was too lazy to redo the screenshots though

The clustered transformations now also support logging to …

[Read more]

Jul

2007

PDI 3.0 : first milestone available

Posted by Matt Casters on Thu 19 Jul 2007 15:51 UTC
Tags:

Data Integration

Dear Kettle fan,

While this first milestone release of Kettle version 3 is absolutely NOT YET READY FOR PRODUCTION, it’s a nice way to see the speed of our new architecture for yourself.
Version 3.0 is a complete refactoring of the complete Kettle code base and as such it will take a while for things to settle down again.
That being said, we have a number of tests that tell us this might be a good time to tell the world we’re still very much alive.

As noted above, this release focuses on performance. Version 3.0 was reworked to completely separate data and metadata. This has led to significant performance gains across the board. At the same time we expect all your old transformations to run unchanged. (if not, it’s a bug)

Get your new software fix over here: …

[Read more]

Apr

2007

Getting ready for MySQL Santa Clara

Posted by Matt Casters on Sun 22 Apr 2007 18:58 UTC
Tags:

Data Integration

Few! After a long trip (10 hours flight) I’m spending all the time left on preparations for my talks on Tuesday and Wednesday.

Hard work. You know, like going to a basball game, watching the San Francisco Giants beat the Arizona Diamondbacks with a nice homerun by Barry Bonds.

Zito, the other Barry, pitched a really nice game to help win the game with 1-0.
Aside from all that fun, from an geek viewpoint, I think the video screens in the ballpark are simply awesome.

Until next time,

Matt

Apr

2007

Handling 500M rows

Posted by Matt Casters on Mon 16 Apr 2007 07:59 UTC
Tags:

Data Integration

We’ve been doing some tests with medium sized data sets lately. We extracted around half a year of data (514M rows) from a warehouse where we’re doing a database partitioning and clustering test.
Below is an example where we copy +500M rows from one database to another one that is partitioned. (MS SQL Server to MySQL 5.1). This is done using the following transformation. In stead of just using one partitioned writer, we used 3 to speed up the process. (lowers latency).

Copying 500M rows is just as easy as copying a thousand, it just takes a little longer…

It would have completed the task a lot faster if we wouldn’t have been copying to a single table on DB4 at the same time. (yep, again 500M rows) This slowed down the transformation to the maximum speed of DB4. That being said, if you still had any doubt about Pentaho Data Integration being able to copy large volumes of data, …

[Read more]

Mar

2007

Meet me at MySQL Santa Clara

Posted by Matt Casters on Thu 29 Mar 2007 19:12 UTC
Tags:

Data Integration

Dear Kettle fan,

Next month, the MySQL Conference & Expo 2007 takes place in Santa Clara. I was invited to do 2 talks over there:

Addressing data chaos using MySQL and Kettle, a 60 minute presentation on Wed Apr-25th, 10:45-11:45
Exploiting MySQL 5.1 for advanced BI applications, a 90 minute presentaiton on Tue Apr-24th, 16:40-18:15

For a complete overview of all the great sessions that will take place, go here.

Feel free to join us for a Birds of a feather session on Tuesday evening, right after my talk. Joining me there are …

[Read more]

Mar

2007

Good old file handling

Posted by Matt Casters on Wed 28 Mar 2007 08:28 UTC
Tags:

Data Integration

In a heavily webbed, automated, interconnected world with most data stored on relational databases, we can sometimes forget that there are indeed many situation where you simply want to FTP a file from one place to another.

That process in itself holds many dangers as I pointed out to someone on the forum today. Let me re-cap that post here on the blog…

Suppose your files are coming in using FTP to a local directory.

A file is being written, let’s call it FILE_20070328.txt.
Now, in advance you don’t know the size of that file. Let’s say it’s 10MB and takes 30 seconds to FTP.
In your transformation you detect this file and start working. Chances are very high that you’ll be reading an incomplete file. (See also this technical tip on variables and file handling)

There are 2 ways to …

[Read more]

Mar

2007

MySQL Bulk export to file

Posted by Matt Casters on Wed 07 Mar 2007 20:40 UTC
Tags:

Data Integration

A few days ago I had some good news on the new MySQL Bulk loader for which we added support in Kettle.

Today French i18n/translation hero Samatar checked in the code for the exact oposite, built on the “SELECT … INTO OUTFILE …” statements that MySQL supports.

As you can see, this job entry allows you to export the content of a MySQL database table to a flat file. Again, this is done completely by MySQL and therefor works at optimal speed (really fast!)
We added all kinds of helpers in the GUI so that you can easily select the table and the columns to export. All you need to do is give it a filename a separator and off you go! Not only that, you can use variables to specify almost all parameters of the job entry.

In short: another great option for those of us that work with MySQL …

[Read more]

Mar

2007

A nice chat

Posted by Matt Casters on Tue 06 Mar 2007 19:45 UTC
Tags:

Data Integration

Earlier today I had a nice IM chat with someone. He or she is referred to below as Question and I’m Answer. There where interesting questions and perhaps others find the answers interesting as well. I seemed a shame to let the information in the chat log go to waste, so I’m posting it here on my blog.
Question: I have a qestion for you about the possibility of creating custom transformations.
Answer: sure
Question: my company already has quite a buit of business logice that is coded in C and/or C++ and this logic then calls some Corba services. Would it be possible for use to integrate that logic into Kettle?
Answer: Not directly, however, it’s not that hard to write wrappers in JNI (java native interface) and create a plugin for it.
Question: That was my idea. I was just wondering if there were any other ways
Answer: …

[Read more]

Mar

2007

MySQL bulk load

Posted by Matt Casters on Mon 05 Mar 2007 11:21 UTC
Tags:

Data Integration

Pretty Sick Slick

The last week I was under the weather and a year ago that would have meant that development of Pentaho Data Integration (PDI) would pretty much stop. These days I’m happy to say that this is absolutely not true anymore. In fact, hundreds of commits where done in the last week.

MySQL bulk load

To pick one example, Samatar Hassan added a job entry that allows you to configure a MySQL Bulk load job entry:

This job entry loads data as fast as possible into a MySQL database by using the LOAD DATA SQL command. It’s not as flexible as the Text File Input step, but it sure is fast. In certain cases, it might actually be up to ten times as fast. In short: …

[Read more]

Top Authors

Oracle MySQL Blogs

Vendor Blogs

MySQL Links