Home |  MySQL Buzz |  FAQ |  Feeds |  Submit your blog feed |  Feedback |  Archive |  Aggregate feed RSS 2.0 English Deutsch Español Français Italiano 日本語 Русский Português 中文
Previous 30 Newer Entries Showing entries 61 to 87

Displaying posts with tag: Data Integration (reset)

Step performance graphs
+0 Vote Up -0Vote Down

One of the things I’ve been working on lately in Kettle / Pentaho Data Integration is the transparency of the performance monitoring.

We don’t just need an API to get the step performance data out, but we also need to visualize this data in a simple way, something like this:

The next steps will be to also allow this data to be spooled off to a database somewhere and to be accessed remotely using Carte.

Until next time,


Ohloh top 10
+0 Vote Up -0Vote Down

People sometimes ask me if I still do a lot of development.

Well, Ohloh keeps track of that these days and it seems that between September and November 2007 I was the 7th most active contributor:

Ohloh tracks 90655 developers in 8985 projects including Firefox, Apache HTTP server, Subversion, MySQL, PHP, Open Office, the Linux kernel, Ubuntu and many more.  As such, I’m kinda proud of that 7th spot.

If version 3 of Pentaho Data Integration has any bugs left when it launches, it won’t be because I was having a vacation

Until next time,


4.3 million rows per second
+0 Vote Up -0Vote Down

Earlier today I was building a test-case in which I wanted to put a lot of Unicode data into a database table. The problem is of-course that I don’t have a lot of data, just a small Excel input file.

So I made a Cartesian product with a couple of empty row generators:

It was interesting to see how fast the second join step was generating rows:

Yes, you are reading that correctly: 717 million rows processed in 165 seconds = 4.3 million rows per second.

For those of you that would love to try this on their own machine. Here is an exclusive present for the readers of this blog in the form of a 3.0.0-RC2 preview of 2007/10/12 (88MB zip file). We’ve been fixing bugs like crazy so

  [Read more...]
Pentaho reference case
+0 Vote Up -0Vote Down

Thought I’d mention that a new case study featuring Pentaho and Kettle showed up over at the Database Trends and Applications. The name of the paper is called “Loma Linda University Health Care Deploys Pentaho BI” (PDF).

To quote :

With commercial products you don’t know if you are getting what you want, but with open source you can create proofs-of-concept. And the TCO is so much lower.

Until next time!


Kettle 3 RC1
+0 Vote Up -0Vote Down

Dear Kettle fans,

Again, we leave a very busy period behind us (to start another :-)) with this announcement of this first release candidate for version 3.0.0.

Here is a link to the binary zip file and here is the source code.

What has changed since version 3.0.0-M2?

  • A new debugger (see also my blog entry on the subject)
  • Remote execution of jobs. (see also this wiki page)
  • Toolbar New Job/Trans change
  • Faster variable insertion through CTRL-SPACE
  • JavaScript enhancements for 3.0: (see also
  [Read more...]
Help OpenMRS!!!
+0 Vote Up -0Vote Down

My friend and colleague Julian Hyde of Mondrian fame just blogged about this: help out the OpenMRS project , please!

The folks behind the OpenMRS are helping to improve the health-care systems in developing countries. More in particular, they are fighting AIDS with this software. OpenMRS has certainly shown to be up to the task at hand: it is currently tracking the medical conditions of over a million people in 12 countries.

Because of the exponential growth of users, this project is in urgent need of BI manpower. Julian and myself have both agreed to help out with strategical advice for the BI part of OpenMRS.

If you want to be part of the team, if you know a bit

  [Read more...]
Back to basics
+0 Vote Up -0Vote Down

A few days ago someone made the comment that Pentaho Data Integration (Kettle) was a bit too hard to use. The person on the chat was someone that tried to load a text file into a database table and he was having a hard time doing just that.

So let’s go back to basics in this blog post and load a delimited text file into a MySQL table.

If you want to see how it’s done, click on this link to watch a real-time (non-edited) flash movie. It’s 11MB to download and is about 2-3 minutes long.

Until next time!


Kettle 3 Milestone 2 is available
+0 Vote Up -0Vote Down

UPDATE: for all you people that missed the news and come here directly, we have an RC1 now too.

Dear Kettle fans,

After a long period of bug-squashing and other frantic coding activities, we are happy to give you Kettle’s second milestone of version 3.0.0. (77MB zip file)

What has changed since M1?

  • New icons!! This is the first release to include a new set of icons and as such a fresh new look.
  • A new Mondrian Input step to read from Pentaho Analyses using MDX.
  • A new Regular Expression evaluation step
  • Access Input (don’t ask!)
  • Fixed / improved repository support
  • Improved database dialect handling (SQL Server .. problem and forcing identifiers to lower/uppercase)
  • Vast improvements to the
  [Read more...]
Making the case for Kettle
+0 Vote Up -0Vote Down

Dear data integration fans,

Once in a while, there are discussions on various blogs (usually with me smack in the middle of it) debating the differences between code generation and model based execution, how this impacts the way we approach databases, the open nature of it all, etc.

With this blog entry I want to push the notion that Pentaho Data Integration (Kettle) didn’t just evolve by chance into the state it is today as a streaming, metadata driven, model based engine. I made some careful design choices early on…

Open as possible

The goal of Kettle from the beginning was to be as open as possible. My definition of “as open as possible” included:

  [Read more...]
Digging Mondrian
+0 Vote Up -0Vote Down

On Friday I committed code to 3.0 trunk to allow people to execute an MDX query on a Mondrian server and get the result back in a tabular format. This particular code to “flatten” an OLAP cube into rows was written by Julian Hyde, the lead developer and founder of Mondrian OLAP a.k.a. Pentaho analyses.

If you run the Pentaho demo on your box and then look at the Analyses sample, you could see something like this:

Suppose you wanted to get this exact data to work with, create analytical data, exports, … Well, now you have the option of doing it in Kettle:

What you do is create a database connection to the database that Mondrian reads from, hand it the location of the Mondrian schema (catalog). You

  [Read more...]
Being lazy
+0 Vote Up -0Vote Down

Dear Kettle fan,
Since our code is open, we have to be honest: in the past, the performance of Kettle was less than stellar in the “Text File” department. It’s true that we did offer some workarounds with respect to database loading, but there are cases when people don’t want to touch any database at all. Let’s take a closer look at that specific problem…

Reading and writing text files…
Let’s take a look at this delimited (CSV) file (28MB). Unzipped, the file is around 89MB in size.

Suppose you read this file using version 2.5.1 (soon to be out) with a single “Text File Input” step. On my machine, that process consumes most of the available CPU power it can take and takes around 57 seconds to complete. (1M rows/minute or 60M

  [Read more...]
Clustering & partitioning
+0 Vote Up -0Vote Down

Let’s have a quick look at some clustering examples in the new 3.0 engine:

This example runs all steps in the transformation in a clustered mode. That means that there are 4 slave transformations that run in parallel with each other.
The interesting part is that first of all the “Fixed Input” step is running in parallel, each copy reading a certain part of a file.
The second thing to mention about it is that we now allow you to run multiple copies of a single step on a cluster. In this example, we run 3 copies of a step per slave transformation. In total there are 12 copies of the sort step in action in parallel.

IMPORTANT: this is a test-transformation, a real world sorting exercise would also include a “Sorted Merge” step to keep

  [Read more...]
PDI 3.0 : first milestone available
+0 Vote Up -0Vote Down

Dear Kettle fan,

While this first milestone release of Kettle version 3 is absolutely NOT YET READY FOR PRODUCTION, it’s a nice way to see the speed of our new architecture for yourself.
Version 3.0 is a complete refactoring of the complete Kettle code base and as such it will take a while for things to settle down again.
That being said, we have a number of tests that tell us this might be a good time to tell the world we’re still very much alive.

As noted above, this release focuses on performance.  Version 3.0 was reworked to completely separate data and metadata.  This has led to significant performance gains across the board.  At the same time we expect all your old transformations to run unchanged.  (if not, it’s a bug)

Get your new software fix over here:

  [Read more...]
Getting ready for MySQL Santa Clara
+0 Vote Up -0Vote Down

Few!  After a long trip (10 hours flight) I’m spending all the time left on preparations for my talks on Tuesday and Wednesday.

Hard work.  You know, like going to a basball game, watching the San Francisco Giants beat the Arizona Diamondbacks with a nice homerun by Barry Bonds.

Zito, the other Barry, pitched a really nice game to help win the game with 1-0.
Aside from all that fun, from an geek viewpoint, I think the video screens in the ballpark are simply awesome.

Until next time,


Handling 500M rows
+0 Vote Up -0Vote Down

We’ve been doing some tests with medium sized data sets lately.  We extracted around half a year of data (514M rows) from a warehouse where we’re doing a database partitioning and clustering test.
Below is an example where we copy +500M rows from one database to another one that is partitioned. (MS SQL Server to MySQL 5.1).  This is done using the following transformation.  In stead of just using one partitioned writer, we used 3 to speed up the process. (lowers latency).

Copying 500M rows is just as easy as copying a thousand, it just takes a little longer…

It would have completed the task a lot faster if we wouldn’t have been copying to a single table on DB4 at the same time. (yep, again 500M rows) This slowed down the

  [Read more...]
Meet me at MySQL Santa Clara
+0 Vote Up -0Vote Down

Dear Kettle fan,

Next month, the MySQL Conference & Expo 2007 takes place in Santa Clara. I was invited to do 2 talks over there:

For a complete overview of all the great sessions that will take place, go here.

Feel free to join us for a Birds of a feather session on Tuesday evening, right after my talk. Joining me there are Julian Hyde (Mondrian founder & lead developer)

  [Read more...]
Good old file handling
+0 Vote Up -0Vote Down

In a heavily webbed, automated, interconnected world with most data stored on relational databases, we can sometimes forget that there are indeed many situation where you simply want to FTP a file from one place to another.

That process in itself holds many dangers as I pointed out to someone on the forum today.  Let me re-cap that post here on the blog…

Suppose your files are coming in using FTP to a local directory.

A file is being written, let’s call it FILE_20070328.txt.
Now, in advance you don’t know the size of that file. Let’s say it’s 10MB and takes 30 seconds to FTP.
In your transformation you detect this file and start working. Chances are very high that you’ll be reading an incomplete file.  (See also this technical tip on variables and file

  [Read more...]
MySQL Bulk export to file
+0 Vote Up -0Vote Down

A few days ago I had some good news on the new MySQL Bulk loader for which we added support in Kettle.

Today French i18n/translation hero Samatar checked in the code for the exact oposite, built on the “SELECT … INTO OUTFILE …” statements that MySQL supports.

As you can see, this job entry allows you to export the content of a MySQL database table to a flat file. Again, this is done completely by MySQL and therefor works at optimal speed (really fast!)
We added all kinds of helpers in the GUI so that you can easily select the table and the columns to export. All you need to do is give it a filename a separator and off you go! Not only that, you can use variables to specify almost all parameters of the job

  [Read more...]
A nice chat
+0 Vote Up -0Vote Down

Earlier today I had a nice IM chat with someone.  He or she is referred to below as Question and I’m Answer.  There where interesting questions and perhaps others find the answers interesting as well.  I seemed a shame to let the information in the chat log go to waste, so I’m posting it here on my blog.
Question: I have a qestion for you about the possibility of creating custom transformations.
Answer: sure
Question: my company already has quite a buit of business logice that is coded in C and/or C++ and this logic then calls some Corba services.  Would it be possible for use to integrate that logic into Kettle?
Answer: Not directly, however, it’s not that hard to write wrappers in JNI (java native interface) and create a plugin for it.
Question: That was my idea.  I was just wondering if there

  [Read more...]
MySQL bulk load
+0 Vote Up -0Vote Down

Pretty Sick Slick

The last week I was under the weather and a year ago that would have meant that development of Pentaho Data Integration (PDI) would pretty much stop. These days I’m happy to say that this is absolutely not true anymore. In fact, hundreds of commits where done in the last week.

MySQL bulk load

To pick one example, Samatar Hassan added a job entry that allows you to configure a MySQL Bulk load job entry:

This job entry loads data as fast as possible into a MySQL database by using the LOAD DATA SQL command. It’s not as flexible as the Text File Input step, but it sure is fast. In certain

  [Read more...]
Handling errors
+0 Vote Up -0Vote Down

In the next milestone build of Pentaho Data Integration (2.4.1-M1) we will be introducing advanced error handling features. (2.4.1-M1 is expected around February 19th)
We looked hard to find the easiest and most flexible way to implement this, and I think we have found a good solution.

Here is an example:

The transformation above works as follows: it generates a sequence between -1000 and 1000.  The table is a MySQL table with a single “id” column defined as TINYINT.  As you all know, that data type only accepts values between -128 and 127.

So what this transformation does is, it insert 256 rows into the table and divert all the others to a text file, our “error bucket”.

How can we configure this new

  [Read more...]
Pentaho metadata
+0 Vote Up -0Vote Down

Today Pentaho released the first milestone of “Pentaho metadata” as a new core utility to make the life of BI professionals yet a little bit easier.

As you can see in the screenshot of the Pentaho metadata editor above, we offer a solution that bridges the gap between the worlds of the relational databases and business information. Here are a few of the key points that differentiate Pentaho metadata from the competition:

  • Completely open
    • Released under an open source license (Mozilla Public License v1.0)
    • Persisting in the Common Warehouse Metamode, a recognised industry standard, potentially allowing for easier interoperability with
  [Read more...]
Database partitioning
+0 Vote Up -0Vote Down

We’ve been experimenting lately with database partitioning (in version 2.3.2-dev, make sure to update your kettle.jar to the latest snapshot). In our context, database partitioning means that we divide our data over a cluster of several databases.

A typical way of doing that is that you divide the customer_id by the number of hosts in the cluster and get the remainder. If the remainder is 0, you store the data on the first host in the cluster, 1 for the second, 2 for the third, etc.

This sort of thing is something that we’ve been implementing in Kettle for the last couple of weeks. The reasoning is simple: if one database is not up to the task, split the load over 2 or 5 or 10 databases on any amount of hosts.  ( Now imagine all the PCs at work all running an in-memory database

  [Read more...]
Kettle webinar at MySQL!
+0 Vote Up -0Vote Down

Hi Kettle fans,

The 2.3.1 release has been dropped! These kinds of things are always a bit troublesome because of the testing I need to do to get it out, the documentation updates, etc, etc. It’s also the right time for me to do bug administration: clean up old stuff, etc. This is not the most enjoyable type of activity and I’m glad it’s over and done with.

Anyway, it’s about time we did something enjoyable again: a webinar! Yes, we have been invited to do a MySQL Webinar (http://www.mysql.com/news-and-events/web-seminars/etl-using-pentaho-kettle.php) next Tuesday. At 10:00 am Pacific, 1:00 pm Eastern or 17:00 GMT for the people living in my timezone. The presentation will be approximately 45 minutes long followed by Q&A.

I think that this is

  [Read more...]
Key-value madness
+0 Vote Up -0Vote Down

People that write data integration solutions often have a tough job at hand. I can tell because I get to see all the questions and bugs that get reported.

That is the main reason I committed a first collection of 30 examples to the codebase to be included in the next GA release (2.3.1 will be released next Friday, more on that later).

Today I would like to talk about one of the more interesting examples on the de-normaliser step. What the de-normaliser step does is help you out with the lookup of key-value pairs so that you can attribute the value after lookup to a certain field. The step is a fairly recent addition to Pentaho Data Integration, but gets a lot of attention. I guess that’s because the use of a key-value pair system is often used in situations where programmers need a very flexible way of storing data in a

  [Read more...]
Simpler reporting : make your data richer
+0 Vote Up -0Vote Down

A lot of time, I hear discussions about which reporting tool is the easiest to use for certain special tasks. Most of the time, I just ignore these “threads” because it’s not my cup of tea as a developer of ETL solutions.
However, it has to be said, often the solution to complex reporting requirements is to be found in ETL.
When you find yourself struggling with complex reports that need any of the following:

  • compare different records
  • aggregate beyond simple sums and averages.
  • report on a non-existing records (report 0 sales, etc)

Well, in those cases you need ETL.

Let’s take for example the case of the reporting on non-existing sales: how can you report that there has been 0 sales for a certain product during a certain week? Well, you can create an aggregate table in your ETL that contains the following:

      [Read more...]
    Take back control
    +0 Vote Up -0Vote Down

    The type of questions that we get on the forums with regards to Pentaho Data Integration (Kettle) has been shifting lately from this type of question:

    How do I read data from database type xxx

    going to this type of questions:

    I want to read a list of e-mail addresses from a database table, set a variable and send the warehouse log files off to all these people.

    That’s quite an evolution that’s been going on. It’s obvious that people are starting to find the obvious solutions to the first type of questions so now they just get stuck on doing more complex things. I guess that’s to be expected, really. It’s nice that for the most part, I can now say, “yes, with the new 2.3.0 release, that is most certainly possible”
    However, IMHO, often there is something missing at the implementers side of the story as

      [Read more...]
    Previous 30 Newer Entries Showing entries 61 to 87

    Planet MySQL © 1995, 2014, Oracle Corporation and/or its affiliates   Legal Policies | Your Privacy Rights | Terms of Use

    Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.