Showing entries 81 to 87
« 10 Newer Entries
Displaying posts with tag: Data Integration (reset)
Handling errors

In the next milestone build of Pentaho Data Integration (2.4.1-M1) we will be introducing advanced error handling features. (2.4.1-M1 is expected around February 19th)
We looked hard to find the easiest and most flexible way to implement this, and I think we have found a good solution.

Here is an example:

The transformation above works as follows: it generates a sequence between -1000 and 1000.  The table is a MySQL table with a single “id” column defined as TINYINT.  As you all know, that data type only accepts values between -128 and 127.

So what this transformation does is, it insert 256 rows into the table and divert all the others to a text file, our “error bucket”.

How can we configure this new feature?  Simple: click right on the step where you want the error handling to take place, in this case, …

[Read more]
Pentaho metadata

Today Pentaho released the first milestone of “Pentaho metadata” as a new core utility to make the life of BI professionals yet a little bit easier.

As you can see in the screenshot of the Pentaho metadata editor above, we offer a solution that bridges the gap between the worlds of the relational databases and business information. Here are a few of the key points that differentiate Pentaho metadata from the competition:

  • Completely open
    • Released under an open source license (Mozilla Public License v1.0)
    • Persisting in the Common Warehouse Metamode, a recognised industry standard, potentially allowing for easier interoperability with 3rd party tools
[Read more]
Database partitioning

We’ve been experimenting lately with database partitioning (in version 2.3.2-dev, make sure to update your kettle.jar to the latest snapshot). In our context, database partitioning means that we divide our data over a cluster of several databases.

A typical way of doing that is that you divide the customer_id by the number of hosts in the cluster and get the remainder. If the remainder is 0, you store the data on the first host in the cluster, 1 for the second, 2 for the third, etc.

This sort of thing is something that we’ve been implementing in Kettle for the last couple of weeks. The reasoning is simple: if one database is not up to the task, split the load over 2 or 5 or 10 databases on any amount of hosts.  ( Now imagine all the PCs at work all running an in-memory database )
Besides small changes to the Kettle transformation …

[Read more]
Kettle webinar at MySQL!

Hi Kettle fans,

The 2.3.1 release has been dropped! These kinds of things are always a bit troublesome because of the testing I need to do to get it out, the documentation updates, etc, etc. It’s also the right time for me to do bug administration: clean up old stuff, etc. This is not the most enjoyable type of activity and I’m glad it’s over and done with.

Anyway, it’s about time we did something enjoyable again: a webinar! Yes, we have been invited to do a MySQL Webinar next Tuesday. At 10:00 am Pacific, 1:00 pm Eastern or 17:00 GMT for the people living in my timezone. The presentation will be approximately 45 minutes long followed by Q&A.

I think that this is a great …

[Read more]
Key-value madness

People that write data integration solutions often have a tough job at hand. I can tell because I get to see all the questions and bugs that get reported.

That is the main reason I committed a first collection of 30 examples to the codebase to be included in the next GA release (2.3.1 will be released next Friday, more on that later).

Today I would like to talk about one of the more interesting examples on the de-normaliser step. What the de-normaliser step does is help you out with the lookup of key-value pairs so that you can attribute the value after lookup to a certain field. The step is a fairly recent addition to Pentaho Data Integration, but gets a lot of attention. I guess that’s because the use of a key-value pair system is often used in situations where programmers need a very flexible way of storing data in a relational database.

The …

[Read more]
Simpler reporting : make your data richer

A lot of time, I hear discussions about which reporting tool is the easiest to use for certain special tasks. Most of the time, I just ignore these “threads” because it’s not my cup of tea as a developer of ETL solutions.
However, it has to be said, often the solution to complex reporting requirements is to be found in ETL.
When you find yourself struggling with complex reports that need any of the following:

  • compare different records
  • aggregate beyond simple sums and averages.
  • report on a non-existing records (report 0 sales, etc)

Well, in those cases you need ETL.

Let’s take for example the case of the reporting on non-existing sales: how can you report that there has been 0 sales for a certain product during a certain week? Well, you can create an aggregate table in your ETL that contains the following:

  • Dimensions
[Read more]
Take back control

The type of questions that we get on the forums with regards to Pentaho Data Integration (Kettle) has been shifting lately from this type of question:

How do I read data from database type xxx

going to this type of questions:

I want to read a list of e-mail addresses from a database table, set a variable and send the warehouse log files off to all these people.

That’s quite an evolution that’s been going on. It’s obvious that people are starting to find the obvious solutions to the first type of questions so now they just get stuck on doing more complex things. I guess that’s to be expected, really. It’s nice that for the most part, I can now say, “yes, with the new 2.3.0 release, that is most certainly possible”
However, IMHO, often there is something missing at the implementers side of the story as well. I guess what I’m saying is that all too often these …

[Read more]
Showing entries 81 to 87
« 10 Newer Entries