Planet MySQL

Displaying posts with tag: PDI (reset)

Nov

2011

Data Modeling

Posted by Matt Casters on Thu 03 Nov 2011 14:07 UTC
Tags:

Open Source, Data Integration, metadata, Kettle, PDI, Kimball, Multidimensional modeling

Dear data integration fans,

I’m a big fan of “appropriate” data modeling prior to doing any data integration work. For a number of folks out there that means the creation of an Enterprise Data Warehouse model in classical Bill Inmon style. Others prefer to use modern modeling techniques like Data Vault, created by Dan Linstedt. However, the largest group data warehouse architects use a technique called dimensional modeling championed by Ralph Kimball.

Using a modeling technique is very important since it brings structure to your data warehouse. …

[Read more]

Jul

2011

Real-time streaming data aggregation

Posted by Matt Casters on Thu 28 Jul 2011 08:54 UTC
Tags:

Open Source, Data Integration, Pentaho, streaming, Kettle, twitter, PDI, pentaho data integration, Real-time

Dear Kettle users,

Most of you usually use a data integration engine to process data in a batch-oriented way. Pentaho Data Integration (Kettle) is typically deployed to run monthly, nightly, hourly workloads. Sometimes folks run micro-batches of work every minute or so. However, it’s lesser known that our beloved transformation engine can also be used to stream data indefinitely (never ending) from a source to a target. This sort of data integration is sometimes referred to as being “streaming“, “real-time“, “near real-time“, “continuous” and so on. Typical examples of situations where you have a never-ending supply of data that needs to be processed the instance it becomes available are JMS (Java Message Service), RDBMS log sniffing, on-line fraud analyses, web or application …

[Read more]

May

2011

Memory tuning fast paced ETL

Posted by Matt Casters on Tue 31 May 2011 19:12 UTC
Tags:

Data Integration, master, slave, ETL, Clustering, Kettle, memory, PDI, pentaho data integration, parallel, Never ending

Dear Kettle friends,

on occasion we need to support environments where not only a lot of data needs to be processed but also in frequent batches. For example, a new data file with hundreds of thousands of rows arrives in a folder every few seconds.

In this setting we want to use clustering to use “commodity” computing resources in parallel. In this blog post I’ll detail how the general architecture would look like and how to tune memory usage in this environment.

Clustering was first created around the end of 2006. Back then it looked like this.

The master

This is the most important part of our cluster. It takes care of administrating network configuration and topology. It also keeps track of the state of dynamically added slave servers.

The master …

[Read more]

Nov

2010

Kettle vs Oracle REF CURSOR

Posted by Matt Casters on Wed 17 Nov 2010 00:01 UTC
Tags:

Oracle, Databases, Data Integration, Pentaho, Kettle, PDI, PDI-200, REF Cursor

Dear Kettle fans,

PDI-200 has been out there for a while now. Jens created the feature request a little over 3 years ago. I guess the main thing blocking this issue was not as much a technical problem but more of a licensing and dependency one (Oracle JDBC dependency and distribution license).

However, now that we have the User Defined Java Class step we can work around those pesky problems. That is because the Java code in there only gets compiled and executed at runtime so it’s perfectly fine to create any sort of dependency in there you like.

The following transformation reads a set of rows from a stored procedure as described on this web page.

In short, our UDJC step executes the following code:

begin ? := sp_get_stocks(?); …

[Read more]

Oct

2010

Pentaho Kettle Solutions Overview

Posted by Matt Casters on Fri 08 Oct 2010 15:57 UTC
Tags:

review, Data Integration, Book, Pentaho, Kettle, PDI, Pentaho Kettle Solutions

Dear Kettle friends,

As mentioned in my previous blog post, copies of our new book Pentaho Kettle Solutions are finally shipping. Roland, Jos and myself worked really hard on it and, as you can probably imagine, we were really happy when we finally got the physical version of our book in our hands.

So let’s take a look at what’s in this book, what the concept behind it was and give you an overview of the content…

The concept

Given the fact that Maria’s book, called Pentaho Data Integration 3.2, …

[Read more]

May

2010

Book Review : Pentaho 3.2 Data Integration

Posted by Matt Casters on Thu 06 May 2010 18:21 UTC
Tags:

Open Source, Data Integration, Book, Pentaho, Kettle, PDI, pentaho data integration, packt

Dear Kettle fans,

A few weeks ago, when I was stuck in the US after the MySQL User Conference, a new book was published by Packt Publishing.

That all by itself is something that is not too remarkable. However, this time it’s a book about my brainchild Kettle. That makes this book very special to me. The full title is Pentaho 3.2 Data Integration : Beginner’s Guide (Amazon, Packt). The title all by itself explains the purpose of this book: give the reader a quick-start when it comes to Pentaho Data Integration (Kettle).

The author María Carina Roldán ( …

[Read more]

Apr

2010

Slides from my MySQL UC 2010 presentation

Posted by Matt Casters on Tue 27 Apr 2010 10:54 UTC
Tags:

conference, Presentation, Data Integration, Pentaho, Kettle, slides, PDI, pentaho data integration, Eyjafjallajökul, MySQL UC 2010, MySQL

As requested by a few fans out there, here are the slides of my presentation:

Pentaho Data Integration 4.0 and MySQL.pdf

I had a great time at the conference, met a lot of nice folks, friends, customers, partners and colleagues. After the conference I was unable to get back home like so many of you because of the Paul Simon singing Eyjafjallajökul volcano in Iceland.

So I ended up flying over to Orlando for a week of brutal PDI 4.0 RC1 hacking with the rest of the l33t super Pentaho development team. However, after 2+ weeks from home, even a severe storm …

[Read more]

Jun

2009

Mapping to a database table

Posted by Matt Casters on Wed 24 Jun 2009 15:56 UTC
Tags:

Linux, Data Integration, Kettle, PDI, mapping, Table output

For some reason, the creation of a mapping to a database table poses a problem for certain people.

This is how it’s done in PDI 3.2.0 or later in the “Table Output” step:

Ogg video available over here

Until next time,
Matt

Dec

2008

Kettle at the MySQL UC 2009

Posted by Matt Casters on Thu 11 Dec 2008 19:25 UTC
Tags:

Open Source, Databases, conference, Data Integration, Kettle, 2009, PDI, MySQL

Hello Kettle fans,

Like Roland I got confirmation earlier this week that I could present my talk on “MySQL and Pentaho Data Integration in a cloud computing setting”, at the next MySQL user conference.

I’m very excited about the work we’ve done on the subject and it’s going to be great talking about it in April.

See you there!
Matt

Top Authors

Oracle MySQL Blogs

Team Blogs

Vendor Blogs

Search

MySQL Links