Home |  MySQL Buzz |  FAQ |  Feeds |  Submit your blog feed |  Feedback |  Archive |  Aggregate feed RSS 2.0 English Deutsch Español Français Italiano 日本語 Русский Português 中文
Showing entries 1 to 30 of 34 Next 4 Older Entries

Displaying posts with tag: Kettle (reset)

Data Modeling
+2 Vote Up -0Vote Down

Dear data integration fans,

I’m a big fan of “appropriate” data modeling prior to doing any data integration work.  For a number of folks out there that means the creation of an Enterprise Data Warehouse model in classical Bill Inmon style.  Others prefer to use modern modeling techniques like Data Vault, created by Dan Linstedt.  However, the largest group data warehouse architects use a technique called dimensional modeling championed by Ralph Kimball.

Using a modeling technique is very important since it brings structure to your data warehouse.  The techniques used, when applied correctly of-course, are

  [Read more...]
Proposals for Codebits.EU
+2 Vote Up -0Vote Down
Codebits is an annual 3-day conference about software and, well, code. It's organized by SAPO and this year's edition is to be held on November 10 thru 12 at the Pavilhão Atlântico, Sala Tejo in Lisbon, Portugal.

I've never attended SAPO Codebits before, but I heard good things about it from Datacharmer Giuseppe Maxia. The interesting thing about the way this conference is organized is that all proposals are available to the public, which can also vote for the proposals. This year's proposals are looking very interesting already, with high

  [Read more...]
Real-time streaming data aggregation
+0 Vote Up -0Vote Down

Dear Kettle users,

Most of you usually use a data integration engine to process data in a batch-oriented way.  Pentaho Data Integration (Kettle) is typically deployed to run monthly, nightly, hourly workloads.  Sometimes folks run micro-batches of work every minute or so.  However, it’s lesser known that our beloved transformation engine can also be used to stream data indefinitely (never ending) from a source to a target.  This sort of data integration is sometimes referred to as being “streaming“, “real-time“, “near real-time“, “continuous” and so on.  Typical examples of situations where you have a never-ending supply of data that needs to be processed the instance it becomes available are JMS (Java Message Service), RDBMS log sniffing, on-line fraud

  [Read more...]
Memory tuning fast paced ETL
+3 Vote Up -0Vote Down

Dear Kettle friends,

on occasion we need to support environments where not only a lot of data needs to be processed but also in frequent batches.  For example, a new data file with hundreds of thousands of rows arrives in a folder every few seconds.

In this setting we want to use clustering to use “commodity” computing resources in parallel.  In this blog post I’ll detail how the general architecture would look like and how to tune memory usage in this environment.

Clustering was first created around the end of 2006.  Back then it looked like this.

The master

This is the most important part of our cluster.  It takes care of administrating network configuration and topology.  It also keeps track of the state of dynamically added slave servers.

The master is started

  [Read more...]
Data Cleaner 2
+2 Vote Up -0Vote Down

Dear Kettle friends,

Some time ago while I visited the nice folks from Human Inference in Arnhem, I ran into Kasper Sørensen, the lead developer of DataCleaner.

DataCleaner is an open source data quality tool released (like Kettle) under the LGPL license.  It is essentially to blame for the lack of a profiling tool inside of Kettle.  That is because having DataCleaner available to our users was enough to push the priority of having our own data profiling tool far enough down.

Kasper worked on DataCleaner pretty much in his spare time in the past.  Now that Human Inference took over the project I was expecting more frequent updates and

  [Read more...]
Reading from MongoDB
+1 Vote Up -2Vote Down

Hi Folks,

Now that we’re blogging again I thought I might as well continue to do so.

Today we’re reading data from MongoDB with Pentaho Data Integration.  We haven’t had a lot of requests for MongoDB support so there is no step to read from it yet.  However, it is surprisingly simple to do with the “User Defined Java Class” step.

For the following sample to work you need to be on a recent 4.2.0-M1 build.  Get it from here.

Then download mongo-2.4.jar and put it in the libext/ folder of your PDI/Kettle distribution.

Then you can read from a collection with the following “User Defined Java Class” code:

import java.math.*;
import java.util.*;
import java.util.Map.Entry;
import
  [Read more...]
Parse nasty XLS with dynamic ETL
+1 Vote Up -0Vote Down

Dear Kettle friends,

Last year, right after the summer in version 4.1 of Pentaho Data Integration, we introduced the notion of dynamically inserted ETL metadata (Youtube video here).  Since then we received a lot of positive feedback on this functionality which encouraged me to extend it to a few more steps. Already with support for “CSV Input” and “Select Values” we could do a lot of dynamic things.  However, we can clearly do a lot better by extending our initiative to a few more steps: “Microsoft Excel Input” (which can also read ODS by the way), “Row Normalizer” and “Row De-normalizer”.

Below I’ll describe an actual (obfuscated) example that you will probably recognize as it is equally hideous as simple in it’s horrible complexity.

Take a look at this file:

  [Read more...]
Kettle vs Oracle REF CURSOR
+2 Vote Up -2Vote Down

Dear Kettle fans,

PDI-200 has been out there for a while now.  Jens created the feature request a little over 3 years ago.  I guess the main thing blocking this issue was not as much a technical problem but more of a licensing and dependency one (Oracle JDBC dependency and distribution license).

However, now that we have the User Defined Java Class step we can work around those pesky problems. That is because the Java code in there only gets compiled and executed at runtime so it’s perfectly fine to create any sort of dependency in there you like.

The following transformation reads a set of rows from a stored procedure as described on this web page.

In short, our UDJC step executes the following code:

begin ? := sp_get_stocks(?);

  [Read more...]
Pentaho Kettle Solutions Overview
+2 Vote Up -1Vote Down

Dear Kettle friends,

As mentioned in my previous blog post, copies of our new book Pentaho Kettle Solutions are finally shipping.  Roland, Jos and myself worked really hard on it and, as you can probably imagine, we were really happy when we finally got the physical version of our book in our hands.

So let’s take a look at what’s in this book, what the concept behind it was and give you an overview of the content…

The concept

Given the fact that Maria’s book, called

  [Read more...]
Pentaho Kettle Solutions
Employee +3 Vote Up -0Vote Down
I have several favorite authors -- Tim Dorsey, Clive Cussler, and few others that I buy their latest book just because I trust the quality of their work.  Now on that list are Roland Bouman, Jos van Dongen, and Matt Casters.  In a follow up to Bouman's and van Dongen's Pentaho Solutions: Business Intelligence and Data Warehousing with Pentaho and MySQL, the have now produced Pentaho Kettle Solutions which explores the often murky world of ETL and data integration. 

Kettle can be confusing as there are many components with  names such as spoon and

  [Read more...]
Back to blogging....
+7 Vote Up -0Vote Down
It has been a while since I posted on my blog - in fact, I believe this is the first time ever that more than one month passed between posts since I started blogging. There are a couple of reasons for the lag:

  • Matt Casters, Jos van Dongen and me have spent a lot of time finalizing our forthcoming book, Pentaho Kettle Solutions (Wiley, ISBN: 978-0-470-63517-9). The book is currently being produced, and should be available according to schedule in early September 2010. If you're interested, you might like to read

  [Read more...]
Book Review : Pentaho 3.2 Data Integration
+1 Vote Up -2Vote Down

Dear Kettle fans,

A few weeks ago, when I was stuck in the US after the MySQL User Conference, a new book was published by Packt Publishing.

That all by itself is something that is not too remarkable.  However, this time it’s a book about my brainchild Kettle. That makes this book very special to me. The full title is Pentaho 3.2 Data Integration : Beginner’s Guide (Amazon, Packt).  The title all by itself explains the purpose of this book: give the reader a quick-start when it comes to Pentaho Data Integration (Kettle).

The author María Carina

  [Read more...]
Slides from my MySQL UC 2010 presentation
+1 Vote Up -0Vote Down

As requested by a few fans out there, here are the slides of my presentation:

Pentaho Data Integration 4.0 and MySQL.pdf

I had a great time at the conference, met a lot of nice folks, friends, customers, partners and colleagues. After the conference I was unable to get back home like so many of you because of the Paul Simon singing Eyjafjallajökul volcano in Iceland.

So I ended up flying over to Orlando for a week of brutal PDI 4.0 RC1 hacking with the rest of the l33t super Pentaho development team.  However, after 2+ weeks from home, even a severe storm over Philadelphia couldn’t prevent me from getting home eventually.

Until next time,
Matt

MySQL User Conference 2010
+1 Vote Up -0Vote Down

Dear Kettle and MySQL fans,

Next week I’ll be strolling around the MySQL user conference in Santa Clara.  Even better, I’ll be presenting Tuesday afternoon (3:05pm).  The topic is Pentaho Data Integration 4.0 and MySQL.

The presentation will show you what the world’s most popular open source data integration tool can do for a MySQL user.  It will include practical examples and will showcase the latest improvements present in the brand new version 4.0.

Even more than the presentation itself, I’m looking forward to meeting you all over there.  The regular crowd, MySQL users, Pentaho partners, folks from

  [Read more...]
Writing another book: Pentaho Kettle Solutions
+4 Vote Up -0Vote Down
Last year, at about this time of the year, I was well involved in the process of writing the book Pentaho Solutions: Business Intelligence and Data Warehousing with Pentaho and MySQL" for Wiley. To date, "Pentaho Solutions" is still the only all-round book on the open source Pentaho Business Intelligence suite.

It was an extremely interesting project to participate in, full of new experiences.

  [Read more...]
Re-Introducing UDJC
+1 Vote Up -0Vote Down

Dear Kettle fans,

Daniel & I had a lot of fun in Orlando last week. Among other things we worked on the User Defined Java Class (UDJC) step.  If you have a bit of Java Experience, this step allows you to quickly write your own plugin in a step. This step is available in recent builds of Pentaho Data Integration (Kettle) version 4.

Now, how does this work?  Well, let’s take Roland Bouman’s example : the calculation of the the date of Easter.  In this blog post, Roland explains how to calculate Easter in MySQL and Kettle using JavaScript.  OK, so what if you want this calculation to be really fast in Kettle?  Well, then you can turn to pure Java to do the job…

import java.util.*;
  [Read more...]
Easter Eggs for MySQL and Kettle
+3 Vote Up -0Vote Down
To whom it may concern,

A MySQL stored function to calculate easter day


I uploaded a MySQL forge snippet for the f_easter() function. You can use this function in MySQL statements to calculate easter sunday for any given year:

mysql> select f_easter(year(now()));
+-----------------------+
| f_easter(year(now())) |
+-----------------------+
| 2010-04-04 |
+-----------------------+
1 row in set (0.00 sec)

Anonymous Gregorian algorithm


To implement it, I simply transcribed the code of the "Anonymous Gregorian algorithm" from wikipedia's











  [Read more...]
My OSCON 2009 Session: Taming your Data...
+1 Vote Up -0Vote Down
Yes!

Finally, it's there: In a few hours, I will be flying off to San Franscisco to attend OSCON 2009 in San Jose, California. This is the first time I'm attending, and I'm tremendously excited to be there! The sessions look very promising, and I'm looking forward to seeing some excellent speakers. I expect to learn a lot.

I'm also very proud and feel honoured to have the chance to deliver a session myself. It's called Taming Your Data: Practical Data Integration Solutions with Kettle.

Unsurprisingly, I will be talkig a lot about Kettle, a.k.a. Pentaho Data Integration. Recently, I





  [Read more...]
Starring Sakila: MySQL university recording, slides and materials available onMySQLForge
+0 Vote Up -0Vote Down
Hi!

Yesterday I had the honour of presenting my mini-bi/datawarehousing tutorial "Starring Sakila" for MySQL University. I did a modified version of the presentation I did together with Matt Casters at the MySQL user's conference 2009. The structure of the presentation is still largely the same, although I condensed various bits, and I added practical examples of setting up the ETL process and creating a Pentaho Analysis View (OLAP pivot table) on top of a Mondrian Cube.

The slides, session recording, and materials such as SQL script, pentaho data integration jobs and transformations, and Sakila Rentals Cube for Mondrian are all available here on MySQL



  [Read more...]
Mapping to a database table
+0 Vote Up -0Vote Down

For some reason, the creation of a mapping to a database table poses a problem for certain people.

This is how it’s done in PDI 3.2.0 or later in the “Table Output” step:

Ogg video available over here

Until next time,
Matt

Google Goodies and Lego
+0 Vote Up -0Vote Down

Dear Kettle friends,

Will Gorman and Mike D’Amour, Senior Developers at Pentaho, are presenting Pentaho’s Google integration work at the Google I/O Developer Conference. (at the Sandbox area to be specific)   Yesterday, Pentaho announced that much.

Here are a few of the integration points:

  • Google maps dashboard (available in the Pentaho BI server you can download)
  • A new Google Docs step was created for Pentaho Data Integration Enterprise Edition
  • Running (AVI, 30MB) the Pentaho BI server on
  [Read more...]
PDI cloud : massive performance roundup
+0 Vote Up -0Vote Down

Dear Kettle fans,

As expected there was a lot of interest in cloud computing at the MySQL conference last week.  It felt really good to be able to pass the Bayon Technologies white paper around to friends, contacts and analysts.  It’s one thing to demonstrate a certain scalability on your blog, it’s another entirely to have a smart man like Nicholas Goodman do the math.

Sorting massive amounts of rows is hard problem to take on.  Making it scale on low-cost EC2

  [Read more...]
Next week : MySQL UC
+0 Vote Up -0Vote Down

Dear Kettle & MySQL fans!

I’m really looking forward to go to the MySQL User Conference next week, not just because I’m speaking in 2 sessions again, but perhaps also because these are “interesting” times for MySQL and Sun Microsystems.  Pivotal times it would seem.

Here are the 2 sessions I’m going to do:

  • Cloud Computing with MySQL and Kettle : I’m particularly happy that MySQL accepted this session: it will demonstrate how easy it has become to do cloud computing exercises with tools like MySQL and Kettle.
  [Read more...]
Pentaho Partner Summit ‘09
+0 Vote Up -0Vote Down

Dear reader,

In a little over 3 weeks, April 2nd and 3rd, we’re organizing a Pentaho Partner Summit at the Quadrus Conference Center in Menlo Park near San Francisco.

If you are (as the invitation describes) an “Executive, luminary, current or prospective partner from around the world” and if you come over you’ll meet myself, Julian Hyde and perhaps a couple of other architects as well.  That is outside of a host of other interesting people like Zack Urlocker (MySQL) and of course Richard Daley our CEO. We’ll be doing a couple of lengthy sessions on Kettle and Mondrian among other things.

See you

  [Read more...]
MySQL Stored Procedure Result Sets in Pentaho Data Integration
+0 Vote Up -0Vote Down
Quick tip - suppose you need the result set of a MySQL stored procedure in your Pentaho Data Integration (a.k.a. Kettle) Transformation, what do you do?

A Call DB Procedure sounds promising, but as it turns out, you can't use it to retrieve any result sets. Rather, this type of step is meant have an input stream from another source:

  • drive stored procedure execution presumably for some useful side-effect

  • invoke a database stored function and obtain the scalar result



So, what can we do? The answer is simpler than might be expected.

Just use an ordinary








  [Read more...]
Kettle at the MySQL UC 2009
+0 Vote Up -0Vote Down

Hello Kettle fans,

Like Roland I got confirmation earlier this week that I could present my talk on “MySQL and Pentaho Data Integration in a cloud computing setting”, at the next MySQL user conference (http://www.mysql.com/news-and-events/users-conference/).

I’m very excited about the work we’ve done on the subject and it’s going to be great talking about it in April.

See you there!
Matt

Kettle workshop at KHM
+0 Vote Up -0Vote Down

Good news Kettle fans!

Our community is bound to become a bit larger as a whole group of students (38) at the Katholieke Hogeschool Mechelen (Batchelor level) will receive a one day workshop with Pentaho Data Integration (Kettle).  This workshop will take place in early November, most likely the 4th.

It’s interesting to see that during that day we’ll be able to go through most of the work involved in reading and staging the data, data cleansing and a few slowly changing dimensions with a fact table.  On top of that we’ll explain how to use Pentaho Data Integration in that setting.  When time permits we’ll show how to set up a metadata model on top of that data to create reports on it.  On top of that the students will get an idea about what exactly

  [Read more...]
Maturity of Open Source ETL / Why do people fail to do their homework?
+0 Vote Up -0Vote Down
I just read this post on Matt Casters' blog. Here, Matt describes why Element 61's Jan Claes is dead wrong in the way he assesses the maturity of open source ETL tools.

Well, I've just read Jan Claes' article in the "research and insights" area of the Element61 website, and frankly, it is pretty easy to see how unsubstantiated it is. Some may be tempted to classify the article as

  [Read more...]
Dead wrong
+0 Vote Up -0Vote Down

Belgian consultancy company Element 61 has just posted an opinion piece under the disguise of a review on open source ETL.

What a load of utter nonsens.  Try reading this:

Instead of using SQL statements to transform data, an Open Source ETL tool gives the developer a standard set of functions, error handling rules and database connections. The integration of all these different components is done by the Open Source ETL tool provider. The straightforward transformations can be implemented very quickly, without the hassle of writing queries, connecting to data sources or writing your own error handling process. When there are complex transformations to make, Open Source ETL tools will often not offer out-of-the-box solutions.

Well Mr Jan Claes, we’re perfectly

  [Read more...]
T-Dose 2008
+0 Vote Up -0Vote Down

Roland Bouman and I will be doing a presentation together at T-Dose on October 25th:

Building Open Source BI solutions with Pentaho and MySQL

It’s a free conference, feel free to join us there for a chat and/or a drink!

Until then,
Matt

Showing entries 1 to 30 of 34 Next 4 Older Entries

Planet MySQL © 1995, 2014, Oracle Corporation and/or its affiliates   Legal Policies | Your Privacy Rights | Terms of Use

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.