Showing entries 51 to 60 of 87
« 10 Newer Entries | 10 Older Entries »
Displaying posts with tag: Data Integration (reset)
Top 10 Data Management Issues for 2009

So it’s that time of year again when everyone puts out their predictions for the year ahead.  I think predictions are a bit of a waste of time because to be interesting predictions have to be big, but a year really isn’t all that long so actual changes over the course of 2009 are likely to be just small progressions.  So instead I have been thinking about the top issues that we face heading into 2009 and here is my Top 10 list for issues in Data Management.  In this post I avoid offering solutions to these issues, while I have several ideas on solutions these can be the subject of subsequent posts.

10 - Limits on Scalability

While scalability is on my list it is at number 10 because against popular belief, scalability is only an issue for a very small number of data based applications.  Almost all data based applications in use today can be scaled without major issue by increasing the underlying …

[Read more]
Kettle at the MySQL UC 2009

Hello Kettle fans,

Like Roland I got confirmation earlier this week that I could present my talk on “MySQL and Pentaho Data Integration in a cloud computing setting”, at the next MySQL user conference.

I’m very excited about the work we’ve done on the subject and it’s going to be great talking about it in April.

See you there!
Matt

Kettle workshop at KHM

Good news Kettle fans!

Our community is bound to become a bit larger as a whole group of students (38) at the Katholieke Hogeschool Mechelen (Batchelor level) will receive a one day workshop with Pentaho Data Integration (Kettle).  This workshop will take place in early November, most likely the 4th.

It’s interesting to see that during that day we’ll be able to go through most of the work involved in reading and staging the data, data cleansing and a few slowly changing dimensions with a fact table.  On top of that we’ll explain how to use Pentaho Data Integration in that setting.  When time permits we’ll show how to set up a metadata model on top of that data to create reports on it.  On top of that the students will get an idea about what exactly open source is all about.

Obviously, the …

[Read more]
Dead wrong

Belgian consultancy company Element 61 has just posted an opinion piece under the disguise of a review on open source ETL.

What a load of utter nonsens.  Try reading this:

Instead of using SQL statements to transform data, an Open Source ETL tool gives the developer a standard set of functions, error handling rules and database connections. The integration of all these different components is done by the Open Source ETL tool provider. The straightforward transformations can be implemented very quickly, without the hassle of writing queries, connecting to data sources or writing your own error handling process. When there are complex transformations to make, Open Source ETL tools will often not offer out-of-the-box solutions.

Well Mr Jan Claes, we’re perfectly capable of handling …

[Read more]
T-Dose 2008

Roland Bouman and I will be doing a presentation together at T-Dose on October 25th:

Building Open Source BI solutions with Pentaho and MySQL

It’s a free conference, feel free to join us there for a chat and/or a drink!

Until then,
Matt

Getting started with Kettle

For those people starting with Kettle (Pentaho Data Integration) we created a Getting Started page on our Wiki.

Since I realized that for some people, simple and easy can never be simple and easy enough I created 8 mini-flash demos :

[Read more]
Pentaho changes

I’m back at my favorite spot at the Orlando airport:

This week has gone bye so fast it’s kinda scary.  I got dragged into one meeting after another design session after another knowledge transfer opportunity for 5 days in a row.  After our long working days, the discussions and talks just continued over dinner and beers.

It was great to meet everyone and as always we had a good time around the office and at the Ale House.  I even managed to stay sober this time around.  Well at least most of the time.

As always, the thing that struck me the most was how fast Pentaho changes.  It’s almost like visiting a different company every time I drop in.  Since I don’t see the day-to-day changes around the office, the difference between the first time I visited (15 people) and now (70+) is striking.  The office space occupied more than doubled for example.

Well, let me tell you, …

[Read more]
Parallel CSV reader

I almost forgot I wrote the code a while back. Someone asked me about it yesterday, so I dusted the parallel CSV reader code off this morning and here are the results:

This test basically reads a file with 10M customer records (generated), sized 919169988 bytes in 18.3 seconds. (50MB/s) Obviously, my poor laptop disk can’t deliver at that speed, so these test results are obtained by utilizing the excellent Linux caching system

In any case, the caching system simulates faster disk subsystem.

On my computer, the system doesn’t really scale linearly (especially in this case, the OS uses up some CPU power too) , but the speedup is noticeable from 25.8 to 18.3 seconds. (about 30% faster)

The interesting thing is that if you have more CPUs at your disposal (both SMP and clustered setups work) you can probably make it scale to the full extent of your disk speed.

In the case where lazy conversion is disabled …

[Read more]
IRC ##pentaho

Of-course there are the crazies, but usually we have a good time over on ##pentaho IRC.

Yesterday we had our very first community event when Doug “Spanky” Moran hosted a dial-in to talk about what was up in the community.

Today, I learned about regular andresF his blog.

Internet Relay Chat is old technology that has existed for quite a while now, but to me it doesn’t lose it’s appeal. Over at FOSDEM I learned that there are companies like MySQL that have private channels to communicate “non-intrusively” with colleagues. “Maybe some developer can help me with this stupid problem. I’ll just drop a question on the channel.” It’s a good idea, we should consider it for Pentaho too.

Until next time,

Matt

Rolling back transactions

Pentaho Data Integration (Kettle) never was a real transactional database engine, and never pretended to be that. It was designed to handle large data volumes and slam a commit in between every couple of thousand rows to prevent the databases from chocking on the logging problem.

However, more and more people are using Kettle transformations in a transactional way. They want to have the option to roll back any change that happened to a database during the execution of a transformation in case anything goes wrong.

Well, we have been working on that in the past, but never quite got it right… until today actually. As part of bug report 724 I lifted the decision to commit or roll back all databases to the transformation level.

Take for example a look at this transformation:

What happens is …

[Read more]
Showing entries 51 to 60 of 87
« 10 Newer Entries | 10 Older Entries »