Dear Kettle fans,
Daniel & I had a lot of fun in Orlando last week. Among other things we worked on the User Defined Java Class (UDJC) step. If you have a bit of Java Experience, this step allows you to quickly write your own plugin in a step. This step is available in recent builds of Pentaho Data Integration (Kettle) version 4.
Now, how does this work? Well, let’s take Roland Bouman’s example : the calculation of the the date of Easter. In this blog post, Roland explains how to calculate Easter in MySQL and Kettle using JavaScript. OK, so what if you want this calculation to be really fast in Kettle? Well, then you can turn to pure Java to do the job…
import java.util.*; private int yearIndex; private Calendar calendar; public boolean processRow(StepMetaInterface smi, StepDataInterface sdi) throws KettleException { Object[] r=getRow(); if (r==null) { setOutputDone(); return false; } if (first) { yearIndex = getInputRowMeta().indexOfValue(getParameter("YEAR")); if (yearIndex<0) { throw new KettleException("Year field not found in the input row, check parameter 'YEAR'!"); } calendar = Calendar.getInstance(); calendar.clear(); first=false; } Object[] outputRowData = RowDataUtil.resizeArray(r, data.outputRowMeta.size()); int outputIndex = getInputRowMeta().size(); Long year = getInputRowMeta().getInteger(r, yearIndex); outputRowData[outputIndex++] = easterDate(year.intValue()); putRow(data.outputRowMeta, outputRowData); return true; } private Date easterDate(int year) { int a = year % 19; int b = (int)Math.floor(year / 100); int c = year % 100; int d = (int)Math.floor(b / 4); int e = b % 4; int f = (int)Math.floor(( 8 + b ) / 25); int g = (int)Math.floor((b - f + 1) / 3); int h = (19 * a + b - d - g + 15) % 30; int i = (int)Math.floor(c / 4); int k = c % 4; int L = (32 + 2 * e + 2 * i - h - k) % 7; int m = (int)Math.floor((a + 11 * h + 22 * L) / 451); int n = h + L - 7 * m + 114; calendar.set(year, (int)(Math.floor(n / 31) - 1), (int)((n % 31) + 1)); return calendar.getTime(); }
All you then need to do is specify a return field in the Fields tab called “Easter” (a Date) and a parameter YEAR (the field to contain the year).
The performance on my machine (Dual Core 2 Duo 2.33Ghz) is 134,000 rows/s for the JavaScript version and 450,000 rows/s for the UDJC version. That’s over 3 times faster to do exactly the same thing.
Here is a link to the Kettle test transformation for those that want to give it a try. As you can see, the deployment issue of having a plugin around is completely gone since now you can do anything you can do with a plugin from within the comfort of the UDJC step in Spoon.
The UDJC step uses the wonderful Janino library to compile the entered code to Java byte-code that gets executed at the same speed as everything else in Kettle. This gives us pretty much optimal performance.
You can expect some tweaks to the UDJC step before 4.0 goes into feature freeze. However, the bulk of the changes are in there and working great. Thank you Daniel, for an outstanding job!
Until next time,
Matt