I’ve been setting up a Pentaho Data Integration system with the
goals of supporting collaboration with my team, allowing easy
deployment to test or production, and enabling remote monitoring
and troubleshooting of jobs and tranformations.
I’ve finally figured out a way to achieve these goals, so I’ll
try to pass this on now. I found the book "Pentaho Solutions:
Business Intelligence and Data Warehousing with Pentaho and
MySQL", by Roland Bouman and Jos van Dongen to be a big help
in figuring out how to export/import. It definitely helped me get
up and running quickly.
My first decision was to bet the farm on the use of a repository.
A file based system would probably work, but I felt that it would
require too much file distribution and usage of remote terminals.
So I’ve setup two separate repositories hosted on MySQL
databases: One for development (DEV), and one for Production
(PRD). Here are the steps I followed.
Create the DEV repository first and then use the Repository
Explorer to configure the database connections. Simply add each
connection, one each for each database referenced by your ETL
process. Configure these connections point to the development
database instances. The gotcha I found was that any local
connections I had defined outside the repository were eventually
loaded in the repository. The workaround is to simply stop
maintaining connection information in your user Kettle
configuration, which is maintained in the file:
~/.kettle/shared.xml. I removed this file after deciding to “Go
Repo”. So we have Recommendation #1:
-- Use the repository to store database connections and do
not use localy defined connections.
In the repository explorer, create your user accounts. I created
separate “administrative” accounts for my colleagues, and a
special “etl_user” account to be used only for execution ( using
kitchen.sh) of the jobs.
Create the Production (PRD) Repository. You can cheat a bit, by
using mysqldump to backup your development database and simply
restore it on the host database that will be used for the
production repository.
Connect to the new PRD repository and open the repository
explorer. Then modify the existing database connections so that
they have the same name, but point to production instances of the
databases. It is very important that you not change the
name of the connections. Your jobs and transformations
will reference these “standardized” names. When you deploy a job
to PRD ( explained below), you will not have to change the
database connections if you do it this way.
Now you have two (almost) identical repositories, one for
Development, and one for Production. So what are you waiting for?
Start developing!
In the new development repository, create a subdirectory for your
first real job and its transformation steps. This will help
exploit the export functionality of Spoon. It is very easy to
dump all of the jobs in a repository subdirectory. And that
brings me to my next recommendation:
-- Bundle a job and its individual steps ( other jobs and
transformations) into a single repository subdirectory. Basically
anything that is used by a single line invocation of kitchen.sh
belongs in this subdirectory.
After development and testing of the job is done, you are ready
to deploy this to production. This is a 2 phase process: Export
your job and its components from the DEV repository, then import
into the PRD repository. First some prerequisites. We need to
tidy up the global options of Spoon. In spoon, choose the menu
option:
Edit -> Options
Make sure the following two options are checked:
“Only save used connections to XML”
“Ask about replacing existing connections on open/import”
And UNCHECK ( please or this whole process will
break) :
“Replace existing connections on open/import”
Now you are ready to export your development job. Using the
repository explorer on your DEV repository, expand the “Jobs”
folder. Select the sub-folder you want to export. Right click on
it and select “Export all objects to an XML file” . Using the
displayed dialog box, browse to a temporary directory where you
want to save to, and give the file a name. Any name will do. For
good luck, call it johndz. Click OK. This automatically includes
the transformations that are in the jobs. A very cool feature in
my opinion.
Disconnect from your DEV repository.
Connect to your PRD repository, and open the repository explorer.
The repository explorer has a single menu item: File. Select the
menu option:
File -> Import repository into directory
Navigate to the file that you just exported and Click OK. A
dialog “Directory Selection dialog” appears. Select the root “/”
line, and click OK.
For first time imports, a new dialog appears: “Create Directory”
. This will ask to create a new repository directory. Answer
“Yes” .
The import will proceed. Sometimes it can take a few minutes.
When the last line of the “Repository Import” window says “Import
finished”, you are done.
To invoke your new job using kitchen, reference the repository,
and the job. Here is a sample kitchen.sh line out of my script. (
remember that kitchen.sh references ~/.kettle/repositories.xml )
:
#!/bin/ksh
installdir=/usr/local/pentaho/data-integration/
${installdir}/kitchen.sh -rep=dev_pentaho -user=etl_user
-pass=whatever -dir='source_customers_to _staging'
-job='job_load_stage_customers' -level=Basic
Oct
27
2009