Setting up Development and Production Pentaho PDI Repositories

I’ve been setting up a Pentaho Data Integration system with the goals of supporting collaboration with my team, allowing easy deployment to test or production, and enabling remote monitoring and troubleshooting of jobs and tranformations.

I’ve finally figured out a way to achieve these goals, so I’ll try to pass this on now. I found the book "Pentaho Solutions: Business Intelligence and Data Warehousing with Pentaho and MySQL", by Roland Bouman and Jos van Dongen to be a big help in figuring out how to export/import. It definitely helped me get up and running quickly.

My first decision was to bet the farm on the use of a repository. A file based system would probably work, but I felt that it would require too much file distribution and usage of remote terminals. So I’ve setup two separate repositories hosted on MySQL databases: One for development (DEV), and one for Production (PRD). Here are the steps I followed.

Create the DEV repository first and then use the Repository Explorer to configure the database connections. Simply add each connection, one each for each database referenced by your ETL process. Configure these connections point to the development database instances. The gotcha I found was that any local connections I had defined outside the repository were eventually loaded in the repository. The workaround is to simply stop maintaining connection information in your user Kettle configuration, which is maintained in the file: ~/.kettle/shared.xml. I removed this file after deciding to “Go Repo”. So we have Recommendation #1:

-- Use the repository to store database connections and do not use localy defined connections.

In the repository explorer, create your user accounts. I created separate “administrative” accounts for my colleagues, and a special “etl_user” account to be used only for execution ( using kitchen.sh) of the jobs.

Create the Production (PRD) Repository. You can cheat a bit, by using mysqldump to backup your development database and simply restore it on the host database that will be used for the production repository.

Connect to the new PRD repository and open the repository explorer. Then modify the existing database connections so that they have the same name, but point to production instances of the databases. It is very important that you not change the name of the connections. Your jobs and transformations will reference these “standardized” names. When you deploy a job to PRD ( explained below), you will not have to change the database connections if you do it this way.

Now you have two (almost) identical repositories, one for Development, and one for Production. So what are you waiting for? Start developing!

In the new development repository, create a subdirectory for your first real job and its transformation steps. This will help exploit the export functionality of Spoon. It is very easy to dump all of the jobs in a repository subdirectory. And that brings me to my next recommendation:

-- Bundle a job and its individual steps ( other jobs and transformations) into a single repository subdirectory. Basically anything that is used by a single line invocation of kitchen.sh belongs in this subdirectory.

After development and testing of the job is done, you are ready to deploy this to production. This is a 2 phase process: Export your job and its components from the DEV repository, then import into the PRD repository. First some prerequisites. We need to tidy up the global options of Spoon. In spoon, choose the menu option:

Edit -> Options

Make sure the following two options are checked:
“Only save used connections to XML”
“Ask about replacing existing connections on open/import”

And UNCHECK ( please or this whole process will break) :
“Replace existing connections on open/import”

Now you are ready to export your development job. Using the repository explorer on your DEV repository, expand the “Jobs” folder. Select the sub-folder you want to export. Right click on it and select “Export all objects to an XML file” . Using the displayed dialog box, browse to a temporary directory where you want to save to, and give the file a name. Any name will do. For good luck, call it johndz. Click OK. This automatically includes the transformations that are in the jobs. A very cool feature in my opinion.

Disconnect from your DEV repository.

Connect to your PRD repository, and open the repository explorer. The repository explorer has a single menu item: File. Select the menu option:

File -> Import repository into directory

Navigate to the file that you just exported and Click OK. A dialog “Directory Selection dialog” appears. Select the root “/” line, and click OK.

For first time imports, a new dialog appears: “Create Directory” . This will ask to create a new repository directory. Answer “Yes” .

The import will proceed. Sometimes it can take a few minutes. When the last line of the “Repository Import” window says “Import finished”, you are done.

To invoke your new job using kitchen, reference the repository, and the job. Here is a sample kitchen.sh line out of my script. ( remember that kitchen.sh references ~/.kettle/repositories.xml ) :

#!/bin/ksh
installdir=/usr/local/pentaho/data-integration/
${installdir}/kitchen.sh -rep=dev_pentaho -user=etl_user -pass=whatever -dir='source_customers_to _staging' -job='job_load_stage_customers' -level=Basic