We’ve been experimenting lately with database partitioning (in version 2.3.2-dev, make sure to update your kettle.jar to the latest snapshot). In our context, database partitioning means that we divide our data over a cluster of several databases.
A typical way of doing that is that you divide the customer_id by the number of hosts in the cluster and get the remainder. If the remainder is 0, you store the data on the first host in the cluster, 1 for the second, 2 for the third, etc.
This sort of thing is something that we’ve been implementing in
Kettle for the last couple of weeks. The reasoning is simple: if
one database is not up to the task, split the load over 2 or 5 or
10 databases on any amount of hosts. ( Now imagine all
the PCs at work all running an in-memory database )
Besides small changes to the Kettle transformation …