Dear Kettle friends,
on occasion we need to support environments where not only a lot of data needs to be processed but also in frequent batches. For example, a new data file with hundreds of thousands of rows arrives in a folder every few seconds.
In this setting we want to use clustering to use “commodity” computing resources in parallel. In this blog post I’ll detail how the general architecture would look like and how to tune memory usage in this environment.
Clustering was first created around the end of 2006. Back then it looked like this.