June 16, 2014 By Severalnines
MongoDB is great at storing clickstream data, but using it to
analyze millions of documents can be challenging. Hadoop provides
a way of processing and analyzing data at large scale. Since it
is a parallel system, workloads can be split on multiple nodes
and computations on large datasets can be done in relatively
short timeframes. MongoDB data can be moved into Hadoop using ETL
tools like Talend or Pentaho Data Integration (Kettle).
In this blog, we’ll show you how to integrate your MongoDB and
Hadoop datastores using Talend. We have a MongoDB database
collecting clickstream data from several websites. We’ll create a
job in Talend to extract the documents from MongoDB, transform
and then load them into HDFS. We will also show you how to
schedule this job to be executed every 5 minutes.
Test Case
We have an application …
[Read more]