Planet MySQL Planet MySQL: Meta Deutsch Español Français Italiano 日本語 Русский Português 中文
Showing entries 1 to 10 of 87 10 Older Entries

Displaying posts with tag: Data Integration (reset)

Big Data Integration & ETL - Moving Live Clickstream Data from MongoDB to Hadoop for Analytics
+1 Vote Up -0Vote Down
June 16, 2014 By Severalnines

MongoDB is great at storing clickstream data, but using it to analyze millions of documents can be challenging. Hadoop provides a way of processing and analyzing data at large scale. Since it is a parallel system, workloads can be split on multiple nodes and computations on large datasets can be done in relatively short timeframes. MongoDB data can be moved into Hadoop using ETL tools like Talend or Pentaho Data Integration (Kettle).

 

In this blog, we’ll show you how to integrate your MongoDB and Hadoop datastores using Talend. We have a MongoDB database collecting clickstream data from several websites. We’ll create a job in Talend to extract the documents from MongoDB, transform and then

  [Read more...]
Big Kettle News
+0 Vote Up -0Vote Down

Dear Kettle fans,

Today I’m really excited to be able to announce a few really important changes to the Pentaho Data Integration landscape. To me, the changes that are being announced today compare favorably to reaching Kettle version 1.0 some 9 years ago, or reaching version 2.0 with plugin support or even open sourcing Kettle itself…

First of all…

Pentaho is again open sourcing an important piece of software.  Today we’re bringing all big data related software to you as open source software.  This includes all currently available capabilities to access HDFS, MongoDB, Cassandra, HBase, the specific VFS drivers we created as well as the ability to execute work inside of Hadoop (MapReduce), Amazon EMR, Pig and so

  [Read more...]
Data Modeling
+2 Vote Up -0Vote Down

Dear data integration fans,

I’m a big fan of “appropriate” data modeling prior to doing any data integration work.  For a number of folks out there that means the creation of an Enterprise Data Warehouse model in classical Bill Inmon style.  Others prefer to use modern modeling techniques like Data Vault, created by Dan Linstedt.  However, the largest group data warehouse architects use a technique called dimensional modeling championed by Ralph Kimball.

Using a modeling technique is very important since it brings structure to your data warehouse.  The techniques used, when applied correctly of-course, are

  [Read more...]
What is the biggest challenge for Big Data?
+0 Vote Up -0Vote Down

Often I think about challenges that organizations face with “Big Data”.  While Big Data is a generic and over used term, what I am really referring to is an organizations ability to disseminate, understand and ultimately benefit from increasing volumes of data.  It is almost without question that in the future customers will be won/lost, competitive advantage will be gained/forfeited and businesses will succeed/fail based on their ability to leverage their data assets.

It may be surprising what I think are the near term challenges.  Largely I don’t think these are purely technical.  There are enough wheels in motion now to almost guarantee that data accessibility will continue to improve at pace in-line with the increase in data volume.  Sure, there will continue to be lots of interesting innovation with technology, but

  [Read more...]
NSA, Accumulo & Hadoop
+0 Vote Up -0Vote Down

Reading yesterday that the NSA has submitted a proposal to Apache to incubate their Accumulo platform.  This, according to the description, is a key/value store built over Hadoop which appears to provide similar function to HBase except it provides “cell level access labels” to allow fine grained access control.  This is something you would expect as a requirement for many applications built at government agencies like the NSA.  But this also is very important for organizations in health care and law enforcement etc where strict control is required to large volumes of privacy sensitive data.

An interesting part of this is how it highlights the acceptance of Hadoop.

  [Read more...]
IA Ventures - Jobs shout out
+0 Vote Up -0Vote Down

My friends over at IA Ventures are looking both for an Analyst and for an Associate to their team.  If Big Data, New York and start-ups is in your blood then I can’t think of a better VC to be involved in. 

From the IA blog:

"IA Ventures funds early-stage Big Data companies creating competitive advantage through data and we’re looking for two start-up junkies to join our team – one full-time associate / community manager and one full time analyst. Because there are only four of us (we’re a start-up ourselves, in fact), we’ll need you to help us investigate companies, learn about industries, develop investment theses, perform internal operations, organize

  [Read more...]
Realtime Data Pipelines
+0 Vote Up -0Vote Down

In life there are really two major types of data analytics.  Firstly, we don’t know what we want to know – so we need analytics to tell us what is interesting.  This is broadly called discovery.  Secondly, we already know what we want to know – we just need analytics to tell us this information, often repeatedly and as quickly as possible.  This is called anything from reporting or dashboarding through more general data transformation and so on.

Typically we are using the same techniques to achieve this.  We shove lots of data into a repository of some from (SQL, MPP SQL, NoSQL, HDFS etc) then run queries/ jobs/ processes across that data to retrieve the information we care about.  

Now this makes sense for data discovery.  If we don’t know what we want to know, having lots of data in a big pile that we can slice and dice

  [Read more...]
Real-time streaming data aggregation
+0 Vote Up -0Vote Down

Dear Kettle users,

Most of you usually use a data integration engine to process data in a batch-oriented way.  Pentaho Data Integration (Kettle) is typically deployed to run monthly, nightly, hourly workloads.  Sometimes folks run micro-batches of work every minute or so.  However, it’s lesser known that our beloved transformation engine can also be used to stream data indefinitely (never ending) from a source to a target.  This sort of data integration is sometimes referred to as being “streaming“, “real-time“, “near real-time“, “continuous” and so on.  Typical examples of situations where you have a never-ending supply of data that needs to be processed the instance it becomes available are JMS (Java Message Service), RDBMS log sniffing, on-line fraud

  [Read more...]
Memory tuning fast paced ETL
+3 Vote Up -0Vote Down

Dear Kettle friends,

on occasion we need to support environments where not only a lot of data needs to be processed but also in frequent batches.  For example, a new data file with hundreds of thousands of rows arrives in a folder every few seconds.

In this setting we want to use clustering to use “commodity” computing resources in parallel.  In this blog post I’ll detail how the general architecture would look like and how to tune memory usage in this environment.

Clustering was first created around the end of 2006.  Back then it looked like this.

The master

This is the most important part of our cluster.  It takes care of administrating network configuration and topology.  It also keeps track of the state of dynamically added slave servers.

The master is started

  [Read more...]
Dynamic de-normalization of attributes stored in key-value pair tables
+0 Vote Up -0Vote Down

Dear Kettlers,

A couple of years ago I wrote a post about key/value tables and how they can ruin the day of any honest person that wants to create BI solutions.  The obvious advice I gave back then was to not use those tables in the first place if you’re serious about a BI solution.  And if you have to, do some denormalization.

However, there are occasions where you need to query a source system and get some report going on them.  Let’s take a look at an example :

mysql> select * from person;
+----+-------+----------+
| id | name  | lastname |
+----+-------+----------+
|  1 | Lex   | Luthor   |
|  2 | Clark | Kent     |
|  3 | Lois  | Lane     |
+----+-------+----------+
3 rows in set (0.00 sec)

mysql> select * from person_attribute;
+----+-----------+---------------+------------+
| id | person_id | attr_key     
  [Read more...]
Showing entries 1 to 10 of 87 10 Older Entries

Planet MySQL © 1995, 2014, Oracle Corporation and/or its affiliates   Legal Policies | Your Privacy Rights | Terms of Use

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.