Copying MySQL Data to Hadoop with Minimal Loss of Blood Part 2

I have spent the better part of the last month at Big Data conferences trying to see behind the $2.5 million in marketing smoke to see what is really going to be showing up on the to-do list of DBAs. The first bit of news is that half the vendors at shows like Strata or Big Data Techon will probably be gone by this time next year. So picking a vendor right now is a little iffy. Hadoop’s ecosystem is flourishing and will surely be around for some time but the vendors are playing musical chairs.

But we are Open Source and we do not need vendors! Well, yes and no. The good folks at Cloudera and Horton Works have done you a big favor by providing wonderful tutorials that are worth your time to see. Recently two former MySQL-ers, Sarah Sproehnle and Ian Wrigley, have put together Udacity that concisely teaches Hadoop technology and Cloudera deserves a round of applause for this well produced effort. While you can put together your own Hadoop cluster from the various Apache projects, it is often easier to get them from a vendor. You have only so many working hours and it is nice to be able to lean on someone for help occasionally.

Sadly many Big Data projects look like middle school science projects. They were done for the sake of having a big data project and not for a particular business need. MySQL has been sold for years are working great on commodity hardware and many Big Data projects are also being sold that they work great on commodity hardware — and lots of it. So a poor DBA ends up with scads of commodity boxes for MySQL shards, Hadoop clusters, and what ever else tech gets adopted. (Hint: For Christmas ask for stock in commodity hardware companies, disk drive manufactures, and electrical utilities)

Can you guess what the following does:

START a=node(*),
bacon=node:node_auto_index(name="Kevin Bacon")
MATCH p=shortestPath((a)-[:KNOWS*]->(bacon))
RETURN extract(n in nodes(p) : n.name);

It is Neo4j code for find The Six Degrees of Kevin Bacon. Finding the shortest path of friendship relationships with SQL is going to be nasty. But it can be done easily with the right NoSQL world. So there is utility to these types of approaches and ‘Joe Average DBA’ should start planning on learning more as you will be asked to implement these technologies in 2014.

When I started the first part of this column, I did not intend to drag out the length of time it took to get the second part posted. I have been investigating the various ways of bulk and real time loading of Hadoop filesystems. 80% of Hadoop clusters are feed from MySQL and the most popular batch loader is Squoop. Grab a copy of the Apache Squoop Cookbook as it details how to copy data from your MySQL instances to HDFS. Squoop uses a JDBC connector to communicate with your instances.

$ sqoop import --connect jdbc:mysql://localhost/bigdb --username dave --table BigProject

Batch load of bulk data can get tedious. I have been building the MySQL Hadoop Applier from http://labs.mysql.com so that I can get updates to HDFS as simply as MySQL Replication. Sadly my cmake is being fussy. But I love the idea of being able to have data flow automatically over a bulk load.

The concept that my Hadoop cluster is simply another MySQL Replication feed appeals to me.