Moving data between databases is hard. Without ever intending it, I seem to have spent a lifetime working on solutions for getting data into and out of databases, but more frequently between. In fact, my first job out of university was migrating data from BRS/Text, a free-text database (probably what we would call a NoSQL) into a more structured Oracle.
Today I spend some of my time working in Big Data, more often than not, migrating information from existing data stores into Big Data so that they can be analysed, something I covered in more detail here:[Read more...]
The third, and final article in my series on migrating data to and from Hadoop and SQL databases is now available:
Big data is a term that has been used regularly now for almost a decade, and it — along with technologies like NoSQL — are seen as the replacements for the long-successful RDBMS solutions that use SQL. Today, DB2®, Oracle, Microsoft® SQL Server MySQL, and PostgreSQL dominate the SQL space and still make up a considerable proportion of the overall market. In this final article of the series, we will look at more automated solutions for migrating data to and from Hadoop. In the previous articles, we concentrated on methods that take exports or otherwise formatted and extracted data from your SQL source, load that into Hadoop in some way, then process or parse it. But if you want to analyze big data,
I have spent the better part of the last month at Big Data conferences trying to see behind the $2.5 million in marketing smoke to see what is really going to be showing up on the to-do list of DBAs. The first bit of news is that half the vendors at shows like Strata or Big Data Techon will probably be gone by this time next year. So picking a vendor right now is a little iffy. Hadoop’s ecosystem is flourishing and will surely be around for some time but the vendors are playing musical chairs.
But we are Open Source and we do not need vendors! Well, yes and no. The good folks at Cloudera and Horton Works have done you a big favor by providing wonderful tutorials that are worth your time to see. Recently two former MySQL-ers, Sarah Sproehnle and Ian Wrigley, have put together[Read more...]
In the previous article we introduced Hadoop as the most popular Big Data toolset on the market today. We had just started talking about MapReduce as the major framework that makes Hadoop distinctive. So let’s continue the discussion where we left off.
MapReduce is really the key to understanding Hadoop’s parallel processing functionality as it enables data in various formats (XML, text, binary, log, SQL, ect) to be divided up and mapped out to many computers nodes and then recombined back to produce a final data set.
The upcoming Percona Live London conference, November 11-12, features quite a number of talks about the latest MySQL features and related technologies. There will be a lots of talks about the new MySQL 5.6 features:
I’ve got a new article, which is part of a new three-part series, on moving data between SQL and Hadoop, both the export to Hadoop and importing processed content back into an SQL store.
In this first one, we look at the basic mechanics and considerations before you start the migration of data, such as the data format, content, and export techniques.
With the close of call for papers earlier this month, the Percona Live London conference committee was in full swing this past week reviewing all of the many submissions for November’s Percona Live London MySQL Conference.
The submissions are far ranging and cover some really interesting topics, making the lineup for Percona Live London really strong! What the committee looks for in a submission is how much “value” a talk will bring to the[Read more...]
Join me Wednesday, September 18 at 10 a.m. PDT for an hour-long webinar where I will introduce the basic concepts behind column store technology. The webinar’s title is: “Introduction to open source column stores.”
This webinar will talk about Infobright, LucidDB, MonetDB, Hadoop (Impala) and other column stores
Ask ten DBAs for a definition of ‘Big Data’ and you well get more than ten replies. And the majority of those replies will lead you to Hadoop. Hadoop has been the most prominent of the big data frameworks in the open source world. Over 80% of the Hadoop instances in the world are feed their data from MySQL1. But Hadoop is made up of many parts, some confusing and many that do not play nicely with each other. It is analogous to being given a pile of automotive parts from different models and tyring to come up with a car at the end of the day. So what if you do if you are wanting to copy some of your relational data into Hadoop and want to avoid the equivilent of scraped knuckles? The answer is Bigtop and what follows is a way to get a one node does all system running so you can experiement with Hadoop, Map/Reduce, Hive, and all[Read more...]
I will be talking about Big Data with MySQL and Hadoop at MySQL Connect 2013 (Sept. 21-22) in San Francisco as well as at Percona University at Washington, DC (September 12, 2013). Apache Hadoop is a very popular Big Data solution and we can nowadays easily integrate it with MySQL. I will start with a brief introduction of Apache Hadoop and its components (HFDS, Map/Reduce, Hive, HBase/HCatalog, Flume, Scoop, etc). Next I will show 2 major Big Data scenarios:
Dolphin and Elephant: an Introduction
This post is intended for MySQL DBAs or Sysadmins who need to start using Apache Hadoop and want to integrate those 2 solutions. In this post I will cover some basic information about the Hadoop, focusing on Hive as well as MySQL and Hadoop/Hive integration.
First of all, if you were dealing with MySQL or any other relational database most of your professional life (like I was), Hadoop may look different. Very different. Apparently, Hadoop is the opposite to any relational database. Unlike the database where we have a set of tables and indexes, Hadoop works with a set of text files. And… there are no indexes at all. And yes, this may be shocking,[Read more...]
Quite frequently, especially with large-scale or complicated applications, we use MySQL alongside other technologies for certain tasks of reporting, caching as well as main data-store for portions of application.
What technologies for data storage and processing do you use alongside MySQL in your environment? Please feel free to elaborate in the comments about your use case and experiences!Note: There is a poll embedded within this post, please visit the site to participate in this post's poll.
The post[Read more...]
Pivotal launches. SkySQL and Mony Program merge. And much, much more
Our report on the changes in the MySQL ecosystem is now available for 451 clients and non-clients alike at bit.ly/451mysql
— Matt Aslett (@maslett) April 25, 2013
For 451 Research clients: VMware expands Serengeti’s horizons with updated Hadoop virtualization project bit.ly/17muQFI
— Matt Aslett (@maslett) April 26, 2013
For 451 Research clients: SkySQL, Monty Program merge to support MariaDB following formation of MariaDB Foundation bit.ly/10dsdjf
Continuing on from yesterday, the biggest news that I’ve noted in the past 24 hours:
Enabling Real-Time MySQL to HDFS Integration
Batch processing delivered by Map/Reduce remains central to Apache Hadoop, but as the pressure to gain competitive advantage from “speed of thought” analytics grows, so Hadoop itself is undergoing significant evolution. The development of technologies allowing real time queries, such as Apache Drill, Cloudera Impala and the Stinger Initiative are emerging, supported by new generations of resource management with Apache YARN
To support this growing emphasis on real-time operations, we are releasing a new[Read more...]
A little while ago I blogged about (and open sourced) an Impala-powered soccer visualization demo, designed to demonstrate just how responsive Impala queries can be. Since not everyone has the time or resources to run the project themselves, we’ve decided to host it ourselves on an EC2 instance. You can try the visualization; we’ve also opened up the Impala web interface, where you can see query profiles and performance numbers, and Hue (username and password are both ‘test’), where you can run your own queries on the dataset.
ClearStory sheds light on data analysis service. Illuminating ‘dark data’. More.
For 451 clients: ClearStory bags $9m in series A funding, sheds light on its data analysis service bit.ly/Y6v8sV By Krishna Roy
— Matt Aslett (@maslett) February 12, 2013
For 451 clients: Global IDs makes ‘big data’ MDM play via cloud and Hadoop, touts profitable growth bit.ly/Y6v6kL By Krishna Roy
— Matt Aslett (@maslett) February 12, 2013
ScaleBase releases version 2.0 of its MySQL database scalability software bit.ly/WGtEtN
Teradata results. Funding for DataXu. The chemistry of data. And more.
For 451 Research clients: Oracle launches major update to MySQL open source database bit.ly/TSONAt
— Matt Aslett (@maslett) February 8, 2013
— Matt Aslett (@maslett) February 8, 2013
Teradata: Q4 net income $112m on revenue up 10% to $740m, FY net income $419m on revenue up 13% to $2.7bn. bit.ly/14FNS8L