Home |  MySQL Buzz |  FAQ |  Feeds |  Submit your blog feed |  Feedback |  Archive |  Aggregate feed RSS 2.0 English Deutsch Español Français Italiano 日本語 Русский Português 中文
Showing entries 1 to 30 of 87 Next 30 Older Entries

Displaying posts with tag: Data Integration (reset)

Big Data Integration & ETL - Moving Live Clickstream Data from MongoDB to Hadoop for Analytics
+1 Vote Up -0Vote Down
June 16, 2014 By Severalnines

MongoDB is great at storing clickstream data, but using it to analyze millions of documents can be challenging. Hadoop provides a way of processing and analyzing data at large scale. Since it is a parallel system, workloads can be split on multiple nodes and computations on large datasets can be done in relatively short timeframes. MongoDB data can be moved into Hadoop using ETL tools like Talend or Pentaho Data Integration (Kettle).

 

In this blog, we’ll show you how to integrate your MongoDB and Hadoop datastores using Talend. We have a MongoDB database collecting clickstream data from several websites. We’ll create a job in Talend to extract the documents from MongoDB, transform and then

  [Read more...]
Big Kettle News
+0 Vote Up -0Vote Down

Dear Kettle fans,

Today I’m really excited to be able to announce a few really important changes to the Pentaho Data Integration landscape. To me, the changes that are being announced today compare favorably to reaching Kettle version 1.0 some 9 years ago, or reaching version 2.0 with plugin support or even open sourcing Kettle itself…

First of all…

Pentaho is again open sourcing an important piece of software.  Today we’re bringing all big data related software to you as open source software.  This includes all currently available capabilities to access HDFS, MongoDB, Cassandra, HBase, the specific VFS drivers we created as well as the ability to execute work inside of Hadoop (MapReduce), Amazon EMR, Pig and so

  [Read more...]
Data Modeling
+2 Vote Up -0Vote Down

Dear data integration fans,

I’m a big fan of “appropriate” data modeling prior to doing any data integration work.  For a number of folks out there that means the creation of an Enterprise Data Warehouse model in classical Bill Inmon style.  Others prefer to use modern modeling techniques like Data Vault, created by Dan Linstedt.  However, the largest group data warehouse architects use a technique called dimensional modeling championed by Ralph Kimball.

Using a modeling technique is very important since it brings structure to your data warehouse.  The techniques used, when applied correctly of-course, are

  [Read more...]
What is the biggest challenge for Big Data?
+0 Vote Up -0Vote Down

Often I think about challenges that organizations face with “Big Data”.  While Big Data is a generic and over used term, what I am really referring to is an organizations ability to disseminate, understand and ultimately benefit from increasing volumes of data.  It is almost without question that in the future customers will be won/lost, competitive advantage will be gained/forfeited and businesses will succeed/fail based on their ability to leverage their data assets.

It may be surprising what I think are the near term challenges.  Largely I don’t think these are purely technical.  There are enough wheels in motion now to almost guarantee that data accessibility will continue to improve at pace in-line with the increase in data volume.  Sure, there will continue to be lots of interesting innovation with technology, but

  [Read more...]
NSA, Accumulo & Hadoop
+0 Vote Up -0Vote Down

Reading yesterday that the NSA has submitted a proposal to Apache to incubate their Accumulo platform.  This, according to the description, is a key/value store built over Hadoop which appears to provide similar function to HBase except it provides “cell level access labels” to allow fine grained access control.  This is something you would expect as a requirement for many applications built at government agencies like the NSA.  But this also is very important for organizations in health care and law enforcement etc where strict control is required to large volumes of privacy sensitive data.

An interesting part of this is how it highlights the acceptance of Hadoop.

  [Read more...]
IA Ventures - Jobs shout out
+0 Vote Up -0Vote Down

My friends over at IA Ventures are looking both for an Analyst and for an Associate to their team.  If Big Data, New York and start-ups is in your blood then I can’t think of a better VC to be involved in. 

From the IA blog:

"IA Ventures funds early-stage Big Data companies creating competitive advantage through data and we’re looking for two start-up junkies to join our team – one full-time associate / community manager and one full time analyst. Because there are only four of us (we’re a start-up ourselves, in fact), we’ll need you to help us investigate companies, learn about industries, develop investment theses, perform internal operations, organize

  [Read more...]
Realtime Data Pipelines
+0 Vote Up -0Vote Down

In life there are really two major types of data analytics.  Firstly, we don’t know what we want to know – so we need analytics to tell us what is interesting.  This is broadly called discovery.  Secondly, we already know what we want to know – we just need analytics to tell us this information, often repeatedly and as quickly as possible.  This is called anything from reporting or dashboarding through more general data transformation and so on.

Typically we are using the same techniques to achieve this.  We shove lots of data into a repository of some from (SQL, MPP SQL, NoSQL, HDFS etc) then run queries/ jobs/ processes across that data to retrieve the information we care about.  

Now this makes sense for data discovery.  If we don’t know what we want to know, having lots of data in a big pile that we can slice and dice

  [Read more...]
Real-time streaming data aggregation
+0 Vote Up -0Vote Down

Dear Kettle users,

Most of you usually use a data integration engine to process data in a batch-oriented way.  Pentaho Data Integration (Kettle) is typically deployed to run monthly, nightly, hourly workloads.  Sometimes folks run micro-batches of work every minute or so.  However, it’s lesser known that our beloved transformation engine can also be used to stream data indefinitely (never ending) from a source to a target.  This sort of data integration is sometimes referred to as being “streaming“, “real-time“, “near real-time“, “continuous” and so on.  Typical examples of situations where you have a never-ending supply of data that needs to be processed the instance it becomes available are JMS (Java Message Service), RDBMS log sniffing, on-line fraud

  [Read more...]
Memory tuning fast paced ETL
+3 Vote Up -0Vote Down

Dear Kettle friends,

on occasion we need to support environments where not only a lot of data needs to be processed but also in frequent batches.  For example, a new data file with hundreds of thousands of rows arrives in a folder every few seconds.

In this setting we want to use clustering to use “commodity” computing resources in parallel.  In this blog post I’ll detail how the general architecture would look like and how to tune memory usage in this environment.

Clustering was first created around the end of 2006.  Back then it looked like this.

The master

This is the most important part of our cluster.  It takes care of administrating network configuration and topology.  It also keeps track of the state of dynamically added slave servers.

The master is started

  [Read more...]
Dynamic de-normalization of attributes stored in key-value pair tables
+0 Vote Up -0Vote Down

Dear Kettlers,

A couple of years ago I wrote a post about key/value tables and how they can ruin the day of any honest person that wants to create BI solutions.  The obvious advice I gave back then was to not use those tables in the first place if you’re serious about a BI solution.  And if you have to, do some denormalization.

However, there are occasions where you need to query a source system and get some report going on them.  Let’s take a look at an example :

mysql> select * from person;
+----+-------+----------+
| id | name  | lastname |
+----+-------+----------+
|  1 | Lex   | Luthor   |
|  2 | Clark | Kent     |
|  3 | Lois  | Lane     |
+----+-------+----------+
3 rows in set (0.00 sec)

mysql> select * from person_attribute;
+----+-----------+---------------+------------+
| id | person_id | attr_key     
  [Read more...]
Data Cleaner 2
+2 Vote Up -0Vote Down

Dear Kettle friends,

Some time ago while I visited the nice folks from Human Inference in Arnhem, I ran into Kasper Sørensen, the lead developer of DataCleaner.

DataCleaner is an open source data quality tool released (like Kettle) under the LGPL license.  It is essentially to blame for the lack of a profiling tool inside of Kettle.  That is because having DataCleaner available to our users was enough to push the priority of having our own data profiling tool far enough down.

Kasper worked on DataCleaner pretty much in his spare time in the past.  Now that Human Inference took over the project I was expecting more frequent updates and

  [Read more...]
Reading from MongoDB
+1 Vote Up -2Vote Down

Hi Folks,

Now that we’re blogging again I thought I might as well continue to do so.

Today we’re reading data from MongoDB with Pentaho Data Integration.  We haven’t had a lot of requests for MongoDB support so there is no step to read from it yet.  However, it is surprisingly simple to do with the “User Defined Java Class” step.

For the following sample to work you need to be on a recent 4.2.0-M1 build.  Get it from here.

Then download mongo-2.4.jar and put it in the libext/ folder of your PDI/Kettle distribution.

Then you can read from a collection with the following “User Defined Java Class” code:

import java.math.*;
import java.util.*;
import java.util.Map.Entry;
import
  [Read more...]
Parse nasty XLS with dynamic ETL
+1 Vote Up -0Vote Down

Dear Kettle friends,

Last year, right after the summer in version 4.1 of Pentaho Data Integration, we introduced the notion of dynamically inserted ETL metadata (Youtube video here).  Since then we received a lot of positive feedback on this functionality which encouraged me to extend it to a few more steps. Already with support for “CSV Input” and “Select Values” we could do a lot of dynamic things.  However, we can clearly do a lot better by extending our initiative to a few more steps: “Microsoft Excel Input” (which can also read ODS by the way), “Row Normalizer” and “Row De-normalizer”.

Below I’ll describe an actual (obfuscated) example that you will probably recognize as it is equally hideous as simple in it’s horrible complexity.

Take a look at this file:

  [Read more...]
Kettle vs Oracle REF CURSOR
+2 Vote Up -2Vote Down

Dear Kettle fans,

PDI-200 has been out there for a while now.  Jens created the feature request a little over 3 years ago.  I guess the main thing blocking this issue was not as much a technical problem but more of a licensing and dependency one (Oracle JDBC dependency and distribution license).

However, now that we have the User Defined Java Class step we can work around those pesky problems. That is because the Java code in there only gets compiled and executed at runtime so it’s perfectly fine to create any sort of dependency in there you like.

The following transformation reads a set of rows from a stored procedure as described on this web page.

In short, our UDJC step executes the following code:

begin ? := sp_get_stocks(?);

  [Read more...]
Pentaho Kettle Solutions Overview
+2 Vote Up -1Vote Down

Dear Kettle friends,

As mentioned in my previous blog post, copies of our new book Pentaho Kettle Solutions are finally shipping.  Roland, Jos and myself worked really hard on it and, as you can probably imagine, we were really happy when we finally got the physical version of our book in our hands.

So let’s take a look at what’s in this book, what the concept behind it was and give you an overview of the content…

The concept

Given the fact that Maria’s book, called

  [Read more...]
VLDB 2010
+0 Vote Up -0Vote Down

I will be at VLDB 2010 next week.  If anyone on this blog is attending and wants to catch up to discuss start ups and innovation in DB, NoSQL, Big Data etc drop me a line and I will try to meet up.

Book Review : Pentaho 3.2 Data Integration
+1 Vote Up -2Vote Down

Dear Kettle fans,

A few weeks ago, when I was stuck in the US after the MySQL User Conference, a new book was published by Packt Publishing.

That all by itself is something that is not too remarkable.  However, this time it’s a book about my brainchild Kettle. That makes this book very special to me. The full title is Pentaho 3.2 Data Integration : Beginner’s Guide (Amazon, Packt).  The title all by itself explains the purpose of this book: give the reader a quick-start when it comes to Pentaho Data Integration (Kettle).

The author María Carina

  [Read more...]
Ingres Vectorwise smokes it!
+0 Vote Up -3Vote Down

I work in all markets of the database industry, from web & startup through the largest and most established enterprises.  And to be completely honest, the name Ingres has not come up in conversation very much at all.  10 years ago maybe more often, but recently not all that much.  But Ingres has been quietly ticking away.  Despite being largely off the radar, they still have a sizable and loyal customer base, global offices and a focused & dedicated management team.  And importantly they have an open source business model which actually appears to be working.

I wrote last year that their "behind the

  [Read more...]
Slides from my MySQL UC 2010 presentation
+1 Vote Up -0Vote Down

As requested by a few fans out there, here are the slides of my presentation:

Pentaho Data Integration 4.0 and MySQL.pdf

I had a great time at the conference, met a lot of nice folks, friends, customers, partners and colleagues. After the conference I was unable to get back home like so many of you because of the Paul Simon singing Eyjafjallajökul volcano in Iceland.

So I ended up flying over to Orlando for a week of brutal PDI 4.0 RC1 hacking with the rest of the l33t super Pentaho development team.  However, after 2+ weeks from home, even a severe storm over Philadelphia couldn’t prevent me from getting home eventually.

Until next time,
Matt

MySQL User Conference 2010
+1 Vote Up -0Vote Down

Dear Kettle and MySQL fans,

Next week I’ll be strolling around the MySQL user conference in Santa Clara.  Even better, I’ll be presenting Tuesday afternoon (3:05pm).  The topic is Pentaho Data Integration 4.0 and MySQL.

The presentation will show you what the world’s most popular open source data integration tool can do for a MySQL user.  It will include practical examples and will showcase the latest improvements present in the brand new version 4.0.

Even more than the presentation itself, I’m looking forward to meeting you all over there.  The regular crowd, MySQL users, Pentaho partners, folks from

  [Read more...]
What is Big Data?
+0 Vote Up -1Vote Down

Image by Aranda\Lasch via Flickr

One of my favorite terms at the moment is “Big Data”.  While all terms are by nature subjective, in this post I will try and explain what Big Data means to me.

So what is Big Data?

Big Data is the “modern scale” at which we are defining or data usage challenges.  Big Data begins at the point where need to seriously start thinking about the technologies used to drive our information needs.

While Big Data as a term seems to refer to volume this isn’t the case.  Many existing technologies have little problem physically handling large volumes (TB or PB) of data.  Instead the Big Data



  [Read more...]
Re-Introducing UDJC
+1 Vote Up -0Vote Down

Dear Kettle fans,

Daniel & I had a lot of fun in Orlando last week. Among other things we worked on the User Defined Java Class (UDJC) step.  If you have a bit of Java Experience, this step allows you to quickly write your own plugin in a step. This step is available in recent builds of Pentaho Data Integration (Kettle) version 4.

Now, how does this work?  Well, let’s take Roland Bouman’s example : the calculation of the the date of Easter.  In this blog post, Roland explains how to calculate Easter in MySQL and Kettle using JavaScript.  OK, so what if you want this calculation to be really fast in Kettle?  Well, then you can turn to pure Java to do the job…

import java.util.*;
  [Read more...]
Back from Blogging Hiatus - Update 3
+0 Vote Up -0Vote Down

Image by Nathan Lanier via Flickr

<< Back from Blogging Hiatus - Update 2

Ingres

No specific announcements from Ingres other than I think the VectorWise stuff is progressing well.

To me Ingres is a bit of a dark horse.  They are open source and doing reasonable revenues.  And they are active in the enterprise market (something MySQL hasn’t really achieved).  But they remain largely

  [Read more...]
Back from Hiatus - Summary Update 2
+2 Vote Up -1Vote Down

Back from Hiatus - Summary Update 1

GoodData

GoodData has launched and they are providing a cloud based analytics platform for use in integration with online apps.  Starting with some initial focus on SalesForce data, but working hard on expanding the list of ISV’s who choose to provide their customers analytics via GoodData.

GoodData was started by “good guy” Czech serial entrepreneur Roman Stanek (NetBeans) and has just raised funds from Andressen Horowitz and appointed Time O’Reilly to the board.  GoodData is interesting because it is simple, accessible and available on demand.  Still early days


  [Read more...]
Back from Hiatus - Summary Update 1
+0 Vote Up -0Vote Down
Here is a summary of the key discussions I have had over the last month.  Keep in mind, I’m no analyst.  This is largely opinion based on various conversations I have had with the relevant companies (for analyst insight see Curt Monash).

KickFire

I think Kickfire has been doing it a little tough lately.  The difficulties in a startup launching a hardware appliance (and associated logistics) combined with being too focused on the MySQL customer base has impacted the growth of this interesting start up.  But they aren’t taking it lying down and have adjusted the strategy and have added a new appliance to the range.  Kickfire now seems to have a stronger focus

  [Read more...]
Is the RDBMS doomed (yada yada yada) ?
+0 Vote Up -3Vote Down

Image by Snooch2TheNooch via Flickr

I was speaking with Michael Stonebraker this morning.  I mentioned that lately many have been referencing comments he has made over the last couple of years.  And I also mentioned that many had interpreted them as he was implying the RDBMS is “doomed”.  Mike has been saying the same thing for years, but the current NoSQL movement seems to have picked up on this and highlighting one of the RDBMS's own

  [Read more...]
VectorWise
+1 Vote Up -0Vote Down


I was fortunate enough to speak with Marcin Zukowski earlier about VectorWise.  If you missed it, VectorWise came out of stealth mode a day or two ago.  The have announced a joint partnership with Ingres and essentially are claiming impressive analytic RDBMS performance gains on conventional hardware.

To start with, a key message that I think needs to be communicated here is that this is not a product announcement.  Ingres and VectorWise have announced a partnership in which they of course plan to build products together, today those products are still in the works.

VectorWise is a spin out of


  [Read more...]
The NoSQL community needs to engage the DBA’s
+3 Vote Up -1Vote Down

The NoSQL movement has been gaining some steam lately, with discussion forums and mailing lists popping up all around the web.  Despite having a career that has been centered on the RDBMS, I have made no secret that I think we have gone too far down with our RDBMS for everything mindset.  I think we need to add a few more tools back into our data toolbox. 

Today, 99.5% of new data centric developments started will use a RDBMS by default.  Maybe .5 of a % will consider using something as obtuse as a NoSQL platform.  By experience I know the majority of people discussing NoSQL platforms today are web developers.  In

  [Read more...]
HamsterDB
+0 Vote Up -0Vote Down

This post was a bit of a test to see if I could write a serious post about a database platform called Hamster.  I think I just made it :)

With all the noise over key/value stores recently, we should keep in mind that this technology isn’t exactly new.  It is being applied to new problems, but many of the foundations have been around for decades.  Probably the oldest of them all, Berkley DB came into existence during the mid ‘80’s and now has over 200 million deployments (according to the Oracle web site).

HamsterDB, while not having the same pedigree of Berkley, has been steadily worked on by

  [Read more...]
HadoopDB discussion with Daniel Abadi
+0 Vote Up -1Vote Down


I spoke to Daniel Abadi this morning about his HadoopDB announcement that came out a couple of days back.  I am sure this has been a busy time for Daniel and his team over in Yale as HadoopDB has been getting a lot of interest which I am sure will continue to build.

Some notes from our discussion:

  • HadoopDB is primarily focused on high scalability and the required availability at scale.  Daniel questions current MPP’s ability to truly scale past 100 nodes whereas Hadoop has real examples on 3000+ nodes.
  • HadoopDB like many MPP analytical database platforms

  [Read more...]
Showing entries 1 to 30 of 87 Next 30 Older Entries

Planet MySQL © 1995, 2014, Oracle Corporation and/or its affiliates   Legal Policies | Your Privacy Rights | Terms of Use

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.