Home |  MySQL Buzz |  FAQ |  Feeds |  Submit your blog feed |  Feedback |  Archive |  Aggregate feed RSS 2.0 English Deutsch Español Français Italiano 日本語 Русский Português 中文
Showing entries 1 to 13

Displaying posts with tag: ETL (reset)

Big Data Integration & ETL - Moving Live Clickstream Data from MongoDB to Hadoop for Analytics
+1 Vote Up -0Vote Down
June 16, 2014 By Severalnines

MongoDB is great at storing clickstream data, but using it to analyze millions of documents can be challenging. Hadoop provides a way of processing and analyzing data at large scale. Since it is a parallel system, workloads can be split on multiple nodes and computations on large datasets can be done in relatively short timeframes. MongoDB data can be moved into Hadoop using ETL tools like Talend or Pentaho Data Integration (Kettle).


In this blog, we’ll show you how to integrate your MongoDB and Hadoop datastores using Talend. We have a MongoDB database collecting clickstream data from several websites. We’ll create a job in Talend to extract the documents from MongoDB, transform and then

  [Read more...]
MariaDB CONNECT Storage Engine as an ETL (or ELT) ?
+5 Vote Up -1Vote Down

The MariaDB CONNECT Storage Engine allows to access heterogeneous data sources. In my previous post I show you how to use the MariaDB CONNECT Storage Engine to access an Oracle database. This is quite easy through the CONNECT Storage Engine ODBC table type.

For most architectures where heterogeneous databases are involved an ETL (Extract-Transform-Load) is [...]

Exploring SAP HANA – Powering Next Generation Analytics
+0 Vote Up -0Vote Down
SAP HANA , having entered the data 2.0/3.0 space at the right time, has been getting traction lately; and there will be lot of users like me who wants to[...]
Take the time now for gains later.
+2 Vote Up -1Vote Down
Regardless of which data warehouse paradigm you follow or have heard of, Kimball or Inmon. We should all agree that the data warehouse is often a requirement for business. Different people want different things and they all want it from your data. The data warehouse is not a new concept and yet they are over looked at times. A warehouse is never complete, it is an evolving entity that adjusts with the requirements it is given. It is up to us to make sure that the access to enterprise data in an accurate and timely manner is easy and the standard. MySQL can handle a data warehouse perfectly.
MySQL databases are designed in numerous ways, some good some bad. A warehouse can take that data and organize it for the best use of others. What concerns or issues do you often hear when it comes to gathering data from your database? It is easy for all of your developers to query
  [Read more...]
HPCC vs Hadoop at a glance
+0 Vote Up -0Vote Down


Since this article was written, HPCC has undergone a number of significant changes and updates. This addresses some of the critique voiced in this blog post, such as the license (updated from AGPL to Apache 2.0) and integration with other tools. For more information, refer to the comments placed by Flavio Villanustre and Azana Baksh.

The original article can be read unaltered below:

Yesterday I noticed this tweet by Andrei Savu: . This prompted me to read the related GigaOM article and then check out the  [Read more...]
Memory tuning fast paced ETL
+3 Vote Up -0Vote Down

Dear Kettle friends,

on occasion we need to support environments where not only a lot of data needs to be processed but also in frequent batches.  For example, a new data file with hundreds of thousands of rows arrives in a folder every few seconds.

In this setting we want to use clustering to use “commodity” computing resources in parallel.  In this blog post I’ll detail how the general architecture would look like and how to tune memory usage in this environment.

Clustering was first created around the end of 2006.  Back then it looked like this.

The master

This is the most important part of our cluster.  It takes care of administrating network configuration and topology.  It also keeps track of the state of dynamically added slave servers.

The master is started

  [Read more...]
Dynamic de-normalization of attributes stored in key-value pair tables
+0 Vote Up -0Vote Down

Dear Kettlers,

A couple of years ago I wrote a post about key/value tables and how they can ruin the day of any honest person that wants to create BI solutions.  The obvious advice I gave back then was to not use those tables in the first place if you’re serious about a BI solution.  And if you have to, do some denormalization.

However, there are occasions where you need to query a source system and get some report going on them.  Let’s take a look at an example :

mysql> select * from person;
| id | name  | lastname |
|  1 | Lex   | Luthor   |
|  2 | Clark | Kent     |
|  3 | Lois  | Lane     |
3 rows in set (0.00 sec)

mysql> select * from person_attribute;
| id | person_id | attr_key     
  [Read more...]
Parse nasty XLS with dynamic ETL
+1 Vote Up -0Vote Down

Dear Kettle friends,

Last year, right after the summer in version 4.1 of Pentaho Data Integration, we introduced the notion of dynamically inserted ETL metadata (Youtube video here).  Since then we received a lot of positive feedback on this functionality which encouraged me to extend it to a few more steps. Already with support for “CSV Input” and “Select Values” we could do a lot of dynamic things.  However, we can clearly do a lot better by extending our initiative to a few more steps: “Microsoft Excel Input” (which can also read ODS by the way), “Row Normalizer” and “Row De-normalizer”.

Below I’ll describe an actual (obfuscated) example that you will probably recognize as it is equally hideous as simple in it’s horrible complexity.

Take a look at this file:

  [Read more...]
Log Buffer #182, a Carnival of the Vanities for DBAs
+3 Vote Up -0Vote Down

This is the 182nd edition of Log Buffer, the weekly review of database blogs. Make sure to read the whole edition so you do not miss where to submit your SQL limerick!

This week started out with me posting about International Women’s Day, and has me personally attending Confoo (Montreal) which is an excellent conference I hope to return to next year. I learned a lot from confoo, especially the blending nosql and sql session I attended.

This week was also the Hotsos Symposium. Doug’s

  [Read more...]
Reporting redefined - How the Kickfire MySQL appliance simplifies data marts and analytics for the mass market.
+0 Vote Up -1Vote Down

The Kickfire appliance is designed for business intelligence and analytical workloads, as opposed to OLTP (online transaction processing) environments.  Most of the focus in the MySQL area right now revolves around increasing performance for OLTP type workloads, which makes sense as this is the traditional workload that MySQL has been used for.  In contrast,  Kickfire focuses squarely on analytic environments, delivering high performance execution of analytical and reporting queries .

A MySQL server with fast processors, fast disks (or ssd) and lot of memory will answer many OLTP queries easily.  Kickfire will outperform such a server for typical analytical queries such as aggregation over a large number of rows.

A typical OLTP query might ask “What is the shipping address for this invoice?”.  Contrast this with a typical analytical query, which asks “How much of this item did

  [Read more...]
A case for Kettle for your next ETL or data warehouse project
+0 Vote Up -0Vote Down

I am, for the most part, a do-it-yourself type of person. I fix my own car if I can; I even have four healthy tomato plants growing in pots outside as we speak — the plants will take that little extra CO2 out of the air and give me great tasting tomatoes (soon… i hope!)

But I digress.

Whether to use an ETL tool such as Kettle (aka Penatho Data Integration) for a project involving large data transfers is a typical “build vs. buy” type of decision, one that is fairly well understood and I don’t wish to repeat it all here — putting together some Perl scripts to do the job, you typically get great performance, development speed and accessibility. This would need to be balanced against the benefits of ETL

  [Read more...]
Open Source ETL tools vs Commerical ETL tools
+0 Vote Up -0Vote Down
Recently I have been asked by my company to make a case for open-source ETL-data integration tools as an alternative for the commercial data integration tool, Informatica PowerCenter.
So I did a lot of research and I'm going to try my best, considering I have never used the open-source tools nor the commercial one.

I found plenty of information about comparisons between Pentaho Kettle and Talend, which were 2 of the open-source tools I was supposed to research.
Now, without getting in a big arguement (or matt casters posting on my blog), I'd like to attempt to compare the two, very briefly.
And again, this is ONLY from the research I did online and not based on my experience using the tools (since I dont really have any).

  [Read more...]
Analyzing Opportunities in SugarCRM
+0 Vote Up -0Vote Down

Since we released our first Beta, we have been working on various examples to demonstrate the capabilities and benefits of a resource oriented approach to data integration.

One of the examples we have been working on is a data mart for SugarCRM opportunity analysis. We have now published that example on our content download site, packages.snaplogic.org , where you can download it, and try it out. (You will also need a SnapLogic server installation, to run the pipelines.)

The general idea behind data marts is simple – they are subject specific alternatives to a full blown data warehouse. The primary benefits of using a separate database to analyze an operational system are the ability to look at a snapshot of the constantly changing data, and the offloading of the queries to a separate database which is optimized for analysis using a star

  [Read more...]
Showing entries 1 to 13

Planet MySQL © 1995, 2014, Oracle Corporation and/or its affiliates   Legal Policies | Your Privacy Rights | Terms of Use

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.