Here is the typical “Big” data architecture, that covers most components involved in the data pipeline. More or less, we have the same architecture in production in number of places[...]
Inspired by a post from Juice Analytics.
We are a conflicted people. We love our TV and movie violence but worry that it ruins our children’s minds. We want to reduce healthcare costs, but don’t want to restrict the free market.
Conflicts like these leave little room for a satisfactory answer. Basic principles are in conflict and deeply-rooted desires run up against painful consequences. We
Regardless of which data warehouse paradigm you follow or have
heard of, Kimball or Inmon. We should all agree that the data
warehouse is often a requirement for business. Different people
want different things and they all want it from your data. The
data warehouse is not a new concept and yet they are over looked
at times. A warehouse is never complete, it is an evolving entity
that adjusts with the requirements it is given. It is up to us to
make sure that the access to enterprise data in an accurate and
timely manner is easy and the standard. MySQL can handle a data
warehouse perfectly.
MySQL databases are designed in numerous ways, some good some
bad. A warehouse can take that data and organize it for the best
use of others. What concerns or issues do you often hear when it
comes to gathering data from your database? It is easy for all of
your developers to query and get the same data? How many ways
does your company slice and dice data? …
Update
Since this article was written, HPCC has undergone a number of significant changes and updates. This addresses some of the critique voiced in this blog post, such as the license (updated from AGPL to Apache 2.0) and integration with other tools. For more information, refer to the comments placed by Flavio Villanustre and Azana Baksh.
The original article can be read unaltered below:
Yesterday I noticed this tweet by Andrei Savu: . This prompted me to read the related GigaOM article and then check out the HPCC Systems …
[Read more]Dear Kettle friends,
on occasion we need to support environments where not only a lot of data needs to be processed but also in frequent batches. For example, a new data file with hundreds of thousands of rows arrives in a folder every few seconds.
In this setting we want to use clustering to use “commodity” computing resources in parallel. In this blog post I’ll detail how the general architecture would look like and how to tune memory usage in this environment.
Clustering was first created around the end of 2006. Back then it looked like this.
The master
This is the most important part of our cluster. It takes care of administrating network configuration and topology. It also keeps track of the state of dynamically added slave servers.
The master …
[Read more]Dear Kettlers,
A couple of years ago I wrote a post about key/value tables and how they can ruin the day of any honest person that wants to create BI solutions. The obvious advice I gave back then was to not use those tables in the first place if you’re serious about a BI solution. And if you have to, do some denormalization.
However, there are occasions where you need to query a source system and get some report going on them. Let’s take a look at an example :
mysql> select * from person; +----+-------+----------+ | id | name | lastname | +----+-------+----------+ | 1 | Lex | Luthor | | 2 | Clark | Kent | | 3 | Lois | Lane | +----+-------+----------+ 3 rows in set (0.00 sec) mysql> select * from person_attribute; +----+-----------+---------------+------------+ | id | person_id | attr_key | attr_value | …[Read more]
Dear Kettle friends,
Last year, right after the summer in version 4.1 of Pentaho Data Integration, we introduced the notion of dynamically inserted ETL metadata (Youtube video here). Since then we received a lot of positive feedback on this functionality which encouraged me to extend it to a few more steps. Already with support for “CSV Input” and “Select Values” we could do a lot of dynamic things. However, we can clearly do a lot better by extending our initiative to a few more steps: “Microsoft Excel Input” (which can also read ODS by the way), “Row Normalizer” and “Row De-normalizer”.
Below I’ll describe an actual (obfuscated) example that you will probably recognize as it is equally hideous as simple in it’s horrible complexity.
Take a look at this file:
Let’s assume that this spreadsheet …
[Read more]This is the 182nd edition of Log Buffer, the weekly review of database blogs. Make sure to read the whole edition so you do not miss where to submit your SQL limerick!
This week started out with me posting about International Women’s Day, and has me personally attending Confoo (Montreal) which is an excellent conference I hope to return to next year. I learned a lot from confoo, especially the blending nosql and sql session I attended.
This week was also the Hotsos Symposium. …
[Read more]The Kickfire appliance is designed for business intelligence and analytical workloads, as opposed to OLTP (online transaction processing) environments. Most of the focus in the MySQL area right now revolves around increasing performance for OLTP type workloads, which makes sense as this is the traditional workload that MySQL has been used for. In contrast, Kickfire focuses squarely on analytic environments, delivering high performance execution of analytical and reporting queries .
A MySQL server with fast processors, fast disks (or ssd) and lot of memory will answer many OLTP queries easily. Kickfire will outperform such a server for typical analytical queries such as aggregation over a large number of rows.
A typical OLTP query might ask “What is the shipping address for this invoice?”. Contrast this with a typical analytical query, which asks “How much of this item did we sell in all of …
[Read more]I am, for the most part, a do-it-yourself type of person. I fix my own car if I can; I even have four healthy tomato plants growing in pots outside as we speak — the plants will take that little extra CO2 out of the air and give me great tasting tomatoes (soon… i hope!)
But I digress.
Whether to use an ETL tool such as Kettle (aka Penatho Data Integration) for a project involving large data transfers is a typical “build vs. buy” type of decision, one that is fairly well understood and I don’t wish to repeat it all here — putting together some Perl scripts to do the job, you typically get great performance, development speed and accessibility. This would need to be balanced against the benefits of ETL tools and their potential drawbacks (development speed, license costs and performance …
[Read more]