|Showing entries 1 to 11|
Since this article was written, HPCC has undergone a number of significant changes and updates. This addresses some of the critique voiced in this blog post, such as the license (updated from AGPL to Apache 2.0) and integration with other tools. For more information, refer to the comments placed by Flavio Villanustre and Azana Baksh.
The original article can be read unaltered below:Yesterday I noticed this tweet by Andrei Savu: . This prompted me to read the related GigaOM article and then check out the [Read more...]
Dear Kettle friends,
on occasion we need to support environments where not only a lot of data needs to be processed but also in frequent batches. For example, a new data file with hundreds of thousands of rows arrives in a folder every few seconds.
In this setting we want to use clustering to use “commodity” computing resources in parallel. In this blog post I’ll detail how the general architecture would look like and how to tune memory usage in this environment.
Clustering was first created around the end of 2006. Back then it looked like this.
This is the most important part of our cluster. It takes care of administrating network configuration and topology. It also keeps track of the state of dynamically added slave servers.
The master is started[Read more...]
A couple of years ago I wrote a post about key/value tables and how they can ruin the day of any honest person that wants to create BI solutions. The obvious advice I gave back then was to not use those tables in the first place if you’re serious about a BI solution. And if you have to, do some denormalization.
However, there are occasions where you need to query a source system and get some report going on them. Let’s take a look at an example :
mysql> select * from person; +----+-------+----------+ | id | name | lastname | +----+-------+----------+ | 1 | Lex | Luthor | | 2 | Clark | Kent | | 3 | Lois | Lane | +----+-------+----------+ 3 rows in set (0.00 sec) mysql> select * from person_attribute; +----+-----------+---------------+------------+ | id | person_id | attr_key[Read more...]
Dear Kettle friends,
Last year, right after the summer in version 4.1 of Pentaho Data Integration, we introduced the notion of dynamically inserted ETL metadata (Youtube video here). Since then we received a lot of positive feedback on this functionality which encouraged me to extend it to a few more steps. Already with support for “CSV Input” and “Select Values” we could do a lot of dynamic things. However, we can clearly do a lot better by extending our initiative to a few more steps: “Microsoft Excel Input” (which can also read ODS by the way), “Row Normalizer” and “Row De-normalizer”.
Below I’ll describe an actual (obfuscated) example that you will probably recognize as it is equally hideous as simple in it’s horrible complexity.
Take a look at this file:[Read more...]
This is the 182nd edition of Log Buffer, the weekly review of database blogs. Make sure to read the whole edition so you do not miss where to submit your SQL limerick!
This week started out with me posting about International Women’s Day, and has me personally attending Confoo (Montreal) which is an excellent conference I hope to return to next year. I learned a lot from confoo, especially the blending nosql and sql session I attended.[Read more...]
The Kickfire appliance is designed for business intelligence and analytical workloads, as opposed to OLTP (online transaction processing) environments. Most of the focus in the MySQL area right now revolves around increasing performance for OLTP type workloads, which makes sense as this is the traditional workload that MySQL has been used for. In contrast, Kickfire focuses squarely on analytic environments, delivering high performance execution of analytical and reporting queries .
A MySQL server with fast processors, fast disks (or ssd) and lot of memory will answer many OLTP queries easily. Kickfire will outperform such a server for typical analytical queries such as aggregation over a large number of rows.
A typical OLTP query might ask “What is the shipping address for this invoice?”. Contrast this with a typical analytical query, which asks “How much of this item did[Read more...]
I am, for the most part, a do-it-yourself type of person. I fix my own car if I can; I even have four healthy tomato plants growing in pots outside as we speak — the plants will take that little extra CO2 out of the air and give me great tasting tomatoes (soon… i hope!)
But I digress.
Whether to use an ETL tool such as Kettle (aka Penatho Data Integration) for a project involving large data transfers is a typical “build vs. buy” type of decision, one that is fairly well understood and I don’t wish to repeat it all here — putting together some Perl scripts to do the job, you typically get great performance, development speed and accessibility. This would need to be balanced against the benefits of ETL[Read more...]
Since we released our first Beta, we have been working on various examples to demonstrate the capabilities and benefits of a resource oriented approach to data integration.
One of the examples we have been working on is a data mart for SugarCRM opportunity analysis. We have now published that example on our content download site, packages.snaplogic.org , where you can download it, and try it out. (You will also need a SnapLogic server installation, to run the pipelines.)
The general idea behind data marts is simple – they are subject specific alternatives to a full blown data warehouse. The primary benefits of using a separate database to analyze an operational system are the ability to look at a snapshot of the constantly changing data, and the offloading of the queries to a separate database which is optimized for analysis using a star[Read more...]
|Showing entries 1 to 11|