Open source business intelligence and data warehousing are on the
rise!
If you kept up with the MySQL
Performance Blog, you might have noticed a number of posts comparing the open source analytical
databases Infobright, LucidDB, and MonetDB. LucidDB got some more news last week when Nick
Goodman announced that the Dynamo Business Intelligence
Corporation will be offering services around LucidDB, branding it
as DynamoDB.
Now, to top if off, Calpont has just released InfiniDB, a GPLv2 open source version of its
analytical database offering, which is based on the MySQL
server.
So, let's take a quick look at InfiniDB. I haven't yet played
around with it, but the features sure look interesting:
-
- Column-oriented architecture (like all other analytical
database products mentioned)
- Transparent compression
- Vertical and horizontal partitioning: on top of being
column-oriented, data is also partitioned, potentially allowing
for less IO to access data.
- MVCC and support for high concurrency. It would be
interesting to see how much benefit this gives when loading data,
because this is usually one of the bottle necks for
column-oriented databases
- Support for ACID/Transactions
- High performance bulkloader
- No specialized hardware - InfiniDB is a pure software
solution that can run on commidity hardware
- MySQL compatible
The website sums up a few more features and benefits, but I think
this covers the most important ones.
Calpont also offers a closed source enterprise edition, which
differs from the open source by offering support for multi-node
scale-out support. By that, they do not mean regular MySQL
replication scale-out. Instead, the enterprise edition features a
true distributed database architecture which allows you to divide
incoming requests across a layer of so-called "user modules"
(MySQL front ends) and "performance modules" (the actual
workhorses that partition, retrieve and cache data). In this
scenario, the user modules break the queries they recieve from
client applications into pieces, and send them to one or more
performance modules in a parallel fashion. The performance
modules then retrieve the actual data from either their cache, or
from the disk, and sends those back to the user modules which
re-assemble the partial and intermediate results to the final
resultset which is sent back to the client. (see picture)
Given the MySQL compatibility and otherwise similar features, I
think it is fair to compare the open source InfiniDB offering to
the Infobright community edition. Interesting differences are
that InfiniDB supports all usual DML statements
(INSERT, DELETE, UPDATE),
and that InfiniDB offers the same bulkloader in both the
community edition as well as the enterprise edition: Infobright
community edition does not support DML, and offers a bulk loader
that is less performant than the one included in its enterprise
edition. I have not heard of an InfoBright multi-node option, so
when comparing the enterprise edition featuresets, that seems
like an advantage too in Calpont's offering.
Please understand that I am not endorsing one of these products
over the other: I'm just doing a checkbox feature list comparison
here. What it mostly boils down to, is that users that need an
affordable analytical database now have even more choice than
before. In addition, it adds a bit more competition for the
vendors, and I expect them all to improve as a result of that.
These are interesting times for the BI and data warehousing
market :)