Last week when I commented on Directions in Database Technology and mentioned "
Column stores will continue to evolve". I received a number of
comments via IM, Twitter, and email from folks who wanted to know
more about column stores (both in how they relate to Drizzle and
their usage in general).
Very early on when we started work on Drizzle the
plan was to focus web applications. When we looked at cutting
features, one of the criteria was "is this needed for web
deployment". In many cases we have leaned toward keeping
functionality when it was clearly well designed and had a general
usefulness. To give an example, ROLLUP for instance is not
typically used for web applications, but it is a well written
feature that provides us with functionality that we find is
handy.
Rollup though is a feature I would typically group in the "Data
Analytics" area. Did we keep it?
Yes, because it is useful in a general sense even if you are not
doing data analytics (I also find it to be a gem that few MySQL
DBAs know).
Early on with Drizzle I tried to discourage innovation outside of
the web stack, but that has proven to be futile. The fact is, we
provide a micro-kernel, and users will find uses for it. To me
the core of what Drizzle is, is the micro-kernel. Anything other
then the Micro-kernel is service, and these are required to build
solutions. Trying to direct innovation is frankly something I
should have known better then to try to do.
The short of this is that we will tackle data analytics in our
own manner, and today that means we will eventually adopt a
column store. Like map/reduce, column stores are one of the
inevitable trends.
In the open source world, this means Infobright right now. If you look at Infobright,
which has yet to be well known in open source circles, you see a
concrete example of a column store which is well purposed. It is
built on top of MySQL, but has its own enhanced parser for data
analytics (the basic MySQL/Drizzle optimizer is poorly designed
for this sort of work). To really get good performance you have
to go the route that Infobright went in replacing the optimizer
(the value add for "just an engine" is small, you really do need
something more).
At some point I believe we will tackle those types of changes for
our optimizer but I don't see the point in it right now. We
aren't out to replace SQLite or Postgres, why fill a niche that
Infobright already does well?
So then, what is the future of the column store as relates to
Drizzle?
I believe the second most important decision we will make long
term for engines is going to be which column store we pick up on.
I suspect we might even need two.
Why two?
It is obvious that we will need one for data analytics. Using
standard OLTP designs for data analytics does not work. This
though is not our focus, so it is a long term need, not a short
term one.
My interest is in one for shared nothing cloud services (which is
in my personal area of interest). The contender for that at the
moment looks to be HyperTable, but my opinion there is based on back
of the napkin conclusions. We have to do an integration in order
to determine if it pans out (and there are attempts right now to
do this). There seems to be a number of groups interested in
this, so I know it will happen.
As much as column stores are useful for data analytics, and
probably required at this point, I believe there is a larger need
for them in the space of cloud computing. They have a natural
ability to scale out and I believe this will be key for the
semi-structured nature that we see most often in Web Application
data. While I expect setups of single node Drizzle databases, I
also believe that we will need shared storage backends. These
will obviously not be for OLTP uses in the beginning.
Skip ahead into the future though and the nature of MVCC design
though, plus an optimistic optimizer, should allow engineers to
eventually build out OLTP systems with shared nothing backends
that make use of column stores. This is not on our current
roadmap, but it is also not hard to see where the future might
just go.
UPDATE Several people have made mention of LucidDB as being an
open source column oriented database. I've only barely looked at
it, so I can't say much about it.
Nov
01
2008