Column Stores, Drizzle, Search For

Last week when I commented on Directions in Database Technology and mentioned " Column stores will continue to evolve". I received a number of comments via IM, Twitter, and email from folks who wanted to know more about column stores (both in how they relate to Drizzle and their usage in general).

Very early on when we started work on Drizzle the plan was to focus web applications. When we looked at cutting features, one of the criteria was "is this needed for web deployment". In many cases we have leaned toward keeping functionality when it was clearly well designed and had a general usefulness. To give an example, ROLLUP for instance is not typically used for web applications, but it is a well written feature that provides us with functionality that we find is handy.

Rollup though is a feature I would typically group in the "Data Analytics" area. Did we keep it?

Yes, because it is useful in a general sense even if you are not doing data analytics (I also find it to be a gem that few MySQL DBAs know).

Early on with Drizzle I tried to discourage innovation outside of the web stack, but that has proven to be futile. The fact is, we provide a micro-kernel, and users will find uses for it. To me the core of what Drizzle is, is the micro-kernel. Anything other then the Micro-kernel is service, and these are required to build solutions. Trying to direct innovation is frankly something I should have known better then to try to do.

The short of this is that we will tackle data analytics in our own manner, and today that means we will eventually adopt a column store. Like map/reduce, column stores are one of the inevitable trends.

In the open source world, this means Infobright right now. If you look at Infobright, which has yet to be well known in open source circles, you see a concrete example of a column store which is well purposed. It is built on top of MySQL, but has its own enhanced parser for data analytics (the basic MySQL/Drizzle optimizer is poorly designed for this sort of work). To really get good performance you have to go the route that Infobright went in replacing the optimizer (the value add for "just an engine" is small, you really do need something more).

At some point I believe we will tackle those types of changes for our optimizer but I don't see the point in it right now. We aren't out to replace SQLite or Postgres, why fill a niche that Infobright already does well?

So then, what is the future of the column store as relates to Drizzle?

I believe the second most important decision we will make long term for engines is going to be which column store we pick up on. I suspect we might even need two.

Why two?

It is obvious that we will need one for data analytics. Using standard OLTP designs for data analytics does not work. This though is not our focus, so it is a long term need, not a short term one.

My interest is in one for shared nothing cloud services (which is in my personal area of interest). The contender for that at the moment looks to be HyperTable, but my opinion there is based on back of the napkin conclusions. We have to do an integration in order to determine if it pans out (and there are attempts right now to do this). There seems to be a number of groups interested in this, so I know it will happen.

As much as column stores are useful for data analytics, and probably required at this point, I believe there is a larger need for them in the space of cloud computing. They have a natural ability to scale out and I believe this will be key for the semi-structured nature that we see most often in Web Application data. While I expect setups of single node Drizzle databases, I also believe that we will need shared storage backends. These will obviously not be for OLTP uses in the beginning.

Skip ahead into the future though and the nature of MVCC design though, plus an optimistic optimizer, should allow engineers to eventually build out OLTP systems with shared nothing backends that make use of column stores. This is not on our current roadmap, but it is also not hard to see where the future might just go.

UPDATE Several people have made mention of LucidDB as being an open source column oriented database. I've only barely looked at it, so I can't say much about it.