Eventually Consistent Relational Database?

This weekend I attended Drupal Camp PDX and listened to a session titled “Drupal in the Cloud”. The presenter, Josh Koenig from Chapter Three, gave a great introduction of what moving to “the cloud” really means, especially in the context of a typical web application like Drupal. The problem, which is of course no fault of Josh’s, is that the best high availability database practices are harder to deploy because you’re working within a different set of constraints in the cloud. Sure, you can setup MySQL replication, but without the ability to insert a hardware load balancer or better control over floating IPs, reliable single-master solutions are difficult at best.

I spoke with Josh for a bit after and discussed how Drizzle is doing things to help and what it would take to have a Drizzle back end for Drupal (turns out it should not be too difficult). We then got onto the topic what some of the newer non-relational databases would look like for Drupal, and the short answer is it would be extremely difficult. Drupal, in both the core and many of the modules, depend on a relational model for the underlying data. This is not unique to Drupal. People, and the software they write, have thought “relational” for decades when it comes down to data. Sure, the various NoSQL projects are becoming more popular, but the masses are still thinking in terms of joining tables.

Silver Bullet

So, what would be the silver bullet? A relational database that did not depend on a single master. Not just dual-master setups with offset auto-increment, I’m talking about removing the entire concept of master-slave for replication. This is obviously nothing new in the industry, but it’s never been easy to accomplish. Just do some reading on distributed locking algorithms and you’ll get the idea. The main problem with distributed locking is that they don’t scale.

But, what about an eventually consistent replication model for a relational database? So far eventually consistent databases have not been relational (document based like CouchDB or simple key/value pairs) and relational databases have always focused on atomic consistency or some close relaxed relative (various levels of serialization). As a thought experiment, I’m going to attempt to describe what this may look under the hood at a high level.

Eventually consistent?

Not familiar with this term? Take a look at Werner Vogels’ article on the topic. The main idea behind EC is that you sacrifice the ability for all nodes to see exact same thing at any given time (consistency), but in return you can tolerate network partitions and you have availability. This directly relates to the CAP theorem which states you only get two of: Consistency, Availability, and tolerance to network Partitions. So, we are throwing out “C” so we can get rid of those nasty distributed locking algorithms, but in return we take on “EC”.

MyEventuallyConsistentSQL

Let’s start off with a traditional relational database and start modifying it until we have something that looks like an ECRDBMS (ok, maybe this acronym is a bit wordy).

  • Throw out transactions, serializability, and MVCC – Stay with me. These are not required for a relational data model, and they don’t make much sense when your throw out atomic consistency. At any point an event could come in from another database node that would overwrite whatever you are protecting in your transaction, so what’s the point of protecting yourself? I realize this eliminates a certain class of applications that depend on strict consistency, but this is the cost of moving to EC. We still have a large set of applications that will function just fine even with these dirty reads/writes. I don’t really care if my Drupal site or Wordpress blog has dirty transactions (or none at all). Sure, something might get slightly off in some of those edge cases, but no one is really going to complain, and they are easily resolved when recognized.

  • Remove or fix auto-increment types – I think the best way to handle auto-increment types is to just remove them and require a time-based unique ID (ie, UUID) in your primary key. Sure, they might take a little more space, but space is cheap and it’s a small price to pay for scalability. The other option is to setup some type of pool reservation system for the nodes to pull from, but then we get back into the business of distributed locking or single points of failure. It’s easier to just remove them and use globally unique IDs. These play a role in EC replication as well, so we might as well share them here.

  • Apply EC to replication events – This requires making our events deterministic, regardless of ordering. The easiest way is to tag each row with a time-based unique ID, and the most recent ID always wins. This is a bit rough, but it works. You can refine this in a couple of ways. First, we can change our granularity of events. We probably don’t want to go lager (ie, page-based replication), but we may want to go smaller (field based). You may have a hybrid as well, mixing row and field. The other way to refine EC replication is to allow delta events. Rather than pushing only absolute values, create options on certain columns that allow for increment, decrement, append, and so on. For example, an INTEGER column defined as a inc/dec would always have replication events that looked like “x+=3″ instead of “x=42″. You may even have the option to create user-defined delta algorithms, but I won’t go into that here. Oh, and for those thinking append would not work, you can order your append operations by the time-based unique ID (which means keeping a history of the field). Look at what Google Wave is doing with their text field conflict resolution algorithms.

What are we missing? What else would break down if we toss out atomic consistency and make the above changes? One thing I left out is DDL operations. Those would require some more thought, but I’m pretty sure we could figure out a way to handle conflicting events, possibly with configuration parameters to control the decisions made in conflict resolution algorithms. For example, if an UPDATE event gets applied after a ALTER TABLE that removed a column referenced in the UPDATE, you could just ignore that value and apply the other updates (if any). Chances are you didn’t want that column if it was removed at about the same time. This model has the major benefit of not having to worry about which node is the master or keeping an ordered replication log, they all operate independently and toss deterministic events which can be applied in any order.

Summary

This would-be-ECRDBMS looks a bit different on the inside, but from the outside it will look pretty familiar. From the normal web application perspective we are still creating tables, inserting data, joining data, and doing all the things we depend on from a relational database. This many not be a great idea, but I think it would be possible if you are willing to accept some of the behaviors that come along with it. So what do you think? How can it be improved? Would you use it for your application?