We've been working on the design of a protocol which would enable promotion of a slave to a master in a MySQL replication cluster.
Right now, if a MySQL master fails, most people just deal with a temporary outage. They bring the box back up, run REPAIR TABLEs if necessary, and generally take a few hours of downtime.
Google, Flickr, and Friendster have protocols in place for handling master failure but for the most part these are undocumented.
One solution would be to use a system like DRDB to get a synchronous copy of the data into a backup DB. This would work of course but would require more hardware and a custom kernel.
You could also use a second master in multi-master replication but this would require more hardware as well and complicates matters now that you're using multi-master replication which has a few technical issues.
A simpler approach is to just take a slave and promote it to the master. …
[Read more]