We've been working on the design of a protocol which would enable
promotion of a slave to a master in a MySQL replication cluster.
Right now, if a MySQL master fails, most people just deal with a
temporary outage. They bring the box back up, run REPAIR TABLEs
if necessary, and generally take a few hours of downtime.
Google, Flickr, and Friendster have protocols in place for
handling master failure but for the most part these are
undocumented.
One solution would be to use a system like DRDB to get a
synchronous copy of the data into a backup DB. This would work of
course but would require more hardware and a custom kernel.
You could also use a second master in multi-master replication
but this would require more hardware as well and complicates
matters now that you're using multi-master replication which has
a few technical issues.
A simpler approach is to just take a slave and promote it to the
master. …
[Read more]