Automated Database Failover Is Weird but not Evil

Github had a recent outage due to malfunctioning automatic MySQL failover.  Having worked on this problem for several years I felt sympathy but not much need to comment.  Then Baron Schwartz wrote a short post entitled "Is automated failover the root of all evil?"  OK, that seems worth a comment:  it's not.  Great title, though.

Selecting automated database failover involves a trade-off between keeping your site up 24x7 and making things worse by having software do the thinking when humans are not around.  When comparing outcomes of wetware vs. software it is worth remembering that humans are not at their best when woken up at 3:30am.  Humans go on vacations, or their cell phones run out of power.  Humans can commit devastating unforced errors due to inexperience.  For these and other reasons, automated failover is the right choice for many sites even if it is not perfect. 
Speaking of perfection, it is common to hear claims that automated database failover can never be trusted (such as this example).  For the most part such claims apply to particular implementations, not database failover in general.  Even so, it is undoubtedly true that failover is complex and hard to get right.  Here is a short list of things I have learned about failover from working on Tungsten and how I learned them.  Tungsten clusters are master/slave, but you would probably derive similar lessons from most other types of clusters. 
1. Fail over once and only once.  Tungsten does so by electing a coordinator that makes decisions for the whole cluster.  There are other approaches, but you need an algorithm that is provably sound.  Good clusters stop when they cannot maintain the pre-conditions required for soundness, to which Baron's article alludes.  (We got this right more or less from the start through work on other systems and reading lots of books about distributed systems.) 
2. Do not fail over unless the DBMS is truly dead.  The single best criterion for failure seems to be whether the DBMS server will accept a TCP/IP connection.  Tests that look for higher brain function, such as running a SELECT, tend to generate false positives due to transient load problems like running out of connections or slow server responses.  Failing over due to load is very bad as it can take down the entire cluster in sequence as load shifts to the remaining hosts.  (We learned this through trial and error.) 
3. Stop if failover will not work or better yet don't even start.  For example, Tungsten will not fail over if it does not have up-to-date slaves available.  Tungsten will also try to get back to the original pre-failover state when failover fails, though that does not always work.  We get credit for trying, I suppose.  (We also learned this through trial and error.) 
4. Keep it simple.  People often ask why Tungsten does not resubmit transactions that are in-flight when a master failover occurs.  The reason is that there are many reasons why resubmission does not work on a new master and it is difficult to predict when such failures will occur.  Imagine you were dependent on a temp table, for example.  Resubmitting just creates more ways for failover to fail.  Tungsten therefore lets connections break and puts the responsibility on apps to retry failed transactions.  (We learned this from previous products that did not work.) 
Even if you start out with such principles firmly in mind, new failover mechanisms tend to encounter a lot of bugs.  They are hard to find and fix because failover is not easy to test.  Yet the real obstacle to getting automated failover right is not so much bugs but the unexpected nature of the problems clusters encounter.  There is a great quote from J.B.S. Haldane about the nature of the universe that also gives a flavor of the mind-bending nature of distributed programming:  My own suspicion is that the universe is not only queerer than we suppose, but queerer than we can suppose.  I can't count the number of times where something misbehaved in a way that would never have occurred to me without seeing it happen in a live system.  That is why mature clustering products can be pretty good while young ones, however well-designed, are not.  The problem space is just strange. 
My sympathy for the Github failures and everyone involved is therefore heartfelt.  Anyone who has worked on failover knows the guilt of failing to anticipate problems as well as as the sense of enlightenment that comes from understanding why they occur.  Automated failover is not evil.  But it is definitely weird.