Github had a recent outage due to malfunctioning automatic
MySQL failover. Having worked on this problem for several
years I felt sympathy but not much need to comment. Then
Baron Schwartz wrote a short post entitled "Is automated failover the root of all evil?"
OK, that seems worth a comment: it's not. Great
title, though.
Selecting automated database failover involves a trade-off
between keeping your site up 24x7 and making things worse by
having software do the thinking when humans are not around.
When comparing outcomes of wetware vs. software it is worth
remembering that humans are not at their best when woken up at
3:30am. Humans go on vacations, or their cell phones run
out of power. Humans can commit devastating unforced errors
due to inexperience. For these and other reasons, automated
failover is the right choice for many sites even if it is not
perfect.
Speaking of perfection, it is common to hear claims that
automated database failover can never be trusted (such as
this example). For the most part
such claims apply to particular implementations, not database
failover in general. Even so, it is undoubtedly true that
failover is complex and hard to get right. Here is a short
list of things I have learned about failover from working on
Tungsten and how I learned them. Tungsten
clusters are master/slave, but you would probably derive similar
lessons from most other types of clusters.
1. Fail over once and only once. Tungsten does so by
electing a coordinator that makes decisions for the whole
cluster. There are other approaches, but you need an
algorithm that is provably sound. Good clusters stop when
they cannot maintain the pre-conditions required for soundness,
to which Baron's article alludes. (We got this right more
or less from the start through work on other systems and reading
lots of books about distributed systems.)
2. Do not fail over unless the DBMS is truly dead.
The single best criterion for failure seems to be whether
the DBMS server will accept a TCP/IP connection. Tests that
look for higher brain function, such as running a SELECT, tend to
generate false positives due to transient load problems like
running out of connections or slow server responses.
Failing over due to load is very bad as it can take down
the entire cluster in sequence as load shifts to the remaining
hosts. (We learned this through trial and
error.)
3. Stop if failover will not work or better yet don't even
start. For example, Tungsten will not fail over if it
does not have up-to-date slaves available. Tungsten will
also try to get back to the original pre-failover state when
failover fails, though that does not always work. We get
credit for trying, I suppose. (We also learned this through
trial and error.)
4. Keep it simple. People often ask why Tungsten
does not resubmit transactions that are in-flight when a master
failover occurs. The reason is that there are many reasons
why resubmission does not work on a new master and it is
difficult to predict when such failures will occur. Imagine
you were dependent on a temp table, for example.
Resubmitting just creates more ways for failover to fail.
Tungsten therefore lets connections break and puts the
responsibility on apps to retry failed transactions. (We
learned this from previous products that did not
work.)
Even if you start out with such principles firmly in mind, new
failover mechanisms tend to encounter a lot of bugs. They
are hard to find and fix because failover is not easy to test.
Yet the real obstacle to getting automated failover right
is not so much bugs but the unexpected nature of the problems
clusters encounter. There is a great quote from J.B.S. Haldane about the nature of the
universe that also gives a flavor of the mind-bending nature of
distributed programming: My own suspicion is that the
universe is not only queerer than we suppose, but queerer than we
can suppose. I can't count the number of times where
something misbehaved in a way that would never have occurred to
me without seeing it happen in a live system. That is why
mature clustering products can be pretty good while young ones,
however well-designed, are not. The problem space is just
strange.
My sympathy for the Github failures and everyone involved is
therefore heartfelt. Anyone who has worked on failover
knows the guilt of failing to anticipate problems as well as as
the sense of enlightenment that comes from understanding why they
occur. Automated failover is not evil. But it is
definitely weird.
Sep
18
2012