After the GitHub MySQL Failover incident a lot of blogs/people
have explained that fully automated failover might not be the
most optimal solution.
Fully automated failover is indeed dangerous, and should be
avoided if possible. But a complete manual failover is also
dangerous. A fully automated manually triggered failover is
probably a better solution.
A synchronous replication solution is also not a complete
solution. A split-brain situation is a good example of a failure
which could happen. Of course most clusters have all kinds of
safe guard to prevent that, but unfortunately also safe guards
can fail.
Every failover/cluster should be considered broken unless:
- You've tested the failover scripts and procedures
- You've tested the failover scripts and procedures under normal load
- You've tested the failover scripts and procedures under high load …