Sometimes, I make mistakes. It’s true. That can be difficult for us Systems Engineering-types to say, but I try to distance myself from my ego and embrace the mistakes because I often learn the most from them. ..Blah, blah, school of hard knocks, blah, blah…. Usually my mistakes aren’t big enough to cause any visible impact, but this one took the site out for 10 minutes during a period of peak traffic due to a confluence of events.
Doh!
Here is how it went down…
We have an issue where MySQL table statistics are occasionally getting out of whack, usually after a batch operation. This causes bad explain plans, which in turn cause impossibly slow queries. An ANALYZE TABLE (or even SHOW CREATE INDEX) resolves the issue immediately, but I prefer not get woken up at 4AM by long running query alerts when my family and I are trying to sleep. As a way to work around the issue, we decided to disable InnoDB automatic …
[Read more]