Over the course of years, I have observed that the three most sensitive indicators of MySQL having a server lockup are the queries per second, number of connections, and number of queries running. Here is a chart of those three on a production system. Find the bad spot:
I am currently working on developing an automated system that detects abnormal behavior in these three metrics, but doesn’t require any a priori inputs or thresholds, e.g. you don’t have to tell it “more than X is bad.” (It could be that during a low period of the day, X is different than during the peak load.)
It turns out that this is hard to do reliably, without a lot of false positives and without false negatives (not triggering during an incident). If there is existing literature on the mathematical techniques to do this, I’d be …
[Read more]