Last night a customer had an emergency in selected machines on a
large cluster of quite uniform database servers. Some of the
servers were slowing down in a very puzzling way over a short
time span (a couple of hours). Queries were taking multiple
seconds to execute instead of being practically instantaneous.
But nothing seemed to have changed. No increase in traffic, no
code changes, nothing. Some of the servers got sick and then well
again; others that seemed to be identical were still sick; others
hadn’t shown any symptoms yet. The servers were in master-master
pairs, and the reads were being served entirely from the passive
machine in the pair. The servers that were having trouble were
the active ones, those accepting writes.
The customer had graphs of system metrics (++customer), and the
symptoms we observed were that Open_tables went sharply up, the
buffer pool got more dirty pages, InnoDB started creating more
pages, and the …
[Read more]