Query latencies in one data center are larger than elsewhere for one replication hierarchy, but only in the high percentiles. This impacts production and traffic is being failed away from that data center to protect production.
When the P50 and P90 look okay, but the P99 and the P99.9 do not, the database(s) operate normally, and only some queries are running slow. The initial guess was “for some queries the plan has flipped, but only in that data center.”
But first let’s have a look at the database size and the schema.
The schema in question holds metadata for a change data capture process, and that is not a lot.
# du -sh *
0 coredumps
5.6G data
704M log
0 tmp
# du -sh data/theschema
93M data/theschema
and in memory:
The mysqld process has a …
[Read more]