Recently I was working with a customer where we noticed that Seconds_Behind_Master fluctuating from an expected value of 0 seconds behind to a fairly high six figure value. The servers were configured in a master-master relationship and used 5 figure server_id values, and we had just migrated this cluster from one data centre to another by re-pointing masters. Seeing large fluctuations in Seconds_Behind_Master can often be explained by long running queries being processed by the SQL_THREAD, however SHOW PROCESSLIST indicated that there were no long running replication events, and we had no other indication that the server was lagging due to resource constraints — CPU, disk, and memory were under-utilized.
We then moved our investigation to manual review of the binary log where events appeared normal (5 digit server_id values) until every once in a while we would see a rash of server_id 21 events.. Wait, what? I …
[Read more]