Recently, I had to work on an emergency for a customer who was having a problem restarting a MariaDB Galera Cluster. After a failure in the cluster they decided to restart the cluster entirely following the right path: bootstrapping the first node, and then adding the rest of the members, one by one. Everything went fine until one of the nodes rejected the request to join the cluster.
Given this problem, the customer decided to ask us to help with the problematic node because none of the tests they did worked, and the same symptom repeated over and over: SST started, copied few gigs of data and then it just hanged (apparently) while the node remained out of the cluster.
Identifying the issue…
Once onboard with the issue, initially I just checked that the cluster was trying a SST: given the whole dataset was about 31GB I decided to go directly to a healthy solution: to clean up the whole datadir and start afresh. No luck …
[Read more]