- Fully Synchronous replication with a write latency increase equivalent to a ping RTT to the furthest node
- Automatic cluster synchronization, both incremental and full restores
- The ability to read and write on every node without needing to worry about replication delay
However, good engineers know there is always a trade-off. If someone tries to sell you on a technology that sounds amazing without telling you the tradeoffs, be careful.
One of the tradeoffs in Galera is how multi-node writing is implemented. This is highly simplified, but locking inside of an RDBMS can either be optimistic or pessimistic.
Pessimisic locking is what Innodb does as a transaction executes. All the locks it needs (or it thinks it needs) will be acquired throughout the course of the transaction. If a given statement in the transaction needs a lock, it will wait for that lock before proceeding (this is where deadlocks and lock wait timeouts can happen). By the time the transaction reaches the COMMIT, it should be guaranteed to succeed (unless something weird happens like the disk fails or the server crashes) because it acquired all the locks it needed in advance.
Now, think about pessimistic locking in terms of a cluster. It would be incredibly expensive to go gain every lock on every node you need for all running transactions. How would deadlock detection work? On a single node, a deadlock is fairly easy to see by looking at all transactions waiting for locks and checking for circular dependencies, but that would be much more complicated in a cluster. It’s not to say that pessimistic locking isn’t possible in a cluster, but it is pretty tricky, and all that extra network traffic would really slow down transactions.
You may have already guessed, but Galera does not use pessimistic locking cluster-wide, it uses optimistic locking. Initially a transaction on PXC behaves much like a regular Innodb transaction. It does pessimistic locking on the node it is started from all the way up to point of the commit. At that point it knows it has gained all the locks necessary on the local node, but it has no idea about the rest of the cluster. So, it optimistically shoots the transaction out to the rest of the nodes in the cluster to see if they will agree to it. If they do, then the commit succeeds at a simple cost of one network roundtrip per node (which is done in parallel to all nodes at once).
This sounds all well and good, but what is the tradeoff?
Not locking resources cluster-wide means that there can be locking conflicts. In short, these take the form of deadlock errors to the clients of affected transactions. Technically these aren’t necessarily deadlocks, but instead a cross-node locking conflict of some kind. These deadlocks can happen in places where you wouldn’t necessarily expect one.
So, enough talking, let’s see what one looks like. Suppose we have a 3 node PXC cluster. Let’s create a table to test on and insert some data:
node2 mysql> create table autoinc ( i int unsigned not null auto_increment primary key, j varchar(32) ); Query OK, 0 rows affected (0.02 sec) node2 mysql> show create table autoinc\G *************************** 1. row *************************** Table: autoinc Create Table: CREATE TABLE `autoinc` ( `i` int(10) unsigned NOT NULL AUTO_INCREMENT, `j` varchar(32) DEFAULT NULL, PRIMARY KEY (`i`) node2 mysql> insert into autoinc (j) values ('node2' ); Query OK, 1 row affected (0.00 sec) node2 mysql> insert into autoinc (j) values ('node2' ); Query OK, 1 row affected (0.01 sec) node2 mysql> insert into autoinc (j) values ('node2' ); Query OK, 1 row affected (0.00 sec)
Now, on two different nodes we will open transactions and attempt to modify the same row:
node1 mysql> set autocommit=off; Query OK, 0 rows affected (0.00 sec) node1 mysql> select * from autoinc; +---+-------+ | i | j | +---+-------+ | 1 | node2 | | 4 | node2 | | 7 | node2 | +---+-------+ 3 rows in set (0.00 sec) node1 mysql> update autoinc set j="node1" where i = 1; Query OK, 1 row affected (0.00 sec) Rows matched: 1 Changed: 1 Warnings: 0 node1 mysql> SHOW ENGINE INNODB STATUS\G ... ------------ TRANSACTIONS ------------ ... ---TRANSACTION 83B, ACTIVE 50 sec 2 lock struct(s), heap size 376, 1 row lock(s), undo log entries 1 MySQL thread id 3972, OS thread handle 0x7fddb84e0700, query id 16408 localhost root sleeping show engine innodb status Trx read view will not see trx with id >= 83C, sees < 83C TABLE LOCK table `test`.`autoinc` trx id 83B lock mode IX RECORD LOCKS space id 0 page no 823 n bits 72 index `PRIMARY` of table `test`.`autoinc` trx id 83B lock_mode X locks rec but not gap
So far, so good. We have an open transaction on node1 updating row 1 in our table. Note we haven’t committed yet. Now, let’s go to node3:
node3 mysql> set autocommit=off; Query OK, 0 rows affected (0.00 sec) node3 mysql> select * from autoinc; +---+-------+ | i | j | +---+-------+ | 1 | node2 | | 4 | node2 | | 7 | node2 | +---+-------+ 3 rows in set (0.00 sec) node3 mysql> update autoinc set j="node3" where i=1; Query OK, 1 row affected (0.01 sec) Rows matched: 1 Changed: 1 Warnings: 0 node3 mysql> commit; Query OK, 0 rows affected (0.00 sec) node3 mysql> select * from autoinc; +---+-------+ | i | j | +---+-------+ | 1 | node3 | | 4 | node2 | | 7 | node2 | +---+-------+ 3 rows in set (0.00 sec)
Our commit succeeded! On a single node, this should have blocked waiting for our first transaction, but it does not. Let’s go back to node1 and see what happened:
node1 mysql> commit; ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transaction node1 mysql> select * from autoinc; +---+-------+ | i | j | +---+-------+ | 1 | node3 | | 4 | node2 | | 7 | node2 | +---+-------+ 3 rows in set (0.00 sec)
Whoops! We got a deadlock error on commit, that doesn’t happen in normal Innodb.
Galera is allowing the first transaction to commit to “win”. Any other transactions on other nodes with competing locks are automatically rolled back and will receive a deadlock error no matter what they try to do next (even a SELECT or something else seemingly innocuous).
Dealing with this problem, quite simply, is the tradeoff to multi-node writing in a synchronous replication system with optimistic locking.
So, when is this really a problem? Really whenever transactions across multiple nodes can take the same write locks. Whether or not this is a problem for your system is purely application-dependent, but if you’re not sure, I’d say you should assume it will be a problem.
What can you do to work with/around this problem? Here are some ideas from simplest to most complex:
- Only write to a single node in the cluster (at a time)-- all the pessimistic locking will happen as usual on that node only and there should be no conflicts from other nodes. This is pretty easy to rig up with an HAproxy configuration or similar so if your main writer node fails, clients fail over gracefully.
- Update your application DB library to handle sudden deadlocks– If you end up with one of these deadlock errors, simply retry the transaction. Any software using transactions should have some facility to do this anyway (IMO).
- Restructure your application/schema to avoid data hot-spots– this is probably easier said than done in most cases, but in principle the least likely it is for a single piece of data to be written on multiple nodes at the same time, the less often this problem will happen. For some databases this may be perfectly natural, but I believe most conventional relational databases will naturally have these kinds of problems.