With the release of MySQL 5.6 binary log group commit is included, which is a feature focused on improving performance of a server when the binary log is enabled. In short, binary log group commit improve performance by grouping several writes to the binary log instead of writing them one by one, but let me digress a little on how transactions are logged to the binary log before going into the details. Before going into details about the problem and the implementation, let look at what you do to turn it on.
Nothing.
Well... we actually have a few options to tweak it, but nothing
required to turn it on. It even works for existing engines since
we did not have to extend the handlerton interface to implement
the binary log group commit. However, InnoDB has some
optimizations to take advantage of the binary log group commit
implementation.
-
binlog_order_commits={0|1}
-
This is a global variable that can be set without stopping the
server.
If this is off (0), transactions may be committed in parallel. In some circumstances, this might offer some performance boost. For the measurements we did, there were no significant improvement in throughput, but we decided to keep the option anyway since there are special cases were it can offer improvements.
-
binlog_max_flush_queue_time=microseconds
- This variable controls when to stop skimming the flush queue (more about that below) and move on as soon as possible. Note that this is not a timeout on how often the binary log should be written to disk since grabbing the queue and writing it to disk takes time.
Transactions galore...As the server executes transactions the server will collect the changes done by the transaction in a per-connection transaction cache. If statement-based replication is used, the statements will be written to the transaction cache, and if row-based replication is used, the actual rows changed will be written to the transaction cache. Once the transaction commits, the transaction cache is written to the binary log as one single block. This allow each session to execute independently of each others and only need to take a lock on the binary log when writing the transaction data to it. Since transactions are isolated from each others it is enough to serialize the transactions on commits. (Of course, this is in an ideal world. Transactions can see other transactions changes if you set a different transaction isolation level. You would never do that unless you knew exactly what you're doing... right?)
In order to keep the storage engine and the binary log in sync,
the server employs a two-phase commit protocol (or just
2PC) that you can see in Figure 1. As you can see,
there is a call to write()
and one call to
fsync()
in the diagram: I'll get to that is just a
moment, so stay tuned.
The entire point of using a two-phase commit protocol is to be able to guarantee that the transaction is either both in the engine and the binary log (or in neither) even in the event that the server crashes after the prepare, and subsequently recovers. That is, it should not be possible that the transaction is in the engine but not in the binary log, or vice verse. Two-phase commit solves this by requiring that once a transaction is prepared in the engine, it can be either fully committed or fully rolled back even if the server crashes and recover. So, on recovery, the storage engine will then provide the server with all the transactions that are prepared but not yet committed, and the server will then commit the transaction if it can be found in the binary log, and roll it back otherwise.
This is, however, only possible if the transaction can be guaranteed to be persistent in the binary log before committing the transaction in the engine. Since disks are slow and memory fast, the operating system tries to improve performance by keeping part of the file in memory instead of writing directly to disk. Once enough changes have been written, or the memory is needed for something else, the changes are written to disk. This is good for the operating system (and also for anybody using the computer), but causes a problem for the database server since if the server crashes, it is possible that the transaction is committed in the storage engine, but there is no trace of it in the binary log.
For recovery to work properly, it is therefore necessary to
ensure that the file is really on disk, which is why there is a
call to fsync()
in Figure 1, which makes the
in-memory part of the file to be written to disk.
The Infamous prepare_commit_mutex
When the server recovers, it has access to the binary log and can therefore decide what to commit and what to rollback, but what if there is no binary log?
In the general case, a recovery can just roll back all prepared transactions and start again. After all, the transactions that were just prepared but not committed are safe to roll back. They just move the database to the state it had just before starting those transactions. Any clients being connected has not got an indication that the transaction is committed, so they will realize that the transactions have to be re-executed.
There is another case where recovery is being used in this way and that is when using on-line backup methods such as InnoDB Hot Backup (which is used in the MySQL Enterprise Backup). These tools take a copy of the database files and InnoDB transaction logs directly—which is an easy way to take a backup—but it means that the transaction logs contain transactions that have just been prepared. On recovery, they roll back all the transactions and have a database in a consistent state.
Since these on-line backup methods are often used to bootstrap
new slaves, the binary log position of the last committed
transaction is written in the header of the InnoDB redo log. On
recovery, the recovery program print the binary log position of
the last committed transaction and you can use this information
with the CHANGE MASTER
command to start replicating
from the correct position. For this to work correctly, it is
necessary that all the transactions are committed in the same
order as they are written to the binary log. If they are not,
there can be "holes" where some transactions are written to the
binary log, but not yet committed, which cause the slave to miss
transactions that were not committed. The problematic case that
can arise is what you see in Figure 3 below.
You can see an example of this in Figure 2, where replication will start from the last committed position, but there is a transaction that were just prepared and hence was rolled back when the backup was restored on the slave.
To solve this, InnoDB added a mutex called the
prepare_commit_mutex
that was taken when preparing a
transaction and released when committing the transaction. This is
a simple solution to the problem, but causes some problems that
we will get to in a minute. Basically, the
prepare_commit_mutex
solve the problem by forcing
the call sequence to be as in Figure 4.
Steady as she goes... NOT!Since disk writes are slow, writing every transaction to disk will affect performance quite a lot... well, actually very much...
To try to handle that, there is a server option
sync_binlog
that can be set to how often the binary
log should be written to disk. If it is set to 0, the operating
system will decide on when the file pages should be written to
disk, if it is set to 1, then fsync()
will be called
after every transaction being written to the binary log. In
general, if you set sync_binlog
to N, you
can at most lose N-1 transactions, so in practice there
are just two useful settings: sync_binlog=0
means
that you accept that some transactions can be lost and handle it
some other way, and sync_binlog=1
means that you do
not accept to lose any transactions at all. You could of course
set it to some other value to get something in between, but in
reality you can either handle transaction loss or not.
To improve performance, the common case is to bundle many writes
with each sync: this is what the operating system does, and that
is what we should be able to do. However, if you look at
Figure 4 you see that there is no way to place a
fsync()
call in that sequence so that several
transactions are written to disk at the same time. Why? Because
at any point in that sequence there is at most one prepared and
written transaction. However, if you go back to Figure 3,
you can see that it would be possible to place an
fsync()
as shown and write several transactions to
disk at the same time. If it was possible, then all transactions
written to the binary log before the fsync()
call
would be written to disk at once. But this means that it is
necessary to order the commits in the same order as the writes
without using the prepare_commit_mutex
.
So, how does all this work then...The binary log group commit implementation used split the commit procedure into several stages as you can see in Figure 5. The stages are entirely internal to the binary log commit procedure and does not affect anything else. In theory, it would be possible to have another replication implementation with another policy for ordering commits. Since the commit procedure is separated into stages, there can be several threads processing transactions at the same time, which also improves throughput.
For each stage, there is an input queue where sessions queue up for processing. If a thread registers in an empty queue, it is considered the stage leader otherwise, the session is a follower. Stage leaders will bring all the threads in the queue through the stage and then register the leader and all followers for the next stage. Followers will move off to the side and wait for a leader to signal that the entire commit is done. Since it is possible that a leader registers to a non-empty queue, a leader can decide to become a follower and go off waiting as well, but a follower can never become a leader.
When a leader enters a stage, it will grab the entire queue in one go and process it in order according to the stage. After the queue is grabbed, other sessions can register for the stage while the leader processes the old queue.
In the flush stage, all the threads that registered will have their caches written to the binary log. Since all the transactions are written to an internal I/O cache, the last part of the stage is writing the memory cache to disk (which means it is written to the file pages, also in memory).
In the sync stage, the binary log is synced to
disk according to the settings of sync_binlog
. If
sync_binlog=1
all sessions that were flushed in the
flush stage will be synced to disk each time.
In the commit stage, the sessions will that registered for the stage will be committed in the engine in the order they registered, all work is here done by the stage leader. Since order is preserved in each stage of the commit procedure, the writes and the commits will be made in the same order.
After the commit stage has finished executing, all threads that were in the queue for the commit stage will be marked as done and will be signaled that they can continue. Each session will then return from the commit procedure and continue executing.
Thanks to the fact that the leader registers for the next queue and is ready to become a follower, the stage that is slowest will accumulate the most work. This is typically the sync stage, at least for normal hard disks. However, it is critical to fill the flush stage with as many transactions as possible, so the flush stage is treated a little special.
In the flush stage, the leader will skim the the sessions one by
one from the flush queue (the input queue for the flush stage).
As long as the last session was not remove the from the queue, or
the first session was unqueued more than
binlog_max_flush_queue_time
microseconds ago, this
process will continue. There are two different conditions that
can stop the process:
- If the queue is empty, the leader immediately advanced to the next stage and registers all sessions processed to the sync stage queue.
- If the timeout was reached, the entire queue is grabbed and the sessions transaction caches are flushed (as before). The leader then advance to the sync stage.
Performance, performance, performance...I'm sure you all wonder what the improvements are, so without further delay, let's have a look at the results of some benchmarks we have done on the labs tree. There has been several improvements that is not related to the binary log, so I will just focus on the results involving the binary log. In Figure 6 you see a benchmark comparing the 5.6.5 release with the 5.6 June labs release using the binary log. These benchmarks were executed on an 2.00 GHz 48-core Intel® Xeon® 7540 with 512 GiB memory and using SSD disks.
As you can see, the throughput has increased tremendously, with
increases ranging from a little less than 2 and approaching 4
times that of 5.6.5. To a large extent, the improvements are in
the server itself, but what is interesting is that with binary
log group commit, the server is able to keep the pace. Even with
sync_binlog=0
on 5.6.5 and
sync_binlog=1
on the 5.6 labs release, the 5.6 labs
release outperforms 5.6.5 by a big margin.
Another interesting aspect is that even with
sync_binlog=1
the server performs nearly as well
when using sync_binlog=0
. On higher number of
connections (roughly more than 50), the difference in throughput
is varying between 0% [sic] and with a more typical throughput
between 5% and 10%. However, there is a drop of roughly 20%
in the lower range. This looks very strange, especially in the
light that the performance is almost equal in the higher range,
so what is causing that drop and is there anything that can be
done about it?
The answer comes from some internal benchmarks done while developing the feature. For these tests we were using Sysbench on a 64-core Intel® Xeon® X7560 running at 2.27GHz with 126 GB memory and a HDD.
In the benchmarks that you can see in Figure 7 the enhanced version of 5.6 with and without the binary log group commit is compared. The enhanced version of 5.6 include some optimizations to improve performance that are not available in the latest 5.6 DMR, but most are available in the labs tree. However, these are benchmarks done while developing, so it is not really possible to compare them with Figure 6 above, but it will help understand why we have a 20% drop at the lower number of connections.
The bottom line in Figure 7 is the enhanced 5.6 branch
without binary log group commit and using
sync_binlog=1
, which does not scale very well. (This
is nothing new, and is why implementing binary log group commit
is a good idea.) Note that even at sync_binlog=0
the
staged commit architecture scale better than the old
implementation. If you look at the other lines in the figure, you
can see that even when the enhanced 5.6 server is running with
sync_binlog=0
, the binary log group commit
implementation outperforms the enhanced 5.6 branch at roughly 105
simultaneous threads with sync_binlog=1
.
Also note that the difference between sync_binlog=1
and sync_binlog=0
is diminishing as the number of
simultaneous connections is increased, to vanish completely at
roughly 160 simultaneous connections. We haven't made a deep
analysis of this, but while using the performance schema to
analyze the performance of each individual stage, we noted that
the sync time was completely dominating the performance (no
surprise there, just giving the background), and that all
available transactions "piled up" in the sync stage queue. Since
each connection can at most have one ongoing transaction, it
means that at 32 connections, there can never be more than 32
transactions in the queue. As a matter of fact, one can expect
that over a long run, roughly half of the connections are in the
queue and half of the connections are inside the sync stage (this
was also confirmed in the measurements mentioned above), so at
lower number of connections it is just not possible to fill the
queue enough to utilize the system efficiently.
The conclusion is that reducing the sync time would probably make
the difference between sync_binlog=0
and
sync_binlog=1
smaller even on low number of
connections. We didn't do any benchmarks using disks with
battery-backed caches (which should reduce the sync time
significantly, if not entirely eliminates it), but it would be
really interesting to see the effect of that on performance.
Summary and closing remarks
- The binary logging code has been simplified and optimized,
leading to improved performance even when using
sync_binlog=0
. - The
prepare_commit_mutex
is removed from the code and instead the server orders transactions correctly. - Transactions can be written and committed as groups without
losing any transactions, giving around 3 times improvement in
performance on both
sync_binlog=1
andsync_binlog=0
. - The difference between
sync_binlog=0
andsync_binlog=1
is small and reduces as the load increases on the system. - Existing storage engines benefit from binary log group commit since there are no changes to the handlerton interface.
Binary log group commit is one of a range of important new enhancements to replication in MySQL 5.6, including global transaction IDs (GTIDs), multi-threaded slave, crash safe slave and binary log, replication event checksums, and some more. You can learn more about all of these from our DevZone article: dev.mysql.com/tech-resources/articles/mysql-5.6-replication.html
You can also try out binary log group commit today by downloading
the latest MySQL 5.6 build that is available on labs.mysql.com
Related links
- It all started with this post where Mark points out the problem and
show some results of their implementation.
- The next year, Kristian
Nielsen implemented binary log group commit for MariaDB and
has a lot of good posts on the technical challenges in implementing it. This
implementation is using an atomic queue and does flushing and
syncing of the transactions as a batch, after which the sessions
are signalled in order and commit their transactions.