At MySQL Connect last weekend the MySQL 5.6 was released as a
Release Candidate. It contains a large set of improvements over
MySQL 5.5. In this blog I will focus on what has changed in the
area of scalability.
There are four main areas of improvement of scalability in the
MySQL 5.6:
1) Splitting InnoDB Kernel mutex
2) The G5 discovery
3) Splitting LOCK_open mutex
4) Binlog group commit
In InnoDB as in so many other products there was an infamous
kernel mutex. This mutex was a sort of catch all mutex used to
protect the system from concurrent use of various data
structures. For an implementer it was easy enough to use the
kernel mutex as a way to achieve safe concurrency. The problem of
course is that it hurts our scalability. The kernel mutex was a
major bottleneck in many benchmarks, and even more important it
also made it very difficult to resolve other scalability
issues.
Already early in the MySQL 5.6 development an effort in the
InnoDB development team led by Sunny Bains, took the kernel mutex
and split it into around 10 new mutexes. So effectively the
kernel mutex was previously used for 10 different things. Now
through the split, each new mutex protects only one logical
entity. This naturally had a great impact on scalability in the
MySQL 5.6 release, but also will it make future developments
easier. Now since each new mutex is only protecting one logical
entity it is much easier to see how to improve handling of each
of those 10 new mutexes when they become scalability issues. The
new mutex that has the most contention now is the lock mutex,
this mutex protects the metadata of tables mainly.
The G5 discovery refers to a name we gave a patch internally. We
made several experiments after splitting the kernel mutex to see
which of the remaining mutexes would make most sense to fix next.
So some experimental patches were developed that entirely removed
a number of mutexes (obviously not a fully working MySQL version,
but sufficiently working for a simple benchmark). The puzzle we
met was that whatever we removed we had no success in improving
performance. We looked at this problem using various tools
without finding the root cause of it. Finally we used oprofile
and the new Linux perf tool to see which code lines that we spent
time in. We eventually discovered the issue. The problem was that
every time a row was read in InnoDB we updated a statistics
counter. In the case of Sysbench we have a number of scans where
each scan hits 100 rows, so for each Sysbench transaction we read
around 410 rows. Modern 4-socket servers are fast, but if one
tries to update the same variable from all CPUs at once, then the
cache line which this variable resides in, will bounce back and
forth between the CPUs in the system. The systems we used for
testing this could handle between 2-4 million updates per second
of the same variable when updates were applied from CPUs residing
in different sockets. So effectively this statistics counter kept
the number of rows read in a MySQL Server down to 2-4 million
reads per second or converted to Sysbench transactions we could
not deliver more than 5-10.000 TPS. This variable have no major
effect in MySQL 5.5 since it has other scalability issues that
hides this issue. But in MySQL 5.6 we solved so many scalability
issues that this variable became a major bottleneck. A simple
solution was obviously to simply remove this line, but obviously
we developed a more proper solution that we also used in many
other similar areas where this could potentially become an issue.
Finding and resolving this issue was a team effort made possible
by the InnoDB team, the MySQL Server team and the MySQL
performance experts. So the development resources available in
the Oracle MySQL organisation makes it possible to continue
scaling MySQL towards new heights.
Another sort of kernel mutex is the LOCK_open mutex in MySQL.
This mutex used to be a mutex to protect the metadata of a table
within the MySQL Server. It was however used for many other
things as well. This meant that splitting this mutex required a
number of reengineering projects before we were ready to perform
the actual split. Already early in the MySQL 5.6 development the
ground work was finalised to perform the actual split through the
work of the MySQL runtime team. The idea to the actual split came
in a development meeting in Trondheim where myself and Dmitry
Lenev quickly came up with an idea to protect TABLE objects
separately through an array of mutexes and keep the LOCK_open
only for creating new TABLE objects and other activities around
table metadata changes.
Finally this idea also became a new addition to the MySQL 5.6
release. Standing on the shoulders of the G5 discovery and the
split of the kernel mutex, the split of the LOCK_open mutex made
the jump in performance very significant. We were able to improve
performance 70% and more only based on the LOCK_open split in a
number of benchmarks.
What we have discovered with MySQL 5.6 and its benchmarking is
that the standard Sysbench has a limiting factor. The problem is
that Sysbench only uses 1 table, this means that any protection
of this single table will be part of the scalability issues in
running Sysbench. Since we assume that most, if not all,
applications don't direct all queries towards one table, we also
started benchmarking MySQL 5.6 using 2 tables and more in
Sysbench. In this case we avoid bottlenecks related to use of
only a single table.
In MySQL 5.6 we also decided to spend a serious effort in
improving replication scalability. We were happy with the
improvements of scalability in MySQL 5.5, but saw a need to also
move more attention to the replication area. We set out with a
modest goal of scaling replication to 16 CPUs. We actually
managed to scale replication on the server side as far as the
MySQL Server itself to 48 CPUs. We managed this even with
sync_binlog set to 1! So in MySQL 5.6 the binlog writing flies.
Also the slave side saw major improvements in scalability in
MySQL 5.6.
The problem related to writing of the binlog was that we only had
one mutex protecting those writes and all its phases. This meant
among other things that this mutex also protected file writes to
the binlog and even sync to the disks of the file writes.
In order to solve this issue Mats Kindahl and I experimented with
a number of approaches to discover what worked best. We
eventually came up with a sort of queueing system. So writes to
the binlog were divided into phases. The first phase happens as
part of transaction execution (as it does in MySQL 5.5) where
binlog writes are gathered in a data structure operated by the
connection. Thus there is no concurrency issue at all in this
phase. The next step is to write this data into the file buffers
operated by the operating system through write system calls. The
next step is the sync of the writes to the disk. The last step is
the commit that happens in the storage engines. Each of those
steps is handled similarly but with minor differences.
Each step has two phases, a queue phase and an execution phase.
The queue and the execution phase is protected by separate locks.
Whenever someone arrives to a step he grabs the queue mutex and
places himself in the queue. If he is the first to arrive in the
queue then he also tries to grab the execution mutex after
releasing the queue mutex. This means that while waiting for the
execution mutex, more connections can place their data in the
queue, they will then wait for the execution owner to signal them
when the work is done (this could happen after more than one step
is executed). When the execution mutex is acquired, the execution
mutex owner will also grab the queue mutex, grab the queue and
set the queue to empty and release the queue mutex. After this he
will execute all tasks in the queue.
The final commit step can be parallelised if desired, thus
multiple connections can commit simultaneously. The only
limitation to this is when a concurrent hot backup is happening
at the same time.
What is interesting with this architecture is that it is flexible
to where the bottleneck is. For some workloads the bottleneck is
in the commit phase, in this case it makes sense to parallelise
this part which is possible. For others it is the sync phase
which is the bottleneck and for yet other workloads the writing
to the buffer is the bottleneck. Wherever the bottleneck resides
this architecture makes the best of the situation and scales
MySQL 5.6 replication to new heights.
So with this impressive set of new improvements in MySQL 5.6,
what is left to do? As usual when one develops a new feature or
performance improvement one also finds out about a ton of other
things one wants to improve. So the list of ideas is not empty
and there is some pretty interesting scalability improvements in
line also for MySQL 5.7. We have access to machines currently
with 96 CPU threads and bigger machines are likely to appear, we
also cooperate with other parts of Oracle where even larger
systems are possible. So our aim continues to keep MySQL scalable
on commodity servers. Naturally we have noted that many of our
customers are sharding their data set to handle even higher
workloads. A long career on distributed systems has learnt me
that it is extremely important to do proper partitioning of data
sets to achieve scalability on network level. But it is still
extremely important to make each node in the distributed system
as large as possible. This decreases maintenance activities, it
minimises issues in sharding by minimising the number of shards.
We are pursuing this effort both in the MySQL Cluster area and in
the MySQL area where InnoDB is used as a storage engine.
Oct
03
2012