This article is providing the details to the 500K QPS with MySQL 5.7 benchmark results
published earlier and explained during my talk at MySQL
Looking back in the history of MySQL / InnoDB improvement, you may easily realize that there was no any so big speed-up(s) between releases in read-only (RO) performance as they came with MySQL 5.6. And it's easy to understand as well that without having a good level of scalability on read-only, it's hard to expect to reach a good level on read+write (specially when reads are dominating ;-))
However, we were so happy with RO performance in MySQL 5.6, that with 5.7 we turned our main focus on the read+write (RW), because on big data volumes things are not yet going as we wish.. But the RW dependency on RO jumped up once more again, and then InnoDB team came with few more improvements to push the max QPS level in 5.7 yet more far.
But let's take the things by order..
In fact in MySQL 5.6 on RO workload dominating internal contentions are
- Single table used only: MDL, trx_sys and lock_sys (InnoDB)
- Many tables used: trx_sys and lock_sys (so, InnoDB mainly)
Any fast single-table-bound workload will be mainly blocked by
MDL contentions. While multi-table will be limited by InnoDB
internals (different tables will be protected by different MDL
locks, so contention bottleneck on MDL will be lower in this
case). But again, all depends on a workload too -- a more generic
and wider RO workload should scale in MySQL 5.6 pretty well (like
Sysbench OLTP_RO), while workload with short and fast queries
(like Sysbench Point-Selects (fetch a record by PK)) is hitting
all these contentions much more harder and will scale only up to
16cores-HT, having a worse result on 32cores.. But any such a
workload similar to Point-Select test will show you the max
possible performance rate on which all MySQL internals are able
to work together (starting by SQL parser and finishing by row
fetch round-trip).. This will be as well the max possible SQL
query/sec (QPS) rate you may ever achieve on a given MySQL
version and a given HW configuration.
For MySQL 5.6 our best obtained result was around 250K QPS, which on that time was the best ever seen on MySQL / InnoDB with SQL queries ;-)
However, it was only possible to achieve if READ-ONLY TRANSACTIONS were used (a new feature in MySQL 5.6), and again, if AUTOCOMMIT=1 was used, otherwise CPU cycles are simply wasted to process START TRANSACTION / COMMIT statements, and you're loosing in overall performance for sure..
So, the very first improvement introduced within MySQL 5.7 was an auto-discovery of RO transactions (in fact any InnoDB transaction is considered as read-only until any DML is not involved within it).. - this simplified a lot the RO transactions feature itself, and life for users and developers too - no need to take care anymore if RO transaction statement was used or not. But again, you cannot reach the max possible potential QPS on MySQL with this feature as CPU time is still wasted by processing transactions begin / end statements..
On the same time Percona came with a different solution to resolve "transaction list" management (TRX-list) and as the result lower the trx_sys mutex contention within InnoDB. Percona's solution kept better a high load on Point-Selects with transactions workload, but MySQL 5.7 was yet more better here (but I was not able to publish the results of 5.7 as its code was not yet public).. So, at least I'm able to do it now :
- the same RO Point-Select-TRX test (using transactions) on 8-tables is executed on MySQL 5.6, Percona 5.5, and then MySQL 5.7 (results are from May.2013)
- and as you can see, we're far here from the peak 250K QPS obtained on the same 16cores-HT configuration..
- MySQL 5.6 is hitting a contention on the trx_sys mutex and QPS is decreasing since 64usr..
- Percona 5.5 is keeping a load longer, and QPS decrease is starting only since 512usr..
- while MySQL 5.7 already kept on that time QPS without any decrease at all (and even with a higher number of concurrent users which you don't see on these graphs)..
However, it's clear that if one is willing to get the MAX of potential RO QPS rate with MySQL - transactions should be avoided..
So, let's see where it was our MAX QPS in May-2013.
Same Point-Select 8-tables test, but without transactions @MySQL 5.6 :
- the test is executed with keeping MySQL 5.6 running on 16cores only, then on 16cores-HT, then 32cores, and 32cores-HT
- as you can see, the Max QPS is even bigger than expected - it's 275K QPS on MySQL 5.6
- and the Max result is reached on 16cores-HT
- while result on 32cores is not better than on 16cores-HT (due internal contentions, and contentions are better managed here by having 2 CPU threads on the same core - so the true concurrency remains on the level of 16 threads, and not 32 as it will be on 32cores)..
While the same test on MySQL 5.7 was looking pretty differently, as for that time contention on the lock_sys mutex was already lowered in 5.7, and trx_sys mutex related code also got its first changes:
- first of all you may see that on the same 16cores-HT configuration 5.7 was already performing better than 5.6
- then, performance yet more improved on 32cores configuration!
- and then reaching its Max 350K QPS on 32cores-HT config!!
- for the first time in history on this particular (and most aggressive) RO workload we got a better result on 32cores -vs- 16cores-HT, and yet more improved with hyper-threading enabled (on 32cores-HT).. - awesome! ;-)
Comparing to all we saw until now, that was really very good!! ;-)
From the other hand, it was clear as well there is still a room for improvement. Contention on trx_sys was still remained. We did not use a full CPU power to do a useful work (still a lot of CPU cycles were spending on lock spinning).. But the result was already better than ever, and way better than 5.6, so there was no valid reason to continue digging to improve performance here, while our main focus was on RW workloads where we have even more huge room for improvement..
By the end of May, during our Perf-Meeting, Sunny added few changes more around trx_sys mutex contention, and our Max QPS moved to 375K QPS(!) since then - should be more than enough as improvement for 5.7, right? ;-)
On the same time we continued exchange with Percona team who proposed to manage TRX-list in a different way.. - the solution looked interesting, but on 5.5 code base was unable to show a higher QPS, and on 5.6 code base (once Percona Server 5.6 was available) the Max QPS was not bigger than in MySQL 5.6. However, the discussion involved a very interesting point: what will be the impact on RO performance if there will be some RW workloads running in parallel?.. And even if MySQL 5.7 code was still running better within the same test conditions, the impact was very visible (you may find my analyzes about here, while, again, I was not able to present 5.7 results during this time as the code was not public yet - may provide it in one of the next articles now)..
And as any pure RW workload was impacted here as well, there was enough of motivations to finally remaster the whole TRX-list related code in a way that Sunny wanted to see it from a long time.. And this experience was simply amazing! ;-)) Day after day we enjoyed to see our QPS graphs going higher and higher.. - till reached 440K QPS(!) on the same 32cores-HT server.
Point-Selects 8-tables @5.7-DMR2 :
No comments.. ;-))
However, there was a small mystery around.. - we intentionally with Sunny analyzed all bottlenecks and impact of code changes via different tools, and on some of the tests for my big surprise Sunny observed higher QPS levels than me.. This "mystery" was related to the following factors:
- on a high load the 5.7 code is running now near the limits of the HW capacity (mainly CPU), so every instruction matters!
- and the difference if UNIX socket or IP port is used becomes very visible!
- and Sysbench itself is using 30% of CPU time, but if the older Sysbench version is used (with a smaller code path) for the same test load it'll consume only 20% CPU, leaving additional 10% to the MySQL server ;-)
- so, in the same test workload, by using UNIX socket instead of IP port + replacing Sysbench-0.4.13 by Sysbench-0.4.8 we're out-passing 500K QPS(!) - easy, right? ;-))
Let's compare "before" and "after" :
- Lowered CPU% usage by Sysbench
- resulting in a higher CPU time availability for MySQL server
- and we're hitting 500K QPS..
What else?.. ;-)
I may only add: kudos Sunny & whole MySQL Dev Team! ;-)
Let's get a look now on Max QPS obtained on the same Point-Select 8-tables workload on other engines as well. I've used the latest ones available on beginning of Sep.2013 :
- MySQL-5.7.2 (DMR2)
- Percona Server 5.6.13-rc60.5
- Percona Server 5.5.33-rel31.1
Each engine is tested within the following configuration variations:
- CPU taskset: 8cores-HT, 16cores, 16cores-HT, 32cores, 32cores-HT
- Concurrent Users Sessions: 8, 16, 32 ... 1024
- InnoDB Spin Wait Delay: 6, 96
Then the best results from any given combination for every engine are compared -vs- others. And I'm obtaining the graph which I've already presented in the previous article:
Few comments :
- No need to comment the gap on the MySQL 5.7 results, it's just evident..
- then, what is interesting that none of engines based on the MySQL 5.5 code base did not get any closer to the MySQL 5.6 results..
- which is just confirming that only since the MySQL 5.6 code base was used, Percona Server reached the 5.6 level, while MariaDB-10 is still on the road..
- so, the MySQL 5.6 code base rocks, no doubt!
- and MySQL 5.7 simply rocks twice! ;-))
And what about scalability?
The answer is simple: MySQL 5.7 is the only one scaling here.
Then, if IP port is used and a more "heavy" Sysbench-0.4.13, the results are the following:
QPS is just slightly lower, but the tendency is exactly the same..
And scalability is pretty similar:
More other results will come, so stay tuned ;-)
NOTE : for a single-table-bound workloads the story is not yet good:
- decreased contentions on the InnoDB level made other contentions much more visible
- and MDL contentions became even more dominating when the load is bound on a single table..
- this is as expected, and should be fixed within the next DMRs..
We have yet many challenges ahead of us ;-)
For reference, my HW config for presented tests was the following:
- Server : 32cores-HT (bi-thread) Intel 2300Mhz, 128GB RAM
- OS : Oracle Linux 6.2
- FS : EXT4 mounted with "noatime,nodiratime,nobarrier"
max_connections=4000 key_buffer_size=200M low_priority_updates=1 table_open_cache = 8000 back_log=1500 query_cache_type=0 table_open_cache_instances=16 # files innodb_file_per_table innodb_log_file_size=1024M innodb_log_files_in_group = 3 innodb_open_files=4000 # buffers innodb_buffer_pool_size=32000M innodb_buffer_pool_instances=32 innodb_additional_mem_pool_size=20M innodb_log_buffer_size=64M join_buffer_size=32K sort_buffer_size=32K # innodb innodb_checksums=0 innodb_doublewrite=0 innodb_support_xa=0 innodb_thread_concurrency=0 innodb_flush_log_at_trx_commit=2 innodb_max_dirty_pages_pct=50 innodb_use_native_aio=1 innodb_stats_persistent = 1 innodb_spin_wait_delay= 6 / 96 # perf special innodb_adaptive_flushing = 1 innodb_flush_neighbors = 0 innodb_read_io_threads = 4 innodb_write_io_threads = 4 innodb_io_capacity = 4000 innodb_purge_threads=1 innodb_adaptive_hash_index=0 # monitoring innodb_monitor_enable = '%' performance_schema=OFF
In case you need it, Linux Sysbench binaries available from here:
The Sysbench command used to run Point-Selects test via UNIX socket (starting 8 processes in parallel):
LD_PRELOAD=/usr/lib64/libjemalloc.so /BMK/sysbench-0.4.8 --num-threads=$1 --test=oltp --oltp-table-size=10000000 \ --oltp-dist-type=uniform --oltp-table-name=sbtest_10M_$n \ --max-requests=0 --max-time=$2 --mysql-socket=/SSD_raid0/mysql.sock \ --mysql-user=dim --mysql-password=dim --mysql-db=sysbench \ --mysql-table-engine=INNODB --db-driver=mysql \ --oltp-point-selects=1 --oltp-simple-ranges=0 --oltp-sum-ranges=0 \ --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-skip-trx=on \ --oltp-read-only=on run > /tmp/test_$n.log &
The Sysbench command used to run Point-Selects test via IP port (starting 8 processes in parallel):
LD_PRELOAD=/usr/lib64/libjemalloc.so /BMK/sysbench-0.4.13 --num-threads=$1 --test=oltp --oltp-table-size=10000000 \ --oltp-dist-type=uniform --oltp-table-name=sbtest_10M_$n \ --max-requests=0 --max-time=$2 --mysql-host=127.0.0.1 --mysql-port=5700 \ --mysql-user=dim --mysql-password=dim --mysql-db=sysbench \ --mysql-table-engine=INNODB --db-driver=mysql \ --oltp-point-selects=1 --oltp-simple-ranges=0 --oltp-sum-ranges=0 \ --oltp-order-ranges=0 --oltp-distinct-ranges=0 --oltp-skip-trx=on \ --oltp-read-only=on run > /tmp/test_$n.log &