40% better single-threaded performance in MariaDB

Continuing my investigation of single-threaded performance in the MariaDB server, I managed to increase throughput of single-threaded read-only sysbench by more than 40% so far:

I use read-only sysbench 0.4.12 run like this:

    sysbench --num-threads=1 --test=oltp --oltp-test-mode=simple --oltp-read-only --oltp-skip-trx run

And mysqld is run with minimal options:

    sql/mysqld --no-defaults --basedir=X --datadir=Y --innodb-buffer-pool-size=128M

With modern high-performance CPUs, it is necessary to do detailed measurements using the built-in performance counters in order to get any kind of understanding of how an application performs and what the bottlenecks are. Forget about looking at the code and counting instructions or cycles as we did in the old days. It no longer works, not even to within an order of magnitude.

I am using the Linux perf program for this. During my invistigations, I found that the main bottleneck in the benchmark turns out to be instruction cache misses. Here is an example measurement:

Counter Value
INST_RETIRED.ANY 170431
ICACHE.MISSES 17194

So 10% of executed instructions missed the level 1 instruction cache ("icache"). That is bad. The Intel Optimization Manual says this about the ratio ICACHE.MISSES/INST_RETIRED.ANY:

Anything over 1% of instructions retired can be a significant issue.

So we are 10 times worse than "a significant issue".

Instruction cache misses cause a bottleneck in the frontend of the CPU - where x86 instructions are fetch, decoded, and supplied to the micro-op queue to be scheduled for the out-of-order dataflow backend. To get an idea about how badly bottlenecked we actually are in the frontend in this benchmark, we can use another measure from the Intel manual:

IDQ_UOPS_NOT_DELIVERED.CORE / (4*CPU_CLK_UNHALTED.THREAD) = 81.5%

This ratio estimates the percentage of time in which the front-end is not able to deliver instructions sufficiently fast for the backend to work at full speed. In this case, for more than 80% of the time, the CPU would have been able to do more work in the same time if only instructions could be fetch and decoded faster.

Note the importance of this analysis in knowing what areas to work on to improve performance. Roughly speaking, 80% of our time is spent waiting for the icache. We could have spent years on improving data locality or branch mispredictions or whatever in the code, and we would never have been able to get very big improvements. Now, knowing where the sore spot is, we can spend our efforts where it matters the most.

In the case of icache misses, the best place to start is with profile-guided optimisation. This works by first running a specially instrumented mysqld binary to collect data about the load during the benchmark. Then the binary is recompiled, where the compiler can take into account the information obtained to better optimise the generated code.

Using information about how the program actually behaves when run can help the compiler lay out the basic blocks of the program to benefit the most common execution paths. This can help reduce the footprint in the icache by reducing jumps into or out of the middle of a 16-byte instruction stream (such jumps waste the rest of the 16-byte stream, which is the unit of storage of the icache). It can also increase the length of straight-line code execution paths, which should help the hardware prefetcher keep the level 1 icache supplied.

So that is the theory - but does it work in practice? It turns out that it does. First I compile with --coverage added to the CFLAGS/CXXFLAGS. Then I run sysbench to generate the profile data. Then I recompile adding instead the --profile-use flag. That is all there is to it, GCC handles everything else automatically.

Re-running the benchmark with the optimised binary yields a large speedup:

Binary Queries per second Speedup
Base: -O3 21404  
PGO: -O3 --profile-use 30903 44%

So a 44% speedup just from compiling with different optimisation, not bad! (The actual server-side improvement is in fact even higher, as the sysbench client consumes part of the runtime due to the single-threaded nature of the test).

By the way, that 44% speedup comes from just a modest reduction in icache miss rate - from 10% down to 8%. It just shows how expensive those icache misses are. Fetching from L2 cache takes something like 12 cycles, and during each of those cycles the CPU will be able to do nothing. Given that we could potentially execute 4 operations in the backend each of those clocks, that is a lot of waste. So with 8% misses still left there is room for further huge improvements, though those improvements will likely be much harder to achieve than just fiddling with compiler options.

I think it is clear that we will need to start using profile-guided optimisation for future MariaDB binaries. This will be some work, as we need to develop a set of typical workloads to optimise after, and integrate running of those into the compilation. We will need to cover a range of different common usages to optimise after. And more tests with a wider range of benchmarks will be needed to ensure that the gain is more generally applicable and does not cause significant regressions in performance of other parts of the code. With such a huge improvement in this test I am confident that things can be generally improved, but it still needs proper testing.

My work on improving single-threaded performance will continue, as time permits, and I certainly expect more good results along the way (several patches are already in the pipeline). But I thought this one was so interesting that it was worth mentioning to a wider audience.