200M reads per second in MySQL Cluster 7.4

By courtesy of Intel we had access to a very large cluster of Intel servers for a few
weeks. We took the opportunity to see the improvements of the Intel
servers in the new Haswell implementation on the Intel Xeon chips. We also took
the opportunity to see how far we can now scale flexAsynch, the NoSQL benchmark
we've developed for testing MySQL Cluster.

Last time we tested we were using MySQL Cluster 7.2 and the main bottleneck
then was that the API nodes could not push through more than around 300k reads
per second and we have a limit of up to 255 nodes in total. This meant that we
were able to reach a bit more than 70M reads per second using MySQL Cluster 7.2.

In MySQL Cluster 7.3 we improved the handling of thread contention in the NDB API
which means that we are now able to process much more traffic per API node.
In MySQL Cluster 7.4 we also improved the execution in the NDB API receive
processing, and we also improved the handling of scans and PK lookups in the data
nodes. This meant that now each API node can process more than
1M reads per second. This is very good throughput given that each read contains
about 150 bytes. So this means that each socket can handle more than 1Gb/second.

To describe what we achieved we'll first describe the HW involved.
The machines had 2 sockets with Intel E5-2697 v3 processors. These are
Haswell-based Intel Xeon that have 14 cores and 28 CPU threads per CPU socket.
Thus a total of 28 cores and 56 CPU threads in each server operating at 2.6GHz base
frequency and a turbo frequency of 3.6GHz. The machines were equipped with
64 GByte of memory each. They had an Infiniband connection and
a gigabit ethernet port for communication.

The communication to the outside was actually limited by the Infiniband interrupt
handling. The Infiniband interrupt handling was set up to be latency-optimised
which results in higher interrupt rates. We did however manage to push the
flexAsynch such that this limitiation was very minor, it limited the performance
loss to within 10% of the maximum performance available.

We started testing using just 2 data nodes with 2 replicas. In this test we were able
to reach 13.94M reads per second. Using 4 data nodes we reached
28.53M reads per second. Using 8 data nodes we were able to scale it almost
linearly up to 55.30M reads per second. We managed to continue the
almost linear scaling even up to 24 data nodes where we achieved
156.5M reads per second. We also achieved 104.7M reads per second on a
16-node cluster and 131.7M reads on a 20-node cluster. Finally we took the
benchmark to 32 data nodes where we were able to achieve a new record of
205.6M reads per second.



The configuration we used in most of these tests had:
 12 LDM threads, non-HT
 12 TC threads, HT
 2 send threads, non-HT
 8 receive threads, HT
where HT means that we used both CPU threads in a core and non-HT meant
that we only used one thread per CPU core.

We also tried with 20 LDM threads HT, which gave similar results to 12 LDM
threads non-HT. Finally we had threads for replication, main, io and other activities
that were not used much in those benchmarks.

We compared the improvement of Haswell versus Ivy Bridge (Intel Xeon v2) servers
by running a similar configuration with 24 data nodes. With Ivy Bridge
(which had 12 cores per socket and thus 24 cores and 48 CPU threads in total) we
reached 117.0M reads per second and with Haswell we reached
156.5M reads per second. So this is a 33.8% improvement. Important to note here
is that Haswell was slightly limited by the interrupt handling of Infiniband
whereas the Ivy Bridge servers were not  imited by this. So the real difference is
probably more in the order of 40-45%.

At 24 nodes we tested scaling on number of API nodes. We started at 1 API machine
using 4 API node connections. This gave 4.24M reads per second. We then tried with
3 API machines using a total of 12 API node connections where we achieved
12.84M reads per second. We then added 3 machines at a time with 12 new API
connections and this added more than 12M reads per second giving 62.71M reads
per second at 15 API machines, 122.8M reads per second at 30 API machines and
linear scaling continued until 37 API machines where we achieved 156.5M reads
per second. The best results was achieved at 37 API machines where we achieved
156.5M reads per second. Performance of 40 API machines was about the same as at
37 API machines at 156.0M reads per second. The performance was saturated here
since the interrupt handling could not handle more packets per second. Even
without this the data node was close to saturating the CPUs for both the LDM and
the TC threads and the send threads.

Running with clusters like this is interesting. The bottlenecks can be more tricky
to find than the normal case. One must remember that running a benchmark with
37 API machines and 24 data nodes where each machine has 28 CPU cores, thus
more than 1000 CPU cores are involved, it requires understanding a complex
queueing network.

What is interesting here is that the queueing network behaves best if there is some
well behaved bottleneck in the system. This bottleneck ensures that the flow
through the remainder of the system behaves well. However in some cases where
there is no bottleneck in the system one can enter into a wave of increasing and
decreasing performance. We have all experienced this type of
behaviour of queueing networks while being stuck in car queues.

What we discovered is that MySQL Cluster can enter such waves if the config doesn't
have any natural bottlenecks. What happens here is that the data nodes are able to
send results back to the API nodes in an eager fashion. This means that the API nodes
receives many small packets to process. Since small packets takes longer to process
per byte compared to large packets this has the consequence that the API node slows
down. This in turn means that the benchmark slows down. After a while the data nodes
starts sending larger packets again to speed things up and again it hits too eager
sending.

To handle this we introduced a new configuration parameter MaxSendDelay in
MySQL Cluster 7.4. This parameter ensures that we are not so eager in sending
responses back to the API nodes. We will send immediately if there is no other
competing traffic, but if there is other competing traffic, we will delay sending
a bit to ensure that we're sending larger packets. One can say that we're
introducing an artificial bottleneck into the send part. This artificial bottleneck
can in some cases improve throughput by as much as 100% and  more.

The conclusion is that MySQL Cluster 7.4 using the new Haswell computers is
capable of stunning performance. It can deliver 205.6M reads per second of
records a bit larger than 100 bytes, thus providing a data flow of more than
20 GBytes per second of key lookups or 12.3 billion reads per minute.