MySQL Cluster have all the ingridients to be designed as a very
scalable and extremely efficient key lookup engine for the Cloud.
As we have shown in earlier entries of my blog we've been able to
scale MySQL Cluster 7.2 to handle 72 million key lookups per
second or 4.3 billion key lookups per minute. This benchmark was
actually limited by a limitation of the NDB API nodes to handle
more than about 300k lookups per second and so with a maximum of
255 nodes we got to around 72 million per second in total.
However in MySQL Cluster 7.3 we have removed this limitation, in
addition we have also enabled scaling to even bigger data nodes,
so it should be possible now to reach even higher numbers.
The aim of this blog is however not to give any new benchmark
results, rather it is providing details about how the benchmark
program works and how this benchmark program architecture can be
used to design an efficient scalable key lookup data store.
To obtain best possible performance we want to ensure that the
data node can operate as efficiently as possible. This is done by
ensuring that a connection to the data node sends many key
lookups bundled together. Operating on individual key lookups is
possible of course, but as usual it is more efficient to operate
on bigger entities than one key lookup at a time. To provide this
we use a concept we call Executor Thread. The Executor Thread
will only execute key lookups aimed for a certain data node. So
this means that the number of Executor Threads will be a multiple
of the number of data nodes (there could be more than one thread
per data node if necessary). The Executor Thread will receive key
lookups from an internal queue handled by the application program
(in our case the flexAsynch benchmark program). The key lookups
are prepared by the Definer Threads. The Definer Thread will
receive a key lookup aimed for any data node, it will take this
key lookup and calculate the receiving data node for this key
lookup (there is API calls in the NDB API to handle this). Based
on this calculation the Definer Thread will put the key lookup in
the queue of the proper Executor Thread.
The architecture before the Definer Thread is dependent on the
application. In the figure provided here we have shown one
possible architecture where we have one receive thread that
receives a flow of messages from somewhere, to process those
messages we need to interpret the packets and process them, this
could entail one or more key lookups. In the figure we have
assumed there is one key lookup per message and that the Executor
Thread can format the packet back to the sender based on the
information in the internal key lookup order.
So the important part of the architecture is the Executor Thread
that handles messages to one data node based on an internal data
structure that defines one key lookup and defines how to process
the response (this thread should do as little work as possible to
ensure it can focus on communication with the data node). There
should also be a Definer Thread that prepares the key lookup
request and puts the request in the queue of the proper Executor
Thread. The Definer Thread could also do other things and there
could be few or many Definer Threads in the architecture.
So how does flexAsynch work, in this case we don't have any input
traffic, we generate the key lookups in the Definer Threads. The
Definer Thread has a very simple operation. It starts by
preparing a set of key lookups to any data node. For each of
those key lookups it puts the request in the queue of the proper
Executor Thread. After placing all requests in a queue, it starts
waiting for all operations to complete. After all requests have
received their response from the Executor Threads we simply
continue with the next batch.
The operation of the Executor Thread is also very simple, it gets
the current set of key lookups waiting in queue, it prepares
those for execution of the NDB API. It sends off all the
operations to the data node. When the data node have executed all
of the operations it reports the result back to the Definer
Threads and updates some benchmark statistics, then it continues
with the next batch of key lookups.
So the operation of an efficient key lookup data store is not
difficult at all. To make it scale one can then add up to 48 data
nodes per cluster (each is capable of handling more than 5
million key lookups per second of around 100 byte in size). Each
cluster can handle a total of 255 nodes in total. Obviously it is
also straightforward to operate more than one cluster to scale
even further.
The benchmark code exists in
storage/ndb/test/ndbapi/flexAsynch.cpp, the interesting code
exists in the NEW module here (it also contains a lot of legacy
code for old variants of the flexAsynch benchmark).
Nov
25
2013