When you run MySQL on a large NUMA box it is possible to control
the memory placement and the use of CPUs through the use of
numactl. Most modern servers are NUMA boxes nowadays.
numactl works with a concept called NUMA nodes. One NUMA node
contains CPUs and memory where all the CPUs can access the memory
in this NUMA node at an equal delay. However to access memory in
a different NUMA node will typically be slower and it can often
be 50% or even 100% slower to access memory in a different NUMA
node compared to the local NUMA node. One NUMA node is typically
one chip with a memory bus shared by all CPU cores in the chip.
There can be multiple chips in one socket.
With numactl the default option is to allocate memory from the
NUMA node the CPU currently running on is connected to. There is
also an option to interleave memory allocation on the different
parts of the machine by using the interleave option.
Memory allocation actually happens in two steps. The first step
is the one that makes a call to malloc. This invokes a library
linked with your application, this could be e.g. the libc library
or a library containing tcmalloc or jemalloc or some other malloc
implementation. The malloc implementation is very important for
performance of the MySQL Server, but in most cases the malloc
library doesn't control the placement of the allocated
memory.
The allocation of physical memory happens when the memory area is
touched, either the first time or after the memory have been
swapped out and a page fault happens. This is the time that we
assign memory to the actual NUMA node it's going to be allocated
on. To control how the Linux OS decides on this memory allocation
one can use the numactl program.
numactl provides options to decide on whether to use interleaved
memory location or local memory. The problem with local memory
can be easily seen if we consider that the first thing that
happens in the MySQL Server is a recovery of the InnoDB and this
recovery is single-threaded so will thus make a large piece of
the memory in the buffer pool to be attached to the NUMA node
where the recovery took place. Using interleaved allocation means
that we get a better spread of the memory allocation.
We can also use the interleave option to specify which NUMA nodes
the memory should be chosen from. Thus the interleave option acts
both as a way of binding the MySQL Server to NUMA nodes as well
as interleaving memory allocation on all the NUMA nodes the
server is bound to.
numactl finally also provides an ability to bind the MySQL Server
to specific CPUs in the computer. This can be either by locking
to NUMA nodes, or by locking to individual CPU cores.
So e.g. on a machine with 8 NUMA nodes one might start the MySQL
Server like this:
numactl --interleave=2-7 --cpunodebind=2-7 mysqld ....
This will allow the benchmark program to use NUMA node 0 and 1
without interfering with the MySQL Server program. If we want to
use the normal local memory allocation it should more or less be
sufficient to remove the interleave option since we have bound
the MySQL Server to NUMA node 2-7 there should be very slim risk
that the memory is allocated from elsewhere. We could however
also use
--memnodebind=2-7 to ensure that the memory allocation happens in
the desired NUMA nodes.
So how effective is numactl compared to e.g. using taskset. From
a benchmark performance point of view there is not much
difference unless you get memory very unbalanced through a long
recovery at the start of the MySQL Server. Given that taskset
allows the server to be bound to certain CPU cores, it also means
effectively that the memory is bound to the NUMA nodes of the
CPUs the MySQL Server was bound to by taskset.
However binding to a subset of the NUMA nodes or CPUs in the
computer is definitely a good idea. On a large NUMA box one can
gain at least 10% performance by locking to a subset of the
machine compared to allowing the MySQL Server to freely use the
entire machine.
Binding the MySQL Server also improves the stability of the
performance. Also binding to certain CPUs can be an instrument in
ensuring that different appplications running on the same
computer don't interfere with each other. Naturally this can also
be done by using virtual machines.
Dec
21
2010