At MySQL Connect 2013, I talked about how we used MySQL 5.6 at
Facebook, and explained some of new features we added to our
Facebook MySQL 5.6 source tree. In this post, I'm
going to talk about how we made full table scan faster in
InnoDB.
Faster full table scan in InnoDB In general, almost
all queries from applications are using indexes, and reading very
few rows (0..1 on primary key lookups and 0..hundreds on range
scans). But sometimes we run full table scans. Typical full table
scan examples are logical backups (mysqldump) and online schema
changes (SELECT ... INTO OUTFILE).
We take logical backups by mysqldump at Facebook. As you
know MySQL offers both physical and logical backup
commands/utilities. Logical backup has some advantages against
physical backup. For example:
- Logical backup size is much smaller. 3x-10x size difference is not uncommon.
- Relatively easier to parse backups. On physical backup, if we can't restore a database by some serious reasons such as checksum failure, it is very difficult to dig into InnoDB internal data structures and fix corruptions. We trust logical backups rather than physical backups
Major drawbacks of the logical backup are that full backup
and full restore are much slower than physical
backup/restore.
The slowness of full logical backup often causes problems.
It may take very long time if database size grows a lot and
tables are fragmented. At Facebook, we suffered from mysqldump
performance problem that we couldn't finish full logical backup
within reasonable amount of time on some HDD and Flashcache based
servers. We knew that InnoDB wasn't efficient at full table scan
because InnoDB did not actually do sequential reads, but did
random reads mostly. This was a known issue for long years. As
database storage capacity has been growing, the slow full table
scan issue has been getting serious to us. So we decided to
enhance InnoDB to do faster sequential reads. Finally our
Database Engineering team implemented "Logical Readahead" feature
in InnoDB. With logical readahead, our full table scan speed
improved 9~10 times than before under usual production workloads.
Under heavy production workloads, full table scan speed became
15~20 times faster.
Issues of full table scan on large, fragmented tables When
doing full table scan, InnoDB scans pages and rows by primary key
order. This applies to all InnoDB tables, including fragmented
tables. If primary key pages (pages where primary keys and rows
are stored) are not fragmented, full table scan is pretty fast
because reading order is close to physical storage order. This is
similar to reading files by OS command (dd/cat/etc) like
below.
# dd if=/data/mysql/dbname/large_table.ibd of=/dev/null bs=16k iflag=direct
You may find that even on commodity HDD servers, you can
read more than 100MB/s multiplied by "number of drives" from
disks. Over 1GB/s is not uncommon.
Unfortunately, in many cases primary key pages are
fragmented. For example, if you need to manage user_id and
object_id mappings, primary key will be (user_id, object_id).
Insert ordering does not match with user_id ordering, so new
inserts/updates very often cause page splits. New split pages
will be allocated at far from current pages. This means pages get
fragmented.
If primary key pages are fragmented, full table scan
becomes very slow. Fig 1 illustrates the problem. After InnoDB
reads leaf page #3, it has to read page #5230, and after that it
has to read page #4. Page #5230 is far from page #3 and #4, so
disk read ordering becomes almost random, not sequential. It is
very well known that random reads on HDD is much slower than
sequential reads. One very effective approach to improve random
reads is using SSD, but per-GB cost on SSD is still more
expensive than HDD so using SSD is not always possible.
Fig 1. Full table scan is not actually doing sequential
reads
Does Linear Read Ahead really help? InnoDB supports
prefetching feature called "Linear Read Ahead". With linear read ahead,
InnoDB reads an extent (64 consecutive pages: 1MB if not
compressed) at one time if N pages are accessed sequentially (N
can be configured by innodb_read_ahead_threshold parameter,
default is 56). But actually this does not help so much. One
extent (64 pages) is very small range. For most large fragmented
tables, next page does not exist in the same extent. See above
fig 1 for example. After reading page#3, InnoDB needs to read
page#5230. Page#3 does not exist in the same extent as #5230, so
linear read ahead won't help here. This is pretty common case for
large fragmented tables. That's why Linear read ahead does not
help much to improve full table scan performance.
Optimization approach 1: Physical Read Ahead As described
above, the problem of slow full table scan was because InnoDB did
mostly random reads. To make it faster, making InnoDB do
sequential reads was needed. The first approach I came up with
was creating an UDF (Use Defined Function) to read ibd file
(InnoDB data file) sequentially. After executing the UDF, pages
in the ibd file should be within InnoDB buffer pool, so no random
read happens when doing full table scan. Here is an example
usage.
mysql> SELECT buf_warmup ("db1", "large_table"); /* loading into buf pool */ mysql> SELECT * FROM large_application_table; /* in-memory select */
buf_warmup() is a udf that reads entire ibd file of
database "db1", table "large_table". It takes time to read the
entire ibd file from disk, but reads are sequential so much
faster than random reads. When I tested, I could get ~5x overall
faster time than normal linear read ahead.
This proved that sequentially reading ibd files helped to
improve throughput, but there were a couple of
disadvantages:
- If table size is bigger than InnoDB buffer pool size, this approach does not work
- Reading entire ibd file means that not only primary key pages, but also secondary index pages and unused pages have to be read and cached into InnoDB buffer pool, even though only primary key pages are needed for full table scan. This is very inefficient if you have many secondary indexes.
- Applications have to be changed to call UDF.
This looked "good enough" solution, but our database
engineering team came up with a better solution called "Logical
Read Ahead", so we didn't choose UDF approach.
Logical Read Ahead Logical Read Ahead (LRA) works as
below.
- Reading many branch pages of the primary key
- Collecting leaf page numbers
- Reading many (configurable amount of) leaf pages by page number order (mostly sequential disk reads)
- Reading rows by primary key order (same as usual full table scan, but buffer pool hit rate should be very high)
This is illustrated at fig 2 below.
Fig 2: Logical Read Ahead
Logical Read Ahead (LRA) solved issues of Physical Read
Ahead. LRA enables InnoDB to read only primary key pages (not
reading seconday index pages), and to prefetch configurable
number of pages (not entire table) at one time. And LRA does not
require any SQL syntax changes.
We added two new session variables to make LRA work. One is
"innodb_lra_size" to control how many MBs to prefetch leaf pages.
The other is "innodb_lra_sleep" session variable to control how
many milliseconds to sleep per prefetch. We tested around 512MB ~
4096MB prefetch size and 50 milliseconds sleep, and so far we
haven't encountered any serious (such as
crash/stall/inconsistency) problem. These session variables
should be set only when doing full table scan. In our case,
mysqldump and some utility scripts (i.e. online schema change
tool) turn logical read ahead on.
Submitting multiple async I/O requests at once Another
performance issue we noticed was that i/o unit size in InnoDB was
only one page, even if doing readahead. 16KB i/o unit size is too
small for sequential reads, and much less efficient than larger
I/O unit size.
In 5.6, InnoDB uses Linux Native I/O (aio) by default. If
submitting multiple consecutive 16KB read requests at once, Linux
internally can merge requests and reads can be done efficiently.
But unfortunately InnoDB submitted only one page i/o request at
once. I filed a bug report #68659. As written in the bug report, on modern
HDD RAID 1+0 environment, I could get more than 1000MB/s disk
reads by submitting 64 consecutive pages requests at once, while
I got only 160MB/s disk reads by submitting one page
request.
To make Logical Read Ahead work better on our environments,
we fixed this issue. On our MySQL, InnoDB submits many more page
i/o requests before calling io_submit().
Benchmark In both cases, our production tables (fragmented
tables) were used.
1. Full table scan on pure HDD (basic benchmark, no other
workload)
Table size | Without LRA | With LRA | Improvement |
10GB | 10 min 55.00 sec | 1 min 15.20 sec | 8.71x |
50GB | 52 min 44.42 sec | 6 min 36.98 sec | 7.97x |
2. Online schema change under heavy workload
Table size | Without LRA | With LRA | Improvement |
1GB | 7 min 33 sec | 24 sec | 18.8x |
* dump time only, not counting data loading time
Source code All of our enhancements are available at
GitHub.
- Logical read ahead implementation : diff
- Submitting multiple i/o requests at once : diff
- Enabling logical read ahead on mysqldump : diff
Conclusion InnoDB was not efficient for full table
scan, so we fixed it. We did two enhancements, one was
implementing logical read ahead, the other was submitting
multiple async read i/o requests at once. We have seen 8 to 18
times performance improvements on our production tables, and this
has been very helpful to reduce our backup time, schema change
time, etc. I hope this feature will be supported in InnoDB by
Oracle officially, or at least by major MySQL fork
products.