At MySQL Conference and Expo 2009, I explained
how Linux I/O scheduler queue size affects MyISAM insert
performance.
It is well known that Linux implemented four types of I/O
schedulers (noop/deadline/anticipatory/cfq) in Linux kernel
2.6.10. The default is cfq in most distributions including RHEL,
which is not so good then noop normally outperforms, but I'll
talk this in other posts.
Linux I/O scheduler also has a functionality to sort incoming I/O
requests in its request-queue for optimization. Queue size is
configurable. Queue size is defined as
"/sys/block/sdX/queue/nr_requests" then you can change queue
length as follows.
# cat /sys/block/sda/queue/nr_requests
128
# echo 100000 > /sys/block/sda/queue/nr_requests
Changing queue size is even effective for noop scheduler.
Here are benchmarking results about changing i/o scheduler queue
size for MyISAM insert-intensive loads. Detailed are written
in the slides at the UC.
Apparently increasing queue size was very helpful for HDD, but
not helpful for SSD.
Let me explain about backgrounds.
On Linux side, I/O requests are handled by the following
order:
system calls(pwrite/etc)
-> Filesystem
-> I/O scheduler
-> Device Driver/Disks
I/O scheduler sorts incoming I/O requests by logical block
addresses, then sending them to a device driver.
I/O scheduler does not depend on storage devices so is helpful
for some parts (i.e. minimizing disk seek overheads), not helpful
for other parts (i.e. minimizing disk rotation overheads).
On the other hand, Command Queuing (TCQ/NCQ) also sorts I/O
requests for optimization. SAS and recent SATA II disks support
command queuing. The goal is partly duplicate from I/O scheduler.
But TCQ can minimize not only disk seeks but also disk rotation
overhead (See the link to wikipedia). The disadvantage of
TCQ is that queue size is very limited (normally 32-64).
Based on the above, sorting almost all random I/O requests on I/O
sheculer then sending them to TCQ would be nice.
Suppose 100,000 random read I/O requests are coming.
When I/O scheduler queue size is 128 (default in many cases), TCQ
gets I/O requests by almost random order, so pretty high disk
seek overhead happens for each action (requests within single
queue is dispersed).
When I/O scheduler queue size is 100,000, TCQ gets I/O requests
by fully sorted order, so seek overhead can be much
smaller.
Increasing queue size does not have any effect on SSD because no
disk seek happens.
This would explain my benchmarking results.
I/O scheduler queue size settings is not helpful for InnoDB
because InnoDB internally sorts I/O requests to optimize disk
seek overheads, and sending limited number of i/o requests
controlled by InnoDB internal i/o threads. So the role is
duplicate between InnoDB itself and I/O scheduler queue. Note
that TCQ improves InnoDB throughput because disk rotation
overheads are significantly reduced and such optimizations can
not be done from application/kernel side.
MyISAM does nothing special (highly depending on OS) so this
helps.
Updated in Apr 30: Added detailed benchmark conditions for people
who are interested..
Here is a test script. I ran a single-threaded stored procedure
on a same machine.
create table aa (id int auto_increment primary key,
b1 int,
b2 int,
b3 int,
c varchar(100),
index(b1), index(b2), index(b3)) engine=myisam;
drop procedure sp_aa;
delimiter //
create procedure sp_aa(IN count INTEGER)
BEGIN
DECLARE time_a, time_b BIGINT DEFAULT 0;
DECLARE done INTEGER DEFAULT 0;
DECLARE i INTEGER DEFAULT 1;
WHILE done != 1 DO
insert into aa values (i,rand()*count,rand()*count,rand()*count,repeat(1,40));
SET i = i + 1;
IF i % 1000000 = 1 THEN
SELECT unix_timestamp() into time_a from dual;
SELECT i, from_unixtime(time_a), time_a - time_b from dual;
SET time_b = time_a;
END IF;
IF i > count THEN
SET done = 1;
END IF;
END WHILE;
END;
//
delimiter ;
mysql test -vvv -e "call sp_aa(300000000)"
Then wait for a long long time...
# Default insert. no insert-delayed, no disable-keys, no
delay-key-write, no mmap
H/W, OS, MySQL settings
Sun Fire X4150
CPU: Intel Xeon, 8 cores
RAM: 32GB (but limit filesystem cache size up to 5GB, no swapping
happened)
HDD: SAS 15,000RPM, 2 disks, RAID 1, write cache enabled
SSD: Intel X25-E, Single drive, write cache enabled
OS: RedHat Enterprise Linux 5.3 (2.6.18-128.el5)
Filesystem: ext3
I/O Scheduler: deadline
MySQL 5.1.33
key_buffer_size: 2G
i/o stats:
queue size=128 (default)
avg-cpu: %user %nice %system %iowait %steal %idle
0.01 0.00 0.08 24.69 0.00 75.22
Device: rrqm/s wrqm/s r/s w/s rMB/s
sdb 0.00 0.87 0.00 575.60 0.00
wMB/s avgrq-sz avgqu-sz await svctm %util
2.25 8.01 142.44 247.03 1.74 100.00
(At running 12 hours, 13 mil rows were inserted)
queue size=100000
avg-cpu: %user %nice %system %iowait %steal %idle
2.06 0.00 5.32 29.66 0.00 62.96
Device: rrqm/s wrqm/s r/s w/s rMB/s
sdb 0.00 2487.33 0.00 2042.60 0.00
wMB/s avgrq-sz avgqu-sz await svctm %util
35.11 35.20 84402.36 43582.33 0.49 100.01
(At running 1.5 hours, 41 mil rows were inserted)
...
avg-cpu: %user %nice %system %iowait %steal %idle
0.07 0.00 0.19 24.82 0.00 74.91
Device: rrqm/s wrqm/s r/s w/s rMB/s
sdb 0.00 9.03 1.70 756.03 0.01
wMB/s avgrq-sz avgqu-sz await svctm %util
5.56 15.04 31981.72 127560.45 1.32 100.00
(At running 12 hours, 77 mil rows were inserted,
index size was 4.5 GB)
If running many more hours(days) so that index size exceeds
filesystem cache, a lot of disk reads will happen. (I didn't have
time to run that test before UC)