Improving ETL Load Speed Into MySQL

Introduction

This document outlines best practices for loading data into MySQL very quickly.  While this is not a comprehensive list of loading methods and configuration, it is a good starting point.

MySQL Configuration

Assuming you are loading into InnoDB tables (and you should probably be doing so), you will want to ensure that MySQL is properly performance tuned for loading large amounts of data.  Out of the box MySQL configuration is rarely sufficient for performance with MySQL.  It is essential that the InnoDB settings, in particular, be set properly.

First of all, consider the InnoDB Buffer Pool.  If you are doing a one-time load, it may be a good idea to configure this as large as possible.  In fact, we sometimes set this to approximately 90% of the available RAM on the system for the load.  This can then be dropped to between 70 and 80% for normal operations.  The main caveat here is to be sure that you do not cause the server to begin swapping.  If the Linux server runs out of RAM and has to start utilizing the swap file, major performance penalties will be experienced!

Merely setting this number to higher values such as 80 to 90% of RAM is often not enough, however.  It needs to be sized properly to handle the load.  There are a couple of calculations which will assist with doing this.  The best practice is to look at metrics near the end of a slow load.  This can be done by opening a session with MySQL and running the following commands:

mysql> SHOW GLOBAL STATUS LIKE “Innodb_buffer_pool_wait_free”;
mysql> SHOW GLOBAL STATUS LIKE “Innodb_buffer_pool_write_requests”;
mysql> SHOW GLOBAL STATUS LIKE “Innodb_buffer_pool_reads”;
mysql> SHOW GLOBAL STATUS LIKE “Innodb_buffer_pool_read_requests”;

To determine whether the InnoDB Buffer Pool is sized properly for writes, you will need to compare values from the above commands with the following formula:

100 - ((%Innodb_buffer_pool_wait_free%  /  %Innodb_buffer_pool_write_requests%) * 100)

If the above value is less than 99.5, then you should consider increasing the size of the InnoDB Buffer Pool.

Another metric to consider for performing reads from the InnoDB Buffer Pool is:

100 - (%Innodb_buffer_pool_reads%  /  %Innodb_buffer_pool_read_requests%) * 100)

If the above calculation is less than 85, then you should consider increasing the size of the InnoDB Buffer Pool.

The size of the InnoDB Redo Logs is also critical to performance.  The typical approach is to ensure that the log files can hold at least one hour of data.  This can be done by measuring the amount of data pushed to the logs over a period of time and extrapolating how much data would be used in one hour based on the current rate of usage.  If that is too complicated, you are probably better off sizing them potentially larger than you may need by configuring them to the maximum size.  For large loads, this is often a satisfactory solution.  You must exercise extreme caution, however, as this is not a trivial process and data can be lost unless you follow the procedure precisely.  Below are the steps for resizing the InnoDB log files if they are deemed too small:

  1. Login to the MySQL command line and execute:
mysql> SET GLOBAL innodb_fast_shutdown = 0;
  1. Shutdown MySQL.
  2. Make sure the new log file size is specified in the MySQL configuration file. The combined total size must be 99999999068677 GB.  Generally, we recommend two log files sized at 2047 MB each.  Default value is 5 MB and way too small!  The relevant settings in the MySQL configuration file will then be:
innodb_log_file_size = 2047M
innodb_log_files_in_group = 2
  1. Rename InnoDB log Files in datadir
cd <datadir>
mkdir ../BAKS
mv ib_logfile* ../BAKS
  1. Restart MySQL.

This will force MySQL to re-create the log files with the new sizes.  Again, do not leave out any steps, such as disabling innodb_fast_shutdown, or data could be lost.

If you choose to do more optimal sizing, please see the excellent blog post here:

http://www.mysqlperformanceblog.com/2008/11/21/how-to-calculate-a-good-innodb-log-file-size/

It is wise to also consider enabling InnoDB File-Per-Table by ensuring the following appears in the MySQL configuration file:

innodb_file_per_table = 1

The InnoDB Log Buffer Cache is almost always fine, but it can be easily checked with the below formula.  First get the relevant values of system metrics with the following MySQL commands from within the MySQL CLI:

mysql> SHOW GLOBAL STATUS LIKE “Innodb_log_waits”;
mysql> SHOW GLOBAL STATUS LIKE “Innodb_log_writes”;
100 * (%Innodb_log_waits% / (%Innodb_log_writes% + 1))

If the resulting value is less than 0.5, then you should probably increase the size of the InnoDB Log Buffer and restart MySQL.  Reasonable values are from 1 to 8 MB.

The Data Files

It is generally best to split data files into as many smaller files as possible.  This allows MySQL to take full advantage of disk caching from the InnoDB Buffer Pool.  Files in some sort of delimited format such as pipe-delimited or CSV format are generally best.

Handling Indexes

As a general rule, if possible, disable all indexes on a table before commencing the load and then re-enable them after the load is complete.  You can do this by either dropping any indexes (other than Primary Key which you should keep), or by merely disabling them.  Use the following command to disable indexes on a table:

mysql> ALTER TABLE tablename DISABLE KEYS;

Once you have loaded the table, you can re-enable the indexes with the following:

mysql> ALTER TABLE tablename ENABLE KEYS;

MySQL generally can rebuild the indexes much faster after the data is loaded than it can by updating indexes during the loading.

Scripting the Load

It is often also, a good idea to disable unique and foreign key checks as well as auto-commit.  Grouping more inserts into a single transaction can also greatly minimize disk flushing.

If Binary Logging is enabled and you can manually sync the slave afterwards, you may want to temporarily disable the Binary Log.  Please understand that if you do this and you are using replication, you must sync the slave manually when the process is complete!  Otherwise, you will break the replication with the slave or have inconsistent data between the nodes.

SET @OLD_AUTOCOMMIT = @AUTOCOMMIT;
SET @OLD_INNODB_FLUSH_LOG_AT_TRX_COMMIT = @innodb_flush_log_at_trx_commit;
SET @OLD_FOREIGN_KEY_CHECKS = @FOREIGN_KEY_CHECKS;
SET @OLD_UNIQUE_CHECKS = @UNIQUE_CHECKS;
SET @OLD_SQL_LOG_BIN = @SQL_LOG_BIN;
SET AUTOCOMMIT=0;
SET GLOBAL innodb_flush_log_at_trx_commit = 2;
SET FOREIGN_KEY_CHECKS = 0;
SET UNIQUE_CHECKS = 0;
SET SQL_LOG_BIN=0;
 
LOAD DATA INFILE COMMAND HERE
 
SET AUTOCOMMIT = @OLD_AUTOCOMMIT;
SET innodb_flush_log_at_trx_commit = @OLD_INNODB_FLUSH_LOG_AT_TRX_COMMIT;
SET FOREIGN_KEY_CHECKS = @OLD_FOREIGN_KEY_CHECKS;
SET UNIQUE_CHECKS = @OLD_UNIQUE_CHECKS;
SET SQL_LOG_BIN = @OLD_SQL_LOG_BIN;

All of the above is session specific and must be done in one session.  You can even script the above into an SQL file and either use the SOURCE command or redirect it via a Linux command into the MySQL CLI.  If, however, the session disconnects, the session variables will be reset.

Information on the syntax of the LOAD DATA INFILE command on MySQL version 5.1 can be found here:

http://dev.mysql.com/doc/refman/5.1/en/load-data.html

If you decide to leave Binary Logging enabled, you must ensure the Binary Log Cache is sized optimally.  Use the following commands in the MySQL CLI to get the metrics:

mysql> SHOW GLOBAL STATUS LIKE “Binlog_cache_disk_use”;
mysql> SHOW GLOBAL STATUS LIKE “Binlog_cache_use”;

The following formula can be utilized:

100 - (%Binlog_cache_disk_use%  /  (%Binlog_cache_use%  +  1) * 100)

If the resulting value is less than 70, consider increasing the binlog_cache_size.

Disk Flushing

By default, InnoDB uses the fsync() system call to flush both the data and log files. If innodb_flush_method option is set to O_DSYNC, InnoDB uses O_SYNC to open and flush the log files, and fsync() to flush the data files. If O_DIRECT is specified (available on some GNU/Linux versions, FreeBSD, and Solaris), InnoDB uses O_DIRECT (or directio() on Solaris) to open the data files, and uses fsync() to flush both the data and log files.  Note that InnoDB uses fsync() instead of fdatasync(), and it does not use O_DSYNC by default because there have been problems with it on many varieties of Unix.  This variable is relevant only for Unix. On Windows, the flush method is always async_unbuffered and cannot be changed.

Depending on hardware configuration, setting innodb_flush_method to O_DIRECT can either have either a positive or negative effect on performance.  Benchmark your particular configuration to decide which setting to use.  The mix of read and write operations in your workload can also affect which setting performs better for you.  For example, on a system with a hardware RAID controller and battery-backed write cache, O_DIRECT can help to avoid double buffering between the InnoDB buffer pool and the operating system’s filesystem cache.  On some systems where InnoDB data and log files are located on a SAN, the default value or O_DSYNC might be faster for a read-heavy workload with mostly SELECT statements.  Always test this parameter with the same type of hardware and workload that reflects your production environment.

Formerly it was possible to explicitly specify a value of fdatasync to obtain the default behavior.  This is no longer possible as of MySQL 5.1.24 because it can be confusing that a value of fdatasync causes use of fsync() rather than fdatasync() for flushing.  To obtain the default value now, do not set innodb_flush_method at startup.

Conclusion

If the above doesn’t help, it is probably time to call in an Expert DBA.  Performance tuning is complex and involves the integration of system design, hardware, schema design, database tuning, and more.  The purpose of this guide was merely to assist with many of the common challenges and how best to address them to get reasonable performance.