Testing the Fastest Way to Import a Table into MySQL (and some interesting 5.7 performance results)

As I mentioned on my last post, where I compared the default configurations options in 5.6 and 5.7, I have been doing some testing for a particular load in several versions of MySQL. What I have been checking is different ways to load a CSV file (the same file I used for testing the compression tools) into MySQL. For those seasoned MySQL DBAs and programmers, you probably know the answer, so you can jump over to my 5.6 versus 5.7 results. However, the first part of this post is dedicated for developers and MySQL beginners that want to know the answer to the title question, in a step-by-step fashion. I must say I also learned something, as I under- and over-estimated some of the effects of certain configuration options for this workload.

Disclaimers: I do not intend to do proper benchmarks, most of the results obtained here were produced in a couple of runs, and many of them with a default configuration. This is intended, as I want to show “bad practices” for people that is just starting to work with MySQL, and what they should avoid doing. It is only the 5.6 to 5.7 comparison that have left me wondering. Additional disclaimer: I do not call myself a programmer, and much less a python programmer, so I apologize in advance for my code- after all, this is about MySQL. The download link for the scripts is at the bottom.

The Rules

I start with a CSV file (remember that it is actually a tab-separated values file) that is 3,700,635,579 bytes in size, has 46,741,126 rows and looks like this:

171773  38.6048402      -0.0489871      4       2012-08-25 00:37:46     12850816        472193  rubensd
171774  38.6061981      -0.0496867      2       2008-01-19 10:23:21     666916  9250    j3m
171775  38.6067166      -0.0498342      2       2008-01-19 10:23:21     666916  9250    j3m
171776  38.6028122      -0.0497136      5       2012-08-25 00:26:40     12850816        472193  rubensd
171933  40.4200658      -3.7016652      6       2011-11-29 05:37:42     9984625 33103   sergionaranja

I want to load it into a table with the following structure:

CREATE TABLE `nodes` (
  `id` bigint PRIMARY KEY,
  `lat` decimal(9,7),
  `lon` decimal(10,7),
  `version` int,
  `timestamp` timestamp,
  `changeset` bigint,
  `uid` bigint,
  `user` varchar(255)
);

The import finish time will be defined as the moment the table is crash safe (even if there is some pending IO). That means that for InnoDB, the last COMMIT has to be successful and flush_log_at_trx_commit must be equal to 1, meaning that even if there is pending IO to be made, it is fully durable on disk (it is crash-resistant). For MyISAM, that means that I force a FLUSH TABLES before finishing the test. Those are, of course, not equivalent but it is at least a way to make sure that everything is more or less disk-synced. This is the ending part of all my scripts:

# finishing
if (options.engine == 'MyISAM'):
    cursor.execute('FLUSH TABLES')
else:
    db.commit()

cursor.close()
db.close()

For the hardware and OS, check the specs on this previous post– I used the same environment as the one mentioned there, with the exception of using CentOS7 instead of 6.5.

The naive method

Let’s say I am a developer being tasked with loading a file regularly into MySQL- how would I do that? I would probably be tempted to use a CSV parsing library, the mysql connector and link them together in a loop. That would work, wouldn’t it? The main parts of the code would look like this (load_data_01.py):

# import
insert_node = "INSERT INTO nodes (id, lat, lon, version, timestamp, changeset, uid, user) VALUES (%s, %s, %s, %s, %s, %s, %s, %s)"

with open(CSV_FILE, 'rb') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter='\t')
    for node in csv_reader:
        cursor.execute(insert_node, node)

As I am playing the role of a developer without MySQL experience, I would also use the default configuration. Let’s see what we get (again, that is why I call these “tests”, and not benchmarks). Lower is better:

MySQL Version 5.1.72 5.5.39 5.6.20 5.7.4
Load time (seconds) 4708.594 6274.304 6499.033 6939.722

Wow, is that 5.1 being a 50% faster than the rest of versions? Absolutely not, remember that 5.5 was the first version to introduce InnoDB as the default engine, and InnoDB has additional transactional overhead and usually not good default configuration (unlike MyISAM, which is so simple that the default options can work in many cases). Let’s normalize our results by engine:

MySQL Version 5.1.72 5.5.39 5.6.20 5.7.4
MyISAM 4708.594 5010.655 5149.791 5365.005
InnoDB 6238.503 6274.304 6499.033 6939.722

This seems more reasonable, doesn’t it? However, in this case, it seems that there is a slight regression in single-thread performance as the versions go up, specially on MySQL 5.7. Of course, it is early to draw conclusions, because this method of importing a CSV file, row by row, is one of the slowest ones, and we are using very poor configuration options (the defaults), which vary from version to version and should not be taken into account to draw conclusions.

What we can say is that MyISAM seems to work better by default for this very particular scenario for the reasons I mentioned before, but it still takes 1-2 hours to load such a simple file.

The even more naive method

The next question is not: can we do it better, but, can we do it even slower? A particular text draw my attention when looking at the MySQL connector documentation:

Since by default Connector/Python does not autocommit, it is important to call this method after every transaction that modifies data for tables that use transactional storage engines.

-from the connector/python documentation
I though to myself- oh, so maybe we can speedup the import process by committing every single row to the database, one by one, don’t we? After all, we are inserting the table on a single huge transaction. Certainly, a huge number of small transactons will be better! This is the slightly modified code (load_data_02.py):

insert_node = "INSERT INTO nodes (id, lat, lon, version, timestamp, changeset, uid, user) VALUES (%s, %s, %s, %s, %s, %s, %s, %s)"

with open('/tmp/nodes.csv', 'rb') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter='\t')
    for node in csv_reader:
        cursor.execute(insert_node, node)
        db.commit()

And I do not even have a fancy graphic to show you because after 2 hours, 19 minutes and 39.405 seconds, I cancelled the import because only 111533 nodes had been inserted in MySQL 5.1.72 for InnoDB with the default configuration (innodb_flush_log_at_trx_commit = 1). Obviously, millions of fsyincs will not make our load faster, consider this a leason learned.

Going forward: multi-inserts

The next step I wanted to test is how effective grouping queries was in a multi-insert statement. This method is used by mysqldump, and supposedly minimizes the SQL overhead of handling every single query (parsing, permission checking, query planning, etc.). This is the main code (load_data_03.py):

concurrent_insertions = 100
[...]
with open(CSV_FILE, 'rb') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter='\t')
    i = 0
    node_list = []
    for node in csv_reader:
        i += 1
        node_list += node
        if (i == concurrent_insertions):
            cursor.execute(insert_node, node_list)
            i = 0
            node_list = []
    csv_file.close()

# insert the reminder nodes
if i > 0:
    insert_node = "INSERT INTO nodes (id, lat, lon, version, timestamp, changeset, uid, user) VALUES (%s, %s, %s, %s, %s, %s, %s, %s)"
    for j in xrange(i - 1):
        insert_node += ', (%s, %s, %s, %s, %s, %s, %s, %s)'
    cursor.execute(insert_node, node_list)

We tested it with a sample of 100 rows inserted with every query. What are the results? Lower is better:

MySQL Version 5.1.72 5.5.39 5.6.20 5.7.4
MyISAM 1794.693 1822.081 1861.341 1888.283
InnoDB 3645.454 3455.800 2849.299 3032.496

With this method we observe an improvement of the import time of 262-284% from the original time for MyISAM and of 171-229% from the original time for InnoDB. Remember that this method will not scale indefinitely, as we will encounter the package size limit if we try to insert too many rows at the same time. However, it is a clear improvement over the one-row-at-a-time approach.

MyISAM times are essentially the same between versions while InnoDB shows an improvement over time (which may be due to code and optimization changes, but also to the defaults like the transaction log size changing, too), except again between 5.6 and 5.7.

The right method for importing data: Load Data

If you have a minimum of experience with MySQL, you know that there is a specialized keyword for data imports, and that is LOAD DATA. Let’s see how the code would end up looking like by using this option (load_data_04.py):

# data import
load_data = "LOAD DATA INFILE '" + CSV_FILE + "' INTO TABLE nodes"

cursor.execute(load_data)

Simple, isn’t it? With this we are minimizing the SQL overhead, and executing the loop in the compiled C MySQL code. Let’s have a look at the results (lower is better):

MySQL Version 5.1.72 5.5.39 5.6.20 5.7.4
MyISAM 141.414 149.612 155.181 166.836
InnoDB 2091.617 1890.972 920.615 1041.702

In this case, MyISAM has a very dramatic improvement – LOAD DATA speeds up to 12x times the import. InnoDB, again still each one with the default parameters can improve the speed up to 3x times, and more significantly in the newer versions (5.6, 5.7) than the older ones (5.1, 5.5). I predict that this has to do much more with the different configuration of log files than with the code changes.

Trying to improve Load Data for MyISAM

Can we improve the load times for MyISAM? There are 2 things that I tried to do -augmenting the key_cache_size and disabling the Performance Schema for 5.6 and 5.7. I set up the key_cache_size to 600M (trying to fit the primary key on memory) and I set the performance_schema = 0, and I tested the 3 remaining combinations. Lower is better:

MySQL Version 5.1.72 5.5.39 5.6.20 5.7.4
default 141.414 149.612 155.181 166.836
key_buffer_size=600M 136.649 170.622 182.698 191.228
key_buffer_size=600M, P_S = OFF 133.967 170.677 177.724 186.171
P_S = OFF 142.592 145.679 150.684 159.702

There are certain things to notice here:

  • P_S=ON and P_S=OFF should have no effect for MySQL 5.1 and 5.5, but it brings different results because of measuring errors. We must understand that only 2 significative figures should be taken into account.
  • key_buffer_cache does not in general improve performance, in fact I would say that it statistically worsens the performance. This is reasonable because after all, I am writing to filesystem cache, and a larger key cache might require costlier memory reservations, or more memory copys. This should be researched further to make a conclusion.
  • Performance_schema may worsen the performance on this workload, but I am not statistically sure.
  • MyISAM (or maybe the MySQL server) seems to have slightly worsen its performance for this specific workload (single threaded batch import).

There are more things that I would like to try with MyISAM, like seeing the impact of the several row formats (fixed), but I wanted to follow up for other engines.

Trying to improve Load Data for InnoDB

InnoDB is a much more interesting engine, as it is ACID by default, and more complex. Can we make it as fast as MyISAM for importing?

The first thing I wanted to do is to change the default values of the innodb_log_file_size and innodb_buffer_pool_size. The log is different by default before and after 5.6, and it is not suitable for a heavy write load. I set it for a first test to 2G, as it is the largest size that 5.1 and 5.5 can use (actually, I set it to 2,147,483,136 as it has to be less than 2G), meaning that we have logs of about 4G. I also set the buffer pool for a convenient size, 8GB, enough to hold the whole dataset. Remember that one of the problems why InnoDB is so slow for imports is because it writes the new pages (at least) twice on disk -on the log, and on the tablespace. However, with these parameters, the second write should be mostly buffered on memory. These are the new results (lower is better):

MySQL Version 5.1.72 5.5.39 5.6.20 5.7.4
default 1923.751 1797.220 850.636 1008.349
log_file_size=2G, buffer_pool=8G 1044.923 1012.488 743.818 850.868

Now this is a test that starts to be more reasonable. We can comment that:

  • Most of the improvements that we had before in 5.6 and 5.7 respect to 5.1 and 5.5 was due to the 10x size in logs.
  • Still, 5.6 and 5.7 are faster than 5.1 and 5.5 (reasonable, as 5.6 had quite some impresive InnoDB changes, both on code and on configuration)
  • InnoDB continues being at least 5x slower than MyISAM
  • Still, 5.7 is slower than 5.6! We are having consistently a 13-18% regression in 5.7 (now I am starting to worry)

I said before that the main overhead of InnoDB is writing the data twice (log and tables). This is actually wrong, as it may actually write it 3 times (on the double write area) and even 4 times, in the binary log. The binary log is not enabled by default, but the double write is, as it protects from corruption. While we never recommend disabling the latter on a production, the truth is that on an import, we do not care if the data ends up corrupted (we can delete it and import it again). There is also some options on certain filesystems to avoid setting it up.

Other features that are in InnoDB for security, not for performance are the InnoDB checksums- they even were the cause of bottlenecks on very fast storage devices like flash PCI cards. In those cases, the CPU was too slow to calculate it! I suspect that that will not be a problem because more modern versions of MySQL (5.6 and 5.7) have the option to change it to the hardware-sped up function CRC32 and, mainly, because I am using a magnetic disk, which is the real bottleneck here. But let’s not believe on what we’ve learned and let’s test it.

The other thing I can check is performance_schema overhead. I’ve found cases of workload where it produces significative overhead, while almost none in others. Let’s also test enabling and disabling it.

These are the results (lower is better):

MySQL Version 5.1.72 5.5.39 5.6.20 5.7.4
default security and monitoring enabled 1044.923 1012.488 743.818 850.868
doublewrite=off 896.423 848.110 483.542 468.943
doublewrite=off,checksums=none 889.827 846.552 488.311 476.916
doublewrite=off,checksums=none,P_S=off 488.273 467.716

There are several things to comment here, some of them I cannot even explain:

  • The doublewrite feature doesn’t halve the performance, but it impacts it significantly (between a 15-30%)
  • Without the doublewrite, most of the 5.7 regression goes away (why?)
  • The doublewrite is also more significative in 5.6 and 5.7 than previous versions of MySQL. I would dare to tell that most of the other bottleneck may have been eliminated (or maybe it is just something like the buffer pool partitions being active by default?)
  • The innodb checksum makes absolutely no difference for this workload and hardware.
  • Again, I cannot give statistical significance to the overhead of the performance schema. However, I have obtained very variables results in these tests, having results with a 10% higher latency than the central values of the ones with it disabled, so I am not a hundred percent sure on this.

In summary, with just a bit of tweaking, we can get results on InnoDB that are only 2x slower than MyISAM, instead of 5x or 12x.

Import in MyISAM, convert it to InnoDB

I’ve seem some people at some forums recommending importing a table as MyISAM, then convert it to InnoDB. Let’s see if we can bust or confirm this myth with the following code (load_data_06.py):

load_data = "LOAD DATA INFILE '/tmp/nodes.csv' INTO TABLE nodes"
alter_table = "ALTER TABLE nodes ENGINE=InnoDB"

cursor.execute(load_data)
cursor.execute(alter_table)

These are the comparisons (lower is better):

MySQL Version 5.1.72 5.5.39 5.6.20 5.7.4
LOAD DATA InnoDB 1923.751 1797.220 850.636 1008.349
LOAD DATA MyISAM; ALTER TABLE ENGINE=InnoDB 2075.445 2041.893 1537.775 1600.467

I can see how that could be almost true in 5.1, but it is definitely not true in supported versions of MySQL. It is actually faster than importing twice the table, once for MyISAM and another for InnoDB.

I leave as a homework as a reader to check it for other engines, like MEMORY or CSV [Hint: Maybe we could import to this latest engine in a different way].

Parallel loading

MyISAM writes to tables using a full table lock (although it can perform in some cases concurrent inserts), but InnoDB only requires row-level locks in many cases. Can we speed up the process by doing a parallel loading? This is what I tried to test with my last test. I do not trust my programming skills (or do not have time) to perform the file-seeking and chunking in a performant way, so I will start with a pre-sliced .csv file into 8 chunks. It should not consume much time, but the limited synchronization tools on the default threading library, together with my limited time made me opt for this plan. We only need to understand that we do not start with the exact same scenario in this case. This is the code (load_data_08.py):

NUM_THREADS = 4

def load_data(port, csv_file):
    db = mysql.connector.connect(host="localhost", port=port, user="msandbox", passwd="msandbox", database="test")
    cursor = db.cursor()
    load_data_sql = "LOAD DATA INFILE '" + csv_file + "' INTO TABLE nodes"
    cursor.execute(load_data_sql)   
    db.commit()

thread_list = []

for i in range(1, NUM_THREADS + 1):
   t = threading.Thread(target=load_data, args=(port, '/tmp/nodes.csv' + `i`))
   thread_list.append(t)

for thread in thread_list:
   thread.start()

for thread in thread_list:
   thread.join()

And these are the results, with different parameters:

MySQL Version 5.1.72 5.5.39 5.6.20 5.7.4
1 thread, log_file_size=2G, buffer_pool=8G 894.367 859.965 488.273 467.716
8 threads, log_file_size=2G, buffer_pool=8G 752.233 704.444 370.598 290.343
8 threads, log_file_size=5G, buffer_pool=20G 301.693 243.544
4 threads, log_file_size=5G, buffer_pool=20G 295.884 245.569

From this we can see that:

  • There is little performance changes between loading in parallel with 4 or 8 threads. This is a machine with 4 cores (8 HT)
  • Parallelization helps, although it doesn’t scale (4-8 threads gives around a 33% speed up)
  • This is where 5.6 and specially 5.7 shines
  • A larger transaction log and buffer pool (larger than 4G, only available in 5.6+) still helps with the load
  • Parallel load with 5.7 is the fastest way in which I can load this file into a table using InnoDB: 243 seconds. It is 1.8x times the fastest way I can load a MyISAM table (5.1, single-threaded): 134 seconds. That is almost 200K rows/s!

Summary and open questions

  • The fastest way you can import a table into MySQL without using raw files is the LOAD DATA syntax. Use parallelization for InnoDB for better results, and remember to tune basic parameters like your transaction log size and buffer pool. Careful programming and importing can make a >2-hour problem became a 2-minute process. You can disable temporarily some security features for extra performance
  • There seems to be an important regression in 5.7 for this particular single-threaded insert load for both MyISAM and InnoDB, with up to 15% worse performance than 5.6. I do not know yet why.
  • On the bright side, there is also an important improvement (up to 20%) in relation to 5.6 with parallel write-load.
  • Performance schema may have an impact on this particular workload, but I am unable to measure it reliably (it is closer to 0 than my measuring error). That is a good thing.

I would be grateful if you can tell me if I have made any mistakes on my assumptions here.

Here you can download the different scripts in Python tested for the MySQL data loading.

Remember that these were not “formal” benchmarks, and I have no longer access to the machine where I generated them. I have yet to analyze if the same problem exists on 5.7.5. There are other people pointing to regressions under low concurrency, like Mark Callaghan, maybe these are related? As usual, post a comment here or reach me on Twitter.