InnoDB data flush and write for dummy.

 

That is trying to explain myself a quite complicate process in few word/paragraphs, don’t expect anything special, this write start from one talk at the office, where I was drive in confusion by many talks, mine include.

So I decide to try to get a better logical path on what is going on InnoDB process when we modify a record. It was not my purpose to do an in depth review of every process, just overview.

Most of what I have written come from my understanding of other people writing, and code reading.

InnoDB maxi experts forgive me if I will not be precise, and correct me please when I will be wrong.

 

Let us start from the beginning.

 

InnoDB is one of the storage engines present in MySQL, one of its characteristics is that it is ACID compliant; this means it has atomic, consistent, isolated, durable transactions, and so on …. We know that let us go ahead, if you don’t know what I am talking about STOP here and read: http://dev.mysql.com/doc/refman/5.5/en/innodb-storage-engine.html.

 

To satisfy those requirements InnoDB use several components like:

Buffer Pool

Rollback log (undo insert, undo update)

redo log

doublewright

 

Finally InnoDB internally organizes the data in memory page of 16K default dimension. How all those blocks are combined and used is not always very clear.

But we know so move it on …

 

I will try to present a logical path that explains the different blocks and moments.

 

Let us begin starting MySQL, on start the MYSQLD demon will invoke InnoDB storge engine initialization, which has to perform several operations, like check the files presence and if they are readable/writable, check the catalog in the system table space, read the LSN from the data files and compare it with the last apply in the double-buffer and REDO log, initialize the Buffer Pool and other buffers.

 

If we are lucky and all the operation will be successfully executed, InnoDB will initialize correctly and will come online.

 

At this point our data-set still reside in the data file and BP is empty, system areas like undo insert, undo update (rollback log) are empty as well given no transaction are really in place.

 

Let assume we want to do a smart thing like warming-our database loading in the BP the most used datasets, to do so we simply run a select on those dataset and InnoDB will load the data from disk to BP storing record in pages of 16Kb (5.6 will allow an easy way to change/manage the page size).

 

Anyhow, running SELECT * FROM COUNTRY will load in the BP all the table in once, and data will from now on accessible from BP instead of the disk. This is true until data is frequently accessed. In this case InnoDB will maintain the pages in memory available for reading, in the case no one will ask for it, the pages containing the data will be marked as old. InnoDB use the LRU approach (least recently used) to identify which page can be removed from the BP to make space for new blocks. Also it will place the new created block(s), in the middle of the LRU list.

 

So far so good and easy, and if we need to tune the BP for efficient use of the LRU list reading the http://dev.mysql.com/doc/refman/5.5/en/innodb-buffer-pool.html gives good explanation.

 

But BP for InnoDB is not only a buffer we use to cache data, it is a real working area, where data is also manipulated.

 

Here the fun starts.

 

Let us assume we will have 3 different transactions accessing the COUNTRY table, for modification, what will happen?

 

First of all the first transaction (there is always someone coming first), will store the values its going to change in a undo log record, this could fit in one page or more, structure will follow the standard InnoDB way using cursor, tuple and page. In the undo log LSN reference is present as well for comparison with the data set.

 

The rollback log is mainly pided in undo insert and undo update. It is a log allocate PER transaction and located in the system tables (in 5.6 is possible to move it out). The number of possible buffers/transactions from InnoDB 1.1 is 123*1023 approximately 123K, before InnoDB 1.1 the maximum number was 1023.

 

Finally the log will contain the initial status of the page(s) and the versioning of the modifications done by the transaction, organize in cursors, record and page(s).

 

The combination of the different buffers/transaction and the level of interaction are defined by the Isolation level (Repeatable reads; read committed; read uncommitted etc).

 

We can have many transactions accessing the rollback log but what is not evident at the first shot is:

When rows are modified or deleted, the rows and associated undo logs are not physically removed immediately, or even immediately after the transaction commits. The old data is preserved until transactions that started earlier or concurrently are finished, so that those transactions can access the previous state of modified or deleted rows.

Thus, a long-running transaction can prevent InnoDB from purging data that was changed by a different transaction.

When rows are modified or deleted within a long-running transaction, other transactions using the READ COMMITTED and REPEATABLE READ isolation levels have to do more work to reconstruct the older data if they read those same rows.

When a long-running transaction modifies a table, queries against that table from other transactions do not make use of the covering index technique. Queries that normally could retrieve all the result columns from a secondary index, instead look up the appropriate values from the table data.

 

Seems clear enough, but let me make it clearer.

 

IF a transaction is modifying a table and this table is accessed by another transaction, the secondary index IS NOT USED for returning the value, but IS USED for the search, the value is retrieve by table look-up instead, so INDEX ->PK ->Table look-up.

 

It is not clear to me how undo log it manages the different LSN position, in case of difference between records/cursors. Can we have them on the same page? Or must reside on different space?  Also how I retrieve them in order to follow the correct order.

 

Reading the comments in trx0undo.c done by Heikki I suppose:

If each transaction is given a whole page for its update undo log, file space consumption can be 10 times higher than necessary.

Therefore, partly filled update undo log pages should be reusable.

But then there is no way inpidual pages can be ordered so that the ordering agrees with the serialization numbers of the transactions on the pages.

Thus, the history list must be formed of undo logs, not their header pages as it was in the old implementation.

However, on a single header page the transactions are placed in

the order of their serialization numbers.

As old versions are purged, we may free the page when the last transaction on the page has been purged.

A problem is that the purge has to go through the transactions in the serialization order. This means that we have to look through all rollback segments for the one that has the smallest transaction number in its history list.

 

So from what I read here, we will have a History list based on undo log (s) themselves. But the transactions will also be ordered at page level.

 

I am not sue I get it right… so don’t take it as a statement.

 

Ok back to our COUNTRY table, assume we have first transaction doing ”UPDATE COUNTRY set gov_type='dictatorship' where iso3=ITA;”

Then another doing   “UPDATE COUNTRY set people_status='unhappy' where gov_type='dictatorship'”;

Finally the 3td doing :”INSERT INTO COUNTY (<fields>)  VALUES (values);

 

Assuming they happens in short period of time but without overlap so no issue as the one describe above, what happens internally?

 

First InnoDB need to have a transaction (ib_trx_t), then it will create a “tuple” to put the value I want to change, then a cursor to represent the record structure (ib_crsr_t),  get the tuple value from the cursor, clone the tuple, modify it put back in the cursor, destroy tuple(s) and finally commit the transaction.

 

All operations until the commit, are done at tuple and cursor level. Nothing is really written in the BP or REDO log.

 

Both updates will work the same while the insert will have a different path:  Transaction and cursor creation will be similar but then it will need a tuple array and it will populate it with the available values, if value not present for the specific tuple NULL will apply, if that tuple represent a NOT NULL attribute, an error will occur.

So far so good, so nothing is really written outside the cursor until the commit, so what happen after the commit?

 

Many things.

 

On commit, value is pushed from cursor to the BP and the relevant pages modified and marked dirty.

Also the dirty page will be located in the flush list, which contains all of the dirty pages in LSN order, least-recently-modified first

HERE the optimistic approach will happen, means that InnoDB assume that the change will be only at row/Page level, without the need of an index/B-tree reorganization.

 

Note that at this point InnoDB is not in a Consistent state anymore, and thinking to flush the page(s) to the data file could take to long leaving it in an Inconsistent state for too long.

Solution is that as part of the commit InnoDB flush the information to the REDO log, which “the facto” represent the CD of ACID for InnoDB.

 

There is a lot of confusion in the documentation here, which sometimes contradict itself in the same paragraph.


This one is a good example “InnoDB, like any other ACID-compliant database engine, flushes the redo log of a transaction before it is committed. …  With group commit, InnoDB issues a single write to the log file to perform the commit action for multiple user transactions that commit at about the same time, significantly improving throughput.

 

Ok so IF InnoDB flush uncommitted transaction how can it then group by committed transaction in X of time.

 

More does it make ANY sense to flush not committed transaction at all?

I read all this in the following way (code reading will follow and let you know).

InnoDB will process all the COMMITTED transaction in memory on REDOLOG, but those transactions are not committed on the data-file(s) yet, following the write-ahead-logging concept.

To me seems more logic, and it seems to follow more the InnoDB logic.

 

Back to REDO logs or Iblog, they are written sequentially, where information is previously store in the buffer then flush on disk/file.

They contains only the LSN, and the information related to the changes.

Is important to mention that from InnoDB 1.1 the Group commit is back and that is the way to ensure that Binary log AND Iblog inserts follow the same order. To achieve that InnoDB use the Two Phase Commit protocol, pushing writes to Iblog and Binary log, but it needs to have sync_binlog disable.

 

At this point if the server crash our data will be recover reading the REDO log.

Restarting the InnoDB will start reading the Iblog from the latest checkpoint comparing the LSN number in the log with the last related to the table, all the LSN less then the log LSNs will be then apply.

 

So REDO logs are quite relevant not only for recovery but also for the durability, unfortunately we all know that the REDO logs are also a huge bottleneck, and if not tuned correctly they represent a serious issue for the MySQL effiency.

 

See Vadim blog for discussion about checkpoints (http://www.mysqlperformanceblog.com/2011/04/04/innodb-flushing-theory-and-solutions/).

But that is not all, we can have many other possible issue related to REDO logs like:

1) Iblog needs to have enough space to write the incoming transactions or your transactions process will be stop soon.

2) InnoDB will not cancel information from the log, if the relevant pages are still not flushed from BP to data-files. It will stop any insert and start to flush the dirty pages from the BP.

 

In short, keep innodb_max_dirty_pages_pct to the default value of 75% can KILL your server when using large BP.

 

After that we still have the data in the data-files not matching the data in memory, so now InnoDB needs to align that as well.

InnoDB use both sharp checkpoint, and fuzzy checkpoint.

 

The sharp checkpoint takes place when InnoDB align all the modification performed in committed transactions and when all LSN references are align as well.

Example is when InnoDB shuts down.

 

The fuzzy checkpoint take place very frequently, given InnoDB try to keep the dirty page to minimum, simplifying and citing:

When InnoDB flushes dirty pages to disk, it finds the oldest dirty page’s LSN and treats that as the checkpoint low-water mark. It then writes this to the transaction log header. You can see this in the functions log_checkpoint_margin() and log_checkpoint().

Therefore, every time InnoDB flushes dirty pages from the head of the flush list, it is actually making a checkpoint by advancing the oldest LSN in the system. And that is how continual fuzzy checkpointing is implemented without ever “doing a checkpoint” as a separate event. If there is a server crash, then recovery simply proceeds from the oldest LSN onwards.

(tks Baron for the simplification)

 

Finally InnoDB to perform the real writes use the doublebuffer, only after the write and the flush to the doublewrite buffer have completed, InnoDB write the pages to their proper positions in the data file.

 

If the operating system crashes in the middle of a page write, InnoDB can later find a good copy of the page from the doublewrite during crash recovery.

 

With that the process is far to be complete and I skip points that needs to be addressed, I know it very well.

 

My intention was just to put in order, simplifying a lot, what happens in the “main” flow.

Many open questions remain, what happen in the case of non optimistic (or pessimistic) writes that require a rebuild of innoDB b-tree index?

What in the B-tree reorganization, what about the page flush, insert buffer … and many more.

 

WHAT it will be very nice to have is a book explaining how it works from A to Z, driving us poor guys working with MySQL and InnoDB also in each file/function of the code, or at least on the main sections of it.

Glossary

Checkpoint     As changes are made to data pages that are cached in the buffer pool, those changes are written to the data files sometime later, a process known as flushing. The checkpoint is a record of the latest changes (represented by an LSN value) that have been successfully written to the data files.

LRU     An acronym for “least recently used”, a common method for managing storage areas. The items that have not been used recently are evicted when space is needed to cache newer items. InnoDB uses the LRU mechanism by default to manage the pages within the buffer pool, but makes exceptions in cases where a page might be read only a single time, such as during a full table scan. This variation of the LRU algorithm is called the midpoint insertion strategy. The ways in which the buffer pool management differs from the traditional LRU algorithm is fine-tuned by the options innodb_old_blocks_pct, innodb_old_blocks_time, and the new MySQL 5.6 options innodb_lru_scan_depth and innodb_flush_neighbors.

LSN     Acronym for “log sequence number”. This arbitrary, ever-increasing value represents a point in time corresponding to operations recorded in the redo log. (This point in time is regardless of transaction boundaries; it can fall in the middle of one or more transactions.) It is used internally by InnoDB during crash recovery and for managing the buffer pool.

In the MySQL Enterprise Backup product, you can specify an LSN to represent the point in time from which to take an incremental backup. The relevant LSN is displayed by the output of the ibbackup command. Once you have the LSN corresponding to the time of a full backup, you can specify that value to take a subsequent incremental backup, whose output contains another LSN for the next incremental backup.

Flush     To write changes to the database files, that had been buffered in a memory area or a temporary disk storage area. The InnoDB storage structures that are periodically flushed include the redo log, the undo log, and the buffer pool.     Flushing can happen because a memory area becomes full and the system needs to free some space, because a commit operation means the changes from a transaction can be finalized, or because a slow shutdown operation means that all outstanding work should be finalized. When it is not critical to flush all the buffered data at once, InnoDB can use a technique called fuzzy checkpointing to flush small batches of pages to spread out the I/O overhead.

Commit

A SQL statement that ends a transaction, making permanent any changes made by the transaction. It is the opposite of rollback, which undoes any changes made in the transaction.     InnoDB uses an optimistic mechanism for commits, so that changes can be written to the data files before the commit actually occurs. This technique makes the commit itself faster, with the tradeoff that more work is required in case of a rollback.

Page     A unit representing how much data InnoDB transfers at any one time between disk (the data files) and memory (the buffer pool). A page can contain one or more rows, depending on how much data is in each row. If a row does not fit entirely into a single page, InnoDB sets up additional pointer-style data structures so that the information about the row can be stored in one page.     One way to fit more data in each page is to use compressed row format. For tables that use BLOBs or large text fields, compact row format allows those large columns to be stored separately from the rest of the row, reducing I/O overhead and memory usage for queries that do not reference those columns.     When InnoDB reads or writes sets of pages as a batch to increase I/O throughput, it reads or writes an extent at a time.     All the InnoDB disk data structures within a MySQL instance share the same page size.

Extent     A group of pages within a tablespace totaling 1 megabyte. With the default page size of 16KB, an extent contains 64 pages. In MySQL 5.6, the page size can also be 4KB or 8KB, in which case an extent contains more pages, still adding up to 1MB.     InnoDB features such as segments, read-ahead requests and the doublewrite buffer use I/O operations that read, write, allocate, or free data one extent at a time.

Dirty page

A page in the InnoDB buffer pool that has been updated in memory, where the changes are not yet written (flushed) to the data files.

Dirty read     An operation that retrieves unreliable data, data that was updated by another transaction but not yet committed. It is only possible with the isolation level known as read uncommitted.     This kind of operation does not adhere to the ACID principle of database design. It is considered very risky, because the data could be rolled back, or updated further before being committed; then, the transaction doing the dirty read would be using data that was never confirmed as accurate.     Its polar opposite is consistent read, where InnoDB goes to great lengths to ensure that a transaction does not read information updated by another transaction, even if the other transaction commits in the meantime.

Crash recovery     The cleanup activities that occur when InnoDB is started again after a crash. Changes that were committed before the crash, but not yet written into the tablespace files, are reconstructed from the doublewrite buffer. When the database is shut down normally, this type of activity is performed during shutdown by the purge operation.     During normal operation, committed data can be stored in the insert buffer for a period of time before being written to the tablespace files. There is always a tradeoff between keeping the tablespace files up-to-date, which introduces performance overhead during normal operation, and buffering the data, which can make shutdown and crash recovery take longer.

Doublewrite buffer     InnoDB uses a novel file flush technique called doublewrite. Before writing pages to the data files, InnoDB first writes them to a contiguous area called the doublewrite buffer. Only after the write and the flush to the doublewrite buffer have completed, does InnoDB write the pages to their proper positions in the data file. If the operating system crashes in the middle of a page write, InnoDB can later find a good copy of the page from the doublewrite buffer during crash recovery.     Although data is always written twice, the doublewrite buffer does not require twice as much I/O overhead or twice as many I/O operations. Data is written to the buffer itself as a large sequential chunk, with a single fsync() call to the operating system.

Write combining     An optimization technique that reduces write operations when dirty pages are flushed from the InnoDB buffer pool. If a row in a page is updated multiple times, or multiple rows on the same page are updated, all of those changes are stored to the data files in a single write operation rather than one write for each change.

Undo log     A storage area that holds copies of data modified by active transactions. If another transaction needs to see the original data (as part of a consistent read operation), the unmodified data is retrieved from this storage area. This area is physically part of the system tablespace. It is split into separate portions, the insert undo buffer and the update undo buffer. Collectively, these parts are also known as the rollback segment, a familiar term for Oracle DBAs.

Redo log     A set of files, typically named ib_logfile0 and ib_logfile1, that record statements that attempt to change data in InnoDB tables. These statements are replayed automatically to correct data written by incomplete transactions, on startup following a crash. The passage of data through the redo logs is represented by the ever-increasing LSN value. The 4GB limit on maximum size for the redo log is raised in MySQL 5.6.     The disk layout of the redo log is influenced by the configuration options innodb_log_file_size, innodb_log_group_home_dir, and (rarely) innodb_log_files_in_group. The performance of redo log operations is also affected by the log buffer, which is controlled by the innodb_log_buffer_size configuration option.

Tuple     In the Embedded InnoDB product, a data structure holding one or more table columns. The tuple is an intermediate holding area where query results are stored before the columns are extracted, or where column values are stored before being used in an insert or update operation.

Cursor     An internal data structure that is used to represent the result set of a query, or other operation that performs a search using a SQL WHERE clause. It works like an iterator in other high-level languages, producing each value from the result set as requested.     Although usually SQL handles the processing of cursors for you, you might delve into the inner workings when dealing with performance-critical code. Embedded InnoDB requires you to be familiar with cursors and to manipulate them using C API calls.

Reference

Baron http://www.xaprb.com/blog/2011/01/29/how-innodb-performs-a-checkpoint/

Vadim http://www.mysqlperformanceblog.com/2011/04/04/innodb-flushing-theory-and-solutions/

Ewen Fortune http://www.mysqlperformanceblog.com/2011/02/03/how-innodb-handles-redo-logging/

Mark Challagan  http://www.facebook.com/note.php?note_id=408059000932

Domas http://dom.as/2011/07/03/innodb-index-lock/

 

And the code … good reading

 

Finally I thank Danil Zburivsky who raised few good questions.

{joscommentenable}