How InnoDB handles REDO logging

Xaprb (Baron) recently blogged about how InnoDB performs a checkpoint , I thought it might be useful to explain another important mechanism that affects both response time and throughput – The transaction log.

The InnoDB transaction log handles REDO logging, this is the mechanism that provides the A (Atomic) C (Consistent)  and D (Durability) in ACID. The transaction log keeps a complete record of every change that occurs to the pages inside the database. Instead of logging whole pages (Value / Physical logging) where both the before and after image of a page is logged or by logging logical changes  (Logical logging) to the dataset, InnoDB uses  Physiological logging, this basically means that it combines the two techniques to provide a logging mechanism that is both optimal in terms of the amount of data logged whilst still ensuring the log is action consistent and each operation is idempotent. It is important to note that physiological logging needs to operate on consistent pages, Innodb achieves this by using the double write buffer to ensure consistent page writes.

Why log?

Since InnoDB tries to keep the working set in memory (InnoDB Buffer Pool), therefore the changes made by transactions will occur in volatile memory and later be flushed to disk. So in the event of volatile memory failure or during a system restart InnoDB can guarantee to have a (C)onsistent record of the state of the data in the database in (D)urable memory and that each transaction is (A)tomic.

What is actually logged?

<space id, page no, operation code, data>

This translates into operations such as

<0, 2600, ‘insert’, ‘after record at offset 8192’, ‘(38, Percona, Fortune…)’>

The log header stores information about the last checkpoint. The log is stored as records rather than pages and has 512 byte alignment, matching disk sectors. Percona Server with XtraDB allows you to control the transaction log record size, as better performance may be gained with 4096 byte records with newer media such as SSD cards.

Why Idempotent?

Each record in the log needs to be able to be redone multiple times since at restart and media recovery we will not know if that action has previously been executed as we will start recovery from the last checkpoint.InnoDB achieves this by reading the LSN of the page and compares that to the one of log record, if the LSN is higher that means the page was flushed after this record was processed so the change does not need to be applied.

How does this affect response time?

In order to keep ACID compliance, the transaction log must guarantee the logging action happens before the transaction is committed, this is known as write-ahead-logging. This essentially means that before an update can return it must be logged. As the time to log is added to every update it can become an important overhead to your response time. Indeed if InnoDB cannot log at all, your transaction will never complete.

Both the MySQL configuration and hardware will affect the speed at which InnoDB can log. You can request that InnoDB force its REDO records to the transaction log but not actually flush the log to disk at commit. By this means you are ensuring that in the event of a MySQL crash the data will still be consistent, however in the event of a system crash there may be some records that never make it to durable memory.

The three options available are:

innodb_flush_log_at_trx_commit = 0 /* Write log buffer to log about once a second and flush */

innodb_flush_log_at_trx_commit = 1 /* Write log buffer to log and flush to disk */

innodb_flush_log_at_trx_commit = 2 /* Write log buffer to log and flush about once per second */

(About once a second is related to the fact the flush and sync is controlled by a background thread that gives no guarantee it will be per second)

NB: Percona  Server with XtraDB actually allow you to control this at session level.

This allows you to risk durability by having the logging return before the record is actually on disk. A similar and complementary mechanism is to use a battery backed raid card. This allows updates to be written to the raid cards battery protected memory which has much faster access than the underlying disk(s).

How are the REDO records applied?

InnoDB processes the REDO records from the last checkpoint, each record’s LSN is compared against the LSN of the page and if the on disk LSN is equal or higher the record is skipped. This behavior can be seen by enabling the innodb_recovery_stats option in Percona Server, which allows you to see the progress made during recovery.

How does this affect throughput?

As the REDO log in InnoDB uses a fixed length circular transaction log, the speed at which you can process updates is tightly linked to the speed at which check-pointing can occur. In order to insert a new record in the transaction log, InnoDB must ensure that the previously written record has been flushed to disk . This means that it can be beneficial to have very large transaction logs which allow a larger window between REDO logging and the checkpoint on disk.

Vadim touched upon this subject recently in his post MySQL 5.5.8 – in search of stability, as seen in his post, Percona Server with XtraDB allows much larger logs beyond the InnoDB limit of 4G.

As a side effect of logging there is no need to apply page changes at commit, therefore what would be a random write is translated into a much faster sequential write. The background processing of dirty pages can then apply algorithms to optimize those writes during checkpoint.

What else?

There is of course one component missing to our picture, just because each change is logged it does not mean we can reconstruct a consistent set of data. A change may be logged and then later rolled back do to an error or a specific request, as part of the DO-UNDO-REDO protocol InnoDB also records UNDO information, however that is stored outside of the transaction log and as such is not covered here.