InnoDB Tidbit: The doublewrite buffer wastes 32 pages (512 KiB)

In my ongoing quest to completely understand InnoDB’s data storage, I came across a quite small and inconsequential waste, which is nevertheless fun to write about. I noticed the following block of pages which were allocated very early in the ibdata1 system tablespace but apparently unused (unnecessary lines removed from output):

$ innodb_space -f ibdata1 space-page-type-regions

start       end         count       type                
13          44          32          ALLOCATED           

Background on the doublewrite buffer

Most people using InnoDB have heard of the “doublewrite buffer”—part of InnoDB’s page flushing strategy. The doublewrite buffer is used as a “scratch area” to write (by default) 128 pages contiguously before flushing them out to their final destinations (which may be up to 128 different writes). The MySQL manual says, in “InnoDB Disk I/O”:

InnoDB uses a novel file flush technique involving a structure called the doublewrite buffer. It adds safety to recovery following an operating system crash or a power outage, and improves performance on most varieties of Unix by reducing the need for fsync() operations.

Before writing pages to a data file, InnoDB first writes them to a contiguous tablespace area called the doublewrite buffer. Only after the write and the flush to the doublewrite buffer has completed does InnoDB write the pages to their proper positions in the data file. If the operating system crashes in the middle of a page write, InnoDB can later find a good copy of the page from the doublewrite buffer during recovery.

The doublewrite buffer needs to be accounted for, too

Normally the doublewrite buffer consists of two extents, each of which is 64 contiguous pages (1 MiB), for a total of 128 pages, or 2 MiB. However, InnoDB can’t just blindly borrow those two extents; it must account for them in the space file. To do this, it creates a file segment (aka Fseg) and uses an Inode to point to it. In Page management in InnoDB space files I described how file segments contain:

  • An array of up to 32 individually-allocated “fragment” pages
  • A list of “full” extents (no pages free)
  • A list of “not full” extents (partially allocated)
  • A list of “free” extents (no pages allocated)

Causing full extents to be allocated

Allocations to a file segment will always fill up the fragment array before allocating complete extents. The doublewrite buffer is strangely not special in this case. The code which allocates it uses the following loop in trx/trx0sys.c at line 335:

for (i = 0; i < 2 * TRX_SYS_DOUBLEWRITE_BLOCK_SIZE
       + FSP_EXTENT_SIZE / 2; i++) {

This is unfortunately written without any comments of note, but it is allocating a total of 160 pages:

  • FSP_EXTENT_SIZE / 2 → 64 / 2 → 32 pages
  • 2 * TRX_SYS_DOUBLEWRITE_BLOCK_SIZE → 2 * 64 → 128 pages

The initial 32 pages allocated are there purely just to cause the fragment array to be filled up thus forcing the fseg_alloc_free_page calls that follow to start allocating complete extents for the remaining 128 pages (that the doublewrite buffer actually needs). The code then checks which extents were allocated and adds the initial page numbers for those extents to the TRX_SYS header as the doublewrite buffer allocation. In a typical system, InnoDB would allocate the following pages:

  • Fragment pages 13-44 — Perpetually unused fragment pages, but left allocated to the file segment for the doublewrite buffer.
  • Extent starting at page 64, ending at page 127 — Block 1 of the doublewrite buffer in practice.
  • Extent starting at page 128, ending at page 191 — Block 2 of the doublewrite buffer in practice.

Using innodb_ruby to dump file segments (by inode)

I recently added a new space-inodes-detail and space-inodes-summary modes to the innodb_space program in innodb_ruby which can handily show exactly which pages and extents are allocated to a given file segment (trimmed for clarity and reformatted for line wrap; normally printed on a single long line):

$ innodb_space -f ibdata1 space-inodes-detail

INODE fseg_id=15, pages=160,
  frag=32 pages (13, 14, ..., 43, 44),
  full=2 extents (64-127, 128-191),
  not_full=0 extents () (0/0 pages used),
  free=0 extents ()

Here you can clearly see the two complete extents in the file segment’s “full” list, along with the 32 fragment pages.

Conclusions

There are a few ways this could have been avoided, such as freeing the individual pages after the two extents were allocated, or adding a special “no fragments” allocation method. However, as I said at the start it is pretty inconsequential, as it amounts to only 512 KiB per installation. The code could definitely use a rewrite for clarity though, given the nuanced behavior. It could also stand to use one of the existing defines such as FSEG_FRAG_ARR_N_SLOTS or FSEG_FRAG_LIMIT rather than repeating the basically unexplained calculation FSP_EXTENT_SIZE / 2. Additionally rewriting it to use a more meaningful loop structure would be helpful; there’s no reason it needs to allocate all three sets of pages in the same for loop (especially without comments).