In On learning InnoDB: A journey to the core, I introduced the innodb_diagrams project to document the InnoDB internals, which provides the diagrams used in this post. Later on in A quick introduction to innodb_ruby I walked through installation and a few quick demos of the innodb_space command-line tool.
The physical structure of InnoDB’s INDEX pages was described in The physical structure of InnoDB index pages, and the logical structure was described in B+Tree index structures in InnoDB. Now we’ll look in detail at the physical structure of records used in those pages.
In this post, only COMPACT row format (for Barracuda table format) is considered.
In previous posts, record offsets have been described in many structures that need to “point” to records. Record offsets point to the start of the record data itself, which is variable-length, but each record is preceded by a record header, which is also variable-length. In this post and the illustrations in it, I use N to mean the start of the record, where the record data is at N and using positive offsets (e.g. N+1), and the record header uses negative offsets (e.g. N-1). InnoDB often refers to the start of the record data, location N as the “origin”.
The record header
In previous posts, the record header was mentioned a few times, but hasn’t been fully described. As above, the record header precedes the record itself, and is structured as follows:
The fields in the record header are (in order, backwards from N):
The record header is a minimum of 5 bytes per row, but for records with many variable-width fields, it could be a lot larger.
The clustered key (PRIMARY KEY) has one of the more complex record structures:
Since clustered keys are unique and not nullable, the nullable field bitmap will be absent. Clustered keys may be of variable length, and if so the variable field length array will be present. The following fields are included in the record data:
The structure of non-leaf page records is similar, but somewhat simpler:
Since non-leaf pages are not MVCC, the Transaction ID and Roll Pointer fields are eliminated. Instead of the non-key fields, the child page number this node pointer points to is included.
Secondary indexes in InnoDB have an identical overall structure to the clustered key (PRIMARY KEY) but instead of containing non-key fields, they contain the clustered key fields, also known as a Primary Key Value or most commonly, PKV. If any fields overlap between the secondary key and the clustered key, the overlapping fields are removed from the clustered key stored in the secondary index records. For example, if a table has a PRIMARY KEY (a, b, c) and a secondary index KEY (a, d), the secondary key in the index will be as expected, (a, d) but the PKVs will contain only (b, c).
Since secondary keys are allowed to include fields that are both non-unique and nullable, both the variable field lengths array and the nullable field bitmap may be present if necessary. Otherwise the leaf page structure is very simple:
The secondary key fields are concatenated together into a single byte stream as with the clustered key. The clustered key fields are concatenated together in exactly the same way to form the PKV.
The secondary index non-leaf pages will also look familiar:
There is one thing of note for secondary index non-leaf pages: the clustered key fields (PKV) are included in the record and is considered part of the record’s key, not its value. Secondary indexes may be non-unique, but each record in the page must have a unique identifier, so the PKV must be included in the record to ensure uniqueness. This will mean that records in non-leaf pages of secondary keys will be 4 bytes larger than their leaf page counterparts.
An aside on per-row overhead
Looking at the above illustrations, you can easily calculate the per-row overhead InnoDB requires. Clustered Key Leaf Pages require a minimum of 5 bytes for the header, 6 bytes for the Transaction ID, and 7 bytes for the Roll Pointer, totaling 18 bytes per row. For very narrow tables (for example 2-3 integers) this overhead can be quite high.
In addition, there is substantial per-page overhead, and waste in inefficiently filling pages which can consume large amounts of space (for instance pages may be half-filled).
In the next post I will describe the page directory and its purpose in efficient record retrieval.