This article discusses the storage (inline and external) of field data in the InnoDB storage engine. All fields of variable length types like VARCHAR, VARBINARY, BLOB and TEXT can be stored inline within the clustered index record, or stored externally in separate BLOB pages outside of the index record (but within the same tablespace). All of these fields can be classified as large objects. These large objects are either binary large objects or character large objects. The binary large objects do not have an associated character set, whereas the character large objects do.
Within the InnoDB storage engine there is no difference between the way character large objects and binary large objects are handled. Throughout this article we will use the term “BLOB field” to refer to any of the aforementioned field types that can be chosen for external storage.
This article will provide information about the following:
- Explain when a BLOB field will be stored inline and when it will be stored externally, with respect to the clustered index record.
- The structure of the BLOB reference.
- The BLOB prefix that is stored in the clustered index record, when the BLOB is stored externally.
- Utility gdb functions to examine the BLOB reference and the record offsets array.
The BLOB fields are associated with the clustered index records (the primary key) of a table. Only the clustered index can store a BLOB field externally. A secondary index cannot have externally stored fields. For the purposes of this article, we won’t deal with any secondary indexes.
The Schema
The following example table will be used to present the information:
CREATE TABLE t1 (f1 INT PRIMARY KEY, f2 BLOB, f3 TEXT); INSERT INTO t1 VALUES (1, REPEAT('௲', 1000), REPEAT('௱', 1000)); INSERT INTO t1 VALUES (2, REPEAT('௲', 20000), REPEAT('௱', 20000)); INSERT INTO t1 VALUES (3, REPEAT('௲', 20000), REPEAT('௱', 1500)); INSERT INTO t1 VALUES (4, REPEAT('௲', 1500), REPEAT('௱', 20000));
Note: ௱ – Tamil number one hundred (Unicode 0BF1), ௲ – Tamil number one thousand (Unicode 0BF2)
A single clustered index record can have 1 or more externally stored BLOBs. So for the given table definition of t1, there are 4 possible ways that the BLOB fields of f2 and f3 can be stored:
- f2 and f3 are both stored inline within the clustered index page
- f2 is stored inline, while f3 is stored externally
- f3 is stored inline, while f2 is stored externally
- both f2 and f3 are stored externally
In the following sections, let us see which of the BLOB columns are externally stored and which of them are stored inline, for each sample row we created above. Note that the row format of table t1 is not explicitly specified. In MySQL 5.6, it will default to the COMPACT row format. Please keep this in mind as we discuss the example.
Overview of BLOB Storage
The BLOB data can be stored inline in the clustered index record, or it can be stored externally in separate BLOB pages. These external BLOB pages are allocated from the same tablespace in which the clustered index resides. The BLOB data will always be stored inline whenever possible though. If and only if this is not possible because of the record size, then the BLOB field will be stored externally. This is true for all of the current row formats: REDUNDANT, COMPACT, DYNAMIC, and COMPRESSED. Let’s now take a look at the storage details for the BLOB columns in our example table.
In MySQL 5.6, the default row format of a table is COMPACT, so
that’s what our t1 table is using. The default page size is 16K,
so that’s also what we’re using. The maximum record size that can
be stored in a 16K page using the COMPACT row format is 8126
bytes. The function page_get_free_space_of_empty()
will tell us the total free space available in a page. The value
returned by this function, which is then divided by 2, will give
us the maximum allowed record size. The division by 2 is required
because an index page must contain a minimum of 2 records. Let’s
look at an example (the argument value of “1″ tells the function
that the row format for the page is COMPACT):
(gdb) call page_get_free_space_of_empty(1)/2 $4 = 8126 (gdb)
The following table shows the storage details of the BLOB columns for each row in table t1. Keeping in mind the maximum allowed record size, it is clear as to why the 60000 bytes of BLOB data is always stored externally. It simply will not fit within a single clustered index record.
Primary Key | Length of f2 | Storage of f2 | Length of f3 | Storage of f3 |
---|---|---|---|---|
1 | 3000 bytes | Inline | 3000 bytes | Inline |
2 | 60000 bytes | External | 60000 bytes | External |
3 | 60000 bytes | External | 4500 bytes | Inline |
4 | 4500 bytes | Inline | 60000 bytes | External |
Inline Storage of BLOBs
As mentioned previously, no BLOB fields will be externally stored if the size of the record is less than the maximum record size allowed in a page. In our example table, for the row with a primary key value of 1, no BLOB fields need to be stored externally because the full record size is less than 8126 bytes. The following tables give the sizing details for each of the rows in our example table:
Primary Key | Clustered Index Record Size (in bytes) | Fields Moved Out | Record Size After Moving Blob Outside (in bytes) | Maximum Allowed Record Size (in bytes) |
---|---|---|---|---|
1 | 6027 | - | 6027 | 8126 |
2 | 120027 | f2, f3 | 1603 | 8126 |
3 | 64527 | f2 | 5315 | 8126 |
4 | 64527 | f3 | 5315 | 8126 |
As we can see, the BLOB fields are stored externally until the
record size falls below the limit. In the table above, column 2
gives the initial clustered index record size. If this size is
greater than the allowed maximum size of the record (shown in
column 5), then the function
dtuple_convert_big_rec()
is invoked to choose the
fields destined for external storage. Column 3 lists the fields
that have been chosen by this function for external storage.
Column 4 shows the clustered index record size after moving the
chosen fields to external storage. Again, this value must be less
than the maximum record size, shown in column 5 (which is 8126
bytes in our example).
All of the size details provided above are obtained through the debugger from the callers of this function.
For more clarity, let me explain the clustered index record length for the first row with a primary key value of 1 (shown as 6027 bytes). The length of the user fields f1, f2, and f3 are 4 bytes, 3000 bytes, and 3000 bytes respectively. The length of the system fields DB_ROLL_PTR and DB_TRX_ID are 7 bytes and 6 bytes respectively. The record header stores the length of the f2 and f3 fields, taking 4 bytes (2 bytes for each field). The record header also contains a null bit array, which for this record takes up 1 byte. Lastly, the record header contains REC_N_NEW_EXTRA_BYTES—which is 5 bytes for the COMPACT row format—of additional information. The complete storage details for the final record are presented in the following table:
Field | Length (in bytes) |
---|---|
Total | 6027 |
Length of f3 | 2 |
Length of f2 | 2 |
Null bit array | 1 |
REC_N_NEW_EXTRA_BYTES | 5 |
f1 | 4 |
DB_TRX_ID | 6 |
DB_ROLL_PTR | 7 |
f2 | 3000 |
f3 | 3000 |
You can refer to the documentation in storage/innobase/rem/rem0rec.cc
for
more details about the COMPACT and REDUNDANT row formats. The
REDUNDANT row format is also explained in the blog article
InnoDB REDUNDANT Row Format.
Choosing Fields for External Storage
As discussed above, the function
dtuple_convert_big_rec()
is invoked to decide which
parts of the oversized clustered index record will be chosen for
external storage. This function makes use of the following rules
to decide this:
- No fixed length fields can be chosen for external storage.
- No variable length fields whose size is less than or equal to 40 (2 * BTR_EXTERN_FIELD_REF_SIZE) bytes will be chosen for external storage.
- No variable length fields whose size is less than the BLOB prefix size will be chosen for external storage. This means that in the case of REDUNDANT and COMPACT row formats, if the field data length is less than or equal to 768 bytes (DICT_ANTELOPE_MAX_INDEX_COL_LEN), then it will not be chosen for external storage. This rule is not applicable for DYNAMIC and COMPRESSED row formats, because their BLOB prefix is 0 bytes.
In the function dtuple_convert_big_rec()
, we examine
one BLOB field at a time for potential external storage, and then
move it to external storage if it passes the criteria noted
above, until the clustered index record size falls within the
maximum allowed. Larger fields will be selected for external
storage before smaller fields, to ensure that maximum space
savings happens in the clustered index page. This ensures that
more records can be stored in each clustered index page.
BLOB Reference
When a BLOB field is stored externally, a BLOB reference is stored in the clustered index record. The BLOB reference will be stored after the BLOB prefix, if any. This BLOB reference is 20 bytes, and it contains the following information:
- The space identifier (4 bytes)
- The page number where the first page of the BLOB is stored (4 bytes)
- The offset of the BLOB header within that page (4 bytes)
- The total size of the BLOB data (8 bytes)
Even though 8 bytes are available to store the total size of the BLOB data, only the last 4 bytes are actually used. This means that within InnoDB, the maximum size of a single BLOB field is currently 4GB.
In the length field, two bits are used to store the ownership and inheritance information, which are not discussed in this article. We will cover that in a subsequent blog article. The most significant bit of the length field is used to store ownership information and the second most significant bit is used to store the inheritance information.
Here is a gdb function to print the contents of a BLOB reference.
This function takes a pointer to the external BLOB reference as
an argument. The calls to ntohl()
are required
because all data on disk is stored in network byte order by
InnoDB.
define ib_print_blob_ref set $ref=$arg0 set $space_id=ntohl(*(ulint *)($ref)) set $page_no=ntohl(*(ulint *)($ref + 4)) set $offset=ntohl(*(ulint *)($ref + 8)) set $flags=*(char *)($ref + 12) set $ownership = $flags & 0x80 set $inherited = $flags & 0x40 set $length = ntohl(*(ulint *)($ref + 16)) printf "space_id : %lu\n", $space_id printf "page_no : %lu\n", $page_no printf "offset : %lu\n", $offset printf "ownership : %x\n", $ownership printf "inherited : %x\n", $inherited printf "length : %lu\n", $length end
BLOB Prefix
When a BLOB field is stored externally, we may also store a prefix of the value in the clustered index record, depending on the row format used. For the REDUNDANT and COMPACT row formats, a BLOB prefix of 768 bytes is stored in the clustered index record. For the DYNAMIC and COMPRESSED row formats, a BLOB prefix is never stored in the clustered index record. The BLOB prefix would be followed by the BLOB reference.
The BLOB prefix, when available, helps to calculate the secondary index key without needing to fetch the externally stored BLOB data (which involves at least one extra page load/fetch). This is possible because the maximum length of a secondary index key is 767 bytes. If we attempt to create a secondary index with a bigger length it will be automatically truncated with a warning. For example, consider the following statement:
CREATE INDEX s1 ON t1 (f2(768));
This will generate the following warning to the user: “Specified key was too long; max key length is 767 bytes”. The resulting index records of s1 will only be on a prefix of f2, covering only the first 767 bytes.
External Storage of BLOBs
A BLOB field is considered to be stored externally if it is stored outside of the clustered index B-tree, and just a reference to the BLOB (along with any BLOB prefix) is stored in the clustered index record. This means that the size and structure of the clustered index B-tree will not be affected by the externally stored BLOB data.
For the row in our sample table with a primary key value of 2, the BLOB fields f2 and f3 will be stored externally from the clustered index record. This is because their size is too big to fit within the clustered index record. Instead, a reference to the externally stored BLOB data will be stored in the clustered index record. A pictorial view of this external storage is shown below:
Note: The diagram above shows each clustered index record having exactly one externally stored BLOB. This is just a simplication. Each clustered index record can have any number of externally stored BLOBs.
Let’s verify that the BLOB fields for the row with a primary key
value of 2 are indeed externally stored with the help of the
debugger. Run the server through a debugger (we used gdb) and
issue the query SELECT * FROM t1 WHERE f1 = 2
. Put a
breakpoint in the function row_search_mvcc()
, which
searches for rows using a cursor. Within this function, once the
offsets are calculated for the selected record, let’s print the
offsets to verify which of the columns are stored externally.
I’ll use the gdb function ib_print_offsets
(provided
in the appendix) to examine the offsets array. The (*) in the
field output shows that it is in fact stored externally.
(gdb) ib_print_offsets offsets Number of Fields: 5 ROW_FORMAT=compact ( )field: 0, end offset: 4 ( )field: 1, end offset: 10 ( )field: 2, end offset: 17 (*)field: 3, end offset: 805 (*)field: 4, end offset: 1593 (gdb)
The above output shows that the record has 5 fields (3 fields are specified by the user, and 2 are internal system fields, namely DB_TRX_ID and DB_ROLL_PTR). The field number 0 begins at offset 0 and ends at offset 4. The first field begins at offset 4 and ends at offset 10. The second field begins at offset 10 and ends at offset 17. The third field, whose data is stored externally, begins at offset 17 and ends at offset 805. The fourth field, whose data is stored externally, begins at offset 805 and ends at 1593. This information is presented in a tabular format below.
Field Number | Description | External | Begin Offset | End Offset | Length |
---|---|---|---|---|---|
0 | The field f1 (Primary Key) | No | 0 | 4 | 4 |
1 | DB_TRX_ID | No | 4 | 10 | 6 |
2 | DB_ROLL_PTR | No | 10 | 17 | 7 |
3 | The field f2 | Yes | 17 | 805 | 788 (768 bytes blob prefix + 20 bytes of blob reference) |
4 | The field f3 | Yes | 805 | 1593 | 788 (768 bytes blob prefix + 20 bytes of blob reference) |
The length of the fields f2 and f3 are each 788 bytes. This includes the 768 bytes of the BLOB prefix, and the 20 bytes for the external BLOB reference. The gdb function that I used to examine the offsets array is provided in the appendix.
The BLOB reference can be examined by using the gdb function
ib_print_blob_ref
(which we provided earlier). It
takes the BLOB reference as an argument. The BLOB reference of
field f2 is available at rec+805-20, where 805 is the end offset
of f2 and 20 is the size of the BLOB reference. The BLOB
reference of field f3 is available at rec+1593-20, where 1593 is
the end offset of f2 and 20 is the size of the BLOB reference.
The output is provided below:
(gdb) ib_print_blob_ref rec+805-20 space_id : 23 page_no : 4 offset : 38 ownership : 0 inherited : 0 length : 59232 (gdb) (gdb) ib_print_blob_ref rec+1593 -20 space_id : 23 page_no : 8 offset : 38 ownership : 0 inherited : 0 length : 59232 (gdb)
The length of the externally stored part of the BLOB is 59232 bytes. The BLOB prefix is 768 bytes. So the total length of the field is 59232 + 768 = 60000 bytes, which is the size of both the f2 and f3 fields.
Conclusion
In this article we saw when the BLOB data will be stored inline, and when it will be stored externally. We also saw the rules applied in choosing particular field data for external storage. In a subsequent blog article, we will cover further details regarding BLOB handling within InnoDB.
Appendix: A gdb Function to Examine the Offsets Array
Within InnoDB, to manipulate and access the various fields of a record, an offsets array is calculated by the function rec_get_offsets(). This function returns an offsets array of type ulint*. This array contains the following pieces of information:
- The number of fields in the index.
- Whether the record has any externally stored fields.
- The offset at which each of the fields begin and end.
- Whether the row format is COMPACT or REDUNDANT. Note that DYNAMIC and COMPRESSED row formats are a special type of the COMPACT row format. In other words, a DYNAMIC row format is a COMPACT row format plus additional rules. And the COMPRESSED row format is a DYNAMIC row format with additional rules.
- In a debug build, it also contains the rec pointer and the index pointer.
Here is the gdb function referenced earlier that can be used to examine an offsets array in a debug build of MySQL:
define ib_print_offsets set $offsets = $arg0 set $n_fields = $offsets[1] set $rec = $offsets[2] set $index = $offsets[3] set $flags = $offsets[4] set $compact = $flags & (1 << 31) set $any_ext = $flags & (1 << 30) printf "Number of Fields: %lu\n", $n_fields if ($compact == 0) printf "ROW_FORMAT=redundant\n" else printf "ROW_FORMAT=compact\n" end set $n = 0 while ($n < $n_fields) set $field_offset = $offsets[5 + $n] set $external = $field_offset & (1 << 30) set $offset = $field_offset & ((1 << 30) - 1) if ($external == 0) echo ( ) else echo (*) end printf "field: %lu, end offset: %lu\n", $n, $offset set $n = $n + 1 end end
Authors
This article is jointly authored by Annamalai Gurusami (கு அண்ணாமலை), Aditya and Thirunarayanan Balathandayuthapani of the MySQL Server Sustaining Team, Bengaluru.
I would also like to thank Marko Makela and Matt Lord for their reviews and support. They helped to improve this article to a great extent.