How does InnoDB behave without a Primary Key?

This afternoon, Arjen Lentz and I were discussing InnoDB’s behavior without a declared PRIMARY KEY, and the topic felt interesting enough and undocumented enough to warrant its own short post.

Background on InnoDB clustered keys

In The physical structure of InnoDB index pages I described how “Everything is an index in InnoDB”. This means that InnoDB must always have a “cluster key” for each table, which is normally the PRIMARY KEY. The manual has this to say in Clustered and Secondary Indexes:

If the table has no PRIMARY KEY or suitable UNIQUE index, InnoDB internally generates a hidden clustered index on a synthetic column containing row ID values. The rows are ordered by the ID that InnoDB assigns to the rows in such a table. The row ID is a 6-byte field that increases monotonically as new rows are inserted. Thus, the rows ordered by the row ID are physically in insertion order.

I had previously assumed this meant that an invisible column would be used along with the same sequence generation code that is used to implement auto_increment (which itself has some scalability issues). However the reality is that they are completely different implementations.

Implementation of implicit Row IDs

How this is actually implemented is, as the manual says, if a table is declared with no PRIMARY KEY and no non-nullable UNIQUE KEY, InnoDB will automatically add a 6-byte (48-bit) integer column called ROW_ID to the table, and cluster the data based on that column. The column won’t be accessible to any queries nor usable for anything internally such as row-based replication.

What the manual doesn’t mention is that all tables using such ROW_ID columns share the same global sequence counter (the manual says “increases monotonically” and doesn’t clarify), which is part of the data dictionary. The maximum used value for all row IDs (well, technically the next ID to be used) is stored in the system tablespace (e.g. ibdata1) in page 7 (type SYS), within the data dictionary header (field DICT_HDR_ROW_ID).

This global sequence counter is protected by dict_sys->mutex, even for incrementing (as opposed to using atomic increment). The implementation is in include/dict0boot.ic (many blank lines deleted):

    38  UNIV_INLINE
    39  row_id_t
    40  dict_sys_get_new_row_id(void)
    41  /*=========================*/
    42  {
    43          row_id_t        id;
    44  
    45          mutex_enter(&(dict_sys->mutex));
    47          id = dict_sys->row_id;
    49          if (0 == (id % DICT_HDR_ROW_ID_WRITE_MARGIN)) {
    51                  dict_hdr_flush_row_id();
    52          }
    54          dict_sys->row_id++;
    56          mutex_exit(&(dict_sys->mutex));
    57  
    58          return(id);
    59  }

(You may also notice that this code lacks any protection for overflowing the 48 bits allotted to row IDs. That is unnecessarily sloppy coding, but even at a continuous 1 million inserts per second [which is probably a bit optimistic ;)] it would take about 9 years to exhaust the ID space. I guess that’s okay.)

Ensuring non-conflicting IDs are generated

The counter is flushed to disk every 256th ID generated (the define DICT_HDR_ROW_ID_WRITE_MARGIN above), by modifying the value in the SYS data dictionary page, which is logged to the transaction log. On startup, InnoDB will increase the DICT_HDR_ROW_ID stored on disk by at least 256, and at most 511. This ensures that any IDs generated will have been less than the new starting value, and thus there will not be any conflicts.

Performance and contention implications

Given how much other code within InnoDB is protected by dict_sys->mutex I think it’s fair to say any tables with an implicit clustered key (ROW_ID) could expect to experience random insert stalls during operations like dropping (unrelated) tables. Parallel insertion into multiple tables with implicit keys could be performance-constrained, as it will be serialized on both the shared mutex and cache contention for the shared counter variable. Additionally, every 256th value generated will cause a log write (and flush) for the SYS page modification, regardless of whether the transaction has committed yet (or ever will).