Introducing Multiple Clustering Indexes

In this posting I’ll describe TokuDB’s multiple clustering index feature.  (This posting is by Zardosht.)

In general (not just for TokuDB) a clustered index or a clustering index is an index that stores the all of the data for the rows.  Quoting the MySQL 5.1 reference manual:

Accessing a row through the clustered index is fast because the row data is on the same page where the index search leads. If a table is large, the clustered index architecture often saves a disk I/O operation when compared to storage organizations that store row data using a different page from the index record.

Most storage engines allow at most one clustered index for each table. For example, MyISAM does not support clustered indexes at all, whereas InnoDB allows only the primary key be a clustered index.

In TokuDB, we have added functionality and grammar syntax to define multiple clustered indexes. As a result, users can get the performance advantages of clustered indexes for multiple indexes.

How to define multiple clustered indexes

To define a clustered index in TokuDB, simply add the key word “clustering” in front of any index that is to be created.  Here are some examples:

create table foo (a int, b int, clustering key (a)) engine=tokudb;
alter table foo add clustering index clstr_key(a);
create clustering index clstr_key on foo(a);

How clustering keys work

Like InnoDB, TokuDB clusters data with its primary key (if no primary key is defined, then TokuDB auto-generates a hidden one). The primary index maintains a copy of the entire row.  Each secondary index stores, for each row, the row’s secondary key and primary key.

A clustering index maintains a copy of the entire row, not just the primary key. As a result, when querying on a clustering key lookups in the primary index are always avoided.  A clustering index is a covering index for any query.  Another way to think of a clustering index is that it is a materialization of the table sorted in another order.

An example using clustering keys

Suppose we had 40M rows in an iiBench table with this schema:

Create Table: CREATE TABLE `purchases_index` (
 `transactionid` bigint(20) NOT NULL AUTO_INCREMENT,
 `dateandtime` datetime DEFAULT NULL,
 `cashregisterid` int(11) NOT NULL,
 `customerid` int(11) NOT NULL,
 `productid` int(11) NOT NULL,
 `price` float NOT NULL
) ENGINE=TOKUDB AUTO_INCREMENT=40000001 DEFAULT CHARSET=latin1

Now suppose we wanted to perform the following queries in sequence:


  • Query 1:
    select sum(price) from purchases_index where customerid > 51 and customerid < 55;
      

    which scans more than 1200 rows.
  • Query 2:
    select * from purchases_index where customerid = 100;
      

    which returns 406 rows.
  • Query 3:
    select productid, dateandtime from purchases_index where customerid = 200 or customerid=300;
      

    which returns 820 rows.

Ideally, to perform the first query, we would want a covering index of (customerid, price) on the table.  To perform the third query, we would want a covering index of (customerid, productid, dateandtime). For the second query, a covering index would need to include all the fields.

With other storage engines, the common solution is to define a key on the field (customerid). Here are the results with TokuDB when using a key on customerid:

Query 1: 9.47 seconds
Query 2: 3.11 seconds
Query 3: 6.16 seconds

Here are the results with MyISAM when using a key on customerid:

Query 1: 2.66 seconds
Query 2: 0.91 seconds
Query 3: 1.78 seconds

Here are the results with TokuDB when using a clustering key on customerid:

Query 1: 0.14 seconds
Query 2: 0.06 seconds
Query 3: 0.20 seconds

As we can see, clustering indexes provide the same order of magnitude of performance as covering indexes, but for a broader range of queries.

A limitation of clustering keys

A clustering key cannot be defined to be unique.