Hash-based workarounds for MySQL unique constraint limitations

In the past I have written about MySQL's limitations that prevent you from creating an index over a certain length, or with more than a certain number of columns. I've offered some solutions for those problems, but many of my solutions hit another limit further down the line. For example innodb_large_prefix allows you to create an index on larger columns, but only up to 3072 bytes, and my workarounds for the 16 column limit per index will also hit that 3072 byte limit once you add lots of columns or add some very long columns.

Today I'm going to suggest a hash-based solution to bypass that 3072 byte limit to solve a specific subset of index use cases. The basic idea is to build a string by concatenating all of the columns we want to include in the index, and create a unique index on the hash of that string, rather than the string itself. The hash will be a fixed length regardless of the input columns, so we don't need to worry about the 3072 byte limit or the 16 column limit.

For example instead of building a unique index on this:

concat(c1,'-',c2,'-',c3,'-',c4,'-',c5,'-',c6,'-',c7,'-',c8,'-',c9,'-',c10,'-', c11,'-',c12,'-',c13,'-',c14,'-',c15,'-',c16,'-',c17,'-',c18,'-',c19,'-',c20)

I can build a unique index on this:

unhex(sha2(concat(c1,'-',c2,'-',c3,'-',c4,'-',c5,'-',c6,'-',c7,'-',c8,'-',c9,'-',c10,'-', c11,'-',c12,'-',c13,'-',c14,'-',c15,'-',c16,'-',c17,'-',c18,'-',c19,'-',c20),256))

An index on a hash value is not useful for most queries, since it can't speed up range scans or queries using an index prefix, but it will work for unique constraints, and you can also use it for single row lookups if you apply the same hash function to the values you are searching for.

Hat tip to Justin Swanhart for giving me the idea to use hashes for this.

Read on for details and code examples.

MySQL has several built-in hash functions that I could use for this. They vary in terms of the length of the output and the likelihood of hash collisions (two different inputs generating the same outputs). In general these functions output hex strings, but it's more compact to store them as binary strings. A few built in hash functions I could use:

For these examples I will use SHA2. It is less prone to collisions than the others, and supports multiple output lengths of 28, 32, 48, or 64 bytes. My examples will use 32 byte SHA2 hash values stored in a BINARY(32) column.

In my previous post about long indexes, I used this example to show a uniqueness constraint that could not be created because it is over 3072 bytes:

``` mysql > create table if not exists long_index_test (

->   day date not null,
->   product_id int not null,
->   dimension1 varchar(1500) not null,
->   dimension2 varchar(1500) not null,
->   dimension3 varchar(1500) not null
-> ) ENGINE=InnoDB DEFAULT CHARSET=latin1 ROW_FORMAT=DYNAMIC;

Query OK, 0 rows affected (0.00 sec)

mysql > alter table long_index_test

-> add unique index unique_index (day, product_id, dimension1, dimension2, dimension3);

ERROR 1071 (42000): Specified key was too long; max key length is 3072 bytes ```

But using a hash column I can create the unique constraint:

``` mysql > alter table long_index_test

->   add column composite_hash binary(32) not null default 0;

Query OK, 0 rows affected (0.02 sec) Records: 0 Duplicates: 0 Warnings: 0

mysql > alter table long_index_test

->   add unique index unique_index (composite_hash);

Query OK, 0 rows affected (0.03 sec) Records: 0 Duplicates: 0 Warnings: 0 ```

And then I can populate the composite_hash column using triggers, application code, or as part of a LOAD DATA INFILE statement as previously discussed. In MariaDB or MySQL 5.7 I could make composite_hash a generated/virtual column.

The unique index I created above is only useful for the unique constraint or single row lookups. The range scans helped by my original composite index would not be helped by this index, so I may want to create a separate index for that:

``` mysql > alter table long_index_test

->   add index search_index (day, product_id);

Query OK, 0 rows affected (0.02 sec) Records: 0 Duplicates: 0 Warnings: 0 ```

In MySQL 5.7 I can use the hash-based workaround with a generated column. Consider this table:

CREATE TABLE IF NOT EXISTS lots_of_longer_columns ( c1 int not null, c2 int not null, c3 int not null, c4 int not null, c5 int not null, c6 int not null, c7 int not null, c8 int not null, c9 int not null, c10 int not null, c11 varchar(255) not null, c12 varchar(255) not null, c13 varchar(255) not null, c14 varchar(255) not null, c15 varchar(255) not null, c16 varchar(255) not null, c17 varchar(255) not null, c18 varchar(255) not null, c19 varchar(255) not null, c20 varchar(255) not null ) ENGINE=InnoDB DEFAULT CHARSET=utf8;

If I try to create a unique index on all 20 columns in that table using my previous method I'll hit the 3072 byte limit:

``` mysql > alter table lots_of_longer_columns

->   add column all_columns varchar(2669) generated always as (
->   concat(c1,'-',c2,'-',c3,'-',c4,'-',c5,'-',c6,'-',c7,'-',c8,'-',c9,'-',c10,'-',
->     c11,'-',c12,'-',c13,'-',c14,'-',c15,'-',c16,'-',c17,'-',c18,'-',c19,'-',c20)
->   )
->   stored not null,
->   add unique index unique_constraint (all_columns);

ERROR 1071 (42000): Specified key was too long; max key length is 3072 bytes ```

But if I use a hash-based column instead I can create the unique constraint:

``` mysql > alter table lots_of_longer_columns

->   add column all_columns_hash binary(32) generated always as (
->   unhex(sha2(concat(c1,'-',c2,'-',c3,'-',c4,'-',c5,'-',c6,'-',c7,'-',c8,'-',c9,'-',c10,'-',
->     c11,'-',c12,'-',c13,'-',c14,'-',c15,'-',c16,'-',c17,'-',c18,'-',c19,'-',c20),256))
->   )
->   stored not null,
->   add unique index unique_constraint (all_columns_hash);

Query OK, 0 rows affected (0.05 sec) ```

Does it work?

``` mysql > insert into lots_of_longer_columns (c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,c15,c16,c17,c18,c19,c20)

-> values (1,2,3,4,5,6,7,8,9,10,'a','b','c','d','e','f','g','h','i','j');

Query OK, 1 row affected (0.01 sec) mysql > insert into lots_of_longer_columns (c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,c15,c16,c17,c18,c19,c20)

-> values (1,2,3,4,5,6,7,8,9,10,'a','b','c','d','e','f','g','h','i','j');

ERROR 1062 (23000): Duplicate entry '\x98\xB4\xF0\x0Ap\xCD\x9E\x06F\x7F\x98-\xF56\x04\xA6\xB9\xD3is\x' for key 'unique_constraint'

mysql > insert into lots_of_longer_columns (c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,c11,c12,c13,c14,c15,c16,c17,c18,c19,c20)

-> values (1000,2,3,4,5,6,7,8,9,10,'a','b','c','d','e','f','g','h','i','j');

Query OK, 1 row affected (0.00 sec) mysql > update lots_of_longer_columns set c1 = 1 where c1 = 1000; ERROR 1062 (23000): Duplicate entry '\x98\xB4\xF0\x0Ap\xCD\x9E\x06F\x7F\x98-\xF56\x04\xA6\xB9\xD3is\x' for key 'unique_constraint' ```

Yes! The duplicate entires are no longer human readable, but it successfully enforced the unique constraint.