InnoDB Full-Text Search Tutorial

The InnoDB full-text search capability is an exciting feature. The full-text search itself is generally useful to have in an RDBMS. If an application is using all InnoDB tables except for one that is used for full-text searches, now that last table can be switched to InnoDB. If putting the full-text data in a MyISAM table led to scalability problems, duplication, or a less-than-ideal schema design, now those issues can be addressed.

In this post, I’ll take you through some of the basics of setting up and querying an InnoDB FULLTEXT search index. I’ll leave the scalability and performance aspects to Jimmy’s and Vinay’s blog posts, and just use some toy-sized data for demonstration purposes.

Creating a Table with a Full-Text Search Index

The key component of this feature is an index of type FULLTEXT, applied to one or more columns of an InnoDB table.

In Jimmy’s post, he mentions some scalability considerations where you might create the table (including a special FTS_DOC_ID column), load the data, then create the FULLTEXT index afterward. For simplicity (and since the data volume is so small), I’ll create the table with the index in place, then load the data afterward.

use test;
-- We will do some commits and rollbacks to demonstrate transactional features.
-- So turn off the default setting that commits immediately after each statement.
set autocommit=0;

drop table if exists quotes;
-- The InnoDB full-text search feature in the 5.6 Labs release
-- lets us define the FULLTEXT index on an InnoDB table.
-- The ENGINE=INNODB clause is just to illustrate the InnoDB aspect.
-- In 5.5 and above, new tables use InnoDB by default.

create table quotes
  (    id int unsigned auto_increment primary key
    , author varchar(64)    , quote varchar(4000)
    , source varchar(64)
    , fulltext(quote)
  ) engine=innodb;

-- Get some words and phrases to search for into the table.
insert into quotes (author, quote, source) values
  ('Abraham Lincoln', 'Fourscore and seven years ago...',
  'Gettysburg Address')
, ('George Harrison', 'All those years ago...',
  'Live In Japan')
, ('Arthur C. Clarke', 'Then 10 years ago the monolith was discovered.',
  '2010: The Year We Make Contact')
, ('Benjamin Franklin',
  'Early to bed and early to rise, makes a man healthy, wealthy, and wise.',
  'Poor Richard''s Almanack')
, ('James Thurber',
  'Early to rise and early to bed makes a male healthy and wealthy and dead.',
  'The New Yorker')
, ('K', '1500 hundred years ago, everybody knew that the Earth was the center of the universe.',
  'Men in Black')
;

-- Since this is an InnoDB table, we are mindful of transactions.
commit;

Word and Phrase Search – Natural Language Mode

Once the data is loaded and committed, you can run queries using the MATCH(columns) AGAINST (search expression) operator to do the actual searches. You can combine this operator with all the usual WHERE and similar clauses in the SELECT statement.

The simplest kind of search is to find a single word, or a phrase with all words in exact order. For this type of search, use the IN NATURAL LANGUAGE clause inside the AGAINST() call. This technique typically involves a user-entered string that you pass verbatim to the query (of course, after escaping any quotation marks or other special characters to prevent SQL injection attacks).

-- Search for a single word.
select author as "Monolith" from quotes
  where match(quote) against ('monolith' in natural language mode);
+------------------+
| Monolith         |
+------------------+
| Arthur C. Clarke |
+------------------+
1 row in set (0.01 sec)

-- Unlike with MyISAM full-text search, the default minimum
-- length is 3 rather than 4, and the search returns words that
-- appear in a high proportion of the table rows.
select author as "Ago" from quotes
  where match(quote) against ('ago' in natural language mode);
+------------------+
| Ago              |
+------------------+
| Abraham Lincoln  |
| George Harrison  |
| Arthur C. Clarke |
| K                |
+------------------+
4 rows in set (0.00 sec)

AND / OR / NOT Operators – Boolean Mode

For more complicated searches, you can have multiple words and phrases and search for different combinations of optional and required terms, not necessarily in the same order. This technique typically involves several data values that you query from elsewhere, or splitting apart a user-entered string and applying your own rules to the words and phrases inside.

-- Search for a combination of words, not in the same order as the original.
select author as "Ago and Years" from quotes
  where match(quote) against ('+ago +years' in boolean mode);
+------------------+
| Ago and Years    |
+------------------+
| Abraham Lincoln  |
| George Harrison  |
| Arthur C. Clarke |
| K                |
+------------------+
4 rows in set (0.00 sec)

-- Search for other Boolean combinations of words.
select author as "Fourscore or Monolith" from quotes
  where match(quote) against ('fourscore monolith' in boolean mode);
+-----------------------+
| Fourscore or Monolith |
+-----------------------+
| Abraham Lincoln       |
| Arthur C. Clarke      |
+-----------------------+
2 rows in set (0.00 sec)

select author as "Years and not Monolith" from quotes
  where match(quote) against ('+years -monolith' in boolean mode);
+------------------------+
| Years and not Monolith |
+------------------------+
| Abraham Lincoln        |
| George Harrison        |
| K                      |
+------------------------+
3 rows in set (0.00 sec)

Proximity Search

Proximity search is new feature for InnoDB full-text search. It is a special case of Boolean search using the @ operator within the AGAINST() string. You supply 2 or more words, double-quoted, within the single-quoted AGAINST() string, followed by @distance to specify how far apart these words can be. The distance represents the maximum number of bytes (which might not be equal to the number of characters) between the starting points of all these words.

-- The starting points for these words are too far apart
-- (not within 20 bytes), so no results.
select quote as "Too Far Apart" from quotes
  where match(quote) against ('"early wise" @20' in boolean mode);
Empty set (0.00 sec)

-- But the starting points of all words are within 100 bytes,
-- so this query does give results.
select quote as "Early...Wise" from quotes
  where match(quote) against ('"early wise" @100' in boolean mode);
+-------------------------------------------------------------------------+
| Early...Wise                                                            |
+-------------------------------------------------------------------------+
| Early to bed and early to rise, makes a man healthy, wealthy, and wise. |
+-------------------------------------------------------------------------+
1 row in set (0.00 sec)

-- In this case, the smallest distance that produces results is 49.
select quote as "Early...Wise" from quotes
  where match(quote) against ('"early wise" @49' in boolean mode);
+-------------------------------------------------------------------------+
| Early...Wise                                                            |
+-------------------------------------------------------------------------+
| Early to bed and early to rise, makes a man healthy, wealthy, and wise. |
+-------------------------------------------------------------------------+
1 row in set (0.00 sec)

-- Here is an example showing 2 results, with the words close to each other.
select quote as "Early...Bed" from quotes
  where match(quote) against ('"early bed" @20' in boolean mode);
+---------------------------------------------------------------------------+
| Early...Bed                                                               |
+---------------------------------------------------------------------------+
| Early to bed and early to rise, makes a man healthy, wealthy, and wise.   |
| Early to rise and early to bed makes a male healthy and wealthy and dead. |
+---------------------------------------------------------------------------+
2 rows in set (0.00 sec)

Relevance Ranking

The relevance ranking is fairly basic, derived from word frequencies within each document and the search data overall. Typically, you would only ORDER BY this value for very simplistic searches of small documents; for any important search you would layer your own ranking logic on top, perhaps with the MySQL relevance value as one factor in the overall rank.

-- Get the relevance of a single word.
select substr(quote,1,20) as "And",
  match(quote) against ('and' in natural language mode) as Relevance
  from quotes order by Relevance desc;
+----------------------+--------------------+
| And                  | Relevance          |
+----------------------+--------------------+
| Early to rise and ea | 0.2718571722507477 |
| Early to bed and ear | 0.1812381148338318 |
| Fourscore and seven  | 0.0906190574169159 |
| All those years ago. |                  0 |
| Then 10 years ago th |                  0 |
| 1500 hundred years a |                  0 |
+----------------------+--------------------+

6 rows in set (0.00 sec)

Transactions

The key idea behind bringing full-text search to InnoDB tables is to make this feature compatible with transactions, so that you can include full-text columns alongside other columns in tables in ways that make sense in terms of schema design, and multiple sessions can update the full-text column data (and/or other columns in the table) simultaneously. The full-text data doesn’t have to be treated as read-only or read-mostly.

As mentioned in Jimmy’s blog post, the table structures that manipulate the full-text data behind the scenes are only updated at COMMIT time. So make sure to insert or update full-text data in one transaction, commit, and then run any full-text queries in a subsequent transaction. (Actually, in the examples below, it looks like the data is taken out of the full-text results as soon as a DELETE is issued, then comes back if the deletion is rolled back. I think that is explained in Jimmy’s blog post by the discussion about the delete-marking optimization to avoid huge updates to the full-text index for deleted data.)

drop table if exists quotes_uncommitted;

create table quotes_uncommitted
  (
      author varchar(64)
    , quote varchar(4000)
    , source varchar(64)
    , fulltext(quote)
    , primary key (author, quote(128))
  );

-- We insert but don't immediately commit.
insert into quotes_uncommitted select author, quote, source from quotes;
-- Within the same transaction, a full-text search does not see the uncommitted data.
select count(author), author as "Uncommitted Results" from quotes_uncommitted
  where match(quote) against ('ago' in natural language mode);
+---------------+---------------------+
| count(author) | Uncommitted Results |
+---------------+---------------------+
|             0 | NULL                |
+---------------+---------------------+
1 row in set (0.00 sec)

-- If the newly inserted rows are rolled back...
rollback;
-- ...then the full-text search still doesn't see them.
select count(author), author as "Rolled-Back Results" from quotes_uncommitted
  where match(quote) against ('ago' in natural language mode);
+---------------+---------------------+
| count(author) | Rolled-Back Results |
+---------------+---------------------+
|             0 | NULL                |
+---------------+---------------------+
1 row in set (0.00 sec)

-- OK, let's start with some committed data in the table, then empty the table,
-- and then try some FTS queries
-- both before and after the commit.
insert into quotes_uncommitted select author, quote, source from quotes;
commit;
delete from quotes_uncommitted;
select count(author), author as "Deleted but still not committed" from quotes_uncommitted
  where match(quote) against ('ago' in natural language mode);
+---------------+---------------------------------+
| count(author) | Deleted but still not committed |
+---------------+---------------------------------+
|             0 | NULL                            |
+---------------+---------------------------------+
1 row in set (0.00 sec)

rollback;
select count(author), author as "Deleted and rolled back" from quotes_uncommitted
  where match(quote) against ('ago' in natural language mode);
+---------------+-------------------------+
| count(author) | Deleted and rolled back |
+---------------+-------------------------+
|             4 | Abraham Lincoln         |
+---------------+-------------------------+
1 row in set (0.00 sec)

delete from quotes_uncommitted;
commit;
select count(author), author as "Deleted and committed" from quotes_uncommitted
  where match(quote) against ('ago' in natural language mode);
+---------------+-----------------------+
| count(author) | Deleted and committed |
+---------------+-----------------------+
|             0 | NULL                  |
+---------------+-----------------------+
1 row in set (0.00 sec)

insert into quotes_uncommitted select author, quote, source from quotes;
commit;
truncate table quotes_uncommitted;
select count(author), author as "Truncated" from quotes_uncommitted
  where match(quote) against ('ago' in natural language mode);
+---------------+-----------+
| count(author) | Truncated |
+---------------+-----------+
|             0 | NULL      |
+---------------+-----------+
1 row in set (0.00 sec)

Multi-Column Searches

Although you can only have one FULLTEXT index in an InnoDB table, that index can apply to multiple columns, allowing you to search when you aren’t sure which column contains the term. With a multi-column index, we can MATCH() against all the columns to find words that appear in any of those columns. Always reference all the same columns in the MATCH() clause as in the FULLTEXT index definition, because the information about which column the words appear in is not included in the full-text search data.

drop table if exists quotes_multi_col;

create table quotes_multi_col
  (
    id int unsigned auto_increment primary key
    , author varchar(64)
    , quote varchar(4000)
    , source varchar(64)
    , fulltext(author, quote, source)
  );

insert into quotes_multi_col select * from quotes;
commit;

select author as "Poor 1 (NL)", substr(quote,1,15) as "Poor 2 (NL)", source as "Poor 3 (NL)" from
  quotes_multi_col where match(author, quote, source)
  against ('poor' in natural language mode);
+-------------------+-----------------+-------------------------+
| Poor 1 (NL)       | Poor 2 (NL)     | Poor 3 (NL)             |
+-------------------+-----------------+-------------------------+
| Benjamin Franklin | Early to bed an | Poor Richard's Almanack |
+-------------------+-----------------+-------------------------+
1 row in set (0.00 sec)

select author as "Poor 1 (BOOL)", substr(quote,1,15) as "Poor 2 (BOOL)", source as "Poor 3 (BOOL)"
  from quotes_multi_col where match(author, quote, source)
  against ('poor' in boolean mode);
+-------------------+-----------------+-------------------------+
| Poor 1 (BOOL)     | Poor 2 (BOOL)   | Poor 3 (BOOL)           |
+-------------------+-----------------+-------------------------+
| Benjamin Franklin | Early to bed an | Poor Richard's Almanack |
+-------------------+-----------------+-------------------------+
1 row in set (0.00 sec)

select author as "Clarke 1 (NL)", substr(quote,1,15) as "Clarke 2 (NL)", source as "Clarke 3 (NL)"
  from quotes_multi_col where match(author, quote, source)
  against ('clarke' in natural language mode);
+------------------+-----------------+--------------------------------+
| Clarke 1 (NL)    | Clarke 2 (NL)   | Clarke 3 (NL)                  |
+------------------+-----------------+--------------------------------+
| Arthur C. Clarke | Then 10 years a | 2010: The Year We Make Contact |
+------------------+-----------------+--------------------------------+
1 row in set (0.00 sec)

Interaction with Other Indexes

Remember that the design of your primary key index and secondary indexes is a big factor in query performance for InnoDB tables.

  • You can include parts (prefixes) of the full-text column(s) within the primary key.
  • However, that might not be a good idea if (a) the associated columns will ever be updated — which causes an expensive reorganization within the InnoDB table, or (b) if the table will have any other secondary indexes — the primary key values for a row are duplicated in the entry for that row in every secondary index, making index operations require more I/O and memory.
  • As mentioned in Jimmy’s blog post, adding the FULLTEXT index to the table is going to create a new column and associated index in the original table, so you could set up the column and index ahead of time, to avoid table reorganization later.
  • You can use the unique constraint of the primary key or a UNIQUE index to prevent duplicate values or combinations of values from being entered.
  • You can use the not-null constraint of the primary key to prevent blank values or combinations of values from being entered.
  • For the Labs release, the InnoDB FULLTEXT processing isn’t integrated with the MySQL optimizer and its estimates for which index is best to use, so don’t draw conclusions about performance characteristics from this early preview.

Stopwords

Stopwords are typically short, commonly used words that you designate as not significant for a search. They are left out of the FULLTEXT index and ignored when entered in FULLTEXT queries. For example, a search for ‘the’ is unsuccessful because it’s in the default stopword list. For your own customized search, you might create a bigger list (say, with common words from several languages) or a smaller one (for example, a music or movie site where words such as “The” in names and titles are significant). The details about customizing the stopword list are in Jimmy’s blog post.

select count(*), author as "Stopword 1", quote as "Stopword 2", source as "Stopword 3"
  from quotes_multi_col
  where match(author, quote, source) against ('the' in natural language mode);
+----------+------------+------------+------------+
| count(*) | Stopword 1 | Stopword 2 | Stopword 3 |
+----------+------------+------------+------------+
|        0 | NULL       | NULL       | NULL       |
+----------+------------+------------+------------+