The InnoDB full-text search capability is an exciting feature. The full-text search itself is generally useful to have in an RDBMS. If an application is using all InnoDB tables except for one that is used for full-text searches, now that last table can be switched to InnoDB. If putting the full-text data in a MyISAM table led to scalability problems, duplication, or a less-than-ideal schema design, now those issues can be addressed.
In this post, I’ll take you through some of the basics of setting
up and querying an InnoDB FULLTEXT
search index.
I’ll leave the scalability and performance aspects to Jimmy’s and Vinay’s blog posts, and just use some toy-sized
data for demonstration purposes.
Creating a Table with a Full-Text Search Index
The key component of this feature is an index of type
FULLTEXT
, applied to one or more columns of an
InnoDB table.
In Jimmy’s post, he mentions some scalability
considerations where you might create the table (including a
special FTS_DOC_ID
column), load the data, then
create the FULLTEXT
index afterward. For simplicity
(and since the data volume is so small), I’ll create the table
with the index in place, then load the data afterward.
use test; -- We will do some commits and rollbacks to demonstrate transactional features. -- So turn off the default setting that commits immediately after each statement. set autocommit=0; drop table if exists quotes; -- The InnoDB full-text search feature in the 5.6 Labs release -- lets us define the FULLTEXT index on an InnoDB table. -- The ENGINE=INNODB clause is just to illustrate the InnoDB aspect. -- In 5.5 and above, new tables use InnoDB by default. create table quotes ( id int unsigned auto_increment primary key , author varchar(64) , quote varchar(4000) , source varchar(64) , fulltext(quote) ) engine=innodb; -- Get some words and phrases to search for into the table. insert into quotes (author, quote, source) values ('Abraham Lincoln', 'Fourscore and seven years ago...', 'Gettysburg Address') , ('George Harrison', 'All those years ago...', 'Live In Japan') , ('Arthur C. Clarke', 'Then 10 years ago the monolith was discovered.', '2010: The Year We Make Contact') , ('Benjamin Franklin', 'Early to bed and early to rise, makes a man healthy, wealthy, and wise.', 'Poor Richard''s Almanack') , ('James Thurber', 'Early to rise and early to bed makes a male healthy and wealthy and dead.', 'The New Yorker') , ('K', '1500 hundred years ago, everybody knew that the Earth was the center of the universe.', 'Men in Black') ; -- Since this is an InnoDB table, we are mindful of transactions. commit;
Word and Phrase Search – Natural Language Mode
Once the data is loaded and committed, you can run queries
using the MATCH(columns) AGAINST (search
expression)
operator to do the actual searches. You
can combine this operator with all the usual WHERE
and similar clauses in the SELECT
statement.
The simplest kind of search is to find a single word, or a phrase
with all words in exact order. For this type of search, use the
IN NATURAL LANGUAGE
clause inside the
AGAINST()
call. This technique typically involves a
user-entered string that you pass verbatim to the query (of
course, after escaping any quotation marks or other special
characters to prevent SQL injection attacks).
-- Search for a single word. select author as "Monolith" from quotes where match(quote) against ('monolith' in natural language mode); +------------------+ | Monolith | +------------------+ | Arthur C. Clarke | +------------------+ 1 row in set (0.01 sec) -- Unlike with MyISAM full-text search, the default minimum -- length is 3 rather than 4, and the search returns words that -- appear in a high proportion of the table rows. select author as "Ago" from quotes where match(quote) against ('ago' in natural language mode); +------------------+ | Ago | +------------------+ | Abraham Lincoln | | George Harrison | | Arthur C. Clarke | | K | +------------------+ 4 rows in set (0.00 sec)
AND / OR / NOT Operators – Boolean Mode
For more complicated searches, you can have multiple words and phrases and search for different combinations of optional and required terms, not necessarily in the same order. This technique typically involves several data values that you query from elsewhere, or splitting apart a user-entered string and applying your own rules to the words and phrases inside.
-- Search for a combination of words, not in the same order as the original. select author as "Ago and Years" from quotes where match(quote) against ('+ago +years' in boolean mode); +------------------+ | Ago and Years | +------------------+ | Abraham Lincoln | | George Harrison | | Arthur C. Clarke | | K | +------------------+ 4 rows in set (0.00 sec) -- Search for other Boolean combinations of words. select author as "Fourscore or Monolith" from quotes where match(quote) against ('fourscore monolith' in boolean mode); +-----------------------+ | Fourscore or Monolith | +-----------------------+ | Abraham Lincoln | | Arthur C. Clarke | +-----------------------+ 2 rows in set (0.00 sec) select author as "Years and not Monolith" from quotes where match(quote) against ('+years -monolith' in boolean mode); +------------------------+ | Years and not Monolith | +------------------------+ | Abraham Lincoln | | George Harrison | | K | +------------------------+ 3 rows in set (0.00 sec)
Proximity Search
Proximity search is new feature for InnoDB full-text search. It
is a special case of Boolean search using the @
operator within the AGAINST()
string. You supply 2
or more words, double-quoted, within the single-quoted
AGAINST()
string, followed by
@distance
to specify how far apart these
words can be. The distance represents the maximum number of bytes
(which might not be equal to the number of characters) between
the starting points of all these words.
-- The starting points for these words are too far apart -- (not within 20 bytes), so no results. select quote as "Too Far Apart" from quotes where match(quote) against ('"early wise" @20' in boolean mode); Empty set (0.00 sec) -- But the starting points of all words are within 100 bytes, -- so this query does give results. select quote as "Early...Wise" from quotes where match(quote) against ('"early wise" @100' in boolean mode); +-------------------------------------------------------------------------+ | Early...Wise | +-------------------------------------------------------------------------+ | Early to bed and early to rise, makes a man healthy, wealthy, and wise. | +-------------------------------------------------------------------------+ 1 row in set (0.00 sec) -- In this case, the smallest distance that produces results is 49. select quote as "Early...Wise" from quotes where match(quote) against ('"early wise" @49' in boolean mode); +-------------------------------------------------------------------------+ | Early...Wise | +-------------------------------------------------------------------------+ | Early to bed and early to rise, makes a man healthy, wealthy, and wise. | +-------------------------------------------------------------------------+ 1 row in set (0.00 sec) -- Here is an example showing 2 results, with the words close to each other. select quote as "Early...Bed" from quotes where match(quote) against ('"early bed" @20' in boolean mode); +---------------------------------------------------------------------------+ | Early...Bed | +---------------------------------------------------------------------------+ | Early to bed and early to rise, makes a man healthy, wealthy, and wise. | | Early to rise and early to bed makes a male healthy and wealthy and dead. | +---------------------------------------------------------------------------+ 2 rows in set (0.00 sec)
Relevance Ranking
The relevance ranking is fairly basic, derived from word
frequencies within each document and the search data overall.
Typically, you would only ORDER BY
this value for
very simplistic searches of small documents; for any important
search you would layer your own ranking logic on top, perhaps
with the MySQL relevance value as one factor in the overall rank.
-- Get the relevance of a single word. select substr(quote,1,20) as "And", match(quote) against ('and' in natural language mode) as Relevance from quotes order by Relevance desc; +----------------------+--------------------+ | And | Relevance | +----------------------+--------------------+ | Early to rise and ea | 0.2718571722507477 | | Early to bed and ear | 0.1812381148338318 | | Fourscore and seven | 0.0906190574169159 | | All those years ago. | 0 | | Then 10 years ago th | 0 | | 1500 hundred years a | 0 | +----------------------+--------------------+ 6 rows in set (0.00 sec)
Transactions
The key idea behind bringing full-text search to InnoDB tables is to make this feature compatible with transactions, so that you can include full-text columns alongside other columns in tables in ways that make sense in terms of schema design, and multiple sessions can update the full-text column data (and/or other columns in the table) simultaneously. The full-text data doesn’t have to be treated as read-only or read-mostly.
As mentioned in Jimmy’s blog post, the table structures that
manipulate the full-text data behind the scenes are only updated
at COMMIT
time. So make sure to insert or update
full-text data in one transaction, commit, and then run any
full-text queries in a subsequent transaction. (Actually, in the
examples below, it looks like the data is taken out of the
full-text results as soon as a DELETE
is issued,
then comes back if the deletion is rolled back. I think that is
explained in Jimmy’s blog post by the discussion about the
delete-marking optimization to avoid huge updates to the
full-text index for deleted data.)
drop table if exists quotes_uncommitted; create table quotes_uncommitted ( author varchar(64) , quote varchar(4000) , source varchar(64) , fulltext(quote) , primary key (author, quote(128)) ); -- We insert but don't immediately commit. insert into quotes_uncommitted select author, quote, source from quotes; -- Within the same transaction, a full-text search does not see the uncommitted data. select count(author), author as "Uncommitted Results" from quotes_uncommitted where match(quote) against ('ago' in natural language mode); +---------------+---------------------+ | count(author) | Uncommitted Results | +---------------+---------------------+ | 0 | NULL | +---------------+---------------------+ 1 row in set (0.00 sec) -- If the newly inserted rows are rolled back... rollback; -- ...then the full-text search still doesn't see them. select count(author), author as "Rolled-Back Results" from quotes_uncommitted where match(quote) against ('ago' in natural language mode); +---------------+---------------------+ | count(author) | Rolled-Back Results | +---------------+---------------------+ | 0 | NULL | +---------------+---------------------+ 1 row in set (0.00 sec) -- OK, let's start with some committed data in the table, then empty the table, -- and then try some FTS queries -- both before and after the commit. insert into quotes_uncommitted select author, quote, source from quotes; commit; delete from quotes_uncommitted; select count(author), author as "Deleted but still not committed" from quotes_uncommitted where match(quote) against ('ago' in natural language mode); +---------------+---------------------------------+ | count(author) | Deleted but still not committed | +---------------+---------------------------------+ | 0 | NULL | +---------------+---------------------------------+ 1 row in set (0.00 sec) rollback; select count(author), author as "Deleted and rolled back" from quotes_uncommitted where match(quote) against ('ago' in natural language mode); +---------------+-------------------------+ | count(author) | Deleted and rolled back | +---------------+-------------------------+ | 4 | Abraham Lincoln | +---------------+-------------------------+ 1 row in set (0.00 sec) delete from quotes_uncommitted; commit; select count(author), author as "Deleted and committed" from quotes_uncommitted where match(quote) against ('ago' in natural language mode); +---------------+-----------------------+ | count(author) | Deleted and committed | +---------------+-----------------------+ | 0 | NULL | +---------------+-----------------------+ 1 row in set (0.00 sec) insert into quotes_uncommitted select author, quote, source from quotes; commit; truncate table quotes_uncommitted; select count(author), author as "Truncated" from quotes_uncommitted where match(quote) against ('ago' in natural language mode); +---------------+-----------+ | count(author) | Truncated | +---------------+-----------+ | 0 | NULL | +---------------+-----------+ 1 row in set (0.00 sec)
Multi-Column Searches
Although you can only have one FULLTEXT
index in an
InnoDB table, that index can apply to multiple columns, allowing
you to search when you aren’t sure which column contains the
term. With a multi-column index, we can MATCH()
against all the columns to find words that appear in any of those
columns. Always reference all the same columns in the
MATCH()
clause as in the FULLTEXT
index
definition, because the information about which column the words
appear in is not included in the full-text search data.
drop table if exists quotes_multi_col; create table quotes_multi_col ( id int unsigned auto_increment primary key , author varchar(64) , quote varchar(4000) , source varchar(64) , fulltext(author, quote, source) ); insert into quotes_multi_col select * from quotes; commit; select author as "Poor 1 (NL)", substr(quote,1,15) as "Poor 2 (NL)", source as "Poor 3 (NL)" from quotes_multi_col where match(author, quote, source) against ('poor' in natural language mode); +-------------------+-----------------+-------------------------+ | Poor 1 (NL) | Poor 2 (NL) | Poor 3 (NL) | +-------------------+-----------------+-------------------------+ | Benjamin Franklin | Early to bed an | Poor Richard's Almanack | +-------------------+-----------------+-------------------------+ 1 row in set (0.00 sec) select author as "Poor 1 (BOOL)", substr(quote,1,15) as "Poor 2 (BOOL)", source as "Poor 3 (BOOL)" from quotes_multi_col where match(author, quote, source) against ('poor' in boolean mode); +-------------------+-----------------+-------------------------+ | Poor 1 (BOOL) | Poor 2 (BOOL) | Poor 3 (BOOL) | +-------------------+-----------------+-------------------------+ | Benjamin Franklin | Early to bed an | Poor Richard's Almanack | +-------------------+-----------------+-------------------------+ 1 row in set (0.00 sec) select author as "Clarke 1 (NL)", substr(quote,1,15) as "Clarke 2 (NL)", source as "Clarke 3 (NL)" from quotes_multi_col where match(author, quote, source) against ('clarke' in natural language mode); +------------------+-----------------+--------------------------------+ | Clarke 1 (NL) | Clarke 2 (NL) | Clarke 3 (NL) | +------------------+-----------------+--------------------------------+ | Arthur C. Clarke | Then 10 years a | 2010: The Year We Make Contact | +------------------+-----------------+--------------------------------+ 1 row in set (0.00 sec)
Interaction with Other Indexes
Remember that the design of your primary key index and secondary indexes is a big factor in query performance for InnoDB tables.
- You can include parts (prefixes) of the full-text column(s) within the primary key.
- However, that might not be a good idea if (a) the associated columns will ever be updated — which causes an expensive reorganization within the InnoDB table, or (b) if the table will have any other secondary indexes — the primary key values for a row are duplicated in the entry for that row in every secondary index, making index operations require more I/O and memory.
- As mentioned in Jimmy’s blog post, adding the
FULLTEXT
index to the table is going to create a new column and associated index in the original table, so you could set up the column and index ahead of time, to avoid table reorganization later. - You can use the unique constraint of the primary key or a
UNIQUE
index to prevent duplicate values or combinations of values from being entered. - You can use the not-null constraint of the primary key to prevent blank values or combinations of values from being entered.
- For the Labs release, the InnoDB
FULLTEXT
processing isn’t integrated with the MySQL optimizer and its estimates for which index is best to use, so don’t draw conclusions about performance characteristics from this early preview.
Stopwords
Stopwords are typically short, commonly used words that you
designate as not significant for a search. They are left out of
the FULLTEXT
index and ignored when entered in
FULLTEXT
queries. For example, a search for ‘the’ is
unsuccessful because it’s in the default stopword list. For your
own customized search, you might create a bigger list (say, with
common words from several languages) or a smaller one (for
example, a music or movie site where words such as “The” in names
and titles are significant). The details about customizing the
stopword list are in Jimmy’s blog post.
select count(*), author as "Stopword 1", quote as "Stopword 2", source as "Stopword 3" from quotes_multi_col where match(author, quote, source) against ('the' in natural language mode); +----------+------------+------------+------------+ | count(*) | Stopword 1 | Stopword 2 | Stopword 3 | +----------+------------+------------+------------+ | 0 | NULL | NULL | NULL | +----------+------------+------------+------------+