Planet MySQL :: Planet MySQL - Archives - InnoDB Full-text Search in MySQL 5.6: Part 2, The ...

InnoDB Full-text Search in MySQL 5.6: Part 2, The Queries!

This is part 2 in a 3 part series. In part 1, we took a quick look at some initial configuration of InnoDB full-text search and discovered a little bit of quirky behavior; here, we are going to run some queries and compare the result sets. Our hope is that the one of two things will happen; either the results returned from a MyISAM FTS query will be exactly identical to the same query when performed against InnoDB data, OR that the results returned by InnoDB FTS will somehow be “better” (as much as it’s actually possible to do this in a single blog post) than what MyISAM gives us.

Recall that we have two different sets of data, one which is the text of roughly 8000 SEO-stuffed webpage bodies (we call that one SEO) and the other, which we call DIR, that is roughly 800,000 directory records with name, address, and the like. We are using MySQL 5.5.30 and MySQL 5.6.10 with no configuration tuning other than to set innodb_ft_min_token_size to 4 (rather than the default of 3) so that it matches MyISAM’s default ft_min_word_length.

First, MyISAM, with MySQL 5.5, on the SEO data set:

mysql: SELECT id, title, MATCH(title, body) AGAINST ('arizona business records'
   IN NATURAL LANGUAGE MODE) AS score FROM seo_test_myisam ORDER BY 3 DESC LIMIT 5;
+------+-----------------------------------------------------------------------+--------------------+
| id   | title                                                                 | score              |
+------+-----------------------------------------------------------------------+--------------------+
|  458 | Free Business Forms for Starting or Maintaining a Successful Business | 1.3383517265319824 |
| 7112 | Download Idaho Tax Forms for Individual or Business Needs             | 0.9273209571838379 |
| 7113 | Illinois Tax Forms for Individuals and Business                       | 0.8827990889549255 |
| 7121 | Massachusetts Tax Forms                                               | 0.8678107261657715 |
|  454 | Business Accounting and Invoice Forms                                 | 0.8668115139007568 |
+------+-----------------------------------------------------------------------+--------------------+

The same query, run against InnoDB on 5.6.10:

mysql: SELECT id, title, MATCH(title, body) AGAINST ('arizona business records'
   IN NATURAL LANGUAGE MODE) AS score FROM seo_test_innodb ORDER BY 3 DESC LIMIT 5;
+----+------------------------------------------------------------------------+-------------------+
| id | title                                                                  | score             |
+----+------------------------------------------------------------------------+-------------------+
| 48 | California Free Public Records, Criminal Records And Background Checks | 21.23662567138672 |
| 66 | Michigan Free Public Records, Criminal Records And Background checks   | 5.363710880279541 |
| 44 | Alabama Free Public Records, Criminal Records And Background Checks    | 5.310127258300781 |
| 57 | Illinois Free Public Records, Criminal Records And Background Checks   | 4.569097518920898 |
| 70 | Montana Free Public Records, Criminal Records And Background Checks    | 4.516233444213867 |
+----+------------------------------------------------------------------------+-------------------+

Wow. I’m not sure if I should be concerned so much that the *scores* are different, but the *matches* are COMPLETELY DIFFERENT between 5.5/MyISAM and 5.6/InnoDB. Now, we know that MyISAM FTS does have the caveat with natural language searches whereby a word that’s present in 50% or more of the rows is treated as a stopword, so does that account for our problem? It might, because the word ‘arizona’ appears in over 6900 of the 7150 rows, and the word ‘records’ appears in 7082 of them. So let’s try something else that’s less likely to have that issue. The word “corporation” appears in 143 of the documents; the word “forms” appears in 439 of them, and the word “commission” appears in 130. There might be some overlap here, but even if there isn’t, 143+130+439 < 0.5 * 7150, so none of these should be treated as stopwords in MyISAM.

With 5.5:

mysql: SELECT id, title, MATCH(title, body) AGAINST ('corporation commission forms'
   IN NATURAL LANGUAGE MODE) AS score FROM seo_test_myisam ORDER BY 3 DESC LIMIT 5;
+------+--------------------+--------------------+
| id   | title              | score              |
+------+--------------------+--------------------+
| 7132 | New York Tax Forms |  7.821961879730225 |
| 7127 | Nebraska Tax Forms |  6.882259845733643 |
| 7123 | Free Forms         |  6.748578071594238 |
| 7126 | Montana Tax Forms  | 6.4749345779418945 |
| 7119 | Maine Tax Forms    |  6.400341510772705 |
+------+--------------------+--------------------+

With 5.6:

mysql: SELECT id, title, MATCH(title, body) AGAINST ('corporation commission forms'
   IN NATURAL LANGUAGE MODE) AS score FROM seo_test_innodb ORDER BY 3 DESC LIMIT 5;
+-----+--------------------------------------------------------------------------+--------------------+
| id  | title                                                                    | score              |
+-----+--------------------------------------------------------------------------+--------------------+
|  79 | Ohio Free Public Records, Criminal Records And Background Checks         |  51.76125717163086 |
|  78 | Free North Dakota Public Records, Criminal Records And Background Checks |  30.32522201538086 |
|  19 | Free Public Printable Forms For All Industries                           | 27.557558059692383 |
| 408 | Free Colorado DMV Resources and Driving Records                          | 24.933029174804688 |
|  48 | California Free Public Records, Criminal Records And Background Checks   | 24.224336624145508 |
+-----+--------------------------------------------------------------------------+--------------------+

OK, now I’m starting to get a little worried. The docs do tell us that the default stopword list is substantially different between InnoDB and MyISAM, and as it turns out, there are only 36 stopwords in the default InnoDB list, but there are 543 stopwords in the default MyISAM list. What happens if we take the MyISAM stopwords, insert them into a table, and configure that table to be our stopword list for InnoDB?

This is the table that we’re trying to emulate:

mysql: SHOW CREATE TABLE information_schema.innodb_ft_default_stopword\G
*************************** 1. row ***************************
       Table: INNODB_FT_DEFAULT_STOPWORD
Create Table: CREATE TEMPORARY TABLE `INNODB_FT_DEFAULT_STOPWORD` (
  `value` varchar(18) NOT NULL DEFAULT ''
) ENGINE=MEMORY DEFAULT CHARSET=utf8

The docs tell us that we need to create an *InnoDB* table with a single VARCHAR column named “value”. OK, sounds easy enough:

mysql: SHOW CREATE TABLE innodb_myisam_stopword\G
*************************** 1. row ***************************
       Table: innodb_myisam_stopword
Create Table: CREATE TABLE `innodb_myisam_stopword` (
  `value` varchar(18) NOT NULL DEFAULT ''
) ENGINE=InnoDB DEFAULT CHARSET=utf8

But, when we try to use this table, here’s what comes back:

mysql: SET GLOBAL innodb_ft_server_stopword_table='test/innodb_myisam_stopword';
ERROR 1231 (42000): Variable 'innodb_ft_server_stopword_table' can't be set to the value of 'test/innodb_myisam_stopword'

And here’s what appeared in the server’s error log:

InnoDB: invalid column type for stopword table test/innodb_myisam_stopword. Its first column must be of varchar type

Uh… Does this mean that my next blog post should be entitled, “When is a VARCHAR Not Really a VARCHAR?” Thinking that maybe this was a case of GEN_CLUST_INDEX causing me issues, I tried adding a second column to the table which was an integer PK, and in another attempt, I tried just making the “value” column the PK, but neither of those worked. Also, trying to set innodb_ft_user_stopword_table produced the same error. I submitted a bug report (68450), and as you can see from the bug discussion, it turns out that this table is character-set-sensitive. If you’re going to use your own stopword table for InnoDB FTS, at least for the moment, this table must use the latin1 character set.

mysql: CREATE TABLE innodb_ft_list2 ( value VARCHAR(18) NOT NULL DEFAULT '') ENGINE=InnoDB DEFAULT CHARSET=latin1;
Query OK, 0 rows affected (0.02 sec)
mysql: SET GLOBAL innodb_ft_server_stopword_table='test/innodb_ft_list2';
Query OK, 0 rows affected (0.00 sec)
mysql: INSERT INTO innodb_ft_list2 SELECT * FROM innodb_myisam_stopword;
Query OK, 543 rows affected (0.01 sec)
Records: 543  Duplicates: 0  Warnings: 0

As far as I can tell, this little gotcha doesn’t appear to be mentioned anywhere in the MySQL 5.6 documentation; every place where it talks about creating one of these stopword tables, it simply mentions the table engine and the column name/type, so I’m not sure if this is an intentional restriction that just needs to be better documented or if it’s a limitation with the InnoDB FTS feature that will be removed in a later version.

Now that we’ve sorted this out, let’s drop and rebuild our FT index on the InnoDB table and try the above queries one more time. We already know what the MyISAM results are going to be; do our InnoDB results change? No, they are exactly the same, although the scores did change slightly.

What about a Boolean mode query? The docs tell us that if we use Boolean mode, and we put a “+” in front of our search term, then that term *must* appear in the search results. But does it?

With 5.5:

mysql: SELECT id, full_name, MATCH(full_name, details) AGAINST ('+james +peterson +arizona'
   IN BOOLEAN MODE) AS score FROM dir_test_myisam ORDER BY 3 DESC LIMIT 5;
+-------+--------------------------+-------+
| id    | full_name                | score |
+-------+--------------------------+-------+
| 74717 | James R Peterson         |     1 |
|     1 | Harold Wesley Abbott Iii |     0 |
|     3 | William Doyle Abbott Jr  |     0 |
|     5 | Robert Jack Abraham      |     0 |
|     7 | Mark Allen Abrell        |     0 |
+-------+--------------------------+-------+

And with 5.6:

mysql: SELECT id, full_name, MATCH(full_name, details) AGAINST ('+james +peterson +arizona'
   IN BOOLEAN MODE) AS score FROM dir_test_innodb ORDER BY 3 DESC LIMIT 5;
+---------+------------------+-------------------+
| id      | full_name        | score             |
+---------+------------------+-------------------+
|   74717 | James R Peterson | 23.63458251953125 |
| 1310720 | Terry Lynn Suter |                 0 |
| 1441792 | Jorge E Morrison |                 0 |
| 1310976 | Oscar Blakemore  |                 0 |
| 1442048 | Donald Simmons   |                 0 |
+---------+------------------+-------------------+

There’s only one row in the table that actually matches all three search terms, and in this case, both MyISAM and InnoDB FTS performed identically and found it. I’m not really concerned about the fact that the next four rows are completely different; the scores are zero, which means “no match.” This looks promising, so let’s explore further. Again, from the docs, if we run a boolean mode query where some of the search terms are prefixed with “+” and others have no prefix, results that have the unprefixed term will be ranked higher than those with out it. So, for example, if we change the above query to be “+james +peterson arizona” then we might expect to get back multiple matches containing the words “James” and “Peterson”, and we should expect the record from Arizona to be towards the top of the list.

With 5.5, this is exactly what happens:

mysql: SELECT id, full_name, MATCH(full_name, details) AGAINST ('+james +peterson arizona'
   IN BOOLEAN MODE) AS score FROM dir_test_myisam ORDER BY 3 DESC LIMIT 5;
+--------+------------------------------+--------------------+
| id     | full_name                    | score              |
+--------+------------------------------+--------------------+
|  74717 | James R Peterson             | 1.3333333730697632 |
|  14159 | Christopher Michael James    |                  1 |
|  44427 | James Cyrus Peterson         |                  1 |
|  53501 | James/Rober T Giles/Peterson |                  1 |
| 126373 | Bamish James Peterson        |                  1 |
+--------+------------------------------+--------------------+

With 5.6, we’re not so fortunate.

mysql: SELECT id, full_name, MATCH(full_name, details) AGAINST ('+james +peterson arizona'
   IN BOOLEAN MODE) AS score FROM dir_test_innodb ORDER BY 3 DESC LIMIT 5;
+--------+--------------------------+--------------------+
| id     | full_name                | score              |
+--------+--------------------------+--------------------+
| 741223 | Alphonso Lee Peterson Sr | 59.972904205322266 |
| 925811 | James B Peterson         |  59.37348937988281 |
| 269589 | Michael James Peterson   |  44.82982635498047 |
| 471273 | James Allen Peterson     |  39.58232879638672 |
| 925781 | Anthony James Petersen   |  38.03296661376953 |
+--------+--------------------------+--------------------+

These results aren’t even close to identical. As it turns out, the full record for “Alphonso Lee Peterson Sr” does also contain the name “James”, and the word “Peterson” is listed in there several times, but “Arizona” is not present at all, whereas the record for “James R Peterson” had all three search terms and no significant repetition of any of them. Using this particular query, “James R Peterson” is #15 on the list.

At this point, it’s pretty obvious that the way MyISAM is calculating the scores is much different from the way that InnoDB is doing it, and given what I said earlier about the repetition of words in the “Alphonso Lee Peterson Sr” record versus the “James R Peterson” one, we might argue that InnoDB is actually behaving more correctly than MyISAM. Imagine if we were searching through newspaper articles or something of that sort, and we were looking for queries containing the word “MySQL” – odds are that an article which has 10 instances of “MySQL” might be more desirable to us than an article which only has it mentioned once. So if I look at these results from that persepctive, I can understand the how and the why of it. My concern is that there are likely going to be people who believe that switching to InnoDB FTS is simply a matter of upgrading to 5.6 and running ALTER TABLE foo ENGINE=InnoDB. In theory, yes. In practice, not even close.

I tried one more Boolean search, this time looking for someone’s full name, which I knew to be present only once in the database, and I used double quotes to group the search terms as a single phrase:

With 5.5:

mysql: SELECT id, full_name, MATCH(full_name, details) AGAINST ('"Thomas B Smith"'
   IN BOOLEAN MODE) AS score FROM dir_test_myisam ORDER BY 3 DESC LIMIT 1;
+-------+----------------+-------+
| id    | full_name      | score |
+-------+----------------+-------+
| 62633 | Thomas B Smith |     1 |
+-------+----------------+-------+

Looks good, there he is. Now what happens under 5.6?

mysql: SELECT id, full_name, MATCH(full_name, details) AGAINST ('"Thomas B Smith"'
   IN BOOLEAN MODE) AS score FROM dir_test_innodb ORDER BY 3 DESC LIMIT 1;
+---------+------------------+-------+
| id      | full_name        | score |
+---------+------------------+-------+
| 1310720 | Terry Lynn Suter |     0 |
+---------+------------------+-------+

In the immortal words of Homer J. Simpson, “D’OH!!” Why is MyISAM able to locate this record but InnoDB cannot find it at all? I suspect that the “B” is causing problems for InnoDB, because it’s only a single character and we’ve set innodb_ft_min_token_size to 4. Thus, when InnoDB is parsing the data and building the word list, it’s completely ignoring Mr. Smith’s middle initial. To test this hypothesis, I reset innodb_ft_min_token_size to 1, dropped/rebuilt the InnoDB index, and tried again.

mysql: SELECT id, full_name, MATCH(full_name, details) AGAINST ('"Thomas B Smith"'
   IN BOOLEAN MODE) AS score FROM dir_test_innodb ORDER BY 3 DESC LIMIT 1;
+-------+----------------+--------------------+
| id    | full_name      | score              |
+-------+----------------+--------------------+
| 62633 | Thomas B Smith | 31.749061584472656 |
+-------+----------------+--------------------+

Aha, there he is! Based on that result, I would caution anyone designing an application that’s going to use InnoDB FTS to be quite mindful of the types of queries that you’re expecting your users to run. In particular, if you expect or are going to allow users to enter search phrases that include initials, numbers, or any other string of length less than 3 (the default), I think you’re going to be forced to set innodb_ft_min_token_size to 1. Otherwise you’ll run into the same problem as our Mr. Smith here. [This does raise the question of why it works with MyISAM when ft_min_word_length defaults to 4, but that is a topic for another day.]

Note that there may or may not be some performance implications to cranking this value all the way down; that is something I have not yet tested but will be reporting on in part 3 of this series. I can, however, confirm that the on-disk size of my DIR dataset is exactly the same with a setting of 1 versus a setting of 4. This may or may not be the case with multi-byte character sets or with ideographic languages such as Japanese, although Japanese poses its own unique problems for FTS of any kind due to its lack of traditional word boundaries.

In any event, it appears that we’ve solved the Boolean-mode search issue, but we still have vastly different results with the natural-language-mode search. For those of you who are expecting and need to have the MyISAM-style search results, there is at least one potential escape hatch from this rabbit hole. When defining a FULLTEXT KEY, you can use the “WITH PARSER” modifier to specify the name of a UDF which references your own custom-written fulltext parser plugin. Thus I am thinking that it may be possible to take the MyISAM full-text parser code, convert it to a plugin, and use it for InnoDB FT indexes where you’re expecting MyISAM-style results. Verifying or refuting this conjecture is left as an exercise for the reader.

A quick recap of what we’ve learned so far:

There are parts of InnoDB FTS configuration which are both letter-case and character-set sensitive. Watch out!
When you add your first FULLTEXT KEY to an InnoDB table, be prepared for a table rebuild.
Calculation of match score is completely different between the two engines; sometimes this leads to wildly different results.
If you were hoping to use InnoDB FTS as a simple drop-in replacement for your current MyISAM FTS, the results may surprise you.

That last point bears particular emphasis, as it also illustrates an important best practice even if FTS isn’t involved. Always test how your application behaves as a result of a major MySQL version upgrade before rolling it into production! Percona has tools (pt-upgrade and Percona Playback) that can help you with this. These tools are free and open source, please use them. You, and your users, will be happy that you did.

In the third and final installment of this series, we will take a look at performance. How does the speed of InnoDB FTS compare to its MyISAM counterpart, and what kinds of tricks can we use to make it go faster? Stay tuned to find out!

The post InnoDB Full-text Search in MySQL 5.6: Part 2, The Queries! appeared first on MySQL Performance Blog.

Search

MySQL Links