InnoDB Full-Text: MeCab Parser

In addition to our general CJK support, as detailed in this blog post, we’ve also added a MeCab parser. MeCab is a Japanese morphological analyzer, and we now have a full-text plugin parser based on it!

How Would I Use It?

  1. Set the mecab_rc_file option — mecab_rc_file is a read-only system variable pertaining to the MeCab parser. The mecabrc file that it points to is a configuration file required by MeCab, and it should at least have one entry for
    dicdir=/path/to/ipadic
    
    , which tells MeCab where to load the dictionary from.Once MySQL is installed, we have a default bundled mecabrc file in
    /path/to/mysql/install/lib/mecab/etc/mecabrc
    
    , and we have three dictionaries within the
    /path/to/mysql/install/lib/mecab/dic
    
    directory: ipadic_euc-jp, ipadic_sjis, and ipadic_utf-8. We’ll need to modify the mecabrc file to specify which one of these three dictionaries we want to use.

    Note: If you have your own dictionary, you can instead use that as well. There are also many additional options that can be specified within the mecabrc file. For more information about that configuration file, please see the documentation here.

    For our testing purposes here, let’s load ipadic_utf-8 using these steps (in my case, MySQL 5.7.7 is installed in /usr/local/mysql):

    1. Add an entry in the mecabrc file like this:
      dicdir=/usr/local/mysql/lib/mecab/dic/ipadic_utf-8
      
    2. Add an entry in the [mysqld] section of /etc/my.cnf like this:
      loose-mecab-rc-file=/usr/local/mysql/lib/mecab/etc/mecabrc
      
  2. Set innodb_ft_min_token_size — The recommended value is 1 or 2 with the MeCab parser (the default value is 3). We will use 1 for the following examples.
  3. Install the MeCab Plugin:
    mysql> INSTALL PLUGIN mecab SONAME 'libpluginmecab.so';
    Query OK, 0 rows affected (0.09 sec)
    
    mysql> SHOW STATUS LIKE 'mecab_charset';
    +---------------+-------+
    | Variable_name | Value |
    +---------------+-------+
    | mecab_charset | utf8  |
    +---------------+-------+
    1 row in set (0.00 sec)
    
  4. Create a Full-Text Index with MeCab (NOTE: With 5.7.6, you will need to use utf8 instead of utf8mb4. The MeCab parser plugin now supports the eucjpms, cp932, and utf8mb4 character sets in 5.7.7 and later—Bug#20534096):
    mysql> CREATE TABLE articles
    (
            FTS_DOC_ID BIGINT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
            title VARCHAR(100),
            FULLTEXT INDEX mecab_idx (title) WITH PARSER mecab
    )ENGINE=InnoDB CHARACTER SET utf8mb4;
    Query OK, 0 rows affected (1.27 sec)
    

More on MeCab Tokenization

Let’s look at an example that demonstrates how the word tokenization is done:

mysql> set names utf8mb4;
Query OK, 0 rows affected (0.00 sec)

mysql> INSERT INTO articles (title) VALUES ('東京都は日本の首都です'),('京都と大阪は日本の府です');
Query OK, 2 rows affected (0.37 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> SET GLOBAL innodb_ft_aux_table="test/articles";
Query OK, 0 rows affected (0.00 sec)

mysql> SELECT * FROM INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE ORDER BY doc_id, position;
+--------+--------------+-------------+-----------+--------+----------+
| WORD   | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+--------+--------------+-------------+-----------+--------+----------+
| 東京   |            1 |           1 |         1 |      1 |        0 |
| 都     |            1 |           1 |         1 |      1 |        6 |
| は     |            1 |           2 |         2 |      1 |        9 |
| 日本   |            1 |           2 |         2 |      1 |       12 |
| の     |            1 |           2 |         2 |      1 |       18 |
| 首都   |            1 |           1 |         1 |      1 |       21 |
| です   |            1 |           2 |         2 |      1 |       27 |
| 京都   |            2 |           2 |         1 |      2 |        0 |
| と     |            2 |           2 |         1 |      2 |        6 |
| 大阪   |            2 |           2 |         1 |      2 |        9 |
| は     |            1 |           2 |         2 |      2 |       15 |
| 日本   |            1 |           2 |         2 |      2 |       18 |
| の     |            1 |           2 |         2 |      2 |       24 |
| 府     |            2 |           2 |         1 |      2 |       27 |
| です   |            1 |           2 |         2 |      2 |       30 |
+--------+--------------+-------------+-----------+--------+----------+
15 rows in set (0.00 sec)

More on Full-Text Searches with Mecab
Text Searches

  • In NATURAL LANGUAGE MODE, the text searched for is converted to a union of search tokens. For example, '日本の首都' is converted to '日本 の 首都'. Here’s a working example:
    mysql> SELECT * FROM articles WHERE MATCH(title) AGAINST('日本の首都' IN NATURAL LANGUAGE MODE);
    +------------+--------------------------------------+
    | FTS_DOC_ID | title                                |
    +------------+--------------------------------------+
    |          1 | 東京都は日本の首都です               |
    |          2 | 京都と大阪は日本の府です             |
    +------------+--------------------------------------+
    2 rows in set (0.00 sec)
    
  • In BOOLEAN MODE searches, the text searched for is converted to a phrase search. For example, '日本の首都' is converted to '"日本 の 首都"'. Here’s a working example:
    mysql> SELECT * FROM articles WHERE MATCH(title) AGAINST('日本の首都' IN BOOLEAN MODE);
    +------------+-----------------------------------+
    | FTS_DOC_ID | title                             |
    +------------+-----------------------------------+
    |          1 | 東京都は日本の首都です            |
    +------------+-----------------------------------+
    1 row in set (0.01 sec)
    

Wildcard Searches

  • We don’t tokenize the text of a wildcard search. For example, for '日本の首都*' we will search the prefix of '日本の首都', and may not produce any matches. Here’s two working examples:
    mysql> SELECT * FROM articles WHERE MATCH(title) AGAINST('日本*' IN BOOLEAN MODE);
    +------------+--------------------------------------+
    | FTS_DOC_ID | title                                |
    +------------+--------------------------------------+
    |          1 | 東京都は日本の首都です               |
    |          2 | 京都と大阪は日本の府です             |
    +------------+--------------------------------------+
    2 rows in set (0.00 sec)
    
    mysql> SELECT * FROM articles WHERE MATCH(title) AGAINST('日本の首都*' IN BOOLEAN MODE);
    Empty set (0.00 sec)
    

Phrase Searches

  • A phrase search is tokenized by mecab. For example, "日本の首都" is converted to "日本 の 首都". Here’s a working example:
    mysql> SELECT * FROM articles WHERE MATCH(title) AGAINST('"日本の首都"' IN BOOLEAN MODE);
    +------------+-----------------------------------+
    | FTS_DOC_ID | title                             |
    +------------+-----------------------------------+
    |          1 | 東京都は日本の首都です            |
    +------------+-----------------------------------+
    1 row in set (0.00 sec)
    

MeCab Limitations

It only supports three specific character sets: eucjpms (ujis), cp932 (sjis), and utf8 (utf8mb4). If there is a mismatch between what MeCab is using and what the InnoDB table is using—for example the MeCab character set is ujis, but the the fulltext index is utf8/utf8mb4—then you will get a character set mismatch error when attempting the search.

If you would like to learn more general details about InnoDB full-text search, please see the InnoDB Full-Text Index section of the user manual and Jimmy’s excellent Dr. Dobb’s article. For more details about the MeCab parser specifically, please see the MeCab parser section in the user manual.

We hope that you find this new feature useful! We’re very happy to have improved CJK support throughout MySQL 5.7, and this is a big part of that. If you have any questions please feel free to post them here on the blog post or in a support ticket. If you feel that you have encountered any related bugs, please let us know via a comment here, a bug report, or a support ticket.

As always, THANK YOU for using MySQL!