Chinese and Japanese and Korean (CJK) text usually has no spaces
between words. Conventional full-text search does its tokenizing
by looking for spaces. Therefore conventional full-text search
will fail for CJK.
One workaround is bigrams. Suppose the text is
册免从冘
There should be three index keys, one for each two-character
sequence:
册免, 免从, and 从冘.
Now, in a search like
SELECT * FROM t WHERE MATCH(text_column) AGAINST ('免从');
a bigram-supporting full-text index will have a chance. It's
wasteful and there will be false hits whenever the bigram isn't
really a "word", but the folks in CJK-land have found that
bigrams (or the three-character counterpart, trigrams) actually
work.
One way to get bigrams for MySQL or MariaDB is to get mroonga.
Why care about Yet Another Storage Engine)?
Back in 2008 a project named Senna attracted the …
[Read more]