With Unicode it is possible for strings to look the same, but
with slight differences in which codepoints are used.
For example the é in Café can be <U+0065 U+0301> or
<U+00E9>.
The solution is to use Unicode normalization, which is supported
in every major programming language. Both versions of Café will
be normalized to use U+00E9.
In the best situation the application inserting data into the
database will do the normalization, but that often not the case.
This gives the following issue: If you search for Café in the
normalized form it won't return non-normalized entries.
I made a proof-of-concept parser plugin which indexes the
normalized version of words.
A very short demo:
mysql> CREATE TABLE test1 (id int auto_increment primary key,
-> txt TEXT CHARACTER SET utf8mb4, fulltext (txt));
Query OK, 0 rows affected (0.30 sec)
mysql> CREATE TABLE test2 (id int …
[Read more]