Showing entries 1 to 10
Displaying posts with tag: unicode (reset)
Using a parser plugin for improved search results with MySQL 5.7 and InnoDB.

With Unicode it is possible for strings to look the same, but with slight differences in which codepoints are used.

For example the é in Café can be <U+0065 U+0301> or <U+00E9>.

The solution is to use Unicode normalization, which is supported in every major programming language. Both versions of Café will be normalized to use U+00E9.

In the best situation the application inserting data into the database will do the normalization, but that often not the case.

This gives the following issue: If you search for Café in the normalized form it won't return non-normalized entries.

I made a proof-of-concept parser plugin which indexes the normalized version of words.

A very short demo:

mysql> CREATE TABLE test1 (id int auto_increment primary key,
-> txt TEXT CHARACTER SET utf8mb4, fulltext (txt));
Query OK, 0 rows affected (0.30 sec)

mysql> …
[Read more]
Importing the Unicode Character Database in MySQL

In Python it is easily possible to findout the name of a Unicode character and findout some properties about that character. The module which does that is called unicodedata.

An example:

>>> import unicodedata

This module uses the data as released in the UnicodeData.txt file from the website.

So if UnicodeData.txt is a 'database', then we should be able to import it into MySQL and use it!

I wrote a small Python script to automate this. The basic steps are:

  • Download UnicodeData.txt
  • Create a unicodedata.ucd table
  • Use LOAD DATA LOCAL INFILE to load the data

This isn't difficult especially because the file doesn't have the actual characters …

[Read more]
Backing up and restoring tables named with special characters


The names of databases and tables within MySQL are known as identifiers. In the simplest case these identifiers are just strings of certain ASCII characters (the basic Latin letters, the digits 0-9, the dollar sign and the underscore). However, if an identifier is placed in quotes, it can contain any character of the full Unicode Basic Multilingual Plane (except U+0000). We say that a character is a special character if it is permitted in a quoted identifier but not in an unquoted identifier.

MySQL Enterprise Backup (MEB) 3.12.1 introduces support for proper handling of table and database names with special characters. In MEB versions prior to 3.12.1 database and table names were represented as ASCII strings and the same string was used on the command line, internally within MEB and in filenames.  This caused MEB to fail some …

[Read more]
WordPress and UTF-8

Update: WordPress 4.2 has full UTF-8 support! There’s no need to upgrade manually any more. ?

For many years, MySQL had only supported a small part of UTF-8, a section commonly referred to as plane 0, the “Basic Multilingual Plane”, or the BMP. The UTF-8 spec is divided into “planes“, and plane 0 contains the most commonly used characters. For a long time, this was reasonably sufficient for MySQL’s purposes, and WordPress made do with this limitation.

It has always been possible to store all UTF-8 characters in the latin1 character set, though latin1 has shortcomings. While it recognises the connection between upper and lower case characters in Latin alphabets (such as English, French and German), it doesn’t recognise the same connection for other alphabets. For example, it doesn’t know …

[Read more]
Using 4-byte UTF-8 (aka 3-byte UNICODE) in MariaDB and MySQL

As I wrote in a previous post, MariaDB / MySQL has some issues with the standard UTF-8 encoding there. This UTF-8 encoding limits us to 3 UTF-8 bytes or 2 UNICODE bytes if you want to look at it that way. This is slightly limiting, but for languages it is usually pretty much OK, although there are some little used languages in the 3 byte UNICODE range. But in addition to languages, you will be missing symbols, such as smileys!

Help is on the way though, in the utf8mb4 character set that is part of both MariaDB and MySQL. This is a character set that is just like the one just called utf8, except this one accepts all the UNICODE characters with up to 3 UNICODE bytes, or 4 bytes using the UTF-8 encoding.

This means that there are more limits to how long a column might be when using utf8mb4 compared …

[Read more]
How MariaDB and MySQL makes life with UTF-8 a bit too easy. And how to fix it...

UNICODE is getting more and more traction and most new applications, at least web applications, support UNICODE. I have written about UNICODE and related stuff before in Character sets, Collations, UTF-8 and all that but before I go into some more specific and some issues, and fixes, let me tell you about UNICODE, UTF-8 and how MySQL interprets it. See the blogpost linked to above for more information on the subject, surprisingly even more boring, on Collations.

So, let's begin with UNICODE. UNICODE is a character set that is very complete, you should be able to make yourself understood in any language using the characters from this vast character set. This is not to say that all characters from all languages are in UNICODE, some are missing here and there and sometimes new characters make their way into …

[Read more]
Adding a case insensitive, distinct unicode collation

Every once in a while questions like the one in MySQL Bug #60843 or Bug #19567 come up:

What collation should i use if i want case insensitive behavior but also want all accented letter to be treated as distinct to their base letters?

or shorter, as the reporter of bug #60843 put it:

I need something like utf8_bin + ci

utf8_general_ci and utf8_unicode_ci unfortunately do not provide this behavior and utf8_bin is obviously not case insensitive.

read more

Guidelines for generating XML

Over the last little while I've come across quite a few XML feed generators written in PHP, with varying degrees of 'correctness'. Even though generating XML should be very simple, there's still quite a bit of pitfalls I feel every PHP or (insert your language)-developer should know about.

1. You are better off using an XML library

This is the first and foremost rule. Most people end up generating their xml using simple string concatenation, while there are many dedicated tools out there that really help you generate your own XML.

In PHP land the best example is XMLWriter. It is actually quite easy to use:

  1. <?php
  3. $xmlWriter = new XMLWriter();
  4. $xmlWriter->openMemory();
  5. $xmlWriter->startDocument('1.0','UTF-8');
  6. $xmlWriter->startElement('root'); …
[Read more]
Unicode nearing 50% of the web

According to a recent post from the Google Blog, Unicode nearing 50% uptake on the web. A rather steep graph as well:

This is pretty good news. I've had the 'pleasure' of working with a number of integration project where the 3rd party was still using iso-8859-1 (aka latin-1). Usually when this is the case, its not by choice but because of their software's default settings (Browsers, MySQL, etc.). I for one hope non-unicode charsets will soon be a thing of the past.

One other note in the post was about ligatures, such as fi and the dutch ij. If this is the first time you heard about these, you might be surprised to see that you can (likely) …

[Read more]
Unicode coming to PHP 6

The move from PHP 5 to PHP 6 will be a painful one. But once it’s done, I hope that it will be easier to handle safe web development for a global, multi-language internet. After all these years, we still … Continue reading →

Showing entries 1 to 10