In MySQL 8.0 our plan is to drastically improve support for utf8. While utf8 support itself dates back to MySQL 4.1, there exist some limitations. The “sushi = beer” problem in the title refers to Bug #76553. Sushi and beer don’t even go well together, at least not to my taste:-) I will use this bug as an example to explain issues with utf8 collations in the past and our plans for utf8 support going forward.…
10 Older Entries »
It's time to summarize the year of 2016. As a kind of a weird
summary, in this post I'd like to share a list of MySQL bug
reports I've created in 2016 that are still remaining "Verified"
- Bug #79831 - "Unexpected error message on crash-safe slave with max_relay_log_size set". According to Umesh this is not repeatable with 5.7. The fact that I've reported the bug on January 4 probably means I was working at that time. I should not repeat this mistake again next year.
- Bug #80067 - "Index on BIT column is NOT used when column name only is used in WHERE clause". People say the same problem happens with INT and, what may be even less expected, BOOLEAN columns.
In Python it is easily possible to findout the name of a Unicode
character and findout some properties about that character. The
module which does that is called
>>> import unicodedata
'WHITE SMILING FACE'
This module uses the data as released in the UnicodeData.txt file from the unicode.org website.
So if UnicodeData.txt is a 'database', then we should be able to import it into MySQL and use it!
I wrote a small Python script to automate this. The basic steps are:
- Download UnicodeData.txt
- Create a unicodedata.ucd table
LOAD DATA LOCAL INFILEto load the data
This isn't difficult especially because the file doesn't have the actual characters …[Read more]
In MySQL Character encoding – part 1 we stated that the myriad of ways in which character encoding can be controlled can lead to many situations where your data may not be available as expected.
Setting MySQL Client and Server Character encoding.
Lets restart MySQL with the correct setting for our purpose, UTF8. Here we can see the setting in the MySQL configuration file, in this case /etc/mysql/my.cnf.
character-set-server = utf8
This change is then reflected in the session and global variables once the instance is restarted with the new configuration parameter.
mysql> SELECT …[Read more]
Breaking and unbreaking your data
Recently at FOSDEM, Maciej presented “Breaking and unbreaking your data”, a presentation about the potential problems you can incur regarding character encoding whilst working with MySQL. In short, there are a myriad of places where character encoding can be controlled, which gives ample opportunity for the system to break and for text to become unrecoverable.
The slides from the presentation are available on slideshare.[Read more]
Here’s a problem some or most of us have encountered. You have a latin1 table defined like below, and your application is storing utf8 data to the column on a latin1 connection. Obviously, double encoding occurs. Now your development team decided to use utf8 everywhere, but during the process you can only have as little to no downtime while keeping your stored data valid.
CREATE TABLE `t` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `c` text, PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1; master> SET NAMES latin1; master> INSERT INTO t (c) VALUES ('¡Celebración!'); master> SELECT id, c, HEX(c) FROM t; +----+-----------------+--------------------------------+ | id | c | HEX(c) | +----+-----------------+--------------------------------+ | 3 | ¡Celebración! | C2A143656C656272616369C3B36E21 | +----+-----------------+--------------------------------+ 1 row in set (0.00 sec) master> …[Read more]
As I wrote in a previous post, MariaDB / MySQL has some issues
with the standard UTF-8 encoding there. This UTF-8 encoding
limits us to 3 UTF-8 bytes or 2 UNICODE bytes if you want to look
at it that way. This is slightly limiting, but for languages it
is usually pretty much OK, although there are some little used
languages in the 3 byte UNICODE range. But in addition to
languages, you will be missing symbols, such as smileys!
Help is on the way though, in the utf8mb4 character set that is part of both MariaDB and MySQL. This is a character set that is just like the one just called utf8, except this one accepts all the UNICODE characters with up to 3 UNICODE bytes, or 4 bytes using the UTF-8 encoding.
This means that there are more limits to how long a column might be when using utf8mb4 compared …
UNICODE is getting more and more traction and most new
applications, at least web applications, support UNICODE. I have
written about UNICODE and related stuff before in Character sets, Collations, UTF-8 and all that
but before I go into some more specific and some issues, and
fixes, let me tell you about UNICODE, UTF-8 and how MySQL
interprets it. See the blogpost linked to above for more
information on the subject, surprisingly even more boring, on
So, let's begin with UNICODE. UNICODE is a character set that is very complete, you should be able to make yourself understood in any language using the characters from this vast character set. This is not to say that all characters from all languages are in UNICODE, some are missing here and there and sometimes new characters make their way into …
The ALTER TABLE statement syntax is explained in the manual
To put it simply, there are two ways you can alter the table to use a new character set.
1. ALTER TABLE tablename DEFAULT CHARACTER SET utf8;
This will alter the table to use the new character set as the default, but as a safety mechanism, it will only change the table definition for the default character set. That is, existing character fields will have the old character set per column. For example:
mysql> create table mybig5 (id int not null auto_increment primary key,
-> subject varchar(100) ) engine=innodb default charset big5;
Query OK, 0 rows affected (0.81 sec)
mysql> show create table …
It has been a very very long working week-end for the technical team at Believe...
From MariaDB 5.2 to 5.5.28 some few QP regression still need to be fixed.
From latin1 to utf8 very few indexes as to be modified, reaching the max index column size limit of 767 bytes for InnoDB. Some Url like Referer as been change to varbinary as the column content was already a mixed of encoding stored in latin1.
Don' forget UTF8 is bad for in memory workload as strings in …
10 Older Entries »