The other day we were doing some peer review on a schema change at VividCortex and the topic of character set, column length, and the potential downsides of using utf8 came up. I knew from experience that there are some hidden gotchas with this, and usually I’ll just say a couple words and send a link with further reading. But Google turned up nothing, so I am writing this post to try to help fix that.
TL;DR version is that when MySQL can’t use an index for operations such as ORDER BY, it may allocate fixed-length memory buffers large enough to hold the worst-case values, and the same might apply to temporary tables on disk. This can be seen in EXPLAIN as “Using filesort; using temporary” and similar. And when this happens, you might end up making MySQL do gigabytes worth of work to finish a query on a table that’s only a fraction of that size.
Let’s see why this can happen.
The Theory of UTF8
In …
[Read more]