The Road to MySQL 5.6 -- A DBA Perspective

We've all heard the hype.  MySQL 5.6 is packed with amazing new features that address all our database problems.  5.6 deals with replication and HA and performance and monitoring and security and features.  It just may cure cancer.

In fact it's been out for ages.  It went GA about 18 months ago.  This is all old news. MySQL 5.7 already exists. So why am I writing about this now?

The hard truth is that if you're a DBA with a non-trivial database environment then you can't drop everything and upgrade at the drop of a hat.  Newly GA versions tend to be buggy. You have uptime requirements to maintain.  Upgrading to the latest version to try out all the new features in production is a luxury you simply do not have.  Additionally the transition brings challenges.  New versions may have incompatibilities with the old so you need plenty of time to test your application against it.  The deployment process may have lots of hidden gotchas.

If you're a DBA then you're constantly evaluating this equation:  at what point does the disadvantage of remaining on a known old version outweigh the pain and effort of doing an upgrade.  If there are no clear benefits, then why upgrade at all?  In software and operations, the weight of an incumbent solution is heavy indeed.

So for me, that tipping point has been in favor of upgrading only recently.  Since I think a lot of DBAs are still running on 5.5 environments I thought it was worth it to document this process here to see if my mistakes could help others.  This isn't an exhaustive how-to.  This is just my own experience of going through the process.

First up: performance_schema Performance_schema was introduced in 5.5, but it was generally accepted that the performance cost was too prohibitive to enable it in production.  Even in 5.6 some claim it's still too expensive, but others have measured it at around 2-3% overhead.  Despite the cost I'm really excited about using it to solve other, nastier performance issues so I'm going to keep it enabled by default for new servers.
The first use-case I wanted to implement was to monitor replication capacity more easily.  Estimating how close you are to having replication lag problems has not been very reliable before performance_schema.  It was mostly guesswork.  Once you go over your replication capacity then there's not much you can do to fix it except do fewer updates on your masters.
Mark Leith has been a great proponent of all things performance_schema and has been writing lots of tools to make it easier to use.  He demonstrated how to implement a replication "load average" type metric using p_s -- and that was way back with 5.5!
I set out to get this data into our graphing system so I could generate pretty pictures like this:

This show the percentage of time the replication SQL thread was actually busy. This is data collected from a replica which is clearly not doing much.  On busier clusters it would show us when we're about to have lag issues.
So how did implementing this go?  Well here's a list of gotchas that I ran into:
1. performance_schema=on can only be set at startup time. This one is documented in the manual so no surprises there.
2. certain instruments must *also* be enabled at startup time. This isn't documented and drove me crazy.  Finally the one and only place I found which described this issue was in the slide deck from a Percona presentation by Valerii Kravchuk which was a life-saver!  I pointed out bug #68097 which discussed the issue.  The summary is that some objects like mutexes are initialized only once at server start time, so if the instruments aren't enabled at startup then enabling them later won't do anything.
This is in direct contradiction of the documentation which says: "Modifications to the setup_instruments table affect monitoring immediately."
To fix this you have to set the following in your my.cnf and restart:
performance_schema_instrument = 'wait/synch/mutex/innodb/%=on' performance_schema_instrument = 'wait/synch/cond/%=on' 3. Check the setup_consumers table if you're still not getting data Setup for p_s has at minimum two parts.  the setup_instruments table lets you customize which events you want to collect data about, but you also have to enable a place to put that collected data, which are the consumers
> select * from setup_consumers;                                                                                                   +--------------------------------+---------+ | NAME                           | ENABLED | +--------------------------------+---------+ | events_stages_current          | NO      | | events_stages_history          | NO      | | events_stages_history_long     | NO      | | events_statements_current      | YES     | | events_statements_history      | YES     | | events_statements_history_long | NO      | | events_waits_current           | YES     | | events_waits_history           | YES     | | events_waits_history_long      | YES     | | global_instrumentation         | YES     | | thread_instrumentation         | YES     | | statements_digest              | YES     | +--------------------------------+---------+
Mostly at this point I'm interested in waits. These help us measure where the server is spending it's time internally.
UPDATE setup_consumers SET ENABLED='YES' WHERE NAME LIKE '%waits%';
4. Performance_schema can consume a lot of memory To make storing and displaying performance schema data as fast as possible, it uses a storage engine which is memory based.  AFAIK it allocates all this memory up front.  Most of the buffers are now auto-sized and the memory consumed depends a lot on the value of max_connections you have set on your server.
SHOW ENGINE PERFORMANCE_SCHEMA STATUS; +---------------------------+------------+ | performance_schema.memory | 2283449856 | +---------------------------+------------+
On a system with max_connections=10,000 it consumes around 2G of RAM.

5. The structure of performance_schema tables changes often It's common to see error messages like the following in the server's error.log:
[ERROR] Native table 'performance_schema'.'events_waits_current' has the wrong structure 
This means you've forgotten to run mysql_upgrade after changing the installed server version.  This will happen when moving from 5.5 to 5.6, but if you look at the changelog of MySQL 5.6 it's clear that lots of development is going on with performance_schema which results in changes to the structure of these tables.  Therefore even with minor version upgrades within the 5.6 branch you will need to run mysql_upgrade to update these tables.  Otherwise performance_schema will fail to initialize and it will be unusable. Summary While performance schema is super powerful, information about how to configure and use it is lacking in content and accuracy.  I'm not the first one to raise these issues and hopefully by repeating them we can get more attention on it.  
I will try to post more about specific issues with upgrading from 5.5 to 5.6!