New feature in MariaDB 10.1: ANALYZE statement

Last week, yours truly has pushed a new feature into MariaDB 10.1 tree: ANALYZE statement.

The idea of this feature is to make it easy to compare query plan with query execution. ANALYZE statement will run the statement, and produce EXPLAIN-like output, where optimizer’s estimates are followed by numbers that were observed when running the query. The output looks like this:

Here,

  • Next to rows there is r_rows column which shows how many records were read from the table.
  • Next to filtered there is r_filtered column which shows which fraction of records was left after the part of the WHERE condition attached to the table was checked.

I think this should explain the feature. If you want more details, please refer to the KB article ANALYZE statement. It also discusses the meaning of the above EXPLAIN output.

Technical details and further plans

ANALYZE currently uses its own counters. Counting is done for all queries, including non-ANALYZE queries. This should be okay (not have visible overhead) as long as counting just increments integer variables in the query plan, without doing any atomic operations or making syscalls.

The upside of this approach is that it’s now trivial to make Explain in the slow query log also print ANALYZE output. When a query runs slowly, you will be able to know where exactly the optimizer was wrong.

The downside is that getting more data will not be as easy. So far, the most requested numbers beyond r_rows and r_filtered were r_time(amount of time spent in reading the table) and r_io(amount of IO that we did on the table). Counting the amount of time that was spent while reading each row will impose CPU overhead, it is a known problem. Counting IO is just incrementing a counter, but it will require interaction between ANALYZE code and storage engine(s) code, which will add complexity.

There is PERFORMANCE_SCHEMA feature, where others have already spent a lot of effort to count wait time and IO. It’s tempting to reuse it. The problem is, P_S collects the wrong data. P_S counters are global, while ANALYZE needs to count IO for each table reference separately. Consider a self-join. From P_S point of view, it is reading from the same table. From ANALYZE point of view, it is reads from two different table references. I’m currently not sure whether ANALYZE should/could rely on PERFORMANCE_SCHEMA.

A totally different angle is that tabular EXPLAIN output doesn’t allow to show much data (for example, how many rows were there before/after GROUP BY?). Here the solution is clear, I think: support EXPLAIN FORMAT=JSON and then add ANALYZE FORMAT=JSON where we can provide lots of detail.