on ORDER BY optimization

Generally in MySQL we send queries massaged to a point where optimizer doesn’t have to think about anything. In our major user database environment 99.9% of queries don’t have alternative query plans available (either because of forced indexes or just straightforward Primary Key read). We have various other systems and from time to time we have to do SQL work there and chase optimizer errors.

There’re multiple places where optimizer can make a choice in very basic queries, for example:

  • Which index returns less rows
  • Which index can be used for ORDER BY

A query that I was looking asked a very basic question, on a job instances table, show state and status for latest-by-ID entry for job name=’Ship Christmas Presents’ (real name was a bit different ;-). So, it was SELECT c,d FROM t WHERE b=X ORDER BY a DESC LIMIT 1, where PK is (a) and a possible index is on (b,c).

What we were observing was a massive table scan on PK instead of using (b, ...) indexing. Table in question was in hundreds of gigabytes, so that did hurt, a bit. If one forced the (b,c) index, queries became really fast. This wasn’t an issue with some intermittent statistics flapping.

The first thing that immediately caught my attention was that LIMIT 2 produced a proper query plan, whereas LIMIT 1 did not. I shipped a few-gigabyte sized subset onto my test machine and carefully went through what was happening.

EXPLAIN was just telling me that it will be picking bad query plan, EXPLAIN FORMAT=JSON was still telling the same, so I needed to look at some detailed execution data. MySQL has this extremely promising facility called ‘optimizer trace’, so I got the trace for my test-case. Unfortunately, the trace gave everything that I knew – that only one index made sense for reducing the dataset and that it changed to PK to order things better. It told me about number of rows and some “cost” of the plan – whatever those numbers mean, and it gave super-high numbers for table scan.

My optimizer trace did not tell why it decided to switch from decent query plan to absolutely horrible one, it just had a block telling that “reconsidering_access_paths_for_index_ordering” and that "plan_changed": true, which I already knew. On the other hand, that provided me with quick pointer into the source code – the six thousand lines of sql_select.cc. I could find above trace pointer somewhere in test_if_skip_sort_order(), which then calls this:

test_if_cheaper_ordering(..., order, table, usable_keys, ref_key, select_limit, &best_key, ...);

Essentially, this function will look at all the indexes that can be used instead of “filesort” and see what would happen if we used them. This function would say that the ref_key (b,c) – the index that returns least rows – is not the best key, and the best key is one on (a). How did it come up with such conclusion? Short answer – optimizer is very naïve.

That logic is explained in this comment:

/*
We assume that each of the tested indexes is not correlated
with ref_key. Thus, to select first N records we have to scan
N/selectivity(ref_key) index entries.
...
*/

It makes the naïve and somewhat optimistic calculation on (a) and the most pessimistic possible calculation on (b,c). I’ll start with the pessimistic one. Optimizer assumes that there’re 20000 instances for our job, and for each of these jobs we have to do a separate seek into PK, so our cost is 20000 (~300MB) for PK reads + few more reads on index itself.

Now the optimism (and bad query plan) comes from the idea that if we’re only selecting only 1 out of 20000 rows, that means only 0.005% of table scan is enough to satisfy LIMIT 1. If we would provide LIMIT 10, it would be only 0.05%. On a 100GB table, that means 5MB of data read, and that seems to be cheaper than the very pessimistic calculation of more than 300MB above.

Unfortunately, optimizer doesn’t understand, that there’re humans who are building these systems, and humans have their own rationale and thinking. One of the ideas that a human would have is that if you have 100GB table, you better understand things like data locality and other sorts of efficiencies. That means that in real world most of large tables have various degrees of correlation of data. In our fictitious example we know that “Christmas” happen, well, at Christmas. Looking through our window we know that it isn’t anywhere near Christmas.

So, database assumes that our data is distributed like this:

----W----H----E----E----E-----

When it is more like this:

-----------W-H-EE-EE----------

With this data distribution the pessimistic decision on (b,c) becomes way too dark and miserable – there’s a chance that “random” seeks into PK will all hit same pages, so we will need very few megabytes read. Our best case ends up being at ~5MB, our worst case is somewhere at above mentioned 300MB. Now optimistic decision can vary from “oh hey, I just started reading and found a row immediately” win of a lottery to “hey, I calculated an average for you” naïveté of 5MB to “oh snap, I was wrong” of say… 30GB.

Though we all agree that optimizer cannot everything, it errs into the direction of chasing much wider range of possibilities and being way more opportunistic rather than being somewhat more reliable. Obviously, it is very hard to tell whether changing some of these heuristics is possible in a way that would not affect million other places, but in this exact case we can look at what could be done better with information at our hand, either by rewriting queries or implementing somewhat procedural access.

One way is materializing the maximum ID first simply because it is already hidden in the (b,c) index – internally inside InnoDB that index is (b,c,a). So if we

SELECT MAX(a) FROM t WHERE b=X

we can get most of the data for the read just by scanning few index pages. We can use that data then to dive into primary key for that single row. Very similar approach can be taken with different limits.

We should also know that optimizer has most of this logic back from MyISAM times and it doesn’t know that it knows multiple values of (a) just by doing records-in-range estimation for (b,c) index – it does two random dives and it already observes primary key values. Simply by knowing that both of these values represent the range nowhere close the table scan head or tail it can discard the truly expensive query plan.

Of course, that is messy and can have broken sampling, but hey, it may be better than nothing, although it will again assume something opposite – that data is somehow correlated. Which we humans think it is and computer is on the opposite side of the fence.

So, in summary, you have to understand that database assumes things you don’t and sometimes you have to force your query plans or write your queries in a way that there’s no optimization space for optimizer left.

See also: