In an earlier blog post, I showed how optimizer improvements
in MySQL 5.6 gave better performance for several of the queries
in the DBT-3 benchmark.
However, for one of the queries, Query 18, I was not able to give
exact numbers for the improvement since the query took very long
in MySQL 5.5. I decided to try to find out exactly how long the
query would take, but when the query had run for one month, I
gave up. How can a query take so long? Especially, when I had set
up InnoDB with a buffer pool that should be large enough to hold
the entire database. Let's have a look at the query:
select c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice, sum(l_quantity)
from customer, orders, lineitem
where o_orderkey in (
select l_orderkey
from lineitem
group by l_orderkey
having sum(l_quantity) > 313
)
and c_custkey = o_custkey
and o_orderkey = l_orderkey
group by c_name, c_custkey, o_orderkey, o_orderdate, o_totalprice
order by o_totalprice desc, o_orderdate
LIMIT 100;
This query will find the orders from customers that have placed
big orders. The reason that this takes so long in MySQL 5.5, is
that the subquery in the WHERE clause will be executed for each
processed row of the table for which this subquery is part of the
WHERE predicate. Let's look at the EXPLAIN output for this query:
+----+--------------------+----------+-------+--------------------------------------------+-----------------------+---------+-------------------------+---------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+----------+-------+--------------------------------------------+-----------------------+---------+-------------------------+---------+---------------------------------+
| 1 | PRIMARY | customer | ALL | PRIMARY | NULL | NULL | NULL | 150000 | Using temporary; Using filesort |
| 1 | PRIMARY | orders | ref | PRIMARY,i_o_custkey | i_o_custkey | 5 | dbt3.customer.c_custkey | 7 | Using where |
| 1 | PRIMARY | lineitem | ref | PRIMARY,i_l_orderkey,i_l_orderkey_quantity | i_l_orderkey_quantity | 4 | dbt3.orders.o_orderkey | 2 | Using index |
| 2 | DEPENDENT SUBQUERY | lineitem | index | NULL | PRIMARY | 8 | NULL | 6001215 | NULL |
+----+--------------------+----------+-------+--------------------------------------------+-----------------------+---------+-------------------------+---------+---------------------------------+
The select_type for the subquery is DEPENDENT SUBQUERY. Since the
left hand side of the IN-expression is a field from the orders
table, the subquery will be executed for each row processed from
this table. This implies that the index scan of the lineitem
table will be performed more than one million times given the
rows estimates in the EXPLAIN output (150000*7). I have measured
that one execution of the sub-query takes about 3.5 seconds when
all the data is in memory. Hence, it will take more than 40 days
to execute it one million times.
In MySQL 5.6, Subquery Materialization may be used to avoid the repeated execution of subqueries. This implies that the subquery is executed once and the result stored (materialized) in a temporary table. Then, for each row where the subquery was earlier executed, a hash-based look-up into the temporary table will be made instead to check whether there is a match. The EXPLAIN output for Query 18, looks like this in MySQL 5.6:
+----+-------------+----------+-------+--------------------------------------------+-----------------------+---------+-------------------------+---------+---------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------+-------+--------------------------------------------+-----------------------+---------+-------------------------+---------+---------------------------------+
| 1 | PRIMARY | customer | ALL | PRIMARY | NULL | NULL | NULL | 150000 | Using temporary; Using filesort |
| 1 | PRIMARY | orders | ref | PRIMARY,i_o_custkey | i_o_custkey | 5 | dbt3.customer.c_custkey | 7 | Using where |
| 1 | PRIMARY | lineitem | ref | PRIMARY,i_l_orderkey,i_l_orderkey_quantity | i_l_orderkey_quantity | 4 | dbt3.orders.o_orderkey | 2 | Using index |
| 2 | SUBQUERY | lineitem | index | NULL | PRIMARY | 8 | NULL | 6001215 | NULL |
+----+-------------+----------+-------+--------------------------------------------+-----------------------+---------+-------------------------+---------+---------------------------------+
The only difference is that select_type of the subquery now is
SUBQUERY. In other words, it is no longer dependent on the
preceding tables, and can be executed only once. In my run of
DBT-3 with MySQL 5.6, Query 18 takes only 6.8 seconds. Hence, we
have gone from more than a month to just a few seconds! My
colleague Guilhem has earlier written about another DBT-3 query that is improved with Subquery
Materialization.