If you use the EXPLAIN SELECT statement to see how your
subqueries are treated by MySQL, you may sometimes meet the
"unique_subquery" optimization. Here is how the manual describes it:
"unique_subquery: this type replaces ref for some
IN subqueries of the following form: value IN (SELECT
primary_key FROM single_table WHERE some_expr);
unique_subquery is just an index lookup function that replaces
the subquery completely for better efficiency".Few weeks ago,
while I was reviewing a patch fixing a bug in unique_subquery, I got a
"simplification" pulsion. I told myself that:
- unique_subquery is an optimization for a special case of simple subqueries (single inner table, using index, no aggregates);
- we have a more general system around, used for more complex subqueries, naturally capable of handling simple ones too if we wanted;
- this general system does not have the bug in question...
Then I wondered: what if we removed the unique_subquery
optimization, and let the general system handle this simple
subquery? This would certainly simplify code, and thus
maintainance...But before removing it, of course, we should check
whether unique_subquery brings a significant performance
benefit.
So today I'm testing unique_subquery against the DBT3 benchmark.
I grab a copy of MySQL 5.6.3, and focus on the sixteenth query of
DBT3, which contains a subquery (in red) suitable for handling by
unique_subquery:
select p_brand, p_type, p_size, count(distinct ps_suppkey) as supplier_cnt from partsupp, part where p_partkey = ps_partkey and p_brand <> 'Brand#23' and p_type not like 'LARGE PLATED%' and p_size in (43, 1, 25, 5, 35, 12, 42, 40) and ps_suppkey not in ( select s_suppkey from supplier where s_comment like '%Customer%Complaints%' ) group by p_brand, p_type, p_size order by supplier_cnt desc, p_brand, p_type, p_size;
This query executes in 0.65 seconds on my Linux box, and EXPLAIN
is:
+----+--------------------+----------+-----------------+----------------------+--------------+---------+---------------------+--------+----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+--------------------+----------+-----------------+----------------------+--------------+---------+---------------------+--------+----------------------------------------------+ | 1 | PRIMARY | part | ALL | PRIMARY | NULL | NULL | NULL | 199498 | Using where; Using temporary; Using filesort | | 1 | PRIMARY | partsupp | ref | PRIMARY,i_ps_partkey | i_ps_partkey | 4 | dbt3.part.p_partkey | 2 | Using where; Using index | | 2 | DEPENDENT SUBQUERY | supplier | unique_subquery | PRIMARY | PRIMARY | 4 | func | 1 | Using where | +----+--------------------+----------+-----------------+----------------------+--------------+---------+---------------------+--------+----------------------------------------------+
When I disable unique_subquery (by modifying MySQL's C++ code),
EXPLAIN becomes:
+----+--------------------+----------+--------+----------------------+--------------+---------+---------------------+--------+----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+--------------------+----------+--------+----------------------+--------------+---------+---------------------+--------+----------------------------------------------+ | 1 | PRIMARY | part | ALL | PRIMARY | NULL | NULL | NULL | 199498 | Using where; Using temporary; Using filesort | | 1 | PRIMARY | partsupp | ref | PRIMARY,i_ps_partkey | i_ps_partkey | 4 | dbt3.part.p_partkey | 2 | Using where; Using index | | 2 | DEPENDENT SUBQUERY | supplier | eq_ref | PRIMARY | PRIMARY | 4 | func | 1 | Using where | +----+--------------------+----------+--------+----------------------+--------------+---------+---------------------+--------+----------------------------------------------+
The only change, as expected, is that "unique_subquery" becomes
"eq_ref". The used index is the same (the primary key of the
"supplier" table). The optimizer has the same notion of unicity:
"unique_subquery" and "eq_ref" both denote that a single lookup
is needed, as the index is UNIQUE. Same index, same number of
lookups: execution could well be as fast with "eq_ref" as it was
with "unique_subquery".
But... no. Query now executes in 0.80 seconds. 23% slower
than with unique_subquery!
Finer-grained timing shows that the extra 0.15 seconds are indeed
lost in the subquery evaluation code.
To understand this, let's follow the execution in detail, based
on EXPLAIN output above.
- First line of EXPLAIN output: we do a table scan on the "part" table ("type=ALL" means "table scan") . The "rows" column of EXPLAIN suggests that we are going to have 199,498 rows of "part".
- Second line of EXPLAIN output: for each row from the "part" table, we do an index lookup ("ref") into the "i_ps_partkey" index of the "partsupp" table; apparently such lookup will find two rows ("rows=2").
- At this point, we have a row made of needed columns of "part" and of "partsupp". An upper estimate of the number of those rows is 199,498 multiplied by 2: 400,000. Actually, the real number is around 120,000 (there has been filtering going on, as the "Using where" indicates).
- Then we evaluate the WHERE clause and thus the "NOT IN (subquery)" predicate (the "DEPENDENT SUBQUERY"). 120,000 evaluations of such predicate. And that's where the difference is.
EXPLAIN EXTENDED and then SHOW WARNINGS show how the
predicate
looks like. Let's start with the case where unique_subquery is
disabled:
/* select#1 */ select `dbt3`.`part`.`p_brand` AS `p_brand`,`dbt3`.`part`.`p_type` AS `p_type`,`dbt3`.`part`.`p_size` AS `p_size`,count(distinct `dbt3`.`partsupp`.`ps_suppkey`) AS `supplier_cnt` from `dbt3`.`partsupp` join `dbt3`.`part` where ((`dbt3`.`partsupp`.`ps_partkey` = `dbt3`.`part`.`p_partkey`) and (`dbt3`.`part`.`p_brand` <> 'Brand#23') and (not((`dbt3`.`part`.`p_type` like 'LARGE PLATED%'))) and (`dbt3`.`part`.`p_size` in (43,1,25,5,35,12,42,40)) and (not(<in_optimizer>(`dbt3`.`partsupp`.`ps_suppkey`,<exists>(/* select#2 */ select 1 from `dbt3`.`supplier` where ((`dbt3`.`supplier`.`s_comment` like '%Customer%Complaints%') and (<cache>(`dbt3`.`partsupp`.`ps_suppkey`) = `dbt3`.`supplier`.`s_suppkey`))))))) group by `dbt3`.`part`.`p_brand`,`dbt3`.`part`.`p_type`,`dbt3`.`part`.`p_size` order by count(distinct `dbt3`.`partsupp`.`ps_suppkey`) desc,`dbt3`.`part`.`p_brand`,`dbt3`.`part`.`p_type`,`dbt3`.`part`.`p_size`
Above, the part in red says that
ps_suppkey not in (
select
s_suppkey
from
supplier
where
s_comment like '%Customer%Complaints%'
)
has been transformed from "IN(non correlated subquery)" to
"EXISTS(correlated subquery)", yielding this:
not exists (
select
1
from
supplier
where
s_comment like '%Customer%Complaints%'
AND s_suppkey = ps_suppkey
)
or, more exactly (leaving out the NOT operator, for
brevity):
<exists>(/* select#2 */ select 1 from `dbt3`.`supplier`
where ((`dbt3`.`supplier`.`s_comment` like '%Customer%Complaints%')
and (<cache>(`dbt3`.`partsupp`.`ps_suppkey`) = `dbt3`.`supplier`.`s_suppkey`)))
Evaluating this EXISTS() evaluates the new subquery. This means
all the subquery evaluation machinery: calls to JOIN::exec(),
sub_select(), evaluate_join_record()... Sure, deep down it does
an index lookup like unique_subquery does, but all those function
calls have a cost, and so has all the logic which is lying around
ready to handle any complexity in the subquery, as this is
generic subquery evaluation code ("if group_by do this", "if
order_by do this", "if left_join do this": none of those if()s
are entered, but deciding to enter them or not has a cost). Plus
some initialization code. Plus some de-initialization code. This
overhead, repeated 120,000 times, amounts to 0.15
seconds...
Now, EXPLAIN EXTENDED when unique_subquery is enabled:
/* select#1 */ select `dbt3`.`part`.`p_brand` AS `p_brand`,`dbt3`.`part`.`p_type` AS `p_type`,`dbt3`.`part`.`p_size` AS `p_size`,count(distinct `dbt3`.`partsupp`.`ps_suppkey`) AS `supplier_cnt` from `dbt3`.`partsupp` join `dbt3`.`part` where ((`dbt3`.`partsupp`.`ps_partkey` = `dbt3`.`part`.`p_partkey`) and (`dbt3`.`part`.`p_brand` <> 'Brand#23') and (not((`dbt3`.`part`.`p_type` like 'LARGE PLATED%'))) and (`dbt3`.`part`.`p_size` in (43,1,25,5,35,12,42,40)) and (not(<in_optimizer>(`dbt3`.`partsupp`.`ps_suppkey`,<exists>(<primary_index_lookup>(<cache>(`dbt3`.`partsupp`.`ps_suppkey`) in supplier on PRIMARY where ((`dbt3`.`supplier`.`s_comment` like '%Customer%Complaints%') and (<cache>(`dbt3`.`partsupp`.`ps_suppkey`) = `dbt3`.`supplier`.`s_suppkey`)))))))) group by `dbt3`.`part`.`p_brand`,`dbt3`.`part`.`p_type`,`dbt3`.`part`.`p_size` order by count(distinct `dbt3`.`partsupp`.`ps_suppkey`) desc,`dbt3`.`part`.`p_brand`,`dbt3`.`part`.`p_type`,`dbt3`.`part`.`p_size`
The optimizer has first done the same transformation (IN to
EXISTS) as we saw before, then has done one more transformation,
and EXISTS has become, as written in red above:
<exists>(<primary_index_lookup>(<cache>(`dbt3`.`partsupp`.`ps_suppkey`)
in supplier on PRIMARY
where ((`dbt3`.`supplier`.`s_comment` like '%Customer%Complaints%')
and (<cache>(`dbt3`.`partsupp`.`ps_suppkey`) = `dbt3`.`supplier`.`s_suppkey`))))
which is directly an index lookup
("<primary_index_lookup>"), followed by an additional WHERE
clause. So the overhead of full-blown subquery evaluation
is
avoided. And this overhead is not neglectable, compared to the
index lookup (assuming the relevant index pages are already in
memory).
So the conclusion of my experiment is that unique_subquery is
worth having. I'll have to direct simplification pulsions to some
other code!
Note that there also exists a similar "index_subquery"
optimization applying to non-unique indices. And it's worth
having, for the same reasons.