In one of the previous articles I discussed performance of the three methods to implement an anti-join in MySQL.
Just a quick reminder: an anti-join is an operation that returns all records from one table which share a value of a certain column with no records from another table.
In SQL, there are at least three methods to implement it:
LEFT JOIN / IS NULL
SELECT o.* FROM outer o LEFT JOIN inner i ON i.value = o.value WHERE i.value IS NULL
NOT IN
SELECT o.* FROM outer o WHERE o.value NOT IN ( SELECT value FROM inner )
NOT EXISTS
SELECT o.* FROM outer o WHERE NOT EXISTS ( SELECT NULL FROM inner i WHERE i.value = o.value )
When inner.value
is marked as NOT NULL
,
all these queries are semantically equivalent and with proper
indexing have similarly optimized execution plans in
MySQL.
Now, what if inner.value
is not nullable and does
contain some NULL
values?
Let’s create some sample tables:
Table
creation details
CREATE TABLE filler ( id INT NOT NULL PRIMARY KEY AUTO_INCREMENT ) ENGINE=MyISAM; CREATE TABLE t_inner ( id INT NOT NULL PRIMARY KEY, val INT, stuffing VARCHAR(200) NOT NULL, KEY ix_inner_val (val) ) ENGINE=MyISAM DEFAULT CHARSET=utf8; CREATE TABLE t_outer ( id INT NOT NULL PRIMARY KEY, val INT, stuffing VARCHAR(200) NOT NULL, KEY ix_outer_val (val) ) ENGINE=MyISAM DEFAULT CHARSET=utf8; DELIMITER $$ CREATE PROCEDURE prc_filler(cnt INT) BEGIN DECLARE _cnt INT; SET _cnt = 1; WHILE _cnt <= cnt DO INSERT INTO filler SELECT _cnt; SET _cnt = _cnt + 1; END WHILE; END $$ DELIMITER ; START TRANSACTION; CALL prc_filler(1000000); COMMIT; INSERT INTO t_inner SELECT id, NULLIF(CEILING(RAND(20100527) * 100000), 100000), RPAD('', 200, '*') FROM filler; INSERT INTO t_outer SELECT id, NULLIF(CEILING(RAND(20100527 << 1) * 100000), 100000), RPAD('', 200, '*') FROM filler;
There are two identical MyISAM tables. Each of
the tables contains 1,000,000 random values from
1 to 99,999 and also some
NULL
values. There is an index on value
in both tables.
Now, let’s check the queries.
NOT EXISTS
SELECT SUM(LENGTH(stuffing)), COUNT(*) FROM t_outer o WHERE NOT EXISTS ( SELECT NULL FROM t_inner i WHERE i.val = o.val )
SUM(LENGTH(stuffing)) | COUNT(*) |
---|---|
14600 | 73 |
1 row fetched in 0.0001s (9.9061s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | o | ALL | 1000000 | 100.00 | Using where | ||||
2 | DEPENDENT SUBQUERY | i | ref | ix_inner_val | ix_inner_val | 5 | 20100527_anti.o.val | 10 | 100.00 | Using where; Using index |
Field or reference '20100527_anti.o.val' of SELECT #2 was resolved in SELECT #1 select sum(length(`20100527_anti`.`o`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,count(0) AS `COUNT(*)` from `20100527_anti`.`t_outer` `o` where (not(exists(select NULL from `20100527_anti`.`t_inner` `i` where (`20100527_anti`.`i`.`val` = `20100527_anti`.`o`.`val`))))
The query completes in 9.9 seconds. As we can
see, it is optimized to use the index on t_inner.val
and return on the first match.
LEFT JOIN / IS NULL
SELECT SUM(LENGTH(o.stuffing)), COUNT(*) FROM t_outer o LEFT JOIN t_inner i ON i.val = o.val WHERE i.id IS NULL
SUM(LENGTH(o.stuffing)) | COUNT(*) |
---|---|
14600 | 73 |
1 row fetched in 0.0001s (13.5154s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | SIMPLE | o | ALL | 1000000 | 100.00 | |||||
1 | SIMPLE | i | ref | ix_inner_val | ix_inner_val | 5 | 20100527_anti.o.val | 10 | 100.00 | Using where; Not exists |
select sum(length(`20100527_anti`.`o`.`stuffing`)) AS `SUM(LENGTH(o.stuffing))`,count(0) AS `COUNT(*)` from `20100527_anti`.`t_outer` `o` left join `20100527_anti`.`t_inner` `i` on((`20100527_anti`.`i`.`val` = `20100527_anti`.`o`.`val`)) where isnull(`20100527_anti`.`i`.`id`)
The query semantics are the same as those of NOT
EXISTS
, and we even see the Not exists
optimization in the plan, however this query performs much more
poorly than NOT EXISTS
: 13 seconds.
Why?
MySQL documentation on EXPLAIN
states that Not exists
is used to
optimize the queries similar to the one we have just run:
LEFT JOIN
with IS NULL
predicate
applied to a non-nullable column.
MySQL is aware that such a predicate can only be
satisfied by a record resulting from a JOIN
miss (i.
e. when no matching record was found in the rightmost table) and
stops reading records after first index hit.
However, this optimization is implemented in a way that is far
from being perfect. Despite the fact that no actual value of
id
can be returned by such a query, the engine still
looks up id
in the table (since it’s not a part of
the index). We can see it in the plan: unlike NOT
EXISTS
query, there is no Using index
for
t_inner
. This means that a table lookup is
performed.
Even we replace id
with val
in the
query, it still performs poorly:
SELECT SUM(LENGTH(o.stuffing)), COUNT(*) FROM t_outer o LEFT JOIN t_inner i ON i.val = o.val WHERE i.val IS NULL
SUM(LENGTH(o.stuffing)) | COUNT(*) |
---|---|
14600 | 73 |
1 row fetched in 0.0001s (14.4997s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | SIMPLE | o | ALL | 1000000 | 100.00 | |||||
1 | SIMPLE | i | ref | ix_inner_val | ix_inner_val | 5 | 20100527_anti.o.val | 10 | 100.00 | Using where; Using index |
select sum(length(`20100527_anti`.`o`.`stuffing`)) AS `SUM(LENGTH(o.stuffing))`,count(0) AS `COUNT(*)` from `20100527_anti`.`t_outer` `o` left join `20100527_anti`.`t_inner` `i` on((`20100527_anti`.`i`.`val` = `20100527_anti`.`o`.`val`)) where isnull(`20100527_anti`.`i`.`val`)
This time, no table lookups are made but there is no Not
exists
optimization either.
Despite the fact that the join condition eliminates possibility
of an actual NULL
being returned by the query and
any val IS NULL
reaching the WHERE
clause is a result of a join miss, MySQL still
examines all records in t_inner
, not stopping after
the first hit.
This had been submitted as a bug.
Now, what about NOT IN
?
NOT IN
Unlike the previous two queries that only differ in
implementation, not in semantics, NOT IN
, being
applied as is, would yield the different results.
NOT EXISTS
and IS NULL
are two-state
predicates, they can only return TRUE
or
FALSE
. NOT IN
is a three-state
predicate: it can return TRUE
, FALSE
or
NULL
.
NULL
value is returned in two cases:
- When
t_outer.value
being tested isNULL
- When at least one of
t_inner.value
isNULL
This means that having but a single NULL
in
t_inner
would prevent the query from returning
anything.
Naive approach
Let’s see what happens if we just substitute NOT IN
instead of NOT EXISTS
:
SELECT SUM(LENGTH(stuffing)), COUNT(*) FROM t_outer o WHERE val NOT IN ( SELECT val FROM t_inner i )
SUM(LENGTH(stuffing)) | COUNT(*) |
---|---|
0 | |
1 row fetched in 0.0001s (10.3748s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | o | ALL | 1000000 | 100.00 | Using where | ||||
2 | DEPENDENT SUBQUERY | i | index_subquery | ix_inner_val | ix_inner_val | 5 | func | 20 | 100.00 | Using index; Full scan on NULL key |
select sum(length(`20100527_anti`.`o`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,count(0) AS `COUNT(*)` from `20100527_anti`.`t_outer` `o` where (not(<in_optimizer>(`20100527_anti`.`o`.`val`,<exists>(<index_lookup>(<cache>(`20100527_anti`.`o`.`val`) in t_inner on ix_inner_val checking NULL having trigcond(<is_not_null_test>(`20100527_anti`.`i`.`val`)))))))
Since there are NULL
s in t_inner
, no
record in t_outer
can satisfy the predicate.
MySQL does not optimize this very well. It takes
but a single index scan to find out if there are
NULL
values in t_inner
and return if
they are, but for some reason MySQL still
applies the condition to each record in t_outer
.
Naive approach, improved
With a little help from our side, this can be improved:
SELECT SUM(LENGTH(stuffing)), COUNT(*) FROM t_outer o WHERE NOT EXISTS ( SELECT NULL FROM t_inner i WHERE val IS NULL ) AND val NOT IN ( SELECT val FROM t_inner i )
SUM(LENGTH(stuffing)) | COUNT(*) |
---|---|
0 | |
1 row fetched in 0.0001s (0.0014s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | Impossible WHERE | ||||||||
3 | DEPENDENT SUBQUERY | i | index_subquery | ix_inner_val | ix_inner_val | 5 | func | 20 | 100.00 | Using index; Full scan on NULL key |
2 | SUBQUERY | i | ref | ix_inner_val | ix_inner_val | 5 | 4 | 100.00 | Using where; Using index |
select sum(length(`20100527_anti`.`o`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,count(0) AS `COUNT(*)` from `20100527_anti`.`t_outer` `o` where 0
We added an explicit check for NULL
values. Since
it’s not correlated, MySQL could instantly prove
it false, cache it and avoid the table scan at all.
Ignoring right side NULLs
Now, let’s make a NOT IN
query that does not take
the NULL
values in t_inner
into
account:
SELECT SUM(LENGTH(stuffing)), COUNT(*) FROM t_outer o WHERE val NOT IN ( SELECT val FROM t_inner i WHERE val IS NOT NULL )
SUM(LENGTH(stuffing)) | COUNT(*) |
---|---|
13400 | 67 |
1 row fetched in 0.0001s (10.4060s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | o | ALL | 1000000 | 100.00 | Using where | ||||
2 | DEPENDENT SUBQUERY | i | index_subquery | ix_inner_val | ix_inner_val | 5 | func | 20 | 100.00 | Using index; Using where; Full scan on NULL key |
select sum(length(`20100527_anti`.`o`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,count(0) AS `COUNT(*)` from `20100527_anti`.`t_outer` `o` where (not(<in_optimizer>(`20100527_anti`.`o`.`val`,<exists>(<index_lookup>(<cache>(`20100527_anti`.`o`.`val`) in t_inner on ix_inner_val checking NULL where (`20100527_anti`.`i`.`val` is not null) having trigcond(<is_not_null_test>(`20100527_anti`.`i`.`val`)))))))
This time, the query returns records, but not as many as the previous queries did.
We made an additional check for NULL
in
t_inner
but not in t_outer
. There are
some records in t_outer
that have a
NULL
in val
. Both IN
and
NOT IN
would evaluate to NULL
and
WHERE
would filter them out.
We see another glitch in MySQL optimizer here: a
Full scan on NULL key
applied. Since NOT
IN
should always return TRUE
when the
subquery returns no records (even if the value checked is a
NULL
), on correlated queries a fullscan should be
applied to check for the records and find out whether to return
NULL
or FALSE
. However, in this case
the IN
subquery is not correlated, so the check
could only be performed once and cached, like with the LEFT
JOIN
.
In our case the overhead would be negligible, since the subquery
would return on the first match, but it could matter if we had
more NULL
values in t_outer
.
Now, what if we want NULL
records on
t_outer
to be returned as well? We just need to add
an additional check for NULL
s.
Ignoring all NULL
s
SELECT SUM(LENGTH(stuffing)), COUNT(*) FROM t_outer o WHERE val IS NULL OR val NOT IN ( SELECT val FROM t_inner i WHERE val IS NOT NULL )
SUM(LENGTH(stuffing)) | COUNT(*) |
---|---|
14600 | 73 |
1 row fetched in 0.0002s (10.4842s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | o | ALL | ix_outer_val | 1000000 | 100.00 | Using where | |||
2 | DEPENDENT SUBQUERY | i | index_subquery | ix_inner_val | ix_inner_val | 5 | func | 20 | 100.00 | Using index; Using where; Full scan on NULL key |
select sum(length(`20100527_anti`.`o`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,count(0) AS `COUNT(*)` from `20100527_anti`.`t_outer` `o` where (isnull(`20100527_anti`.`o`.`val`) or (not(<in_optimizer>(`20100527_anti`.`o`.`val`,<exists>(<index_lookup>(<cache>(`20100527_anti`.`o`.`val`) in t_inner on ix_inner_val checking NULL where (`20100527_anti`.`i`.`val` is not null) having trigcond(<is_not_null_test>(`20100527_anti`.`i`.`val`))))))))
Here, the query returns the same results as NOT
EXISTS
.
Full scan on NULL key
is still present in the plan
but will never actually be executed because it will be short
circuited by the previous IS NULL
check.
Summary
As was shown in the earlier article, LEFT JOIN / IS
NULL
and NOT IN
are best used to implement an
anti-join in MySQL if the columns on both sides
are not nullable.
The situation is different when the columns are nullable:
-
NOT EXISTS
performs in most straightforward way: just checks equality and returnsTRUE
orFALSE
on the first hit / miss. -
LEFT JOIN / IS NULL
either makes an additional table lookup or does not return on the first match and performs more poorly in both cases. -
NOT IN
, having different semantics, requires additional checks forNULL
values. These checks should be coded into the query
With nullable columns, NOT EXISTS
and NOT
IN
(with additional checks for NULLS
) are the
most efficient methods to implement an anti-join in
MySQL.
LEFT JOIN / IS NULL
performs poorly.