In one of the previous articles I discussed performance of the three methods to implement an anti-join in MySQL.
Just a quick reminder: an anti-join is an operation that returns all records from one table which share a value of a certain column with no records from another table.
In SQL, there are at least three methods to implement it:
LEFT JOIN / IS NULL
SELECT o.*
FROM outer o
LEFT JOIN
inner i
ON i.value = o.value
WHERE i.value IS NULL
NOT IN
SELECT o.*
FROM outer o
WHERE o.value NOT IN
(
SELECT value
FROM inner
)
NOT EXISTS
SELECT o.*
FROM outer o
WHERE NOT EXISTS
(
SELECT NULL
FROM inner i
WHERE i.value = o.value
)
When inner.value is marked as NOT NULL,
all these queries are semantically equivalent and with proper
indexing have similarly optimized execution plans in
MySQL.
Now, what if inner.value is not nullable and does
contain some NULL values?
Let’s create some sample tables:
Table
creation details
CREATE TABLE filler (
id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=MyISAM;
CREATE TABLE t_inner (
id INT NOT NULL PRIMARY KEY,
val INT,
stuffing VARCHAR(200) NOT NULL,
KEY ix_inner_val (val)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
CREATE TABLE t_outer (
id INT NOT NULL PRIMARY KEY,
val INT,
stuffing VARCHAR(200) NOT NULL,
KEY ix_outer_val (val)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
DELIMITER $$
CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
DECLARE _cnt INT;
SET _cnt = 1;
WHILE _cnt <= cnt DO
INSERT
INTO filler
SELECT _cnt;
SET _cnt = _cnt + 1;
END WHILE;
END
$$
DELIMITER ;
START TRANSACTION;
CALL prc_filler(1000000);
COMMIT;
INSERT
INTO t_inner
SELECT id,
NULLIF(CEILING(RAND(20100527) * 100000), 100000),
RPAD('', 200, '*')
FROM filler;
INSERT
INTO t_outer
SELECT id,
NULLIF(CEILING(RAND(20100527 << 1) * 100000), 100000),
RPAD('', 200, '*')
FROM filler;
There are two identical MyISAM tables. Each of
the tables contains 1,000,000 random values from
1 to 99,999 and also some
NULL values. There is an index on value
in both tables.
Now, let’s check the queries.
NOT EXISTS
SELECT SUM(LENGTH(stuffing)), COUNT(*)
FROM t_outer o
WHERE NOT EXISTS
(
SELECT NULL
FROM t_inner i
WHERE i.val = o.val
)
| SUM(LENGTH(stuffing)) | COUNT(*) |
|---|---|
| 14600 | 73 |
| 1 row fetched in 0.0001s (9.9061s) |
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | PRIMARY | o | ALL | 1000000 | 100.00 | Using where | ||||
| 2 | DEPENDENT SUBQUERY | i | ref | ix_inner_val | ix_inner_val | 5 | 20100527_anti.o.val | 10 | 100.00 | Using where; Using index |
Field or reference '20100527_anti.o.val' of SELECT #2 was resolved in SELECT #1 select sum(length(`20100527_anti`.`o`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,count(0) AS `COUNT(*)` from `20100527_anti`.`t_outer` `o` where (not(exists(select NULL from `20100527_anti`.`t_inner` `i` where (`20100527_anti`.`i`.`val` = `20100527_anti`.`o`.`val`))))
The query completes in 9.9 seconds. As we can
see, it is optimized to use the index on t_inner.val
and return on the first match.
LEFT JOIN / IS NULL
SELECT SUM(LENGTH(o.stuffing)), COUNT(*)
FROM t_outer o
LEFT JOIN
t_inner i
ON i.val = o.val
WHERE i.id IS NULL
| SUM(LENGTH(o.stuffing)) | COUNT(*) |
|---|---|
| 14600 | 73 |
| 1 row fetched in 0.0001s (13.5154s) |
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | SIMPLE | o | ALL | 1000000 | 100.00 | |||||
| 1 | SIMPLE | i | ref | ix_inner_val | ix_inner_val | 5 | 20100527_anti.o.val | 10 | 100.00 | Using where; Not exists |
select sum(length(`20100527_anti`.`o`.`stuffing`)) AS `SUM(LENGTH(o.stuffing))`,count(0) AS `COUNT(*)` from `20100527_anti`.`t_outer` `o` left join `20100527_anti`.`t_inner` `i` on((`20100527_anti`.`i`.`val` = `20100527_anti`.`o`.`val`)) where isnull(`20100527_anti`.`i`.`id`)
The query semantics are the same as those of NOT
EXISTS, and we even see the Not exists
optimization in the plan, however this query performs much more
poorly than NOT EXISTS: 13 seconds.
Why?
MySQL documentation on EXPLAIN
states that Not exists is used to
optimize the queries similar to the one we have just run:
LEFT JOIN with IS NULL predicate
applied to a non-nullable column.
MySQL is aware that such a predicate can only be
satisfied by a record resulting from a JOIN miss (i.
e. when no matching record was found in the rightmost table) and
stops reading records after first index hit.
However, this optimization is implemented in a way that is far
from being perfect. Despite the fact that no actual value of
id can be returned by such a query, the engine still
looks up id in the table (since it’s not a part of
the index). We can see it in the plan: unlike NOT
EXISTS query, there is no Using index for
t_inner. This means that a table lookup is
performed.
Even we replace id with val in the
query, it still performs poorly:
SELECT SUM(LENGTH(o.stuffing)), COUNT(*)
FROM t_outer o
LEFT JOIN
t_inner i
ON i.val = o.val
WHERE i.val IS NULL
| SUM(LENGTH(o.stuffing)) | COUNT(*) |
|---|---|
| 14600 | 73 |
| 1 row fetched in 0.0001s (14.4997s) |
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | SIMPLE | o | ALL | 1000000 | 100.00 | |||||
| 1 | SIMPLE | i | ref | ix_inner_val | ix_inner_val | 5 | 20100527_anti.o.val | 10 | 100.00 | Using where; Using index |
select sum(length(`20100527_anti`.`o`.`stuffing`)) AS `SUM(LENGTH(o.stuffing))`,count(0) AS `COUNT(*)` from `20100527_anti`.`t_outer` `o` left join `20100527_anti`.`t_inner` `i` on((`20100527_anti`.`i`.`val` = `20100527_anti`.`o`.`val`)) where isnull(`20100527_anti`.`i`.`val`)
This time, no table lookups are made but there is no Not
exists optimization either.
Despite the fact that the join condition eliminates possibility
of an actual NULL being returned by the query and
any val IS NULL reaching the WHERE
clause is a result of a join miss, MySQL still
examines all records in t_inner, not stopping after
the first hit.
This had been submitted as a bug.
Now, what about NOT IN?
NOT IN
Unlike the previous two queries that only differ in
implementation, not in semantics, NOT IN, being
applied as is, would yield the different results.
NOT EXISTS and IS NULL are two-state
predicates, they can only return TRUE or
FALSE. NOT IN is a three-state
predicate: it can return TRUE, FALSE or
NULL.
NULL value is returned in two cases:
- When
t_outer.valuebeing tested isNULL - When at least one of
t_inner.valueisNULL
This means that having but a single NULL in
t_inner would prevent the query from returning
anything.
Naive approach
Let’s see what happens if we just substitute NOT IN
instead of NOT EXISTS:
SELECT SUM(LENGTH(stuffing)), COUNT(*)
FROM t_outer o
WHERE val NOT IN
(
SELECT val
FROM t_inner i
)
| SUM(LENGTH(stuffing)) | COUNT(*) |
|---|---|
| 0 | |
| 1 row fetched in 0.0001s (10.3748s) |
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | PRIMARY | o | ALL | 1000000 | 100.00 | Using where | ||||
| 2 | DEPENDENT SUBQUERY | i | index_subquery | ix_inner_val | ix_inner_val | 5 | func | 20 | 100.00 | Using index; Full scan on NULL key |
select sum(length(`20100527_anti`.`o`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,count(0) AS `COUNT(*)` from `20100527_anti`.`t_outer` `o` where (not(<in_optimizer>(`20100527_anti`.`o`.`val`,<exists>(<index_lookup>(<cache>(`20100527_anti`.`o`.`val`) in t_inner on ix_inner_val checking NULL having trigcond(<is_not_null_test>(`20100527_anti`.`i`.`val`)))))))
Since there are NULLs in t_inner, no
record in t_outer can satisfy the predicate.
MySQL does not optimize this very well. It takes
but a single index scan to find out if there are
NULL values in t_inner and return if
they are, but for some reason MySQL still
applies the condition to each record in t_outer.
Naive approach, improved
With a little help from our side, this can be improved:
SELECT SUM(LENGTH(stuffing)), COUNT(*)
FROM t_outer o
WHERE NOT EXISTS
(
SELECT NULL
FROM t_inner i
WHERE val IS NULL
)
AND val NOT IN
(
SELECT val
FROM t_inner i
)
| SUM(LENGTH(stuffing)) | COUNT(*) |
|---|---|
| 0 | |
| 1 row fetched in 0.0001s (0.0014s) |
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | PRIMARY | Impossible WHERE | ||||||||
| 3 | DEPENDENT SUBQUERY | i | index_subquery | ix_inner_val | ix_inner_val | 5 | func | 20 | 100.00 | Using index; Full scan on NULL key |
| 2 | SUBQUERY | i | ref | ix_inner_val | ix_inner_val | 5 | 4 | 100.00 | Using where; Using index |
select sum(length(`20100527_anti`.`o`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,count(0) AS `COUNT(*)` from `20100527_anti`.`t_outer` `o` where 0
We added an explicit check for NULL values. Since
it’s not correlated, MySQL could instantly prove
it false, cache it and avoid the table scan at all.
Ignoring right side NULLs
Now, let’s make a NOT IN query that does not take
the NULL values in t_inner into
account:
SELECT SUM(LENGTH(stuffing)), COUNT(*)
FROM t_outer o
WHERE val NOT IN
(
SELECT val
FROM t_inner i
WHERE val IS NOT NULL
)
| SUM(LENGTH(stuffing)) | COUNT(*) |
|---|---|
| 13400 | 67 |
| 1 row fetched in 0.0001s (10.4060s) |
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | PRIMARY | o | ALL | 1000000 | 100.00 | Using where | ||||
| 2 | DEPENDENT SUBQUERY | i | index_subquery | ix_inner_val | ix_inner_val | 5 | func | 20 | 100.00 | Using index; Using where; Full scan on NULL key |
select sum(length(`20100527_anti`.`o`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,count(0) AS `COUNT(*)` from `20100527_anti`.`t_outer` `o` where (not(<in_optimizer>(`20100527_anti`.`o`.`val`,<exists>(<index_lookup>(<cache>(`20100527_anti`.`o`.`val`) in t_inner on ix_inner_val checking NULL where (`20100527_anti`.`i`.`val` is not null) having trigcond(<is_not_null_test>(`20100527_anti`.`i`.`val`)))))))
This time, the query returns records, but not as many as the previous queries did.
We made an additional check for NULL in
t_inner but not in t_outer. There are
some records in t_outer that have a
NULL in val. Both IN and
NOT IN would evaluate to NULL and
WHERE would filter them out.
We see another glitch in MySQL optimizer here: a
Full scan on NULL key applied. Since NOT
IN should always return TRUE when the
subquery returns no records (even if the value checked is a
NULL), on correlated queries a fullscan should be
applied to check for the records and find out whether to return
NULL or FALSE. However, in this case
the IN subquery is not correlated, so the check
could only be performed once and cached, like with the LEFT
JOIN.
In our case the overhead would be negligible, since the subquery
would return on the first match, but it could matter if we had
more NULL values in t_outer.
Now, what if we want NULL records on
t_outer to be returned as well? We just need to add
an additional check for NULLs.
Ignoring all NULLs
SELECT SUM(LENGTH(stuffing)), COUNT(*)
FROM t_outer o
WHERE val IS NULL
OR val NOT IN
(
SELECT val
FROM t_inner i
WHERE val IS NOT NULL
)
| SUM(LENGTH(stuffing)) | COUNT(*) |
|---|---|
| 14600 | 73 |
| 1 row fetched in 0.0002s (10.4842s) |
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | PRIMARY | o | ALL | ix_outer_val | 1000000 | 100.00 | Using where | |||
| 2 | DEPENDENT SUBQUERY | i | index_subquery | ix_inner_val | ix_inner_val | 5 | func | 20 | 100.00 | Using index; Using where; Full scan on NULL key |
select sum(length(`20100527_anti`.`o`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,count(0) AS `COUNT(*)` from `20100527_anti`.`t_outer` `o` where (isnull(`20100527_anti`.`o`.`val`) or (not(<in_optimizer>(`20100527_anti`.`o`.`val`,<exists>(<index_lookup>(<cache>(`20100527_anti`.`o`.`val`) in t_inner on ix_inner_val checking NULL where (`20100527_anti`.`i`.`val` is not null) having trigcond(<is_not_null_test>(`20100527_anti`.`i`.`val`))))))))
Here, the query returns the same results as NOT
EXISTS.
Full scan on NULL key is still present in the plan
but will never actually be executed because it will be short
circuited by the previous IS NULL check.
Summary
As was shown in the earlier article, LEFT JOIN / IS
NULL and NOT IN are best used to implement an
anti-join in MySQL if the columns on both sides
are not nullable.
The situation is different when the columns are nullable:
-
NOT EXISTSperforms in most straightforward way: just checks equality and returnsTRUEorFALSEon the first hit / miss. -
LEFT JOIN / IS NULLeither makes an additional table lookup or does not return on the first match and performs more poorly in both cases. -
NOT IN, having different semantics, requires additional checks forNULLvalues. These checks should be coded into the query
With nullable columns, NOT EXISTS and NOT
IN (with additional checks for NULLS) are the
most efficient methods to implement an anti-join in
MySQL.
LEFT JOIN / IS NULL performs poorly.