LEFT JOIN / IS NULL vs. NOT IN vs. NOT EXISTS: nullable columns

In one of the previous articles I discussed performance of the three methods to implement an anti-join in MySQL.

Just a quick reminder: an anti-join is an operation that returns all records from one table which share a value of a certain column with no records from another table.

In SQL, there are at least three methods to implement it:

LEFT JOIN / IS NULL

SELECT  o.*
FROM    outer o
LEFT JOIN
        inner i
ON      i.value = o.value
WHERE   i.value IS NULL

NOT IN

SELECT  o.*
FROM    outer o
WHERE   o.value NOT IN
        (
        SELECT  value
        FROM    inner
        )

NOT EXISTS

SELECT  o.*
FROM    outer o
WHERE   NOT EXISTS
        (
        SELECT  NULL
        FROM    inner i
        WHERE   i.value = o.value
        )

When inner.value is marked as NOT NULL, all these queries are semantically equivalent and with proper indexing have similarly optimized execution plans in MySQL.

Now, what if inner.value is not nullable and does contain some NULL values?

Let’s create some sample tables:

Table creation details

CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=MyISAM;

CREATE TABLE t_inner (
        id INT NOT NULL PRIMARY KEY,
        val INT,
        stuffing VARCHAR(200) NOT NULL,
        KEY ix_inner_val (val)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

CREATE TABLE t_outer (
        id INT NOT NULL PRIMARY KEY,
        val INT,
        stuffing VARCHAR(200) NOT NULL,
        KEY ix_outer_val (val)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt <= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(1000000);
COMMIT;

INSERT
INTO    t_inner
SELECT  id,
        NULLIF(CEILING(RAND(20100527) * 100000), 100000),
        RPAD('', 200, '*')
FROM    filler;

INSERT
INTO    t_outer
SELECT  id,
        NULLIF(CEILING(RAND(20100527 << 1) * 100000), 100000),
        RPAD('', 200, '*')
FROM    filler;

There are two identical MyISAM tables. Each of the tables contains 1,000,000 random values from 1 to 99,999 and also some NULL values. There is an index on value in both tables.

Now, let’s check the queries.

NOT EXISTS

SELECT  SUM(LENGTH(stuffing)), COUNT(*)
FROM    t_outer o
WHERE   NOT EXISTS
        (
        SELECT  NULL
        FROM    t_inner i
        WHERE   i.val = o.val
        )
SUM(LENGTH(stuffing)) COUNT(*)
14600 73
1 row fetched in 0.0001s (9.9061s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY o ALL 1000000 100.00 Using where
2 DEPENDENT SUBQUERY i ref ix_inner_val ix_inner_val 5 20100527_anti.o.val 10 100.00 Using where; Using index
Field or reference '20100527_anti.o.val' of SELECT #2 was resolved in SELECT #1
select sum(length(`20100527_anti`.`o`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,count(0) AS `COUNT(*)` from `20100527_anti`.`t_outer` `o` where (not(exists(select NULL from `20100527_anti`.`t_inner` `i` where (`20100527_anti`.`i`.`val` = `20100527_anti`.`o`.`val`))))

The query completes in 9.9 seconds. As we can see, it is optimized to use the index on t_inner.val and return on the first match.

LEFT JOIN / IS NULL

SELECT  SUM(LENGTH(o.stuffing)), COUNT(*)
FROM    t_outer o
LEFT JOIN
        t_inner i
ON      i.val = o.val
WHERE   i.id IS NULL

SUM(LENGTH(o.stuffing)) COUNT(*)
14600 73
1 row fetched in 0.0001s (13.5154s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE o ALL 1000000 100.00
1 SIMPLE i ref ix_inner_val ix_inner_val 5 20100527_anti.o.val 10 100.00 Using where; Not exists
select sum(length(`20100527_anti`.`o`.`stuffing`)) AS `SUM(LENGTH(o.stuffing))`,count(0) AS `COUNT(*)` from `20100527_anti`.`t_outer` `o` left join `20100527_anti`.`t_inner` `i` on((`20100527_anti`.`i`.`val` = `20100527_anti`.`o`.`val`)) where isnull(`20100527_anti`.`i`.`id`)

The query semantics are the same as those of NOT EXISTS, and we even see the Not exists optimization in the plan, however this query performs much more poorly than NOT EXISTS: 13 seconds. Why?

MySQL documentation on EXPLAIN states that Not exists is used to optimize the queries similar to the one we have just run: LEFT JOIN with IS NULL predicate applied to a non-nullable column.

MySQL is aware that such a predicate can only be satisfied by a record resulting from a JOIN miss (i. e. when no matching record was found in the rightmost table) and stops reading records after first index hit.

However, this optimization is implemented in a way that is far from being perfect. Despite the fact that no actual value of id can be returned by such a query, the engine still looks up id in the table (since it’s not a part of the index). We can see it in the plan: unlike NOT EXISTS query, there is no Using index for t_inner. This means that a table lookup is performed.

Even we replace id with val in the query, it still performs poorly:

SELECT  SUM(LENGTH(o.stuffing)), COUNT(*)
FROM    t_outer o
LEFT JOIN
        t_inner i
ON      i.val = o.val
WHERE   i.val IS NULL

SUM(LENGTH(o.stuffing)) COUNT(*)
14600 73
1 row fetched in 0.0001s (14.4997s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 SIMPLE o ALL 1000000 100.00
1 SIMPLE i ref ix_inner_val ix_inner_val 5 20100527_anti.o.val 10 100.00 Using where; Using index
select sum(length(`20100527_anti`.`o`.`stuffing`)) AS `SUM(LENGTH(o.stuffing))`,count(0) AS `COUNT(*)` from `20100527_anti`.`t_outer` `o` left join `20100527_anti`.`t_inner` `i` on((`20100527_anti`.`i`.`val` = `20100527_anti`.`o`.`val`)) where isnull(`20100527_anti`.`i`.`val`)

This time, no table lookups are made but there is no Not exists optimization either.

Despite the fact that the join condition eliminates possibility of an actual NULL being returned by the query and any val IS NULL reaching the WHERE clause is a result of a join miss, MySQL still examines all records in t_inner, not stopping after the first hit.

This had been submitted as a bug.

Now, what about NOT IN?

NOT IN

Unlike the previous two queries that only differ in implementation, not in semantics, NOT IN, being applied as is, would yield the different results.

NOT EXISTS and IS NULL are two-state predicates, they can only return TRUE or FALSE. NOT IN is a three-state predicate: it can return TRUE, FALSE or NULL.

NULL value is returned in two cases:

  • When t_outer.value being tested is NULL
  • When at least one of t_inner.value is NULL

This means that having but a single NULL in t_inner would prevent the query from returning anything.

Naive approach

Let’s see what happens if we just substitute NOT IN instead of NOT EXISTS:

SELECT  SUM(LENGTH(stuffing)), COUNT(*)
FROM    t_outer o
WHERE   val NOT IN
        (
        SELECT  val
        FROM    t_inner i
        )
SUM(LENGTH(stuffing)) COUNT(*)
0
1 row fetched in 0.0001s (10.3748s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY o ALL 1000000 100.00 Using where
2 DEPENDENT SUBQUERY i index_subquery ix_inner_val ix_inner_val 5 func 20 100.00 Using index; Full scan on NULL key
select sum(length(`20100527_anti`.`o`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,count(0) AS `COUNT(*)` from `20100527_anti`.`t_outer` `o` where (not(<in_optimizer>(`20100527_anti`.`o`.`val`,<exists>(<index_lookup>(<cache>(`20100527_anti`.`o`.`val`) in t_inner on ix_inner_val checking NULL having trigcond(<is_not_null_test>(`20100527_anti`.`i`.`val`)))))))

Since there are NULLs in t_inner, no record in t_outer can satisfy the predicate.

MySQL does not optimize this very well. It takes but a single index scan to find out if there are NULL values in t_inner and return if they are, but for some reason MySQL still applies the condition to each record in t_outer.

Naive approach, improved

With a little help from our side, this can be improved:

SELECT  SUM(LENGTH(stuffing)), COUNT(*)
FROM    t_outer o
WHERE   NOT EXISTS
        (
        SELECT  NULL
        FROM    t_inner i
        WHERE   val IS NULL
        )
        AND val NOT IN
        (
        SELECT  val
        FROM    t_inner i
        )
SUM(LENGTH(stuffing)) COUNT(*)
0
1 row fetched in 0.0001s (0.0014s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY Impossible WHERE
3 DEPENDENT SUBQUERY i index_subquery ix_inner_val ix_inner_val 5 func 20 100.00 Using index; Full scan on NULL key
2 SUBQUERY i ref ix_inner_val ix_inner_val 5 4 100.00 Using where; Using index
select sum(length(`20100527_anti`.`o`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,count(0) AS `COUNT(*)` from `20100527_anti`.`t_outer` `o` where 0

We added an explicit check for NULL values. Since it’s not correlated, MySQL could instantly prove it false, cache it and avoid the table scan at all.

Ignoring right side NULLs

Now, let’s make a NOT IN query that does not take the NULL values in t_inner into account:

SELECT  SUM(LENGTH(stuffing)), COUNT(*)
FROM    t_outer o
WHERE   val NOT IN
        (
        SELECT  val
        FROM    t_inner i
        WHERE   val IS NOT NULL
        )
SUM(LENGTH(stuffing)) COUNT(*)
13400 67
1 row fetched in 0.0001s (10.4060s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY o ALL 1000000 100.00 Using where
2 DEPENDENT SUBQUERY i index_subquery ix_inner_val ix_inner_val 5 func 20 100.00 Using index; Using where; Full scan on NULL key
select sum(length(`20100527_anti`.`o`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,count(0) AS `COUNT(*)` from `20100527_anti`.`t_outer` `o` where (not(<in_optimizer>(`20100527_anti`.`o`.`val`,<exists>(<index_lookup>(<cache>(`20100527_anti`.`o`.`val`) in t_inner on ix_inner_val checking NULL where (`20100527_anti`.`i`.`val` is not null) having trigcond(<is_not_null_test>(`20100527_anti`.`i`.`val`)))))))

This time, the query returns records, but not as many as the previous queries did.

We made an additional check for NULL in t_inner but not in t_outer. There are some records in t_outer that have a NULL in val. Both IN and NOT IN would evaluate to NULL and WHERE would filter them out.

We see another glitch in MySQL optimizer here: a Full scan on NULL key applied. Since NOT IN should always return TRUE when the subquery returns no records (even if the value checked is a NULL), on correlated queries a fullscan should be applied to check for the records and find out whether to return NULL or FALSE. However, in this case the IN subquery is not correlated, so the check could only be performed once and cached, like with the LEFT JOIN.

In our case the overhead would be negligible, since the subquery would return on the first match, but it could matter if we had more NULL values in t_outer.

Now, what if we want NULL records on t_outer to be returned as well? We just need to add an additional check for NULLs.

Ignoring all NULLs

SELECT  SUM(LENGTH(stuffing)), COUNT(*)
FROM    t_outer o
WHERE   val IS NULL
        OR val NOT IN
        (
        SELECT  val
        FROM    t_inner i
        WHERE   val IS NOT NULL
        )
SUM(LENGTH(stuffing)) COUNT(*)
14600 73
1 row fetched in 0.0002s (10.4842s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY o ALL ix_outer_val 1000000 100.00 Using where
2 DEPENDENT SUBQUERY i index_subquery ix_inner_val ix_inner_val 5 func 20 100.00 Using index; Using where; Full scan on NULL key
select sum(length(`20100527_anti`.`o`.`stuffing`)) AS `SUM(LENGTH(stuffing))`,count(0) AS `COUNT(*)` from `20100527_anti`.`t_outer` `o` where (isnull(`20100527_anti`.`o`.`val`) or (not(<in_optimizer>(`20100527_anti`.`o`.`val`,<exists>(<index_lookup>(<cache>(`20100527_anti`.`o`.`val`) in t_inner on ix_inner_val checking NULL where (`20100527_anti`.`i`.`val` is not null) having trigcond(<is_not_null_test>(`20100527_anti`.`i`.`val`))))))))

Here, the query returns the same results as NOT EXISTS.

Full scan on NULL key is still present in the plan but will never actually be executed because it will be short circuited by the previous IS NULL check.

Summary

As was shown in the earlier article, LEFT JOIN / IS NULL and NOT IN are best used to implement an anti-join in MySQL if the columns on both sides are not nullable.

The situation is different when the columns are nullable:

  • NOT EXISTS performs in most straightforward way: just checks equality and returns TRUE or FALSE on the first hit / miss.
  • LEFT JOIN / IS NULL either makes an additional table lookup or does not return on the first match and performs more poorly in both cases.
  • NOT IN, having different semantics, requires additional checks for NULL values. These checks should be coded into the query

With nullable columns, NOT EXISTS and NOT IN (with additional checks for NULLS) are the most efficient methods to implement an anti-join in MySQL.

LEFT JOIN / IS NULL performs poorly.