Multiple attributes in a EAV table: GROUP BY vs. NOT EXISTS

Answering questions asked on the site.

Andrew Stillard asks:

I have a store which will hold around 50,000 products in a products table. Each product will have 14 options, giving 700,000 options in total. These are held in an options table which is joined via the product id.

Users search for products based on the options via an Advanced Search menu.

The users need to be able to select multiple options upon which to query. I would normally use a JOIN if it was just the one option to select upon, but because its a variable number i thought it would be best to loop through the WHERE EXISTS statement.

The issue i have currently is that the query is taking a minimum of 18 seconds (And that was a query when the tables only had a fraction of the total products in).
If you can help us speed this up, or suggest an alternative idea that would be greatly appreciated.

The option table mentioned here is in fact an implementation of the EAV model in a relational database.

Each record basically contains 3 things: id of the product it describes; id of the option and the value of the given option for the given product. These fields represent entity, attribute and value, respectively.

This model is very easy to maintain and expand should the need arise: all we have to do to define an extra attribute is to add a record to the EAV table with the name and the value of the attribute. This is a DML operation rather than a DDL one.

However, this model has a serious drawback: we cannot efficiently search for two or more options at once. An index can only be defined on several fields from a single record, so we can only search for a single option using an index.

There are two approaches to writing a query which would search for the entities with the certain conditions on several attributes at once:

  1. For each attribute, find all entities for which the conditions on the given attribute hold, then aggregate the resulting entities, using COUNT(*) as a filter. The number of entity entries should be equal to the number of the attributes.
  2. Takes each entity and for each attribute, check if the condition holds.

The first approach uses a GROUP BY, the second one uses EXISTS.

Let’s create a sample table and see which one is more efficient:

Table creation details

CREATE TABLE filler (
        id INT NOT NULL PRIMARY KEY AUTO_INCREMENT
) ENGINE=Memory;

CREATE TABLE t_product (
        id INT NOT NULL PRIMARY KEY,
        name VARCHAR(30) NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=UTF8;

CREATE TABLE t_option (
        product_id INT NOT NULL,
        option_id INT NOT NULL,
        value INT NOT NULL,
        PRIMARY KEY pk_option_o_p (product_id, option_id),
        KEY ix_option_o_v (option_id, value)
) ENGINE=InnoDB DEFAULT CHARSET=UTF8;

DELIMITER $$

CREATE PROCEDURE prc_filler(cnt INT)
BEGIN
        DECLARE _cnt INT;
        SET _cnt = 1;
        WHILE _cnt <= cnt DO
                INSERT
                INTO    filler
                SELECT  _cnt;
                SET _cnt = _cnt + 1;
        END WHILE;
END
$$

DELIMITER ;

START TRANSACTION;
CALL prc_filler(500000);
COMMIT;

INSERT
INTO    t_product
SELECT  id, CONCAT('Product ', id)
FROM    filler
ORDER BY
        id
LIMIT 50000;

INSERT
INTO    t_option
SELECT  p.id, f.id, CEILING(RAND(20100402 << 1) * 100)
FROM    t_product p
CROSS JOIN
        (
        SELECT  id
        FROM    filler
        ORDER BY
                id
        LIMIT 40
        ) f;

Table t_product contains 50,000 products; table t_option (an EAV table) contains the values of 40 attributes for each products, randomly filled.

Assume we got a complex query involving attributes from 1 to 6. The value of attribute 1 should be from 0 to 100; that of attribute 2 should be from 10 to 90, etc., finally, the value of the attribute 6 should be from 50 to 50.

GROUP BY

Using the first approach, we should find all product_id‘s in the EAV table (t_option) that satisfy the ranges for each attribute, then aggregate these products and filter those whose COUNT(*) is 6. This will mean that the product was found in all 6 ranges and hence satisfies all conditions.

Here’s the query to do that:

SELECT  COUNT(*)
FROM    (
        SELECT  product_id
        FROM    (
                SELECT  product_id
                FROM    (
                        SELECT  1 AS opt, 0 AS l, 100 AS h
                        UNION ALL
                        SELECT  2 AS opt, 10 AS l, 90 AS h
                        UNION ALL
                        SELECT  3 AS opt, 20 AS l, 80 AS h
                        UNION ALL
                        SELECT  4 AS opt, 30 AS l, 70 AS h
                        UNION ALL
                        SELECT  5 AS opt, 40 AS l, 60 AS h
                        UNION ALL
                        SELECT  6 AS opt, 50 AS l, 50 AS h
                        ) v
                JOIN    t_option o
                ON      o.option_id >= opt
                        AND o.option_id <= opt
                        AND o.value BETWEEN l AND h
                ) o
        GROUP BY
                o.product_id
        HAVING  COUNT(*) = 6
        ) q;
COUNT(*)
22
1 row fetched in 0.0001s (0.3291s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY Select tables optimized away
2 DERIVED <derived3> ALL 152657 100.00 Using temporary; Using filesort
3 DERIVED <derived4> ALL 6 100.00
3 DERIVED o range ix_option_o_v ix_option_o_v 8 1090 183532.11 Range checked for each record (index map: 0×2)
4 DERIVED No tables used
5 UNION No tables used
6 UNION No tables used
7 UNION No tables used
8 UNION No tables used
9 UNION No tables used
UNION RESULT <union4,5,6,7,8,9> ALL
select count(0) AS `COUNT(*)` from (select `o`.`product_id` AS `product_id` from (select `20100402_options`.`o`.`product_id` AS `product_id` from (select 1 AS `opt`,0 AS `l`,100 AS `h` union all select 2 AS `opt`,10 AS `l`,90 AS `h` union all select 3 AS `opt`,20 AS `l`,80 AS `h` union all select 4 AS `opt`,30 AS `l`,70 AS `h` union all select 5 AS `opt`,40 AS `l`,60 AS `h` union all select 6 AS `opt`,50 AS `l`,50 AS `h`) `v` join `20100402_options`.`t_option` `o` where ((`20100402_options`.`o`.`option_id` >= `v`.`opt`) and (`20100402_options`.`o`.`option_id` <= `v`.`opt`) and (`20100402_options`.`o`.`value` between `v`.`l` and `v`.`h`))) `o` group by `o`.`product_id` having (count(0) = 6)) `q`

Note that there are two quite counterintuitive things in the query:

  • o.option_id >= opt AND o.option_id <= opt instead of mere o.option_id = opt
  • GROUP BY is applied to a nested query rather than being put immediately after the WHERE clause

As many readers of my blog will remember, these tricks are intended to make MySQL use Range checked for each record that we can spot in the plan. If not for these tricks, MySQL would use an index scan on only a part of the composite index on (attribute, value, product), since the joins on the mixed equality/range conditions are not optimized well in the current releases (at least up to 5.1.42).

This query is quite fast: only 320 ms.

NOT EXISTS

Using this approach, the engine takes each product and checks if all of the relevant attributes satisfy the conditions. As soon as the first attribute failing the check is found, the condition is considered FALSE and further evaluation ceases.

This approach does not use aggregation (which is good), but it needs to browse all products and do random index seeks for each of the products (which is bad).

Here’s the query:

SELECT  COUNT(*)
FROM    t_product p
WHERE   NOT EXISTS
        (
        SELECT  NULL
        FROM    (
                SELECT  1 AS opt, 0 AS l, 100 AS h
                UNION ALL
                SELECT  2 AS opt, 10 AS l, 90 AS h
                UNION ALL
                SELECT  3 AS opt, 20 AS l, 80 AS h
                UNION ALL
                SELECT  4 AS opt, 30 AS l, 70 AS h
                UNION ALL
                SELECT  5 AS opt, 40 AS l, 60 AS h
                UNION ALL
                SELECT  6 AS opt, 50 AS l, 50 AS h
                ) v
        WHERE   NOT EXISTS
                (
                SELECT  NULL
                FROM    t_option o
                WHERE   o.product_id = p.id
                        AND o.option_id = opt
                        AND o.value BETWEEN l AND h
                )
        );
COUNT(*)
22
1 row fetched in 0.0001s (0.8438s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY p index PRIMARY 4 50115 100.00 Using where; Using index
2 DEPENDENT SUBQUERY <derived3> ALL 6 100.00 Using where
9 DEPENDENT SUBQUERY o eq_ref PRIMARY,ix_option_o_v PRIMARY 8 20100402_options.p.id,v.opt 1 100.00 Using where
3 DERIVED No tables used
4 UNION No tables used
5 UNION No tables used
6 UNION No tables used
7 UNION No tables used
8 UNION No tables used
UNION RESULT <union3,4,5,6,7,8> ALL
Field or reference '20100402_options.p.id' of SELECT #9 was resolved in SELECT #1
Field or reference 'v.opt' of SELECT #9 was resolved in SELECT #2
Field or reference 'v.l' of SELECT #9 was resolved in SELECT #2
Field or reference 'v.h' of SELECT #9 was resolved in SELECT #2
select count(0) AS `COUNT(*)` from `20100402_options`.`t_product` `p` where (not(exists(select NULL AS `NULL` from (select 1 AS `opt`,0 AS `l`,100 AS `h` union all select 2 AS `opt`,10 AS `l`,90 AS `h` union all select 3 AS `opt`,20 AS `l`,80 AS `h` union all select 4 AS `opt`,30 AS `l`,70 AS `h` union all select 5 AS `opt`,40 AS `l`,60 AS `h` union all select 6 AS `opt`,50 AS `l`,50 AS `h`) `v` where (not(exists(select NULL AS `NULL` from `20100402_options`.`t_option` `o` where ((`20100402_options`.`o`.`product_id` = `20100402_options`.`p`.`id`) and (`20100402_options`.`o`.`option_id` = `v`.`opt`) and (`20100402_options`.`o`.`value` between `v`.`l` and `v`.`h`))))))))

This query runs for 843 ms, or three times as long as its GROUP BY counterpart.

GROUP BY version seems to be the most efficient, but can we optimize it somehow?

Mixing two approaches

The main problem with the GROUP BY approach is, well, GROUP BY. In MySQL, it requires sorting or materialization (or both) which are quite expensive operations. Also, the recordset for each of the attributes is selected independently: no filter on one of the attributes affects the other ones. The filtering is only done after the aggregation.

The main problem with the NOT EXISTS approach is that every product should be checked for the conditions, though only a tiny fraction of all products satisfies at least several of them. Checking for conditions also requires index seeks in the blocks that are quite far away from each other which is not good for performance (especially if the index does not fit completely into the cache).

However, we can mix the two approaches.

Instead of using the product table as a source for all possible entities to check against the EAV table, we can take the condition on only one attribute that is the least likely to be satisfied.

Since entity/attribute pair form a natural primary key in the EAV table, filtering on an attribute is guaranteed to result in a set of unique entities. Each of these entities should then be further checked using the NOT EXISTS approach, but this time there are much fewer checks that need to be made, since with a properly chosen primary condition (which should be the most strong one) most of the non-matching entities are already filtered out.

This approach, on the one hand, does not require aggregation; on the other hand, the conditions are sieved rather then added up.

Let’s build the query:

SELECT  COUNT(*)
FROM    t_option oo
WHERE   oo.option_id = 6
        AND oo.value BETWEEN 50 AND 50
        AND NOT EXISTS
        (
        SELECT  NULL
        FROM    (
                SELECT  1 AS opt, 0 AS l, 100 AS h
                UNION ALL
                SELECT  2 AS opt, 10 AS l, 90 AS h
                UNION ALL
                SELECT  3 AS opt, 20 AS l, 80 AS h
                UNION ALL
                SELECT  4 AS opt, 30 AS l, 70 AS h
                UNION ALL
                SELECT  5 AS opt, 40 AS l, 60 AS h
                ) v
        WHERE   NOT EXISTS
                (
                SELECT  NULL
                FROM    t_option o
                WHERE   o.product_id = oo.product_id
                        AND o.option_id = opt
                        AND o.value BETWEEN l AND h
                )
        );
COUNT(*)
22
1 row fetched in 0.0001s (0.0105s)
id select_type table type possible_keys key key_len ref rows filtered Extra
1 PRIMARY oo ref ix_option_o_v ix_option_o_v 8 const,const 1090 100.00 Using where; Using index
2 DEPENDENT SUBQUERY <derived3> ALL 5 100.00 Using where
8 DEPENDENT SUBQUERY o eq_ref PRIMARY,ix_option_o_v PRIMARY 8 20100402_options.oo.product_id,v.opt 1 100.00 Using where
3 DERIVED No tables used
4 UNION No tables used
5 UNION No tables used
6 UNION No tables used
7 UNION No tables used
UNION RESULT <union3,4,5,6,7> ALL
Field or reference '20100402_options.oo.product_id' of SELECT #8 was resolved in SELECT #1
Field or reference 'v.opt' of SELECT #8 was resolved in SELECT #2
Field or reference 'v.l' of SELECT #8 was resolved in SELECT #2
Field or reference 'v.h' of SELECT #8 was resolved in SELECT #2
select count(0) AS `COUNT(*)` from `20100402_options`.`t_option` `oo` where ((`20100402_options`.`oo`.`option_id` = 6) and (`20100402_options`.`oo`.`value` between 50 and 50) and (not(exists(select NULL AS `NULL` from (select 1 AS `opt`,0 AS `l`,100 AS `h` union all select 2 AS `opt`,10 AS `l`,90 AS `h` union all select 3 AS `opt`,20 AS `l`,80 AS `h` union all select 4 AS `opt`,30 AS `l`,70 AS `h` union all select 5 AS `opt`,40 AS `l`,60 AS `h`) `v` where (not(exists(select NULL AS `NULL` from `20100402_options`.`t_option` `o` where ((`20100402_options`.`o`.`product_id` = `20100402_options`.`oo`.`product_id`) and (`20100402_options`.`o`.`option_id` = `v`.`opt`) and (`20100402_options`.`o`.`value` between `v`.`l` and `v`.`h`)))))))))

As we can see, using the most selective condition as an initial row source for the further checks greatly improved the query speed.

The query time is now only 10 ms, or 30 times as fast as the GROUP BY approach.

Hope that helps.

I’m always glad to answer the questions regarding database queries.

Ask me a question