Answering questions asked on the site.
Andrew Stillard asks:
I have a store which will hold around 50,000 products in a products table. Each product will have 14 options, giving 700,000 options in total. These are held in an options table which is joined via the product id.
Users search for products based on the options via an Advanced Search menu.
The users need to be able to select multiple options upon which
to query. I would normally use a JOIN
if it was just
the one option to select upon, but because its a variable number
i thought it would be best to loop through the WHERE
EXISTS
statement.
The issue i have currently is that the query is taking a minimum
of 18 seconds (And that was a query when the tables only had a
fraction of the total products in).
If you can help us speed this up, or suggest an alternative idea
that would be greatly appreciated.
The option table mentioned here is in fact an implementation of the EAV model in a relational database.
Each record basically contains 3 things:
id
of the product it describes; id
of
the option and the value of the given option for the given
product. These fields represent entity,
attribute and value,
respectively.
This model is very easy to maintain and expand should the need arise: all we have to do to define an extra attribute is to add a record to the EAV table with the name and the value of the attribute. This is a DML operation rather than a DDL one.
However, this model has a serious drawback: we cannot efficiently search for two or more options at once. An index can only be defined on several fields from a single record, so we can only search for a single option using an index.
There are two approaches to writing a query which would search for the entities with the certain conditions on several attributes at once:
- For each attribute, find all entities for which the
conditions on the given attribute hold, then aggregate the
resulting entities, using
COUNT(*)
as a filter. The number of entity entries should be equal to the number of the attributes. - Takes each entity and for each attribute, check if the condition holds.
The first approach uses a GROUP BY
, the second one
uses EXISTS
.
Let’s create a sample table and see which one is more
efficient:
Table
creation details
CREATE TABLE filler ( id INT NOT NULL PRIMARY KEY AUTO_INCREMENT ) ENGINE=Memory; CREATE TABLE t_product ( id INT NOT NULL PRIMARY KEY, name VARCHAR(30) NOT NULL ) ENGINE=InnoDB DEFAULT CHARSET=UTF8; CREATE TABLE t_option ( product_id INT NOT NULL, option_id INT NOT NULL, value INT NOT NULL, PRIMARY KEY pk_option_o_p (product_id, option_id), KEY ix_option_o_v (option_id, value) ) ENGINE=InnoDB DEFAULT CHARSET=UTF8; DELIMITER $$ CREATE PROCEDURE prc_filler(cnt INT) BEGIN DECLARE _cnt INT; SET _cnt = 1; WHILE _cnt <= cnt DO INSERT INTO filler SELECT _cnt; SET _cnt = _cnt + 1; END WHILE; END $$ DELIMITER ; START TRANSACTION; CALL prc_filler(500000); COMMIT; INSERT INTO t_product SELECT id, CONCAT('Product ', id) FROM filler ORDER BY id LIMIT 50000; INSERT INTO t_option SELECT p.id, f.id, CEILING(RAND(20100402 << 1) * 100) FROM t_product p CROSS JOIN ( SELECT id FROM filler ORDER BY id LIMIT 40 ) f;
Table t_product
contains 50,000
products; table t_option
(an EAV
table) contains the values of 40 attributes for
each products, randomly filled.
Assume we got a complex query involving attributes from 1 to 6. The value of attribute 1 should be from 0 to 100; that of attribute 2 should be from 10 to 90, etc., finally, the value of the attribute 6 should be from 50 to 50.
GROUP BY
Using the first approach, we should find all
product_id‘s in the EAV table
(t_option) that satisfy the ranges for each
attribute, then aggregate these products and filter those whose
COUNT(*)
is 6. This will mean that
the product was found in all 6 ranges and hence
satisfies all conditions.
Here’s the query to do that:
SELECT COUNT(*) FROM ( SELECT product_id FROM ( SELECT product_id FROM ( SELECT 1 AS opt, 0 AS l, 100 AS h UNION ALL SELECT 2 AS opt, 10 AS l, 90 AS h UNION ALL SELECT 3 AS opt, 20 AS l, 80 AS h UNION ALL SELECT 4 AS opt, 30 AS l, 70 AS h UNION ALL SELECT 5 AS opt, 40 AS l, 60 AS h UNION ALL SELECT 6 AS opt, 50 AS l, 50 AS h ) v JOIN t_option o ON o.option_id >= opt AND o.option_id <= opt AND o.value BETWEEN l AND h ) o GROUP BY o.product_id HAVING COUNT(*) = 6 ) q;
COUNT(*) |
---|
22 |
1 row fetched in 0.0001s (0.3291s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | Select tables optimized away | ||||||||
2 | DERIVED | <derived3> | ALL | 152657 | 100.00 | Using temporary; Using filesort | ||||
3 | DERIVED | <derived4> | ALL | 6 | 100.00 | |||||
3 | DERIVED | o | range | ix_option_o_v | ix_option_o_v | 8 | 1090 | 183532.11 | Range checked for each record (index map: 0×2) | |
4 | DERIVED | No tables used | ||||||||
5 | UNION | No tables used | ||||||||
6 | UNION | No tables used | ||||||||
7 | UNION | No tables used | ||||||||
8 | UNION | No tables used | ||||||||
9 | UNION | No tables used | ||||||||
UNION RESULT | <union4,5,6,7,8,9> | ALL |
select count(0) AS `COUNT(*)` from (select `o`.`product_id` AS `product_id` from (select `20100402_options`.`o`.`product_id` AS `product_id` from (select 1 AS `opt`,0 AS `l`,100 AS `h` union all select 2 AS `opt`,10 AS `l`,90 AS `h` union all select 3 AS `opt`,20 AS `l`,80 AS `h` union all select 4 AS `opt`,30 AS `l`,70 AS `h` union all select 5 AS `opt`,40 AS `l`,60 AS `h` union all select 6 AS `opt`,50 AS `l`,50 AS `h`) `v` join `20100402_options`.`t_option` `o` where ((`20100402_options`.`o`.`option_id` >= `v`.`opt`) and (`20100402_options`.`o`.`option_id` <= `v`.`opt`) and (`20100402_options`.`o`.`value` between `v`.`l` and `v`.`h`))) `o` group by `o`.`product_id` having (count(0) = 6)) `q`
Note that there are two quite counterintuitive things in the query:
-
o.option_id >= opt AND o.option_id <= opt
instead of mereo.option_id = opt
-
GROUP BY
is applied to a nested query rather than being put immediately after theWHERE
clause
As many readers of my blog will remember, these tricks are
intended to make MySQL use Range checked
for each record
that we can spot in the plan. If not for
these tricks, MySQL would use an index scan on
only a part of the composite index on (attribute, value,
product)
, since the joins on the mixed equality/range
conditions are not optimized well in the current releases (at
least up to 5.1.42).
This query is quite fast: only 320 ms.
NOT EXISTS
Using this approach, the engine takes each product and checks if all of the relevant attributes satisfy the conditions. As soon as the first attribute failing the check is found, the condition is considered FALSE and further evaluation ceases.
This approach does not use aggregation (which is good), but it needs to browse all products and do random index seeks for each of the products (which is bad).
Here’s the query:
SELECT COUNT(*) FROM t_product p WHERE NOT EXISTS ( SELECT NULL FROM ( SELECT 1 AS opt, 0 AS l, 100 AS h UNION ALL SELECT 2 AS opt, 10 AS l, 90 AS h UNION ALL SELECT 3 AS opt, 20 AS l, 80 AS h UNION ALL SELECT 4 AS opt, 30 AS l, 70 AS h UNION ALL SELECT 5 AS opt, 40 AS l, 60 AS h UNION ALL SELECT 6 AS opt, 50 AS l, 50 AS h ) v WHERE NOT EXISTS ( SELECT NULL FROM t_option o WHERE o.product_id = p.id AND o.option_id = opt AND o.value BETWEEN l AND h ) );
COUNT(*) |
---|
22 |
1 row fetched in 0.0001s (0.8438s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | p | index | PRIMARY | 4 | 50115 | 100.00 | Using where; Using index | ||
2 | DEPENDENT SUBQUERY | <derived3> | ALL | 6 | 100.00 | Using where | ||||
9 | DEPENDENT SUBQUERY | o | eq_ref | PRIMARY,ix_option_o_v | PRIMARY | 8 | 20100402_options.p.id,v.opt | 1 | 100.00 | Using where |
3 | DERIVED | No tables used | ||||||||
4 | UNION | No tables used | ||||||||
5 | UNION | No tables used | ||||||||
6 | UNION | No tables used | ||||||||
7 | UNION | No tables used | ||||||||
8 | UNION | No tables used | ||||||||
UNION RESULT | <union3,4,5,6,7,8> | ALL |
Field or reference '20100402_options.p.id' of SELECT #9 was resolved in SELECT #1 Field or reference 'v.opt' of SELECT #9 was resolved in SELECT #2 Field or reference 'v.l' of SELECT #9 was resolved in SELECT #2 Field or reference 'v.h' of SELECT #9 was resolved in SELECT #2 select count(0) AS `COUNT(*)` from `20100402_options`.`t_product` `p` where (not(exists(select NULL AS `NULL` from (select 1 AS `opt`,0 AS `l`,100 AS `h` union all select 2 AS `opt`,10 AS `l`,90 AS `h` union all select 3 AS `opt`,20 AS `l`,80 AS `h` union all select 4 AS `opt`,30 AS `l`,70 AS `h` union all select 5 AS `opt`,40 AS `l`,60 AS `h` union all select 6 AS `opt`,50 AS `l`,50 AS `h`) `v` where (not(exists(select NULL AS `NULL` from `20100402_options`.`t_option` `o` where ((`20100402_options`.`o`.`product_id` = `20100402_options`.`p`.`id`) and (`20100402_options`.`o`.`option_id` = `v`.`opt`) and (`20100402_options`.`o`.`value` between `v`.`l` and `v`.`h`))))))))
This query runs for 843 ms, or three times as
long as its GROUP BY
counterpart.
GROUP BY version seems to be the most efficient, but can we optimize it somehow?
Mixing two approaches
The main problem with the GROUP BY
approach is,
well, GROUP BY
. In MySQL, it
requires sorting or materialization (or both) which are quite
expensive operations. Also, the recordset for each of the
attributes is selected independently: no filter on one of the
attributes affects the other ones. The filtering is only done
after the aggregation.
The main problem with the NOT EXISTS
approach is
that every product should be checked for the conditions, though
only a tiny fraction of all products satisfies at least several
of them. Checking for conditions also requires index seeks in the
blocks that are quite far away from each other which is not good
for performance (especially if the index does not fit completely
into the cache).
However, we can mix the two approaches.
Instead of using the product table as a source for all possible entities to check against the EAV table, we can take the condition on only one attribute that is the least likely to be satisfied.
Since entity/attribute pair form a natural primary key in the
EAV table, filtering on an attribute is
guaranteed to result in a set of unique entities. Each of these
entities should then be further checked using the NOT
EXISTS
approach, but this time there are much fewer checks
that need to be made, since with a properly chosen primary
condition (which should be the most strong one) most of the
non-matching entities are already filtered out.
This approach, on the one hand, does not require aggregation; on the other hand, the conditions are sieved rather then added up.
Let’s build the query:
SELECT COUNT(*) FROM t_option oo WHERE oo.option_id = 6 AND oo.value BETWEEN 50 AND 50 AND NOT EXISTS ( SELECT NULL FROM ( SELECT 1 AS opt, 0 AS l, 100 AS h UNION ALL SELECT 2 AS opt, 10 AS l, 90 AS h UNION ALL SELECT 3 AS opt, 20 AS l, 80 AS h UNION ALL SELECT 4 AS opt, 30 AS l, 70 AS h UNION ALL SELECT 5 AS opt, 40 AS l, 60 AS h ) v WHERE NOT EXISTS ( SELECT NULL FROM t_option o WHERE o.product_id = oo.product_id AND o.option_id = opt AND o.value BETWEEN l AND h ) );
COUNT(*) |
---|
22 |
1 row fetched in 0.0001s (0.0105s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | oo | ref | ix_option_o_v | ix_option_o_v | 8 | const,const | 1090 | 100.00 | Using where; Using index |
2 | DEPENDENT SUBQUERY | <derived3> | ALL | 5 | 100.00 | Using where | ||||
8 | DEPENDENT SUBQUERY | o | eq_ref | PRIMARY,ix_option_o_v | PRIMARY | 8 | 20100402_options.oo.product_id,v.opt | 1 | 100.00 | Using where |
3 | DERIVED | No tables used | ||||||||
4 | UNION | No tables used | ||||||||
5 | UNION | No tables used | ||||||||
6 | UNION | No tables used | ||||||||
7 | UNION | No tables used | ||||||||
UNION RESULT | <union3,4,5,6,7> | ALL |
Field or reference '20100402_options.oo.product_id' of SELECT #8 was resolved in SELECT #1 Field or reference 'v.opt' of SELECT #8 was resolved in SELECT #2 Field or reference 'v.l' of SELECT #8 was resolved in SELECT #2 Field or reference 'v.h' of SELECT #8 was resolved in SELECT #2 select count(0) AS `COUNT(*)` from `20100402_options`.`t_option` `oo` where ((`20100402_options`.`oo`.`option_id` = 6) and (`20100402_options`.`oo`.`value` between 50 and 50) and (not(exists(select NULL AS `NULL` from (select 1 AS `opt`,0 AS `l`,100 AS `h` union all select 2 AS `opt`,10 AS `l`,90 AS `h` union all select 3 AS `opt`,20 AS `l`,80 AS `h` union all select 4 AS `opt`,30 AS `l`,70 AS `h` union all select 5 AS `opt`,40 AS `l`,60 AS `h`) `v` where (not(exists(select NULL AS `NULL` from `20100402_options`.`t_option` `o` where ((`20100402_options`.`o`.`product_id` = `20100402_options`.`oo`.`product_id`) and (`20100402_options`.`o`.`option_id` = `v`.`opt`) and (`20100402_options`.`o`.`value` between `v`.`l` and `v`.`h`)))))))))
As we can see, using the most selective condition as an initial row source for the further checks greatly improved the query speed.
The query time is now only 10 ms, or
30 times as fast as the GROUP BY
approach.
Hope that helps.
I’m always glad to answer the questions regarding database queries.