Continuing the series:
What is better to store hierarchical data: nested sets model or adjacency list (parent-child) model?
For detailed explanations of the terms, see the first article in the series:
This is the last article of the series which covers MySQL.
MySQL differs from the other systems, since it
is the only system of the big four that does not support
recursion natively. It has neither recursive
CTE’s nor CONNECT BY
clause, not
even rowset returning functions that help to emulate recursion in
PostgreSQL 8.3.
MySQL supports a thing that all other systems
either lack or implement inefficiently: session variables. They
can be set in a SELECT
clause and can be used to
keep some kind of a state between the rows as they are processed
and returned in a rowset.
This of course is against the whole spirit of SQL, since SQL implies operations on whole sets and session variables operate on rows and are totally dependent on the order they are returned or processed. But if used properly, this behavior can be exploited to emulate some things that MySQL lacks: analytic functions, efficient random row sampling etc.
Hierarchical functions are among the things that need to be emulated in MySQL using session variables to keep the function state.
Here’s the old article in my blog that shows how to do this:
On the other hand, MySQL implements one more
thing that is useful for nested sets model:
SPATIAL
indexes.
Spatial indexes and nested sets model
Unlike most other systems, MySQL allows creating
indexes of different types, one of them being
SPATIAL
.
A spatial index is an R-Tree structure that allows indexing multidimensional values (though MySQL only indexes two-dimensional ones). Each value is represented by its minimal bounding box (minimal rectangle that contains the whole shape), and this is what is stored in the index.
The index allows to find an answer efficiently to the following question: given a point, what are the boxes that contain it?
Though this type of query is commonly used on geometrical or geographical data, like find all public toilets within 100 meters of my current location and for God’s sake don’t make it a fullscan. However, this index can be used on any type of query that requires searching a given point in a range defined by two columns just as well.
A plain B-Tree index is good for range queries: find all values within the range defined by two boundaries, like shown on this picture:
Black dots show values that fall inside the range (defined with two constant boundaries). It’s easy to find these values if they are sorted: all values will make a contiguous block. That’s what exactly the B-Tree index do: sort the values and return contiguous blocks.
But if we need to find variable ranges containing a constant point, like on this picture:
, a B-Tree index won’t help, since we have two values here which we just cannot sort using one sort order. An R-Tree index, on the contrary, is perfect for this type of query.
Tasks requiring seaching for a value inside a known range are very common, that’s why almost all database management systems provide a way to build and use a B-Tree index.
There is also a certain class of tasks that require searching for all ranges containing a known value:
and several others. These tasks can be improved by using R-Tree capabilities of MySQL:
The nested sets model belongs to both classes of tasks. It requires the first query (find variable values within a constant range) to build the list of descendants, and the second query (find variable ranges that contain a constant value) to build the list of ancestors.
That’s exactly why it’s so fast on building the list of descendants and so slow on building the list of ancestors.
Using an R-Tree index, nested sets model can be improved.
MySQL can only create an R-Tree
index on a GEOMETRY
type in a
MyISAM table and use it in two special
predicates, MBRContains
and MBRWithin
.
We should represent out nested sets boundaries (lft,
rgt)
as a LineString(Point(-1, lft), Point(1,
rgt))
, and search for a Point(0, value)
using
any of the predicates above.
This is not much of a stretch, actually, from the logical point of view: the nested sets are usually graphically represented as big boxes (parents) containing smaller boxes (children), and that’s exactly what these predicates are designed for: searching for a boxes containing other boxes.
Let’s create a sample table:
Table
creation script
CREATE TABLE filler ( id INT NOT NULL PRIMARY KEY AUTO_INCREMENT ) ENGINE=Memory; CREATE TABLE t_hierarchy ( id INT NOT NULL PRIMARY KEY, parent INT NOT NULL, lft INT NOT NULL, rgt INT NOT NULL, sets LineString NOT NULL, data VARCHAR(100) NOT NULL, stuffing VARCHAR(100) NOT NULL ) ENGINE=MyISAM; DELIMITER $$ CREATE PROCEDURE prc_filler(cnt INT) BEGIN DECLARE _cnt INT; SET _cnt = 1; WHILE _cnt <= cnt DO INSERT INTO filler SELECT _cnt; SET _cnt = _cnt + 1; END WHILE; END; CREATE PROCEDURE prc_hierarchy(width INT) main:BEGIN DECLARE last INT; DECLARE level INT; SET last = 0; SET level = 0; WHILE width >= 1 DO INSERT INTO t_hierarchy SELECT COALESCE(h.id, 0) * 5 + f.id, COALESCE(h.id, 0), COALESCE(h.lft, 0) + 1 + (f.id - 1) * width, COALESCE(h.lft, 0) + f.id * width, LineString( Point(-1, COALESCE(h.lft, 0) + 1 + (f.id - 1) * width), Point(1, COALESCE(h.lft, 0) + f.id * width) ), CONCAT('Value ', COALESCE(h.id, 0) * 5 + f.id), RPAD('', 100, '*') FROM filler f LEFT JOIN t_hierarchy h ON h.id >= last; SET width = width / 5; SET last = last + POWER(5, level); SET level = level + 1; END WHILE; END $$ DELIMITER ; START TRANSACTION; CALL prc_filler(5); CALL prc_hierarchy(585937); COMMIT; CREATE INDEX ix_hierarchy_parent ON t_hierarchy (parent); CREATE INDEX ix_hierarchy_lft ON t_hierarchy (lft); CREATE INDEX ix_hierarchy_rgt ON t_hierarchy (rgt); CREATE SPATIAL INDEX sx_hierarchy_sets ON t_hierarchy (sets);
The table contains 8 levels of hierarchy,
5 children to each parent and
2,441,405 records. Hierarchical attributes are
defined for both adjacency list model (parent
) and
nested sets model (lft
, rgt
). All these
fields are indexed with plain B-Tree indexes.
The field sets
represents the diagonal of the
bounding box of each record. All children’s boxes are contained
within the parent box. This field is indexed with an
R-Tree (SPATIAL
) index.
To query the adjacency list, we also need to create a pair of special functions as described in the article about hierarchical queries in MySQL:
CREATE FUNCTION hierarchy_connect_by_parent_eq_prior_id_with_level(value INT, maxlevel INT) RETURNS INT NOT DETERMINISTIC READS SQL DATA BEGIN DECLARE _id INT; DECLARE _parent INT; DECLARE _next INT; DECLARE _i INT; DECLARE CONTINUE HANDLER FOR NOT FOUND SET @id = NULL; SET _parent = @id; SET _id = -1; SET _i = 0; IF @id IS NULL THEN RETURN NULL; END IF; LOOP SELECT MIN(id) INTO @id FROM t_hierarchy WHERE parent = _parent AND id > _id AND COALESCE(@level < maxlevel, TRUE); IF @id IS NOT NULL OR _parent = @start_with THEN SET @level = @level + 1; RETURN @id; END IF; SET @level := @level - 1; SELECT id, parent INTO _id, _parent FROM t_hierarchy WHERE id = _parent; SET _i = _i + 1; END LOOP; RETURN NULL; END $$ CREATE FUNCTION hierarchy_connect_by_parent_eq_prior_id(value INT) RETURNS INT NOT DETERMINISTIC READS SQL DATA BEGIN DECLARE _id INT; DECLARE _parent INT; DECLARE _next INT; DECLARE CONTINUE HANDLER FOR NOT FOUND SET @id = NULL; SET _parent = @id; SET _id = -1; IF @id IS NULL THEN RETURN NULL; END IF; LOOP SELECT MIN(id) INTO @id FROM t_hierarchy WHERE parent = _parent AND id > _id; IF @id IS NOT NULL OR _parent = @start_with THEN SET @level = @level + 1; RETURN @id; END IF; SET @level := @level - 1; SELECT id, parent INTO _id, _parent FROM t_hierarchy WHERE id = _parent; END LOOP; END $$
These functions are described in more detail in this and this article.
Now, let’s run the queries.
All descendants Nested sets
SELECT SUM(LENGTH(hc.stuffing)) FROM t_hierarchy hp JOIN t_hierarchy hc ON hc.lft BETWEEN hp.lft AND hp.rgt WHERE hp.id = 42
We don’t use any spatial columns or indexes here.
Since this query requiries searching for all records with values
of lft
lying within the given range (that between
lft
and rgt
of the parent record), a
plain B-Tree index on a plain INT
column is just fine.
SUM(LENGTH(hc.stuffing)) |
---|
1953100 |
1 row fetched in 0.0001s (0.3055s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | SIMPLE | hp | const | PRIMARY,ix_hierarchy_lft,ix_hierarchy_rgt | PRIMARY | 4 | const | 1 | 100.00 | |
1 | SIMPLE | hc | range | ix_hierarchy_lft | ix_hierarchy_lft | 4 | 23548 | 100.00 | Using where |
select sum(length(`20090929_nested`.`hc`.`stuffing`)) AS `SUM(LENGTH(hc.stuffing))` from `20090929_nested`.`t_hierarchy` `hp` join `20090929_nested`.`t_hierarchy` `hc` where ((`20090929_nested`.`hc`.`lft` between '257814' and '281250'))
The query completes in 300 ms.
Adjacency list
SELECT SUM(LENGTH(stuffing)) FROM ( SELECT 42 AS id UNION ALL SELECT hierarchy_connect_by_parent_eq_prior_id(id) AS id FROM ( SELECT @start_with := 42, @id := @start_with, @level := 0 ) vars, t_hierarchy WHERE @id IS NOT NULL ) ho JOIN t_hierarchy hi ON hi.id = ho.id
Since MySQL does not support recursive queries natively, we use the function to iterate the trees and a set of session variables to maintain the state of the function between calls.
To provide the results as a resultset, we call the function it
SELECT
clause of a query over the table,
disregarding the input parameters. The table in the
FROM
clause is used as a dummy rowsource.
SUM(LENGTH(stuffing)) |
---|
1953100 |
1 row fetched in 0.0001s (7.5325s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | <derived2> | ALL | 19532 | 100.00 | |||||
1 | PRIMARY | hi | eq_ref | PRIMARY | PRIMARY | 4 | ho.id | 1 | 100.00 | Using where |
2 | DERIVED | No tables used | ||||||||
3 | UNCACHEABLE UNION | <derived4> | system | 1 | 100.00 | |||||
3 | UNCACHEABLE UNION | t_hierarchy | index | PRIMARY | 4 | 2441405 | 100.00 | Using where; Using index | ||
4 | DERIVED | No tables used | ||||||||
UNION RESULT | <union2,3> | ALL |
select sum(length(`20090929_nested`.`hi`.`stuffing`)) AS `SUM(LENGTH(stuffing))` from (select 42 AS `id` union all select `hierarchy_connect_by_parent_eq_prior_id`(`20090929_nested`.`t_hierarchy`.`id`) AS `id` from (select (@start_with:=42) AS `@start_with := 42`,(@id:=(@start_with)) AS `@id := @start_with`,(@level:=0) AS `@level := 0`) `vars` join `20090929_nested`.`t_hierarchy` where ((@id) is not null)) `ho` join `20090929_nested`.`t_hierarchy` `hi` where (`20090929_nested`.`hi`.`id` = `ho`.`id`)
This query runs for 7 seconds.
All ancestors Nested sets
SELECT hp.id, hp.parent, hp.lft, hp.rgt, hp.data FROM t_hierarchy hc JOIN t_hierarchy hp ON MBRWithin(Point(0, hc.lft), hp.sets) WHERE hc.id = 1000000 ORDER BY lft
This query uses the R-Tree index. To do that, we
convert lft
to a point with coordinates (0,
lft)
and search for all boxes containing this point using
MBRWithin
.
id | parent | lft | rgt | data |
---|---|---|---|---|
2 | 0 | 585938 | 1171874 | Value 2 |
12 | 2 | 703126 | 820312 | Value 12 |
63 | 12 | 750001 | 773437 | Value 63 |
319 | 63 | 764063 | 768749 | Value 319 |
1599 | 319 | 766875 | 767811 | Value 1599 |
7999 | 1599 | 767437 | 767623 | Value 7999 |
39999 | 7999 | 767549 | 767585 | Value 39999 |
199999 | 39999 | 767571 | 767577 | Value 199999 |
1000000 | 199999 | 767576 | 767576 | Value 1000000 |
9 rows fetched in 0.0005s (0.0151s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | SIMPLE | hc | const | PRIMARY | PRIMARY | 4 | const | 1 | 100.00 | Using filesort |
1 | SIMPLE | hp | range | sx_hierarchy_sets | sx_hierarchy_sets | 34 | 1 | 100.00 | Using where |
select `20090929_nested`.`hp`.`id` AS `id`,`20090929_nested`.`hp`.`parent` AS `parent`,`20090929_nested`.`hp`.`lft` AS `lft`,`20090929_nested`.`hp`.`rgt` AS `rgt`,`20090929_nested`.`hp`.`data` AS `data` from `20090929_nested`.`t_hierarchy` `hc` join `20090929_nested`.`t_hierarchy` `hp` where (within(point(0,'767576'),`20090929_nested`.`hp`.`sets`)) order by `20090929_nested`.`hp`.`lft`
The query completes in 15 ms. This result is by far faster than everything we saw before. The same queries issued by the other systems are not assisted or poorly assisted by B-Tree indexes, and usually this query is a matter of seconds.
Adjacency list
SELECT hp.id, hp.parent, hp.lft, hp.rgt, hp.data FROM ( SELECT @r AS _id, @level := @level + 1 AS level, ( SELECT @r := NULLIF(parent, 0) FROM t_hierarchy hn WHERE id = _id ) FROM ( SELECT @r := 1000000, @level := 0 ) vars, t_hierarchy hc WHERE @r IS NOT NULL ) hc JOIN t_hierarchy hp ON hp.id = hc._id ORDER BY level DESC
Not that we don’t even use a function for this query.
Each record has only one parent
, and id
is a PRIMARY KEY
, thus the ancestry chain can be
represented as a linked list.
To iterate the linked list, we can use an approach describe in this article:
, which requires no function, just a correlated subquery.
id | parent | lft | rgt | data |
---|---|---|---|---|
2 | 0 | 585938 | 1171874 | Value 2 |
12 | 2 | 703126 | 820312 | Value 12 |
63 | 12 | 750001 | 773437 | Value 63 |
319 | 63 | 764063 | 768749 | Value 319 |
1599 | 319 | 766875 | 767811 | Value 1599 |
7999 | 1599 | 767437 | 767623 | Value 7999 |
39999 | 7999 | 767549 | 767585 | Value 39999 |
199999 | 39999 | 767571 | 767577 | Value 199999 |
1000000 | 199999 | 767576 | 767576 | Value 1000000 |
9 rows fetched in 0.0006s (0.5937s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | <derived2> | ALL | 9 | 100.00 | Using filesort | ||||
1 | PRIMARY | hp | eq_ref | PRIMARY | PRIMARY | 4 | hc._id | 1 | 100.00 | Using where |
2 | DERIVED | <derived4> | system | 1 | 100.00 | |||||
2 | DERIVED | hc | index | PRIMARY | 4 | 2441405 | 100.00 | Using where; Using index | ||
4 | DERIVED | No tables used | ||||||||
3 | DEPENDENT SUBQUERY | hn | eq_ref | PRIMARY | PRIMARY | 4 | func | 1 | 100.00 | Using where |
Field or reference '_id' of SELECT #3 was resolved in SELECT #2 select `20090929_nested`.`hp`.`id` AS `id`,`20090929_nested`.`hp`.`parent` AS `parent`,`20090929_nested`.`hp`.`lft` AS `lft`,`20090929_nested`.`hp`.`rgt` AS `rgt`,`20090929_nested`.`hp`.`data` AS `data` from (select (@r) AS `_id`,(@level:=((@level) + 1)) AS `level`,(select (@r:=nullif(`20090929_nested`.`hn`.`parent`,0)) AS `@r := NULLIF(parent, 0)` from `20090929_nested`.`t_hierarchy` `hn` where (`20090929_nested`.`hn`.`id` = `_id`)) AS `( SELECT @r := NULLIF(parent, 0) FROM t_hierarchy hn WHERE id = _id )` from (select (@r:=1000000) AS `@r := 1000000`,(@level:=0) AS `@level := 0`) `vars` join `20090929_nested`.`t_hierarchy` `hc` where ((@r) is not null)) `hc` join `20090929_nested`.`t_hierarchy` `hp` where (`20090929_nested`.`hp`.`id` = `hc`.`_id`) order by `hc`.`level` desc
This query completes in 600 ms, which is much longer than the nested sets solution.
All descendants up to a certain level Nested sets
SELECT hc.id, hc.parent, hc.lft, hc.rgt, hc.data FROM t_hierarchy hp JOIN t_hierarchy hc ON hc.lft BETWEEN hp.lft AND hp.rgt JOIN t_hierarchy hr ON MBRWithin(Point(0, hc.lft), hr.sets) WHERE hp.id = ? GROUP BY hc.id HAVING COUNT(*) <= ( SELECT COUNT(*) FROM t_hierarchy hp JOIN t_hierarchy hrp ON MBRWithin(Point(0, hp.lft), hrp.sets) WHERE hp.id = ? ) + 2 ORDER BY hc.lft
This query is adjusted for better usage of the R-Tree indexes.
We find all descendents of the item in question and then calculate the total number of parents (which gives us the depth level of each of the children). Then we just compare it with the level of the item.
This solution depends on the number of item’s children too, however, let’s see the performance:
SELECT hc.id, hc.parent, hc.lft, hc.rgt, hc.data FROM t_hierarchy hp JOIN t_hierarchy hc ON hc.lft BETWEEN hp.lft AND hp.rgt JOIN t_hierarchy hr ON MBRWithin(Point(0, hc.lft), hr.sets) WHERE hp.id = 42 GROUP BY hc.id HAVING COUNT(*) <= ( SELECT COUNT(*) FROM t_hierarchy hp JOIN t_hierarchy hrp ON MBRWithin(Point(0, hp.lft), hrp.sets) WHERE hp.id = 42 ) + 2 ORDER BY hc.lft
id | parent | lft | rgt | data |
---|---|---|---|---|
42 | 8 | 257814 | 281250 | Value 42 |
211 | 42 | 257815 | 262501 | Value 211 |
1056 | 211 | 257816 | 258752 | Value 1056 |
1057 | 211 | 258753 | 259689 | Value 1057 |
1058 | 211 | 259690 | 260626 | Value 1058 |
1059 | 211 | 260627 | 261563 | Value 1059 |
1060 | 211 | 261564 | 262500 | Value 1060 |
212 | 42 | 262502 | 267188 | Value 212 |
1061 | 212 | 262503 | 263439 | Value 1061 |
1062 | 212 | 263440 | 264376 | Value 1062 |
1063 | 212 | 264377 | 265313 | Value 1063 |
1064 | 212 | 265314 | 266250 | Value 1064 |
1065 | 212 | 266251 | 267187 | Value 1065 |
213 | 42 | 267189 | 271875 | Value 213 |
1066 | 213 | 267190 | 268126 | Value 1066 |
1067 | 213 | 268127 | 269063 | Value 1067 |
1068 | 213 | 269064 | 270000 | Value 1068 |
1069 | 213 | 270001 | 270937 | Value 1069 |
1070 | 213 | 270938 | 271874 | Value 1070 |
214 | 42 | 271876 | 276562 | Value 214 |
1071 | 214 | 271877 | 272813 | Value 1071 |
1072 | 214 | 272814 | 273750 | Value 1072 |
1073 | 214 | 273751 | 274687 | Value 1073 |
1074 | 214 | 274688 | 275624 | Value 1074 |
1075 | 214 | 275625 | 276561 | Value 1075 |
215 | 42 | 276563 | 281249 | Value 215 |
1076 | 215 | 276564 | 277500 | Value 1076 |
1077 | 215 | 277501 | 278437 | Value 1077 |
1078 | 215 | 278438 | 279374 | Value 1078 |
1079 | 215 | 279375 | 280311 | Value 1079 |
1080 | 215 | 280312 | 281248 | Value 1080 |
31 rows fetched in 0.0014s (4.8592s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | hp | const | PRIMARY,ix_hierarchy_lft,ix_hierarchy_rgt | PRIMARY | 4 | const | 1 | 100.00 | Using temporary; Using filesort |
1 | PRIMARY | hc | range | ix_hierarchy_lft | ix_hierarchy_lft | 4 | 23548 | 100.00 | Using where | |
1 | PRIMARY | hr | ALL | sx_hierarchy_sets | 2441405 | 100.00 | Range checked for each record (index map: 0×10) | |||
2 | SUBQUERY | hp | const | PRIMARY | PRIMARY | 4 | const | 1 | 100.00 | |
2 | SUBQUERY | hrp | range | sx_hierarchy_sets | sx_hierarchy_sets | 34 | 1 | 100.00 | Using where |
select `20090929_nested`.`hc`.`id` AS `id`,`20090929_nested`.`hc`.`parent` AS `parent`,`20090929_nested`.`hc`.`lft` AS `lft`,`20090929_nested`.`hc`.`rgt` AS `rgt`,`20090929_nested`.`hc`.`data` AS `data` from `20090929_nested`.`t_hierarchy` `hp` join `20090929_nested`.`t_hierarchy` `hc` join `20090929_nested`.`t_hierarchy` `hr` where (within(point(0,`20090929_nested`.`hc`.`lft`),`20090929_nested`.`hr`.`sets`) and (`20090929_nested`.`hc`.`lft` between '257814' and '281250')) group by `20090929_nested`.`hc`.`id` having (count(0) <= ((select count(0) AS `COUNT(*)` from `20090929_nested`.`t_hierarchy` `hp` join `20090929_nested`.`t_hierarchy` `hrp` where (within(point(0,'257814'),`20090929_nested`.`hrp`.`sets`))) + 2)) order by `20090929_nested`.`hc`.`lft`
SELECT hc.id, hc.parent, hc.lft, hc.rgt, hc.data FROM t_hierarchy hp JOIN t_hierarchy hc ON hc.lft BETWEEN hp.lft AND hp.rgt JOIN t_hierarchy hr ON MBRWithin(Point(0, hc.lft), hr.sets) WHERE hp.id = 31415 GROUP BY hc.id HAVING COUNT(*) <= ( SELECT COUNT(*) FROM t_hierarchy hp JOIN t_hierarchy hrp ON MBRWithin(Point(0, hp.lft), hrp.sets) WHERE hp.id = 31415 ) + 2 ORDER BY hc.lft
id | parent | lft | rgt | data |
---|---|---|---|---|
31415 | 6282 | 445651 | 445687 | Value 31415 |
157076 | 31415 | 445652 | 445658 | Value 157076 |
785381 | 157076 | 445653 | 445653 | Value 785381 |
785382 | 157076 | 445654 | 445654 | Value 785382 |
785383 | 157076 | 445655 | 445655 | Value 785383 |
785384 | 157076 | 445656 | 445656 | Value 785384 |
785385 | 157076 | 445657 | 445657 | Value 785385 |
157077 | 31415 | 445659 | 445665 | Value 157077 |
785386 | 157077 | 445660 | 445660 | Value 785386 |
785387 | 157077 | 445661 | 445661 | Value 785387 |
785388 | 157077 | 445662 | 445662 | Value 785388 |
785389 | 157077 | 445663 | 445663 | Value 785389 |
785390 | 157077 | 445664 | 445664 | Value 785390 |
157078 | 31415 | 445666 | 445672 | Value 157078 |
785391 | 157078 | 445667 | 445667 | Value 785391 |
785392 | 157078 | 445668 | 445668 | Value 785392 |
785393 | 157078 | 445669 | 445669 | Value 785393 |
785394 | 157078 | 445670 | 445670 | Value 785394 |
785395 | 157078 | 445671 | 445671 | Value 785395 |
157079 | 31415 | 445673 | 445679 | Value 157079 |
785396 | 157079 | 445674 | 445674 | Value 785396 |
785397 | 157079 | 445675 | 445675 | Value 785397 |
785398 | 157079 | 445676 | 445676 | Value 785398 |
785399 | 157079 | 445677 | 445677 | Value 785399 |
785400 | 157079 | 445678 | 445678 | Value 785400 |
157080 | 31415 | 445680 | 445686 | Value 157080 |
785401 | 157080 | 445681 | 445681 | Value 785401 |
785402 | 157080 | 445682 | 445682 | Value 785402 |
785403 | 157080 | 445683 | 445683 | Value 785403 |
785404 | 157080 | 445684 | 445684 | Value 785404 |
785405 | 157080 | 445685 | 445685 | Value 785405 |
31 rows fetched in 0.0014s (0.0105s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | hp | const | PRIMARY,ix_hierarchy_lft,ix_hierarchy_rgt | PRIMARY | 4 | const | 1 | 100.00 | Using temporary; Using filesort |
1 | PRIMARY | hc | range | ix_hierarchy_lft | ix_hierarchy_lft | 4 | 28 | 100.00 | Using where | |
1 | PRIMARY | hr | ALL | sx_hierarchy_sets | 2441405 | 100.00 | Range checked for each record (index map: 0×10) | |||
2 | SUBQUERY | hp | const | PRIMARY | PRIMARY | 4 | const | 1 | 100.00 | |
2 | SUBQUERY | hrp | range | sx_hierarchy_sets | sx_hierarchy_sets | 34 | 1 | 100.00 | Using where |
select `20090929_nested`.`hc`.`id` AS `id`,`20090929_nested`.`hc`.`parent` AS `parent`,`20090929_nested`.`hc`.`lft` AS `lft`,`20090929_nested`.`hc`.`rgt` AS `rgt`,`20090929_nested`.`hc`.`data` AS `data` from `20090929_nested`.`t_hierarchy` `hp` join `20090929_nested`.`t_hierarchy` `hc` join `20090929_nested`.`t_hierarchy` `hr` where (within(point(0,`20090929_nested`.`hc`.`lft`),`20090929_nested`.`hr`.`sets`) and (`20090929_nested`.`hc`.`lft` between '445651' and '445687')) group by `20090929_nested`.`hc`.`id` having (count(0) <= ((select count(0) AS `COUNT(*)` from `20090929_nested`.`t_hierarchy` `hp` join `20090929_nested`.`t_hierarchy` `hrp` where (within(point(0,'445651'),`20090929_nested`.`hrp`.`sets`))) + 2)) order by `20090929_nested`.`hc`.`lft`
The query for item 42 (which has about 20,000 descendants) took minutes in the other systems. Now it completes in less than 5 seconds.
The same query for item 31,415 is over in just 10 ms.
Adjacency list
SELECT hi.id, hi.parent, hi.lft, hi.rgt, hi.data FROM ( SELECT ? AS id UNION ALL SELECT hierarchy_connect_by_parent_eq_prior_id_with_level(id, 2) AS id FROM ( SELECT @start_with := ?, @id := @start_with, @level := 0 ) vars, t_hierarchy WHERE @id IS NOT NULL ) ho JOIN t_hierarchy hi ON hi.id = ho.id
This query imitates recursion, so performance does not directly depend on the number of descendants.
SELECT hi.id, hi.parent, hi.lft, hi.rgt, hi.data FROM ( SELECT 42 AS id UNION ALL SELECT hierarchy_connect_by_parent_eq_prior_id_with_level(id, 2) AS id FROM ( SELECT @start_with := 42, @id := @start_with, @level := 0 ) vars, t_hierarchy WHERE @id IS NOT NULL ) ho JOIN t_hierarchy hi ON hi.id = ho.id
id | parent | lft | rgt | data |
---|---|---|---|---|
42 | 8 | 257814 | 281250 | Value 42 |
211 | 42 | 257815 | 262501 | Value 211 |
1056 | 211 | 257816 | 258752 | Value 1056 |
1057 | 211 | 258753 | 259689 | Value 1057 |
1058 | 211 | 259690 | 260626 | Value 1058 |
1059 | 211 | 260627 | 261563 | Value 1059 |
1060 | 211 | 261564 | 262500 | Value 1060 |
212 | 42 | 262502 | 267188 | Value 212 |
1061 | 212 | 262503 | 263439 | Value 1061 |
1062 | 212 | 263440 | 264376 | Value 1062 |
1063 | 212 | 264377 | 265313 | Value 1063 |
1064 | 212 | 265314 | 266250 | Value 1064 |
1065 | 212 | 266251 | 267187 | Value 1065 |
213 | 42 | 267189 | 271875 | Value 213 |
1066 | 213 | 267190 | 268126 | Value 1066 |
1067 | 213 | 268127 | 269063 | Value 1067 |
1068 | 213 | 269064 | 270000 | Value 1068 |
1069 | 213 | 270001 | 270937 | Value 1069 |
1070 | 213 | 270938 | 271874 | Value 1070 |
214 | 42 | 271876 | 276562 | Value 214 |
1071 | 214 | 271877 | 272813 | Value 1071 |
1072 | 214 | 272814 | 273750 | Value 1072 |
1073 | 214 | 273751 | 274687 | Value 1073 |
1074 | 214 | 274688 | 275624 | Value 1074 |
1075 | 214 | 275625 | 276561 | Value 1075 |
215 | 42 | 276563 | 281249 | Value 215 |
1076 | 215 | 276564 | 277500 | Value 1076 |
1077 | 215 | 277501 | 278437 | Value 1077 |
1078 | 215 | 278438 | 279374 | Value 1078 |
1079 | 215 | 279375 | 280311 | Value 1079 |
1080 | 215 | 280312 | 281248 | Value 1080 |
31 rows fetched in 0.0013s (0.6250s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | <derived2> | ALL | 32 | 100.00 | |||||
1 | PRIMARY | hi | eq_ref | PRIMARY | PRIMARY | 4 | ho.id | 1 | 100.00 | Using where |
2 | DERIVED | No tables used | ||||||||
3 | UNCACHEABLE UNION | <derived4> | system | 1 | 100.00 | |||||
3 | UNCACHEABLE UNION | t_hierarchy | index | PRIMARY | 4 | 2441405 | 100.00 | Using where; Using index | ||
4 | DERIVED | No tables used | ||||||||
UNION RESULT | <union2,3> | ALL |
select `20090929_nested`.`hi`.`id` AS `id`,`20090929_nested`.`hi`.`parent` AS `parent`,`20090929_nested`.`hi`.`lft` AS `lft`,`20090929_nested`.`hi`.`rgt` AS `rgt`,`20090929_nested`.`hi`.`data` AS `data` from (select 42 AS `id` union all select `hierarchy_connect_by_parent_eq_prior_id_with_level`(`20090929_nested`.`t_hierarchy`.`id`,2) AS `id` from (select (@start_with:=42) AS `@start_with := 42`,(@id:=(@start_with)) AS `@id := @start_with`,(@level:=0) AS `@level := 0`) `vars` join `20090929_nested`.`t_hierarchy` where ((@id) is not null)) `ho` join `20090929_nested`.`t_hierarchy` `hi` where (`20090929_nested`.`hi`.`id` = `ho`.`id`)
SELECT hi.id, hi.parent, hi.lft, hi.rgt, hi.data FROM ( SELECT 31415 AS id UNION ALL SELECT hierarchy_connect_by_parent_eq_prior_id_with_level(id, 2) AS id FROM ( SELECT @start_with := 31415, @id := @start_with, @level := 0 ) vars, t_hierarchy WHERE @id IS NOT NULL ) ho JOIN t_hierarchy hi ON hi.id = ho.id
id | parent | lft | rgt | data |
---|---|---|---|---|
31415 | 6282 | 445651 | 445687 | Value 31415 |
157076 | 31415 | 445652 | 445658 | Value 157076 |
785381 | 157076 | 445653 | 445653 | Value 785381 |
785382 | 157076 | 445654 | 445654 | Value 785382 |
785383 | 157076 | 445655 | 445655 | Value 785383 |
785384 | 157076 | 445656 | 445656 | Value 785384 |
785385 | 157076 | 445657 | 445657 | Value 785385 |
157077 | 31415 | 445659 | 445665 | Value 157077 |
785386 | 157077 | 445660 | 445660 | Value 785386 |
785387 | 157077 | 445661 | 445661 | Value 785387 |
785388 | 157077 | 445662 | 445662 | Value 785388 |
785389 | 157077 | 445663 | 445663 | Value 785389 |
785390 | 157077 | 445664 | 445664 | Value 785390 |
157078 | 31415 | 445666 | 445672 | Value 157078 |
785391 | 157078 | 445667 | 445667 | Value 785391 |
785392 | 157078 | 445668 | 445668 | Value 785392 |
785393 | 157078 | 445669 | 445669 | Value 785393 |
785394 | 157078 | 445670 | 445670 | Value 785394 |
785395 | 157078 | 445671 | 445671 | Value 785395 |
157079 | 31415 | 445673 | 445679 | Value 157079 |
785396 | 157079 | 445674 | 445674 | Value 785396 |
785397 | 157079 | 445675 | 445675 | Value 785397 |
785398 | 157079 | 445676 | 445676 | Value 785398 |
785399 | 157079 | 445677 | 445677 | Value 785399 |
785400 | 157079 | 445678 | 445678 | Value 785400 |
157080 | 31415 | 445680 | 445686 | Value 157080 |
785401 | 157080 | 445681 | 445681 | Value 785401 |
785402 | 157080 | 445682 | 445682 | Value 785402 |
785403 | 157080 | 445683 | 445683 | Value 785403 |
785404 | 157080 | 445684 | 445684 | Value 785404 |
785405 | 157080 | 445685 | 445685 | Value 785405 |
31 rows fetched in 0.0014s (0.6406s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | <derived2> | ALL | 32 | 100.00 | |||||
1 | PRIMARY | hi | eq_ref | PRIMARY | PRIMARY | 4 | ho.id | 1 | 100.00 | Using where |
2 | DERIVED | No tables used | ||||||||
3 | UNCACHEABLE UNION | <derived4> | system | 1 | 100.00 | |||||
3 | UNCACHEABLE UNION | t_hierarchy | index | PRIMARY | 4 | 2441405 | 100.00 | Using where; Using index | ||
4 | DERIVED | No tables used | ||||||||
UNION RESULT | <union2,3> | ALL |
select `20090929_nested`.`hi`.`id` AS `id`,`20090929_nested`.`hi`.`parent` AS `parent`,`20090929_nested`.`hi`.`lft` AS `lft`,`20090929_nested`.`hi`.`rgt` AS `rgt`,`20090929_nested`.`hi`.`data` AS `data` from (select 31415 AS `id` union all select `hierarchy_connect_by_parent_eq_prior_id_with_level`(`20090929_nested`.`t_hierarchy`.`id`,2) AS `id` from (select (@start_with:=31415) AS `@start_with := 31415`,(@id:=(@start_with)) AS `@id := @start_with`,(@level:=0) AS `@level := 0`) `vars` join `20090929_nested`.`t_hierarchy` where ((@id) is not null)) `ho` join `20090929_nested`.`t_hierarchy` `hi` where (`20090929_nested`.`hi`.`id` = `ho`.`id`)
Both queries complete in 600 ms
Summary
MySQL differs from the other systems in its possibilities to handle hierarchical data.
On one hand, it lacks a native way to do recursive queries which makes traversing the hierarchy trees harder. It can be emulated using a custom function and session variables to maintain the recursion stack, but this solution is more slow.
On the other hand, MySQL supports R-Tree indexes which can be used to query the ranges containing a given value. This type of search is required for the nested sets queries and R-Tree index is faster.
However, adjacency list is still faster for retrieving all descendants up to the given level.
Both adjacency lists and nested sets require extra maintenance in MySQL: adjacency lists require building a custom function to query each table, nested sets require a function to update it.
Updating a nested sets model can be slow too since R-Tree indexes take much longer time to add to them than B-Tree indexes.
However, using R-Tree indexes, nested sets model is extra fast for searching for all descendants and all ancestors, and shows decent performance in determining the item’s levels and filtering on them.
In MySQL, it is advisable to add the
level
column into the nested sets
model which will make it super fast for all three types of
queries. However, this will make it even more harder to update.
It should also be noted that the only storage engine that allows R-Tree indexes is MyISAM. In case of an update (which can affect millions of rows even to insert a single record), all table will be locked and will not be able to be queried.
Conclusion
In MySQL, the nested sets model should be preferred if the updates to the hierarhical structure are infrequent and it is affordable to lock the table for the duration of an update (which can take minutes on a long table).
This implies creating the table using MyISAM
storage engine, creating the bounding box of a
GEOMETRY
type as described above, indexing it with a
SPATIAL
index and persisting the level
in the table.
If the updates to the table are frequent or it is inaffordable to lock the table for a long period of time implied by an update, then the adjacency list model should be used to store the hierarchical data.
This requires creating a function to query the table.
MySQL is the only system of the big four (MySQL, Oracle, SQL Server, PostgreSQL) for which the nested sets model shows decent performance and can be considered to stored hierarchical data.