This post is inspired by a discussion with John Didion:
Is there any way to optimize the query for overlapping ranges in MySQL if both ranges are dynamic?
I have two tables, each with integer range columns (specified as
LineString
), and I want to find rows that overlap.
No matter what I try, the query planner never uses any indexes.
This question addresses a well-known problem of efficient searching for the intersecting intervals. The queries that deal with it require ability to search for the intervals (stored in two distinct columns) containing a constant scalar value.
Plain B-Tree indexes used by most databases do
not speed up the queries like that. However,
MySQL supports SPATIAL
indexes that
can index two-dimensional shapes and efficiently search for the
shapes containing a given point.
With a little effort, time intervals can be converted into the
geometrical objects, indexed with a SPATIAL
index
and searched for the given point in time (also presented as a
gemetrical object). This is described in the article about
overlapping ranges in MySQL.
The query described in that article, however, searches for the
intervals overlapping a constant range, provided as a parameter
to the query. Now, I will discuss how to adapt the query for a
JOIN
between two tables.
Let’s create two sample tables, each containing a set of time
ranges stored as geometrical objects, and find all records from
both tables whose ranges overlap:
Table
creation details
CREATE TABLE filler ( id INT NOT NULL PRIMARY KEY AUTO_INCREMENT ) ENGINE=Memory; CREATE TABLE t_big ( id INT NOT NULL PRIMARY KEY, rg LineString NOT NULL, SPATIAL KEY sx_big_rg (rg) ) ENGINE=MyISAM; CREATE TABLE t_small ( id INT NOT NULL PRIMARY KEY, rg LineString NOT NULL, SPATIAL KEY sx_small_rg (rg) ) ENGINE=MyISAM; DELIMITER $$ CREATE PROCEDURE prc_filler(cnt INT) BEGIN DECLARE _cnt INT; SET _cnt = 1; WHILE _cnt <= cnt DO INSERT INTO filler SELECT _cnt; SET _cnt = _cnt + 1; END WHILE; END $$ DELIMITER ; START TRANSACTION; CALL prc_filler(120000); COMMIT; INSERT INTO t_big (id, rg) SELECT id, LineString( Point(-1, TIMESTAMPDIFF(second, '1970-01-01', range_start)), Point(1, TIMESTAMPDIFF(second, '1970-01-01', range_end)) ) FROM ( SELECT id, STR_TO_DATE('2009-02-01', '%Y-%m-%d') - INTERVAL id HOUR AS range_start, STR_TO_DATE('2009-02-01', '%Y-%m-%d') - INTERVAL id HOUR + INTERVAL RAND(20100201) * 7200 SECOND AS range_end FROM filler ) q; INSERT INTO t_small (id, rg) SELECT id, LineString( Point(-1, TIMESTAMPDIFF(second, '1970-01-01', range_start)), Point(1, TIMESTAMPDIFF(second, '1970-01-01', range_end)) ) FROM ( SELECT id, STR_TO_DATE('2009-02-01', '%Y-%m-%d') - INTERVAL id DAY AS range_start, STR_TO_DATE('2009-02-01', '%Y-%m-%d') - INTERVAL id DAY + INTERVAL RAND(20100201) * 172800 SECOND AS range_end FROM filler ORDER BY id LIMIT 5000 ) q
There are two tables, t_big
and
t_small
.
t_big
contains 120,000 records with
intervals starting each hour and having random length from
0
to 2
hours, 1 hour
in average.
t_small
contains 5,000 records with
intervals starting each day and having random length from
0
to 2
days, 1 day in
average.
The intervals are presented as the LineString
records with a single line connecting Point(-1,
range_start)
and Point(1, range_end)
.
range_start
and range_end
are presented
as the UNIX timestamps (numbers of seconds since Jan 1st,
1970). To avoid year 2038 problem, we use
TIMESTAMPDIFF
rather than
UNIX_TIMESTAMP
to calculate the value.
Each table is indexes with a SPATIAL
index on the
rg
field.
First of all, we need to make sure that the index is created well.
MySQL uses R-Tree
for the SPATIAL
indexes, which works according to
the principle of least enlargement to the bounding box when
adding a value to the index. Each leaf node is covered by a
certain minimum bounding box stored in the branch node, and the
branch which needs to be increased to the least extent is chosen
as a new parent for the element.
Our ranges are one-dimensional, but MySQL requires the spatial data to be two-dimensional. That’s why we need to use either of the dimensions to represent the time ranges and fill another one to span some constant range so that the box areas would be proportional to the range lengths.
However, if we just fill both ends of the LineString
with a constant, this will make the lines pure horizontal or pure
vertical. Since they all reside on one line, the minimal bounding
boxes of any set of the LineString
’s will have zero
area (since they all will be of zero height).
MySQL will not be able to choose the least
enlargement: in a two-dimensional space, any horizontal
enlargement of a one-dimensional line will be a zero.
MySQL will just append the entry to any randomly
chosen branch in no certain order. This makes the index very
unbalanced and in fact unusable.
To make the index usable, we need to make the boxes have positive
area so that the least enlargement principle could do its job.
That’s why we create the LineString
fields
diagonally, with the first coordinates being -1
for the start point and 1 for the end point.
Now, we should make a query that would find the intersections of these ranges.
MySQL documentation evasively mentions that the predicates
supported by the spatial indexes are MBRContains
and MBRWithin
(with the latter being
just the synonym for MBRContains
with the argument
order reversed), with a side note that in future releases,
spatial indexes may also be used for optimizing other functions
If we were to comply with the documentation and limit ourselves to using only these two functions for the spatial indexes, the query would still be possible and quite efficient.
MBRContains
checks if the bounding box of the second
argument lies within the bounding box for the first argument.
This is not always true for the overlapping ranges. If the ranges
overlap partially, that is the first range both starts and ends
earlier than the second range, then no ranges lie completely
within each other’s bounds, and MBRContains
will
return 0 for any argument order.
However, there is a simple condition that checks whether the ranges do overlap.
With any pair of ranges, there is a range that starts later (or at the same time) than another. If the ranges overlap, then the start of that range should lie within the bounds of its counterpart. Indeed, the start of the latest range should be equal to or greater then the start of the earliest range (by definition) and equal to of less than the end of the earliest range (or the ranges do not overlap).
This can be checked with a simple condition:
MBRContains(rg1, StartPoint(rg2))
.
Since there is no sargable condition that would find out which range starts earlier, we’ll just need to check both ways.
Unfortunately, MySQL does not support merging
spatial indexes, so an OR
condition would make the
indexes unusable. We need to use the two MBRContains
predicates with two separate queries and merge the results using
UNION
.
When optimizing a join, MySQL’s optimizer
usually makes the smaller table leading in the join. With a usual
equijoin condition this is a wise thing to do, since
MySQL uses nested loops with an index search to
join the tables. The duration of the nested loops query is the
product of the leading table scan time (the number of the
records) and driven table search time. B-Tree
search is logarithmic, so (n × log m)
is less than
(m × log n)
as long as m < n
. That’s
why the leading table should be the smallest one.
However, MySQL does not take into account that
MBRContains
predicate is not symmetrical. Reversing
the table scan order makes each search operation to be not a
searching for boxes containing the point, but searching for start
points contained by the box. Since the individual points are not
even indexed, MySQL will not be able to use an
index access path for such a query and will revert to a very
inefficient join buffer
(which in fact is comparing
every record to every other record).
To work around this, we need to force the join order so that the table with the points leads.
In MySQL, this is done by replacing
JOIN
with STRAIGHT_JOIN
, which forces
the leftmost table to be leading in the join.
Here’s the query:
SELECT COUNT(*) FROM ( SELECT s.id AS sid, b.id AS bid FROM t_small s STRAIGHT_JOIN t_big b ON MBRContains(b.rg, StartPoint(s.rg)) UNION SELECT s.id AS sid, b.id AS bid FROM t_big b STRAIGHT_JOIN t_small s ON MBRContains(s.rg, StartPoint(b.rg)) ) q
COUNT(*) |
---|
124911 |
1 row fetched in 0.0001s (4.4999s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | PRIMARY | Select tables optimized away | ||||||||
2 | DERIVED | s | ALL | 5000 | 100.00 | |||||
2 | DERIVED | b | range | sx_big_rg | sx_big_rg | 34 | 1 | 12000000.00 | Range checked for each record (index map: 0×2) | |
3 | UNION | b | ALL | 120000 | 100.00 | |||||
3 | UNION | s | range | sx_small_rg | sx_small_rg | 34 | 1 | 500000.00 | Range checked for each record (index map: 0×2) | |
UNION RESULT | <union2,3> | ALL |
select count(0) AS `COUNT(*)` from (select `20100201_ranges`.`s`.`id` AS `sid`,`20100201_ranges`.`b`.`id` AS `bid` from `20100201_ranges`.`t_small` `s` straight_join `20100201_ranges`.`t_big` `b` union select `20100201_ranges`.`s`.`id` AS `sid`,`20100201_ranges`.`b`.`id` AS `bid` from `20100201_ranges`.`t_big` `b` straight_join `20100201_ranges`.`t_small` `s`) `q`
This query finds the total number of pairs of overlapping ranges from both tables and runs for 4.5 seconds which is quite efficient.
MBRIntersects: a more efficient solution
Normally I would put the words hope that helps and a link to the question form here, which I use to close the articles with the answers.
But when testing the solution, John found out
that MBRIntersects
function (which just checks
intersection of two boxes without splitting them into start and
end points) is sargable too and the query above is quite
overcomplicated.
This function is symmetric, that is MBRIntersects(a, b) ≡
MBRIntersects(b, a)
. However, MySQL is
not aware of that, and an index is used only against the column
in the first argument to MBRIntersects
. This means
we have to make t_small
to lead in the join and put
its column into the second argument of the function.
This way, the column from the driven table (t_big
)
will be searched for the value provided from the leading table
using the index on the driven table, which is exactly what we
need.
Here’s the new updated query:
SELECT COUNT(*) FROM t_small s STRAIGHT_JOIN t_big b ON MBRIntersects(b.rg, s.rg)
COUNT(*) |
---|
124911 |
1 row fetched in 0.0001s (1.2656s) |
id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|
1 | SIMPLE | s | ALL | sx_small_rg | 5000 | 100.00 | ||||
1 | SIMPLE | b | ALL | sx_big_rg | 120000 | 100.00 | Range checked for each record (index map: 0×2) |
select count(0) AS `COUNT(*)` from `20100201_ranges`.`t_small` `s` straight_join `20100201_ranges`.`t_big` `b` where intersects(`20100201_ranges`.`b`.`rg`,`20100201_ranges`.`s`.`rg`)
As we can see, this query is much more simple and completes 3 times as fast.
Finally, let’s see how can we retrieve the datetime values of the
start and end of the ranges from the LineString
’s:
SELECT s.id AS small_id, '1970-01-01' + INTERVAL Y(StartPoint(s.rg)) SECOND AS small_start, '1970-01-01' + INTERVAL Y(EndPoint(s.rg)) SECOND AS small_end, b.id AS big_id, '1970-01-01' + INTERVAL Y(StartPoint(b.rg)) SECOND AS big_start, '1970-01-01' + INTERVAL Y(EndPoint(b.rg)) SECOND AS big_end FROM t_small s STRAIGHT_JOIN t_big b ON MBRIntersects(b.rg, s.rg) LIMIT 10
small_id | small_start | small_end | big_id | big_start | big_end |
---|---|---|---|---|---|
1 | 2009-01-31 00:00:00 | 2009-02-01 20:58:03 | 1 | 2009-01-31 23:00:00 | 2009-02-01 00:52:25 |
1 | 2009-01-31 00:00:00 | 2009-02-01 20:58:03 | 2 | 2009-01-31 22:00:00 | 2009-01-31 22:01:55 |
1 | 2009-01-31 00:00:00 | 2009-02-01 20:58:03 | 3 | 2009-01-31 21:00:00 | 2009-01-31 21:32:21 |
1 | 2009-01-31 00:00:00 | 2009-02-01 20:58:03 | 4 | 2009-01-31 20:00:00 | 2009-01-31 20:36:01 |
1 | 2009-01-31 00:00:00 | 2009-02-01 20:58:03 | 5 | 2009-01-31 19:00:00 | 2009-01-31 20:23:00 |
1 | 2009-01-31 00:00:00 | 2009-02-01 20:58:03 | 6 | 2009-01-31 18:00:00 | 2009-01-31 19:06:56 |
1 | 2009-01-31 00:00:00 | 2009-02-01 20:58:03 | 7 | 2009-01-31 17:00:00 | 2009-01-31 18:25:41 |
1 | 2009-01-31 00:00:00 | 2009-02-01 20:58:03 | 8 | 2009-01-31 16:00:00 | 2009-01-31 17:47:38 |
1 | 2009-01-31 00:00:00 | 2009-02-01 20:58:03 | 9 | 2009-01-31 15:00:00 | 2009-01-31 15:41:06 |
1 | 2009-01-31 00:00:00 | 2009-02-01 20:58:03 | 10 | 2009-01-31 14:00:00 | 2009-01-31 14:02:38 |
10 rows fetched in 0.0006s (0.0024s) |
This retrieves the range bounds in plain DATETIME
format, suitable for output and calculations.
Hope that helps, John, and thanks for a nice tip!
I’m always glad to answer the questions regarding database queries.