I just read SQL: Ranking without self join, in which
Shlomi
Noach shares a nice MySQL-specific trick based on user-defined variables to compute rankings.
Shlomi's trick reminds me somewhat of the trick I came across
little over a year ago to caclulate percentiles. At that time, several
people pointed out to me too that using user-defined variables in
this way can be unreliable.The problem with user-defined
variablesSo what is the problem exaclty? Well, whenever a query
assigns to a variable, and that same variable is read in another
part of the query, you're on thin ice. That's because the result
of the read is likely to differ depending on whether the
assignment took place before or after the read. Not surprising
when you think about it - the whole point of variable assignment
is to change its value, which by definition causes a different
result when subsequently reading the variable (unless you
assigned the already assigned value of course, duh...).
Now watch that previous statement clearly - the word
subsequently is all-important.
See, that's the problem. The semantics of a SQL SELECT
statement is to obtain a (tabular)
resultset - not specifying an algorithm to construct that
resultset. It is the job of the RDBMS to figure out an algorithm
and thus, you can't be sure in what order individual expressions
(including variable evaluation and assignment) are
executed.
The MySQL manual states it like this:
The order of evaluation for user variables is undefined and may
change based on the elements contained within a given query. In
SELECT @a, @a := @a+1 ...
, you might think that
MySQL will evaluate @a
first and then do an
assignment second, but changing the query (for example, by adding
a GROUP BY
, HAVING
, or ORDER
BY
clause) may change the order of evaluation.
The general rule is never to assign a value to a user variable in
one part of a statement and use the same variable in some other
part of the same statement. You might get the results you expect,
but this is not guaranteed.So what good are these variables
anyway?On the one hand, this looks really lame: can't MySQL just
figure out the correct order of doing the calulations? Well, that
is one way of looking at it. But there is an equally valid reason
not to do that. If the calculations would influence execution
order, it would drastically lessen the number of ways that are
available to optimize the statement.
This begs the question: Why is it possible at all to assign
values to the user-defined variables? The answer is quite simple:
you can use it to pass values between statetments. My hunch is
the variables were created in the olden days to overcome some
limitations resulting from the lack of support for subqueries.
Having variables at least enables you to execute a query and
assign the result temporarily for use in a subsequent statement.
For example, to find the student with the highest score, you can
do:
mysql> select @score:=max(score) from score;
+--------------------+
| @score:=max(score) |
+--------------------+
| 97 |
+--------------------+
1 row in set (0.00 sec)
mysql> select * from score where score = @score;
+----------+--------------+-------+
| score_id | student_name | score |
+----------+--------------+-------+
| 2 | Gromit | 97 |
+----------+--------------+-------+
1 row in set (0.03 sec)
There is nothing wrong with this approach - problems start
arising only when reading and writing the same variable in one
and the same statement.Another way - serializing the set with
GROUP_CONCAT
Anyway, the percentile post I just linked to contains another
solution for that problem that relies on GROUP_CONCAT
. It turns out we can use
the same trick here.
(Some people may like to point out that using
GROUP_CONCAT
is not without issues either, because
it may truncate the list in case the pre-assigned string buffer
is not large enough. I wrote about dealing with that limitation
in several places and I remain recommending to set the
group_concat_max_len
server variable to the value
set for the max_packet_size
server variable like so:
SET @@group_concat_max_len := @@max_allowed_packet;
)
The best way to understand how it works is to think of the
problem in a few steps. First, we make an ordered list of all the
values we want to rank. We can do this with
GROUP_CONCAT
like this:
mysql> SELECT GROUP_CONCAT(
-> DISTINCT score
-> ORDER BY score DESC
-> ) AS scores
-> FROM score
-> ;
+-------------+
| scores |
+-------------+
| 97,95,92,85 |
+-------------+
1 row in set (0.00 sec)
Now that we have this list, we can use the FIND_IN_SET
function to look up the
position of any particlar value contained in the list. Because
the list is ordered in descending order (due to the ORDER
BY ... DESC
), and contains only unique values (due to the
DISTINCT
), this position is in fact the rank number.
For example, if we want to know the rank of all scores with the
value 92, we can do:
mysql> SELECT FIND_IN_SET(92, '97,95,92,85')
+--------------------------------+
| FIND_IN_SET(92, '97,95,92,85') |
+--------------------------------+
| 3 |
+--------------------------------+
1 row in set (0.00 sec)
So, the answer is 3
because 92
is the
third entry in the list.
(If you're wondering how it's possible that we can pass the
integer 92
as first argument for
FIND_IN_SET
: the function expects string arguments,
and automatically converts whichever non-string typed value we
pass to a string. In the case of the integer 92
, it
is silently converted to the string '92'
)
Of course, we are't really interested in looking up ranks for
individual numbers one at a time; rather, we'd like to combine
this with a query on the scores
table that does it
for us. Likewise, we don't really want to manually supply the
list of values as a string constant, we want to substitute that
with the query we wrote to generate that list.
So, we get:
mysql> SELECT score_id, student_name, score
-> , FIND_IN_SET(
-> score
-> , (SELECT GROUP_CONCAT(
-> DISTINCT score
-> ORDER BY score DESC
-> )
-> FROM score)
-> ) as rank
-> FROM score;
+----------+--------------+-------+------+
| score_id | student_name | score | rank |
+----------+--------------+-------+------+
| 1 | Wallace | 95 | 2 |
| 2 | Gromit | 97 | 1 |
| 3 | Shaun | 85 | 4 |
| 4 | McGraw | 92 | 3 |
| 5 | Preston | 92 | 3 |
+----------+--------------+-------+------+
5 rows in set (0.00 sec)
Alternatively, if you think that subqueries are for the devil,
you can rewrite this to a CROSS JOIN
like so:
SELECT score_id, student_name, score
, FIND_IN_SET(
score
, scores
) AS rank
FROM score
CROSS JOIN (SELECT GROUP_CONCAT(
DISTINCT score
ORDER BY score DESC
) AS scores
FROM score) scores
Now that we have a solutions, lets see how it compares to
Shlomi's original method. To do this, I am using the
payment
table from the sakila sample database.
First, Shlomi's method:
mysql> SELECT payment_id
-> , amount
-> , @prev := @curr
-> , @curr := amount
-> , @rank := IF(@prev = @curr, @rank, @rank+1) AS rank
-> FROM sakila.payment
-> , (SELECT @curr := null, @prev := null, @rank := 0) sel1
-> ORDER BY amount DESC;
+------------+--------+----------------+-----------------+------+
| payment_id | amount | @prev := @curr | @curr := amount | rank |
+------------+--------+----------------+-----------------+------+
| 342 | 11.99 | NULL | 11.99 | 1 |
. ... . ..... . ..... . ..... . . .
| 15456 | 0.00 | 0.00 | 0.00 | 19 |
+------------+--------+----------------+-----------------+------+
16049 rows in set (0.09 sec)
Wow! It sure is fast :) Now, the GROUP_CONCAT
solution, using a subquery:
mysql> SELECT payment_id, amount
-> , FIND_IN_SET(
-> amount
-> , (SELECT GROUP_CONCAT(
-> DISTINCT amount
-> ORDER BY amount DESC
-> )
-> FROM sakila.payment)
-> ) as rank
-> FROM sakila.payment
+------------+--------+------+
| payment_id | amount | rank |
+------------+--------+------+
| 1 | 2.99 | 15 |
. . . .... . .. .
| 16049 | 2.99 | 15 |
+------------+--------+------+
16049 rows in set (0.14 sec)
(In case you're wondering why the results are different, this is
because the result set for Shlomi's solution is necessarily
ordered by ascending rank (or descending amount - same
difference. To obtain the identical result, you need to add an
ORDER BY
clause to my query. But since the point was
to calculate the ranks, I didn't bother. Of course, adding an
ORDER BY
could slow things down even more.)
Quite a bit slower, bummer. But at leastt we can't run into
nasties with the user variables anymore. For this data set, I get
about the same performance with the CROSS JOIN
, but
I should warn that I did not do a real benchmark.
ConclusionDon't fall into the trap of reading and writing the
same user-defined variable in the same statement. Although it
seems like a great device and can give you very good performance,
you cannot really control the order of reads and writes. Even if
you can, you must check it again whenever you have reason to
believe the query will be solved differently by the server. This
is of course the case whenever you upgrade the server. But also
seemingly harmless changes like adding an index to a table may
change the order of execution.
Almost all cases where people want to read and write to the same
user variables within the same query, they are dealing with a
kind of serialization problem. They are trying to maintain state
in a variable in order to use it across rows. In many cases, the
right way to do that is to use a self-join. But this may not
always be feasible, as pointed out in Shlomi's original post. For
example, rewriting the payment rank query using a self join is
not going to make you happy.
Often, there is a way out. You can use GROUP_CONCAT
to serialize a set of rows. Granted, you need at least one pass
for that, and another one to do something useful with the result,
but this still a lot better than dealing with semi-cartesian self
join issues.