My previous post on calculating percentiles with
MySQL generated some comments and good discussion. In particular,
I got some very interesting comments from Vladimir.
Basically, Vladimir was doubtful whether the
GROUP_CONCAT()
solution would be optimal in
comparison to a JOIN
. His proposal is to solve it
like this:
SELECT SUM(g1.r) sr
, g2.length l
, SUM(g1.r)/(SELECT COUNT(*) FROM film) p
FROM (SELECT COUNT(*) r, length FROM film GROUP BY length) g1
JOIN (SELECT COUNT(*) r, length FROM film GROUP BY length) g2
ON g1.length < g2.length
GROUP BY g2.length
HAVING p > 0.9
ORDER BY p
LIMIT 1
First, this query sets up two identical subqueries in the
FROM
list using GROUP BY
and
COUNT()
to calculate the number of occurrences of
each distinct value. Then, these are joined and GROUP
BY
is again applied to calculate the total number of rows
having a lower value. Finally, HAVING
is used to
find the groups in the upper percentiles, and LIMIT
and ORDER BY
are used to single out the one desired
percentile value.
As it turns out, this solution is slower for moderately small
data sets, but much faster for large data sets. He benchmarked it
for a total of 999 distinct values on varying number of rows.
Here are his slightly rounded numbers:
#rows: group_concat: groupby-join:
4M 1 min 6 sec 5.3 sec
1M 5 sec 2.5 sec
100K 0.5 sec 1.6 sec
Although GROUP_CONCAT()
seems to break down pretty
soon, he also writes:
I must admit that when N of distinct rows reaches approx. 10K I
get pretty the opposite results if the total number of rows is
relatively small. Basically we get into the same situation as
with joining the whole tables.
He concluded by saying:
But what I think is the right solution is having something like
this on the server side:
SELECT COUNT(INC length)/(SELECT COUNT(*) FROM film) p, length
FROM film
GROUP BY length
HAVING p >= 0.9 ORDER BY p LIMIT 1
Where "INC" is a flag that tells the server to not reset
per-group counters in the aggregate functions. This would be
quite a trivial change in the Item_sum class and would make sense
not only for SUM
, but maybe also for
MIN
, MAX
, AVG
,
COUNT
and maybe some other aggregate
functions.
So, COUNT(INC length)
would be the cumulative count,
or a running total of counts. The fun thing is, you can already
do exactly that using user-defined variables. Look:
-- allow statement batching
DELIMITER go
-- initialize
SELECT 0, COUNT(*)
INTO @cum_cnt, @cnt
FROM sakila.payment;
-- calculate percentiles
SELECT @cum_cnt:=@cum_cnt + COUNT(*) / @cnt as p, -- running fraction of #rows per distinct amount
amount
FROM sakila.payment
GROUP BY amount
HAVING p >= 0.9
LIMIT 1;
go
and this gets us the result:
Query OK, 1 row affected (0.01 sec)
+-------------+--------+
| p | amount |
+-------------+--------+
| 0.904542334 | 6.99 |
+-------------+--------+
1 row in set (0.03 sec)
Here is the equivalent GROUP_CONCAT
solution:
SELECT SUBSTRING_INDEX(
SUBSTRING_INDEX(
GROUP_CONCAT(
p.amount
ORDER BY p.amount
SEPARATOR ','
)
, ','
, 90/100 * COUNT(*) + 1
)
, ','
, -1
) AS `90th Percentile`
FROM sakila.payment AS p;
...and it is considerably slower:
+-----------------+
| 90th Percentile |
+-----------------+
| 6.99 |
+-----------------+
1 row in set (0.08 sec)
(sakila.payment
has 16049 rows and 19 distinct
values for amount)
So, the sane thing to do would be to forget about that
GROUP_CONCAT
idea, and use this method. It does not
have the nasty drawbacks of having to mess with
group_concat_max_len
and I am pretty sure Vladimir's
method will be faster across the board anyway.
The one thing you could object about is the extra query to
initialize the user-defined variables. You can get around that by
initializing those in a single row subquery in the
FROM
clause, a technique described by Baron Schwartz (see
for example this post)
SELECT @cum_cnt:=@cum_cnt + COUNT(*) / @cnt as p,
amount
FROM sakila.payment
CROSS JOIN (SELECT @cum_cnt:=0
, @cnt:=COUNT(*)
FROM sakila.payment) p
GROUP BY amount
HAVING p >= 0.9
LIMIT 1;
This has the advantage that the variables are initialized in one
go with the entire query, ensuring you are not accidentally
working with uninitialized or garbage variables.
(BTW: If you want to do more with these user-defined variables, I
can highly recommend more from Baron's site, see for example
his article on advanced user defined variable
techniques.)