Calculating Percentiles with MySQL, Round 2

My previous post on calculating percentiles with MySQL generated some comments and good discussion. In particular, I got some very interesting comments from Vladimir.

Basically, Vladimir was doubtful whether the GROUP_CONCAT() solution would be optimal in comparison to a JOIN. His proposal is to solve it like this:


SELECT SUM(g1.r) sr
, g2.length l
, SUM(g1.r)/(SELECT COUNT(*) FROM film) p
FROM (SELECT COUNT(*) r, length FROM film GROUP BY length) g1
JOIN (SELECT COUNT(*) r, length FROM film GROUP BY length) g2
ON g1.length < g2.length
GROUP BY g2.length
HAVING p > 0.9
ORDER BY p
LIMIT 1


First, this query sets up two identical subqueries in the FROM list using GROUP BY and COUNT() to calculate the number of occurrences of each distinct value. Then, these are joined and GROUP BY is again applied to calculate the total number of rows having a lower value. Finally, HAVING is used to find the groups in the upper percentiles, and LIMIT and ORDER BY are used to single out the one desired percentile value.

As it turns out, this solution is slower for moderately small data sets, but much faster for large data sets. He benchmarked it for a total of 999 distinct values on varying number of rows. Here are his slightly rounded numbers:


#rows: group_concat: groupby-join:
4M 1 min 6 sec 5.3 sec
1M 5 sec 2.5 sec
100K 0.5 sec 1.6 sec


Although GROUP_CONCAT() seems to break down pretty soon, he also writes:

I must admit that when N of distinct rows reaches approx. 10K I get pretty the opposite results if the total number of rows is relatively small. Basically we get into the same situation as with joining the whole tables.

He concluded by saying:

But what I think is the right solution is having something like this on the server side:


SELECT COUNT(INC length)/(SELECT COUNT(*) FROM film) p, length
FROM film
GROUP BY length
HAVING p >= 0.9 ORDER BY p LIMIT 1


Where "INC" is a flag that tells the server to not reset per-group counters in the aggregate functions. This would be quite a trivial change in the Item_sum class and would make sense not only for SUM, but maybe also for MIN, MAX, AVG, COUNT and maybe some other aggregate functions.

So, COUNT(INC length) would be the cumulative count, or a running total of counts. The fun thing is, you can already do exactly that using user-defined variables. Look:


-- allow statement batching
DELIMITER go

-- initialize
SELECT 0, COUNT(*)
INTO @cum_cnt, @cnt
FROM sakila.payment;

-- calculate percentiles
SELECT @cum_cnt:=@cum_cnt + COUNT(*) / @cnt as p, -- running fraction of #rows per distinct amount
amount
FROM sakila.payment
GROUP BY amount
HAVING p >= 0.9
LIMIT 1;

go


and this gets us the result:


Query OK, 1 row affected (0.01 sec)

+-------------+--------+
| p | amount |
+-------------+--------+
| 0.904542334 | 6.99 |
+-------------+--------+
1 row in set (0.03 sec)


Here is the equivalent GROUP_CONCAT solution:


SELECT SUBSTRING_INDEX(
SUBSTRING_INDEX(
GROUP_CONCAT(
p.amount
ORDER BY p.amount
SEPARATOR ','
)
, ','
, 90/100 * COUNT(*) + 1
)
, ','
, -1
) AS `90th Percentile`
FROM sakila.payment AS p;


...and it is considerably slower:


+-----------------+
| 90th Percentile |
+-----------------+
| 6.99 |
+-----------------+
1 row in set (0.08 sec)


(sakila.payment has 16049 rows and 19 distinct values for amount)

So, the sane thing to do would be to forget about that GROUP_CONCAT idea, and use this method. It does not have the nasty drawbacks of having to mess with group_concat_max_len and I am pretty sure Vladimir's method will be faster across the board anyway.

The one thing you could object about is the extra query to initialize the user-defined variables. You can get around that by initializing those in a single row subquery in the FROM clause, a technique described by Baron Schwartz (see for example this post)


SELECT @cum_cnt:=@cum_cnt + COUNT(*) / @cnt as p,
amount
FROM sakila.payment
CROSS JOIN (SELECT @cum_cnt:=0
, @cnt:=COUNT(*)
FROM sakila.payment) p
GROUP BY amount
HAVING p >= 0.9
LIMIT 1;


This has the advantage that the variables are initialized in one go with the entire query, ensuring you are not accidentally working with uninitialized or garbage variables.

(BTW: If you want to do more with these user-defined variables, I can highly recommend more from Baron's site, see for example his article on advanced user defined variable techniques.)