My previous post on calculating percentiles with
MySQL generated some comments and good discussion. In particular,
I got some very interesting comments from Vladimir.

Basically, Vladimir was doubtful whether the
`GROUP_CONCAT()`

solution would be optimal in
comparison to a `JOIN`

. His proposal is to solve it
like this:

SELECT SUM(g1.r) sr

, g2.length l

, SUM(g1.r)/(SELECT COUNT(*) FROM film) p

FROM (SELECT COUNT(*) r, length FROM film GROUP BY length) g1

JOIN (SELECT COUNT(*) r, length FROM film GROUP BY length) g2

ON g1.length < g2.length

GROUP BY g2.length

HAVING p > 0.9

ORDER BY p

LIMIT 1

First, this query sets up two identical subqueries in the
`FROM`

list using `GROUP BY`

and
`COUNT()`

to calculate the number of occurrences of
each distinct value. Then, these are joined and ```
GROUP
BY
```

is again applied to calculate the total number of rows
having a lower value. Finally, `HAVING`

is used to
find the groups in the upper percentiles, and `LIMIT`

and `ORDER BY`

are used to single out the one desired
percentile value.

As it turns out, this solution is slower for moderately small
data sets, but much faster for large data sets. He benchmarked it
for a total of 999 distinct values on varying number of rows.
Here are his slightly rounded numbers:

#rows: group_concat: groupby-join:

4M 1 min 6 sec 5.3 sec

1M 5 sec 2.5 sec

100K 0.5 sec 1.6 sec

Although `GROUP_CONCAT()`

seems to break down pretty
soon, he also writes:

I must admit that when N of distinct rows reaches approx. 10K I
get pretty the opposite results if the total number of rows is
relatively small. Basically we get into the same situation as
with joining the whole tables.

He concluded by saying:

But what I think is the right solution is having something like
this on the server side:

SELECT COUNT(INC length)/(SELECT COUNT(*) FROM film) p, length

FROM film

GROUP BY length

HAVING p >= 0.9 ORDER BY p LIMIT 1

Where "INC" is a flag that tells the server to not reset
per-group counters in the aggregate functions. This would be
quite a trivial change in the Item_sum class and would make sense
not only for `SUM`

, but maybe also for
`MIN`

, `MAX`

, `AVG`

,
`COUNT`

and maybe some other aggregate
functions.

So, `COUNT(INC length)`

would be the cumulative count,
or a running total of counts. The fun thing is, you can already
do exactly that using user-defined variables. Look:

-- allow statement batching

DELIMITER go

-- initialize

SELECT 0, COUNT(*)

INTO @cum_cnt, @cnt

FROM sakila.payment;

-- calculate percentiles

SELECT @cum_cnt:=@cum_cnt + COUNT(*) / @cnt as p, -- running fraction of #rows per distinct amount

amount

FROM sakila.payment

GROUP BY amount

HAVING p >= 0.9

LIMIT 1;

go

and this gets us the result:

Query OK, 1 row affected (0.01 sec)

+-------------+--------+

| p | amount |

+-------------+--------+

| 0.904542334 | 6.99 |

+-------------+--------+

1 row in set (0.03 sec)

Here is the equivalent `GROUP_CONCAT`

solution:

SELECT SUBSTRING_INDEX(

SUBSTRING_INDEX(

GROUP_CONCAT(

p.amount

ORDER BY p.amount

SEPARATOR ','

)

, ','

, 90/100 * COUNT(*) + 1

)

, ','

, -1

) AS `90th Percentile`

FROM sakila.payment AS p;

...and it is considerably slower:

+-----------------+

| 90th Percentile |

+-----------------+

| 6.99 |

+-----------------+

1 row in set (0.08 sec)

(`sakila.payment`

has 16049 rows and 19 distinct
values for amount)

So, the sane thing to do would be to forget about that
`GROUP_CONCAT`

idea, and use this method. It does not
have the nasty drawbacks of having to mess with
`group_concat_max_len`

and I am pretty sure Vladimir's
method will be faster across the board anyway.

The one thing you could object about is the extra query to
initialize the user-defined variables. You can get around that by
initializing those in a single row subquery in the
`FROM`

clause, a technique described by Baron Schwartz (see
for example this post)

SELECT @cum_cnt:=@cum_cnt + COUNT(*) / @cnt as p,

amount

FROM sakila.payment

CROSS JOIN (SELECT @cum_cnt:=0

, @cnt:=COUNT(*)

FROM sakila.payment) p

GROUP BY amount

HAVING p >= 0.9

LIMIT 1;

This has the advantage that the variables are initialized in one
go with the entire query, ensuring you are not accidentally
working with uninitialized or garbage variables.

(BTW: If you want to do more with these user-defined variables, I
can highly recommend more from Baron's site, see for example
his article on advanced user defined variable
techniques.)