Basically, Vladimir was doubtful whether the

`GROUP_CONCAT()`

solution would be optimal in comparison to a `JOIN`

. His proposal is to solve it like this:

SELECT SUM(g1.r) sr

, g2.length l

, SUM(g1.r)/(SELECT COUNT(*) FROM film) p

FROM (SELECT COUNT(*) r, length FROM film GROUP BY length) g1

JOIN (SELECT COUNT(*) r, length FROM film GROUP BY length) g2

ON g1.length GROUP BY g2.length

HAVING p > 0.9

ORDER BY p

LIMIT 1

First, this query sets up two identical subqueries in the

`FROM`

list using `GROUP BY`

and `COUNT()`

to calculate the number of occurrences of each distinct value. Then, these are joined and `GROUP BY`

is again applied to calculate the total number of rows having a lower value. Finally, `HAVING`

is used to find the groups in the upper percentiles, and `LIMIT`

and `ORDER BY`

are used to single out the one desired percentile value.As it turns out, this solution is slower for moderately small data sets, but much faster for large data sets. He benchmarked it for a total of 999 distinct values on varying number of rows. Here are his slightly rounded numbers:

#rows: group_concat: groupby-join:

4M 1 min 6 sec 5.3 sec

1M 5 sec 2.5 sec

100K 0.5 sec 1.6 sec

Although

`GROUP_CONCAT()`

seems to break down pretty soon, he also writes:

I must admit that when N of distinct rows reaches approx. 10K I get pretty the opposite results if the total number of rows is relatively small. Basically we get into the same situation as with joining the whole tables.

He concluded by saying:

But what I think is the right solution is having something like this on the server side:

SELECT COUNT(INC length)/(SELECT COUNT(*) FROM film) p, length

FROM film

GROUP BY length

HAVING p >= 0.9 ORDER BY p LIMIT 1

Where "INC" is a flag that tells the server to not reset per-group counters in the aggregate functions. This would be quite a trivial change in the Item_sum class and would make sense not only for`SUM`

, but maybe also for`MIN`

,`MAX`

,`AVG`

,`COUNT`

and maybe some other aggregate functions.

So,

`COUNT(INC length)`

would be the cumulative count, or a running total of counts. The fun thing is, you can already do exactly that using user-defined variables. Look:

-- allow statement batching

DELIMITER go

-- initialize

SELECT 0, COUNT(*)

INTO @cum_cnt, @cnt

FROM sakila.payment;

-- calculate percentiles

SELECT @cum_cnt:=@cum_cnt + COUNT(*) / @cnt as p, -- running fraction of #rows per distinct amount

amount

FROM sakila.payment

GROUP BY amount

HAVING p >= 0.9

LIMIT 1;

go

and this gets us the result:

Query OK, 1 row affected (0.01 sec)

+-------------+--------+

| p | amount |

+-------------+--------+

| 0.904542334 | 6.99 |

+-------------+--------+

1 row in set (0.03 sec)

Here is the equivalent

`GROUP_CONCAT`

solution:

SELECT SUBSTRING_INDEX(

SUBSTRING_INDEX(

GROUP_CONCAT(

p.amount

ORDER BY p.amount

SEPARATOR ','

)

, ','

, 90/100 * COUNT(*) + 1

)

, ','

, -1

) AS `90th Percentile`

FROM sakila.payment AS p;

...and it is considerably slower:

+-----------------+

| 90th Percentile |

+-----------------+

| 6.99 |

+-----------------+

1 row in set (0.08 sec)

(

`sakila.payment`

has 16049 rows and 19 distinct values for amount)So, the sane thing to do would be to forget about that

`GROUP_CONCAT`

idea, and use this method. It does not have the nasty drawbacks of having to mess with `group_concat_max_len`

and I am pretty sure Vladimir's method will be faster across the board anyway. The one thing you could object about is the extra query to initialize the user-defined variables. You can get around that by initializing those in a single row subquery in the

`FROM`

clause, a technique described by Baron Schwartz (see for example this post)

SELECT @cum_cnt:=@cum_cnt + COUNT(*) / @cnt as p,

amount

FROM sakila.payment

CROSS JOIN (SELECT @cum_cnt:=0

, @cnt:=COUNT(*)

FROM sakila.payment) p

GROUP BY amount

HAVING p >= 0.9

LIMIT 1;

This has the advantage that the variables are initialized in one go with the entire query, ensuring you are not accidentally working with uninitialized or garbage variables.

(BTW: If you want to do more with these user-defined variables, I can highly recommend more from Baron's site, see for example his article on advanced user defined variable techniques.)