My previous post on calculating percentiles with MySQL generated some comments and good discussion. In particular, I got some very interesting comments from Vladimir.

Basically, Vladimir was doubtful whether the `GROUP_CONCAT()` solution would be optimal in comparison to a `JOIN`. His proposal is to solve it like this:

```SELECT   SUM(g1.r) sr,        g2.length l,        SUM(g1.r)/(SELECT COUNT(*) FROM film) pFROM    (SELECT COUNT(*) r, length FROM film GROUP BY length) g1JOIN    (SELECT COUNT(*) r, length FROM film GROUP BY length) g2ON       g1.length < g2.lengthGROUP BY g2.lengthHAVING p > 0.9ORDER BY pLIMIT 1
```

First, this query sets up two identical subqueries in the `FROM` list using `GROUP BY` and `COUNT()` to calculate the number of occurrences of each distinct value. Then, these are joined and ```GROUP BY``` is again applied to calculate the total number of rows having a lower value. Finally, `HAVING` is used to find the groups in the upper percentiles, and `LIMIT` and `ORDER BY` are used to single out the one desired percentile value.

As it turns out, this solution is slower for moderately small data sets, but much faster for large data sets. He benchmarked it for a total of 999 distinct values on varying number of rows. Here are his slightly rounded numbers:

```#rows:  group_concat:  groupby-join: 4M      1 min 6    sec    5.3 sec1M            5    sec    2.5 sec100K          0.5  sec    1.6 sec
```

Although `GROUP_CONCAT()` seems to break down pretty soon, he also writes:

I must admit that when N of distinct rows reaches approx. 10K I get pretty the opposite results if the total number of rows is relatively small. Basically we get into the same situation as with joining the whole tables.

He concluded by saying:

But what I think is the right solution is having something like this on the server side:

```SELECT COUNT(INC length)/(SELECT COUNT(*) FROM film) p, length FROM film GROUP BY lengthHAVING p >= 0.9 ORDER BY p LIMIT 1
```

Where "INC" is a flag that tells the server to not reset per-group counters in the aggregate functions. This would be quite a trivial change in the Item_sum class and would make sense not only for `SUM`, but maybe also for `MIN`, `MAX`, `AVG`, `COUNT` and maybe some other aggregate functions.

So, `COUNT(INC length)` would be the cumulative count, or a running total of counts. The fun thing is, you can already do exactly that using user-defined variables. Look:

```-- allow statement batchingDELIMITER go                -- initializeSELECT      0, COUNT(*)     INTO        @cum_cnt, @cntFROM        sakila.payment;-- calculate percentilesSELECT      @cum_cnt:=@cum_cnt + COUNT(*) / @cnt as p,  -- running fraction of #rows per distinct amount            amount FROM        sakila.paymentGROUP BY    amount HAVING p >= 0.9 LIMIT 1;go
```

and this gets us the result:

```Query OK, 1 row affected (0.01 sec)+-------------+--------+| p           | amount |+-------------+--------+| 0.904542334 |   6.99 |+-------------+--------+1 row in set (0.03 sec)
```

Here is the equivalent `GROUP_CONCAT` solution:

```SELECT  SUBSTRING_INDEX(             SUBSTRING_INDEX(                 GROUP_CONCAT(                     p.amount                     ORDER BY p.amount                     SEPARATOR ','                 )             ,   ','             ,   90/100 * COUNT(*) + 1             )         ,   ','         ,   -1         )   AS `90th Percentile`FROM    sakila.payment AS p;
```

...and it is considerably slower:

```+-----------------+| 90th Percentile |+-----------------+| 6.99            |+-----------------+1 row in set (0.08 sec)
```

(`sakila.payment` has 16049 rows and 19 distinct values for amount)

So, the sane thing to do would be to forget about that `GROUP_CONCAT` idea, and use this method. It does not have the nasty drawbacks of having to mess with `group_concat_max_len` and I am pretty sure Vladimir's method will be faster across the board anyway.

The one thing you could object about is the extra query to initialize the user-defined variables. You can get around that by initializing those in a single row subquery in the `FROM` clause, a technique described by Baron Schwartz (see for example this post)

```SELECT      @cum_cnt:=@cum_cnt + COUNT(*) / @cnt as p,             amount FROM        sakila.paymentCROSS JOIN (SELECT @cum_cnt:=0            ,      @cnt:=COUNT(*)             FROM   sakila.payment) pGROUP BY    amount HAVING p >= 0.9 LIMIT 1;
```

This has the advantage that the variables are initialized in one go with the entire query, ensuring you are not accidentally working with uninitialized or garbage variables.

(BTW: If you want to do more with these user-defined variables, I can highly recommend more from Baron's site, see for example his article on advanced user defined variable techniques.)