Yesterday, I was on the freenode ##pentaho irc channel when
Andres
Chaves asked me how to calculate the Nth percentile in MySQL. He saw a
solution somewhere using subqueries, but wasn't too happy about
it.
A while ago I wrote about calulating the median in MySQL, and it turns
out the Nth percentile can be calculated using a
similar, single-pass approach, not relying on subqueries, UDFs,
or user-defined variables.
The percentile....
So, what is a percentile exactly? Here's what the wikipedia
says:
A percentile is the value of a variable below which a certain
percent of observations fall. So the 20th percentile is the value
(or score) below which 20 percent of the observations may be
found.
....and the median
The wikipedia continues and hints at the relationship between the
Nth percentile and the median:
The 25th percentile is also known as the first quartile; the 50th
percentile as the median.
Sidenote: as I understand it, this latter remark concerning the
median is not entirely correct. The median is "the middle value":
the number of observations with a higher value is equal to the
number of observations that has a lower value. For the series
{1,2,3,4} the median would be 2+3 = 2.5. The 50th percentile
would be 3 in this case. However, in most practical cases the
values are fairly evenly distributed and sufficiently large,
which means the difference between the median and 50th percentile
will be small or absent.
However, the median and percentile problems are quite similar: in
both cases, a value is picked or computed that is higher than the
value found in a particular portion of the total number of
observations. For the median, the requirement is that that
particular portion is equal to the number of observations that
exceeds it; for the Nth percentile the requirement is
that that portion constitutes N percent of the total
number of observations.
The Solution
In the following example, we calculate the 90th percentile of
film lengths:
SELECT SUBSTRING_INDEX(
SUBSTRING_INDEX(
GROUP_CONCAT( -- 1) make a sorted list of values
f.length
ORDER BY f.length
SEPARATOR ','
)
, ',' -- 2) cut at the comma
, 90/100 * COUNT(*) + 1 -- at the position beyond the 90% portion
)
, ',' -- 3) cut at the comma
, -1 -- right after the desired list entry
) AS `90th Percentile`
FROM sakila.film AS f
Here, the literal 90
represents which percentile we
want, that is, "the 90th percentile".
(If you like, you can leave out the SEPARATOR ','
bit, as the default separator is a comma anyway. I just wanted to
have a clear indication for the source of the ','
arguments in the SUBSTRING_INDEX()
calls)
Explanation
The median and Nth percentile problem can be solved
using a similar apporoach: first, we use the string aggregate
function GROUP_CONCAT()
to create an ordered
list of values. Then we use the substring variation
SUBSTRING_INDEX()
to find and excise a
list entry at a particular desired position.
The differences between the solutions for the median and the
Nth is mainly in which entry we have to pick from the
list. (See my prior article for the full story on median). For
the Nth percentile, we first calculate the desired
portion of the total number of observations. Because N
is defined as a percentage, we divide by 100 to get the actual
fraction of the total number of observations:
N / 100
Then, we multiply by the total number of observations to find the
number of observations that make up the actual portion of
observations within the specified percentile:
N / 100 * COUNT(*)
Because we want to find the observation for which the specified
portion has a lower value, we need to look at the next entry
instead of the last entry within the portion, so we add 1:
N / 100 * COUNT(*) + 1
Caveats
There are a number of things to look out for:
Percentile value not unique
When we calculate the 90th percentile of the film lengths, we get
173:
+-----------------+
| 90th Percentile |
+-----------------+
| 173 |
+-----------------+
If we check the result by counting the portion of films with a
length lower than 173
we see:
mysql> SELECT 100 * COUNT(IF(f.length < 173, 1, NULL))/COUNT(*) `Percentage`
-> FROM film AS f;
+------------+
| Percentage |
+------------+
| 89.4212 |
+------------+
The reason that we do not get 90% is that there are multiple
occurrences of films with length equal to 173:
mysql> SELECT title FROM film WHERE length = 173;
+----------------------+
| title |
+----------------------+
| BALLROOM MOCKINGBIRD |
| CONQUERER NUTS |
| FIRE WOLVES |
| GLADIATOR WESTWARD |
| PIZZA JUMANJI |
| TALENTED HOMICIDE |
| VELVET TERMINATOR |
+----------------------+
7 rows in set (0.01 sec)
So, even though we may have picked the entry at the right
position, this may still be a value within the specified portion
instead of beyond. The definitions I found for the Nth
percentile do not stipulate any mechanism to deal with this kind
of ambiguity, whereas for the median, the correct value is always
found by averaging the left and right median if necessary.
What about NULL
The second caveat are NULL
values. Currently this
method does not work when the column for which you want to
caculate percentile values is nullable. It is possible to work
around this though. If you can exclude the rows with the
NULL
value entirely, you can simply add a
WHERE
clause. This is a good idea also because it
will cull the number of rows to process.
It may not be acceptable to throw away the rows with NULL values,
for example if there is another expression in your
SELECT
list that needs to do something with all
rows. You can then still work around it, with some extra hassle.
It would involve tweaking the GROUP_CONCAT
to ignore
NULL values. This could be done like this:
GROUP_CONCAT(
IF(<column> IS NULL
, ''
, <column>)
ORDER BY <column>
SEPARATOR ','
)
This will ensure that GROUP_CONCAT()
does not also
return NULL
when a NULL
value is
present in the specified column. If there are NULL
values, these will end up as a list of comma's in the head of the
result:
,,,,<non-null-value1>,...,<non-null-valueN> -- 4 NULL values
Assuming we know the number of NULL
's (and we do) we
can clean up our list easily with just
SUBSTRING()
:
SUBSTRING(
GROUP_CONCAT(
IF(<column> IS NULL
, ''
, <column>)
ORDER BY <column>
SEPARATOR ','
)
, SUM(IF(<column> IS NULL, 1, 0)) + 1
)
Because we are ignoring the NULL values in our list, we must
likewise ignore them in our calculation of the portion of rows.
So, instead if COUNT(*)
we should use
COUNT(<column>)
in order to not count the
NULL
values.
group_concat_max_len
When using GROUP_CONCAT()
, an issue that should
always be on your radar is the maximum length of the
GROUP_CONCAT()
result. If the result value exceeds
the maximum length, the GROUP_CONCAT()
result will
be truncated, and a warning will be issued:
1 row in set, 1 warning (0.00 sec)
mysql> show warnings;
+---------+------+--------------------------------------+
| Level | Code | Message |
+---------+------+--------------------------------------+
| Warning | 1260 | 1 line(s) were cut by GROUP_CONCAT() |
+---------+------+--------------------------------------+
It is really important to be aware of any warnings, as a
truncation of the GROUP_CONCAT()
result messes up
the entire calculation.
The maximum length of the GROUP_CONCAT()
result is
controlled through the group_concat_max_len
system
variable.
It can be set thus:
SET @@group_concat_max_len := <num-bytes>
The maximum practical value is the maximum packet size, which is
available as the max_allowed_packet
system
variable. This means you can write:
SET @@group_concat_max_len := @@max_allowed_packet;
and you will never be bothered by this problem again. The
GROUP_CONCAT()
result can still be too large though
(namely, larger than the maximum packet size) but in that case
you will get a proper error instead of a truncated result.
You should realize that setting the
group_concat_max_len
to a high value may lead to
memory problems, as each GROUP_CONCAT()
invocation
may individually reserve the specified amount of memory to deal
with its result.
Finally...
I will maintain this percentile calculation as a snippet on the MySQL Forge site. Please go there to find the
latest version.
Now, finally, I have a question for you. When I wrote about the
median calculation, I mentioned that I thought it was an original
method, and I asked whether someone could deny that claim. I did
not get any reaction, so I'd like to repeat my question: Do you
know of any book or text or blog that describes this technique?
If so, let me know so I can provide proper accreditation.
TIA, Roland.