Planet MySQL Planet MySQL: Meta Deutsch Español Français Italiano 日本語 Русский Português 中文
SQL: selecting top N records per group
+5 Vote Up -0 Vote Down

A while back I presented(*) an SQL trick to present with non-aggregated column on a GROUP BY query, without use of subquery or derived tables.

Based on a similar concept, combined with string walking, I now present a query which selects top-n records for each group, ordered by some condition. It will require no subqueries. It executes faster than its more conventional alternatives.

[UPDATE: this is MySQL only. Others can use Window Functions where available]

Using the simple world database, we answer the following question:

What are the top 5 largest (by area) countries for each continent? What are their names, surface area and population?

Similar questions would be:

What were the latest 5 films rented by each customer?

What were the most presented advertisements for each user?

etc.

Step 1: getting the top

We already know how to get a single column’s value for the top country, as presented in the aforementioned post:

SELECT
 Continent,
 SUBSTRING_INDEX(
   GROUP_CONCAT(Name ORDER BY SurfaceArea DESC),
   ',', 1) AS Name
FROM
 Country
GROUP BY
 Continent
;
+---------------+--------------------+
| Continent     | Name               |
+---------------+--------------------+
| Asia          | China              |
| Europe        | Russian Federation |
| North America | Canada             |
| Africa        | Sudan              |
| Oceania       | Australia          |
| Antarctica    | Antarctica         |
| South America | Brazil             |
+---------------+--------------------+

Step 2: adding columns

This part is easy: just throw in the rest of the columns (again, only indicating the top country in each continent)

SELECT
 Continent,
 SUBSTRING_INDEX(
   GROUP_CONCAT(Name ORDER BY SurfaceArea DESC),
   ',', 1) AS Name,
 SUBSTRING_INDEX(
   GROUP_CONCAT(SurfaceArea ORDER BY SurfaceArea DESC),
   ',', 1) AS SurfaceArea,
 SUBSTRING_INDEX(
   GROUP_CONCAT(Population ORDER BY SurfaceArea DESC),
   ',', 1) AS Population
FROM
 Country
GROUP BY
 Continent
;

+---------------+--------------------+-------------+------------+
| Continent     | Name               | SurfaceArea | Population |
+---------------+--------------------+-------------+------------+
| Asia          | China              | 9572900.00  | 1277558000 |
| Europe        | Russian Federation | 17075400.00 | 146934000  |
| North America | Canada             | 9970610.00  | 31147000   |
| Africa        | Sudan              | 2505813.00  | 29490000   |
| Oceania       | Australia          | 7741220.00  | 18886000   |
| Antarctica    | Antarctica         | 13120000.00 | 0          |
| South America | Brazil             | 8547403.00  | 170115000  |
+---------------+--------------------+-------------+------------+

Step 3: casting

You’ll notice that the Population column from this last execution is aligned to the left. This is because it is believed to be a string. The GROUP_CONCAT clause concatenates values in one string, and SUBSTRING_INDEX parses a substring. The same applies to the SurfaceArea column. We’ll cast Population as UNSIGNED and SurfaceArea as DECIMAL:

SELECT
  Continent,
  SUBSTRING_INDEX(
    GROUP_CONCAT(Name ORDER BY SurfaceArea DESC),
    ',', 1) AS Name,
  CAST(
    SUBSTRING_INDEX(
      GROUP_CONCAT(SurfaceArea ORDER BY SurfaceArea DESC),
      ',', 1)
    AS DECIMAL(20,2)
    ) AS SurfaceArea,
  CAST(
    SUBSTRING_INDEX(
      GROUP_CONCAT(Population ORDER BY SurfaceArea DESC),
      ',', 1)
    AS UNSIGNED
    ) AS Population
FROM
 Country
GROUP BY
 Continent
;
+---------------+--------------------+-------------+------------+
| Continent     | Name               | SurfaceArea | Population |
+---------------+--------------------+-------------+------------+
| Asia          | China              |  9572900.00 | 1277558000 |
| Europe        | Russian Federation | 17075400.00 |  146934000 |
| North America | Canada             |  9970610.00 |   31147000 |
| Africa        | Sudan              |  2505813.00 |   29490000 |
| Oceania       | Australia          |  7741220.00 |   18886000 |
| Antarctica    | Antarctica         | 13120000.00 |          0 |
| South America | Brazil             |  8547403.00 |  170115000 |
+---------------+--------------------+-------------+------------+

Step 4: top n records

It’s time to use string walking. Examples for string walking (described in the excellent SQL Cookbook) can be found here, here and here. We’ll be using a numbers table: a simple table which lists ascending integer numbers. For example, you can use the following:

DROP TABLE IF EXISTS `tinyint_asc`;

CREATE TABLE `tinyint_asc` (
 `value` tinyint(3) unsigned NOT NULL default '0',
 PRIMARY KEY (value)
) ;

INSERT INTO `tinyint_asc` VALUES (0),(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),(21),(22),(23),(24),(25),(26),(27),(28),(29),(30),(31),(32),(33),(34),(35),(36),(37),(38),(39),(40),(41),(42),(43),(44),(45),(46),(47),(48),(49),(50),(51),(52),(53),(54),(55),(56),(57),(58),(59),(60),(61),(62),(63),(64),(65),(66),(67),(68),(69),(70),(71),(72),(73),(74),(75),(76),(77),(78),(79),(80),(81),(82),(83),(84),(85),(86),(87),(88),(89),(90),(91),(92),(93),(94),(95),(96),(97),(98),(99),(100),(101),(102),(103),(104),(105),(106),(107),(108),(109),(110),(111),(112),(113),(114),(115),(116),(117),(118),(119),(120),(121),(122),(123),(124),(125),(126),(127),(128),(129),(130),(131),(132),(133),(134),(135),(136),(137),(138),(139),(140),(141),(142),(143),(144),(145),(146),(147),(148),(149),(150),(151),(152),(153),(154),(155),(156),(157),(158),(159),(160),(161),(162),(163),(164),(165),(166),(167),(168),(169),(170),(171),(172),(173),(174),(175),(176),(177),(178),(179),(180),(181),(182),(183),(184),(185),(186),(187),(188),(189),(190),(191),(192),(193),(194),(195),(196),(197),(198),(199),(200),(201),(202),(203),(204),(205),(206),(207),(208),(209),(210),(211),(212),(213),(214),(215),(216),(217),(218),(219),(220),(221),(222),(223),(224),(225),(226),(227),(228),(229),(230),(231),(232),(233),(234),(235),(236),(237),(238),(239),(240),(241),(242),(243),(244),(245),(246),(247),(248),(249),(250),(251),(252),(253),(254),(255);

The trick is to apply the same technique as used above, not for a single row, but for several rows. Here’s how to present the top 5 countries:

SELECT
  Continent,
  SUBSTRING_INDEX(
    SUBSTRING_INDEX(
      GROUP_CONCAT(Name ORDER BY SurfaceArea DESC),
      ',', value),
    ',', -1)
    AS Name,
  CAST(
    SUBSTRING_INDEX(
      SUBSTRING_INDEX(
        GROUP_CONCAT(SurfaceArea ORDER BY SurfaceArea DESC),
        ',', value),
      ',', -1)
    AS DECIMAL(20,2)
    ) AS SurfaceArea,
  CAST(
    SUBSTRING_INDEX(
      SUBSTRING_INDEX(
        GROUP_CONCAT(Population ORDER BY SurfaceArea DESC),
        ',', value),
      ',', -1)
    AS UNSIGNED
    ) AS Population
FROM
  Country, tinyint_asc
WHERE
  tinyint_asc.value >= 1 AND tinyint_asc.value <= 5
GROUP BY
  Continent, value
;
+---------------+----------------------------------------------+-------------+------------+
| Continent     | Name                                         | SurfaceArea | Population |
+---------------+----------------------------------------------+-------------+------------+
| Asia          | China                                        |  9572900.00 | 1277558000 |
| Asia          | India                                        |  3287263.00 | 1013662000 |
| Asia          | Kazakstan                                    |  2724900.00 |   16223000 |
| Asia          | Saudi Arabia                                 |  2149690.00 |   21607000 |
| Asia          | Indonesia                                    |  1904569.00 |  212107000 |
| Europe        | Russian Federation                           | 17075400.00 |  146934000 |
| Europe        | Ukraine                                      |   603700.00 |   50456000 |
| Europe        | France                                       |   551500.00 |   59225700 |
| Europe        | Spain                                        |   505992.00 |   39441700 |
| Europe        | Sweden                                       |   449964.00 |    8861400 |
| North America | Canada                                       |  9970610.00 |   31147000 |
| North America | United States                                |  9363520.00 |  278357000 |
| North America | Greenland                                    |  2166090.00 |      56000 |
| North America | Mexico                                       |  1958201.00 |   98881000 |
| North America | Nicaragua                                    |   130000.00 |    5074000 |
| Africa        | Sudan                                        |  2505813.00 |   29490000 |
| Africa        | Algeria                                      |  2381741.00 |   31471000 |
| Africa        | Congo                                        |  2344858.00 |   51654000 |
| Africa        |  The Democratic Republic of the              |  1759540.00 |    5605000 |
| Africa        | Libyan Arab Jamahiriya                       |  1284000.00 |    7651000 |
| Oceania       | Australia                                    |  7741220.00 |   18886000 |
| Oceania       | Papua New Guinea                             |   462840.00 |    4807000 |
| Oceania       | New Zealand                                  |   270534.00 |    3862000 |
| Oceania       | Solomon Islands                              |    28896.00 |     444000 |
| Oceania       | New Caledonia                                |    18575.00 |     214000 |
| Antarctica    | Antarctica                                   | 13120000.00 |          0 |
| Antarctica    | French Southern territories                  |     7780.00 |          0 |
| Antarctica    | South Georgia and the South Sandwich Islands |     3903.00 |          0 |
| Antarctica    | Heard Island and McDonald Islands            |      359.00 |          0 |
| Antarctica    | Bouvet Island                                |       59.00 |          0 |
| South America | Brazil                                       |  8547403.00 |  170115000 |
| South America | Argentina                                    |  2780400.00 |   37032000 |
| South America | Peru                                         |  1285216.00 |   25662000 |
| South America | Colombia                                     |  1138914.00 |   42321000 |
| South America | Bolivia                                      |  1098581.00 |    8329000 |
+---------------+----------------------------------------------+-------------+------------+

Limitations

You should have:

  • Enough numbers in the numbers table (I’ve used 5 out of 255)
  • Reasonable setting for group_concat_max_len (see this post). Actually it would be better to have a smaller value here, while you make sure it’s large enough; this way you do not waste memory for large groups.

(*) This was two years ago! I’m getting old

Votes:

You must be logged in with a MySQL account to vote on Planet MySQL entries. More information on PlanetMySQL voting.

Planet MySQL © 1995, 2014, Oracle Corporation and/or its affiliates   Legal Policies | Your Privacy Rights | Terms of Use

Content reproduced on this site is the property of the respective copyright holders. It is not reviewed in advance by Oracle and does not necessarily represent the opinion of Oracle or any other party.