How to improve subqueries derived tables performance

Last week I was working on one of the issue where the sub-query related to OLAP staging was running for about 2+ hours in the production server and finally nailed down to get the query to run in < 7 secs. It was bit interesting and kind of known issue in MySQL sub-queries world and one of the limitation from MySQL on giving more control over derived table results.

Sometimes we can re-write the sub-queries so that there is no derived tables complexity involved; but the bad part is; this particular sub-query is part of an UPDATE statement; so not all sub-queries can be re-written especially when they are part of UPDATE or DELETE statements due to its own limitations.

PROBLEM:

Here is the subset of the problem query and as you can see it runs for about 6 minutes in this small subset of data that I used for testing on Mac. All tables are InnoDB based.

--------------
SELECT
    SUM(aggrpt.imps) as imps,
    SUM(aggrpt.clicks) as clicks,
    SUM(aggrpt.pos) as pos
 
FROM aggrpt
LEFT JOIN
(
    SELECT
    DISTINCT ext_group_id, group_id
    FROM sub
) sub2  ON(sub2.ext_group_id=aggrpt.adgroupid)
 
GROUP BY
aggrpt.report_date,
aggrpt.campaignid,
aggrpt.adgroupid,
aggrpt.keywordid
ORDER BY NULL
INTO OUTFILE '/tmp/test-sub.txt'
-------------- 
 
Query OK, 47827 rows affected (6 min 47.48 sec)

HOW TO GET AROUND – SOLUTIONS:

Moving the derived table (in the above case sub2) to a view did not help; and timings are more or less the same. Here is the two alternative ways, which made the query run in < 10 secs.

  1. By creating external table and adding an index instead of using derived table
  2. Added an index within the mysql code on the derived table temporary results table by adding FORCE INDEX syntax (changed the syntax to support this, so that engine will create an index on temporary derived table results, which in this case happens to fit within heap engine instead of disk based)

CASE 1:

This is a known alternative and lot of people use this in production by avoiding the derived tables and/or sub-queries completely by creating tables for derived tables. The only thing that made the big difference is adding an index on this; without index it takes more or less the same 6 minutes times.

--------------
CREATE TEMPORARY TABLE sub_temp SELECT DISTINCT ext_group_id, group_id FROM sub
--------------
Query OK, 72385 rows affected (0.81 sec)
Records: 72385  Duplicates: 0  Warnings: 0
--------------
alter table sub_temp add index i_ext_group_id(ext_group_id)
--------------
Query OK, 72385 rows affected (0.10 sec)
Records: 72385  Duplicates: 0  Warnings: 0
--------------
SELECT
    SUM(aggrpt.imps) as imps,
    SUM(aggrpt.clicks) as clicks,
    SUM(aggrpt.pos) as pos
 
FROM aggrpt
LEFT JOIN sub_temp sub ON(sub.ext_group_id=aggrpt.adgroupid)
 
GROUP BY
    aggrpt.report_date,
    aggrpt.campaignid,
    aggrpt.adgroupid,
    aggrpt.keywordid
ORDER BY NULL
INTO OUTFILE '/tmp/test-sub-temp.txt'
-------------- 
 
Query OK, 47827 rows affected (10.41 sec)

As you can see its a great improvement from 6minutes to 10secs; but without index i_ext_group_id on sub-temp table; then things will be as usual.

CASE 2:

The engine should be smart enough to add the index on intermediate results temporary table (in this case sub2, index on column i_ext_group_id); but MySQL does not support this or at least should allow people to specify one using FORCE/USE INDEX, so that engine can add one.

For example; I patched MySQL to support this syntax, so that it adds the index on the derived table sub2 results automatically and the query immediately returns in 7 secs.

SELECT
    SUM(aggrpt.imps) as imps,
    SUM(aggrpt.clicks) as clicks,
    SUM(aggrpt.pos) as pos
 
FROM aggrpt
LEFT JOIN
(
    SELECT
    DISTINCT ext_group_id, group_id
    FROM sub
) sub2 <strong>USE INDEX(ext_group_id)</strong> ON(sub2.ext_group_id=aggrpt.adgroupid)
 
GROUP BY
aggrpt.report_date,
aggrpt.campaignid,
aggrpt.adgroupid,
aggrpt.keywordid
ORDER BY NULL
INTO OUTFILE '/tmp/test-sub-force.txt'
-------------- 
 
Query OK, 47827 rows affected (7.18 sec)

This is again a great improvement; but again using USE/FORCE INDEX is a hack here (this can also be implemented as hint) with column name; which acts as a hint to optimizer to create an index on that column (if its hint, then on JOIN column).

CHANGES TO MYSQL OPTIMIZER

But the idea for this post is to show how important the index in the intermediate temporary table results is for the query performance and optimizer should be smart enough to identify and add one automatically. The logic is same as how currently optimizer opts for index vs non-index scan by estimating the cost.