Today, Oracle announced the availability of the Development Milestone Release 15 of MySQL 5.7.5. The tagline for this release promises "Enhanced Database Performance and Manageability". That may sound rather generic, the actual list of changes and improvements is simply *huge*, and includes many items that I personally find rather exciting! Perhaps I'm mistaken but I think this may be one of the largest number of changes packed into a MySQL point release that I've witnessed in a long time. The list of changes includes improvements such as:
- InnoDB improvements: Simplified tablespace recovery, support
for spatial indexes, dynamic configuration of the
innodb_buffer_pool_size
parameter(!), more efficient creation and rebuilding of indexes ("sorted index build") - Several general improvements for spatial data, such as support for open polygons and functions to manipulate geohash values and GeoJSON documents
-
performance_schema
additions and improvements, such as a newuser_variables_by_thread
table, addition ofWORK_COMPLETED
andWORK_ESTIMATED
columns to the stage event tables, improvements to the wait event tables, instrumentation for InnoDB memory allocation in the memory summary tables. - Quite a bunch of optimizer improvements, such as better cost
estimation for semi-join materialization, configurable cost model (by way of the
mysql.server_cost
andmysql.engine_cost
system tables) and more exact index statistics. - Many improvements and additions that make replication more robust
- A more sane default SQL mode and
GROUP BY
behaviour
This is very far from an exhaustive list; It really is not an
exaggeration to say that there is much much more than I can cover
here. Just see for yourself.
Now, one of the changes I'd like to highlight in this post is the
improved GROUP BY support. GROUP BY
behavior before
MySQL 5.7.5More than 7 years ago, I wrote an article on this blog
called Debunking GROUP BY Myths. The article is
basically an explanation of the syntax and semantics of the SQL
GROUP BY
clause, with a focus on MySQL particular
non-standard implementation.
Before MySQL 5.7.5, MySQL would by default always aggregate over
the list of expressions that appear in the GROUP
BY
-clause, even if the SELECT
-list contains
non-aggregate expressions that do not also appear in the
GROUP BY
-clause. In the final resultset, MySQL would
produce one of the available values for such non-aggregate
expressions and the result would basically not be deterministic
from the user's point of view.
This behavior is not standard: SQL92 states that any
non-aggregate expressions appearing in the
SELECT
-list must appear in the GROUP
BY
-clause; SQL99 and on state that any non-aggregate
expressions that appear in the SELECT
-list must be
functionally dependent upon the list of expressions
appearing in the GROUP BY
. In this context,
"functionally dependent" simply means that for each unique
combination of values returned by the expressions that make up
the GROUP BY
-clause, the non-aggregate expression
necessarily yields exactly one value. (This concept is further
explained and illustrated in my original article.)
Most RDBMS-es implement the SQL92 behavior, and generate an error
in case a non-aggregate expression appears in the
SELECT
-list but not the GROUP
BY
-clause. Because MySQL would not generate an error at
all and instead would simply allow such queries while silently
producing a non-deterministic result for such expressions, many
users got bitten and frustrated.
My original article offered 3 suggestions to cope with this
non-standard behavior:
- One could explicitly add a
ONLY_FULL_GROUP_BY
option to thesql_mode
(since it was not included by default). This should essentially make pre-MySQL 5.7.5 behave according to SQL92. Unfortunately, this feature would often erroneously spot properly aggregatedSELECT
-list expressions and reject perfectly valid queries. This is why I disrecommended this approach. (See bug #8510 for details.) - I argued instead to be more conscious when building queries,
and manually ensure that non-aggregated expressions in the
SELECT
-list are functionally dependent upon the list of expressions appearing in theGROUP BY
clause. The aim of my original article was to teach a way of thinking about aggregate queries so that developers would be conditioned to do "the right thing" and avoid writing non-deterministic queries. - The final suggestion was to artificially convert
non-aggregate expressions in the
SELECT
-list to aggregate expressions by wrapping them inside an appropriate "dummy" aggregate function likeMIN()
orMAX()
.
In addition, it may seem that one can also simply add the
non-aggregate expressions int the SELECT
-list to the
GROUP BY
-clause. However, as I have shown in my
original article, this is typically not the best solution: if
that expression is truly functionally dependent upon the
expressions in the GROUP BY
-clause it can (and often
does) have a non-trivial impact on the performance. (And if it
wasn't functionally dependent, you probably weren't exactly aware
what you were doing when you wrote your query ;-)
At the time my recommendation was to go for option two and
manually ensure that any non-aggregate expressions in the
SELECT
-list are dependent upon the expressions in
the GROUP BY
-clause. The bug in the former
behahaviour of ONLY_FULL_GROUP_BY
was simply too
restrictive to work with, and adding dummy aggregate expressions
makes it harder to maintain the query. Besides, successfully
writing those dummy aggregates or add those non-aggregate
expressions to the GROUP BY
list still requires the
same structural understanding of the query that was required to
write it correctly, so why bother if you could just as well have
written the query right in the first place?
Basically the message was that giving your query a little more
thought is simply the best solution on all accounts. GROUP
BY
in MySQL 5.7.5In the 5.7.5m15 milestone release,
ONLY_FULL_GROUP_BY
is included in the
sql_mode
by default. Contrary to what its name might
suggest, this does *not* mean that GROUP BY
-clauses
must list all non-aggregated columns appearing in the
SELECT
-list. Oracle and the MySQL development team,
in particular Guilhelm Bichot, really went the extra mile and
implemented behavior that is identical, or at least very close to
what is described in SQL99 and beyond. That is, MySQL 5.7.5m15
will by default reject only those GROUP BY
-queries
that include non-aggregated expressions in the
SELECT
-list that are not functionally dependent upon
the GROUP BY
-list.
This not only means that you cannot mess up your GROUP
BY
-queries anymore (as MySQL will now reject an improper
GROUP BY
query), it will also not require you to
write non-sensical "dummy" aggregates over expressions that can
only have one value per aggregated result row. Hurrah!
Note that this does not mean that writing the query becomes any
easier. It is just as hard (or easy) as before, this new feature
simply means it becomes impossible to accidentally write a query
that delivers non-deterministic results. I don't think anybody in
their right mind can be against that. ExamplesLet's put it to the
test with the examples from my original article. First, let's
check what the default sql_mode
looks like in
MySQL-5.7.5-m15:
mysql> select version(), @@sql_mode;
+-----------+---------------------------------------------------------------+
| version() | @@sql_mode |
+-----------+---------------------------------------------------------------+
| 5.7.5-m15 | ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ENGINE_SUBSTITUTION |
+-----------+---------------------------------------------------------------+
1 row in set (0.00 sec)
As you can see, ONLY_FULL_GROUP_BY
is included by
default. Now let's try this query:
mysql> use menagerie;
mysql> SELECT species
-> , MIN(birth) -- birthdate of oldest pet per species
-> , MAX(birth) -- birthdate of youngest pet per species
-> , birth -- birthdate of ... uh oh...!
-> FROM menagerie.pet
-> GROUP BY species;
ERROR 1055 (42000): Expression #4 of SELECT list is not in GROUP BY clause and contains nonaggregated column 'menagerie.pet.birth' which is not functionally dependent on columns in GROUP BY clause; this is incompatible with sql_mode=only_full_group_by
We should have expected the query to be rejected, because the
birth
column appears in the SELECT
-list
outside an aggregate function, but is not in the GROUP
BY
-clause.
This exact same query would have been rejected prior to MySQL
5.7.5 as well, provided the ONLY_FULL_GROUP_BY
would
have been explicitly included in the sql_mode
.
However, the error message was quite a bit shorter, and
furthermore, it conveyed a different message:
ERROR 1055 (42000): 'menagerie.pet.birth' isn't in GROUP BY
Take a moment to consider the difference: prior to MySQL 5.7.5,
the complaint was that the non-aggregated column is not part of
the GROUP BY
-list, implying that we should include
it; in MySQL 5.7.5, the message is that it is not functionally
dependent upon the GROUP BY
list, implying that
we can include non-aggregated columns that are functionally
dependent upon the GROUP BY
-clause.
Let's consider this query to illustrate the difference:
mysql> use sakila;
mysql> SELECT film_id -- primary key
-> , title -- non-key column
-> , COUNT(*) -- one row per group
-> FROM sakila.film
-> GROUP BY film_id; -- group by on primary key
In this query, the GROUP BY
-clause contains only
film_id
, which is the primary key of the film table.
(Note that it doesn't really make much sense to perform such an
aggregation, since each aggregate row is based on exactly one
source row, but the point is to illustrate how MySQL 5.7.5
handles this query differently than prior versions.)
The SELECT
-list also contains the
film_id
column, which should be ok since it appears
in the GROUP BY
list. But the
SELECT
-list also contains the title
column, which does not appear in the GROUP BY
-list.
However, since both columns come from the same table, and
film_id
is the primary key, it follows that for any
value of film_id
there can be only one value for
title
. In other words, if the value for
film_id
is known, then the corresponding value for
title
is also known. A fancy way of saying it is
that title
is functionally dependent upon
fim_id
, which is to say that the value of
title
is fully determined once the value for
film_id
is known. This is not surprising since
film_id
is the key of the table. This is virtually
identical to the very definition of what it means to be a key:
some column or combination of columns upon which any other column
is functionally dependent.
So, it should be perfectly alright to execute this query, and
MySQL 5.7.5 does. Note that because of the much narrower
semantics of the ONLY_FULL_GROUP_BY
option in MySQL
prior to 5.7.5, this query would be rejected in earlier MySQL
versions if ONLY_FULL_GROUP_BY
is part of the
sql_mode
. Prior to 5.7.5 MySQL could execute this
query but not if the sql_mode
includes
ONLY_FULL_GROUP_BY
.
Now, in the previous query, the functional dependency between
film_id
and title
is quite
straightforward and easy to detect, since both columns come from
the same table. This type of functional dependency is also
correctly detected by for example Postgres 9.1, and this query
can be executed there too.
However, MySQL 5.7.5 is capable of detecting more complex
functional dependencies. Consider this query:
mysql> SELECT i.film_id
-> , f.title
-> , count(i.inventory_id) inventory_count
-> FROM film f
-> LEFT JOIN inventory i
-> ON f.film_id = i.film_id
-> GROUP BY f.film_id
-> HAVING inventory_count = 0
This is almost a typical master detail query, which joins
film
to inventory
over
film_id
to find out how many copies of each film
exist in the inventory. As an extra criterion, we filter out
those films for which there are no copies available by writing
HAVING inventory_count = 0
. The GROUP
BY
-clause is again on film.film_id
, so this
means we should be able to use any column from the
film
table in the SELECT
-list, since
they are functionally dependent upon the GROUP
BY
-list; again we ask for the title
column
from the film
table.
Again we also select a film_id
column, but instead
of asking for the film_id
column from the
film
table, we ask for the one from the
inventory
table. The inventory.film_id
column does not appear in the GROUP BY
-list, and is
also not an aggregated. But even though there may be multiple
rows from the inventory
table for one specific row
in the film
table, the query is still valid. This is
because the join condition f.film_id = i.film_id
ensures the value of the film_id
column in the
inventory
table is functionally dependent upon the
film_id
column from the film
table. And
becaue the film_id
column from the film
table does appear in the GROUP BY
-list, it must mean
the film_id
column from the inventory
table is fully determined, and hence the query is valid.
This query will fail in Postgres 9.1, but not in MySQL 5.7.5: In
Postgres 9.1, we first have to rewrite the
HAVING
-clause and refer to
count(i.inventory_id)
rather than its alias
inventory_count
. But even then, it still considers
inventory.film_id
not to be functionally dependent
upon the GROUP BY
-clause, and it will reject this
query. (If you have any indication that later versions of
Postgres also handle this query correctly, please let me know and
I'll gladly amend this article)
(A little digression: I just argued that in the previous query,
film_id
from inventory
is functionally
dependent upon film_id
from film
because of the join condition. However, this does not necessarily
mean the value of these columns is identical. In fact, in this
particular example the selected film_id
column from
inventory
will be NULL
because our
HAVING
-clause, whereas the value of
film_id
from the film
table is never
NULL
. But it is still true that for each distinct
value from film_id
from the film
table,
the value of film_id
from the inventory
table is fully determined, and hence, functionally dependent.)
Upgrade adviceIf you decide to upgrade to MySQL 5.7.5 (or
beyond), and you used to run with a sql_mode
that
did not include ONLY_FULL_GROUP_BY
, then some
GROUP BY
queries that used to work prior to the
upgrade might fail after the upgrade. This sounds like a bad
thing but if you think about it, it really isn't: the queries
that are going to fail were in fact invalid all along, and gave
you non-deterministic results. You just didn't notice.
A simple way to make these queries work again would be to remove
ONLY_FULL_GROUP_BY
from the sql_mode
.
However, I would very strongly disrecommend that approach.
Rather, each query that fails in MySQL 5.7.5 (or beyond) due to
enabling ONLY_FULL_GROUP_BY
option should be
inspected and rewritten. If your query contains a non-aggregated
expression in the SELECT
-list that is not dependent
upon the GROUP BY
-list, your application was
probably using bogus (or well, at least non-deterministic)
results and you should decide what the proper behavior should be
and modify the query (and/or the application) accordingly. If you
really want to keep relying on non-deterministic results (why?),
you can wrap such expressions into the new
ANY_VALUE()
function. This will essentially preserve
the old behaviour even if ONLY_FULL_GROUP_BY
is
enabled. I suppose this is still better than running without
ONLY_FULL_GROUP_BY
, because in this way it will at
least be clear the result will be non-deterministic since you're
literally asking for it.
(One word about the ANY_VALUE()
"function". By the
way it looks and is called, you might get the impression that
ANY_VALUE()
is some kind of new aggregate function
that aggregates by actively picking one value out of a list. This
is not the case. Proper aggregate functions, like
COUNT()
, MIN()
, MAX()
etc.
will condense the resultset to one single row in the absence of a
GROUP BY
list; ANY_VALUE()
does no such
thing. It is merely a placeholder that tells the optimizer to not
generate an error when the ONLY_FULL_GROUP_BY
contract is broken. Basically, ANY_VALUE()
means:
this query is broken but instead of fixing it we chose to ignore
the fact that it is broken.) FinallyMySQL 5.7.5 is looking like
it will bring a lot of improvements, and thanks to Guilhelm
Bichot, very much improved standards compliance for GROUP
BY
. I have no hesitation to recommend you start using the
ONLY_FULL_GROUP_BY
option. Of course many of the
considerations to consciously writing your query still applies,
and now MySQL will make it even more easy to do so.
If you're interested in these or other improvements, consider
downloading and installing MySQL 5.7.5 m15 to test your current
systems. MySQL 5.7.5 brings quite a number of incompatible
changes and while I believe they are all improvements, one best
get prepared. Happy hacking. Further Reading
- My original Debunking GROUP BY myths article.
- Guilhelm Bichot, the developer of the improvements on this topic: MySQL 5.7: only_full_group_by Improved, Recognizing Functional Dependencies, Enabled by Default!