MySQL 5.7 comes with built-in JSON support, comprising two major features:
- A native JSON data type
- A set of built-in functions to manipulate values of the JSON type
Despite being added rather recently (in MySQL 5.7.8 to be precise
- one point release number before the 5.7.9 GA version), I feel
the JSON support so far looks rather useful. Improvements are
certainly possible, but compared to for example XML support
(added in 5.1 and 5.5), the JSON feature set added to 5.7.8 is
reasonably complete, coherent and standards-compliant.
(We can of course also phrase this more pessimistically and say
that XML support falls short on these accounts, but that's not
what this post is about :-)
There is potentially a lot to write and explain about the JSON
support, and I can't hope to completely cover the subject in one
blog post. Rather, I will highlight a few things I observed and I
hope that this will be helpful for others that want to get
started with JSON in MySQL 5.7. Creating JSON values There are a
number of ways to create values of the JSON type: Using
CAST(... AS JSON)
CAST
a value of any
non-character string type AS JSON
to obtain a JSON
representation of that value. Example:
mysql> SELECT CAST(1 AS JSON), CAST(1.1 AS JSON), CAST(NOW() AS JSON); +-----------------+-------------------+------------------------------+ | CAST(1 AS JSON) | CAST(1.1 AS JSON) | CAST(NOW() AS JSON) | +-----------------+-------------------+------------------------------+ | 1 | 1.1 | "2015-10-31 23:01:56.000000" | +-----------------+-------------------+------------------------------+ 1 row in set (0.00 sec)
Even though it may not be immediately clear from the result, the
CAST
operation actually turned these values into
JSON equivalents. More about this in the next section.
If the value you're casting is of a character string type, then
its value should be parseable as either a JSON object or a JSON
array (i.e., JSON documents), as a JSON keyword indicating a
built-in value, like null
, true
,
false
, or as a properly quoted JSON string value:
mysql> SELECT CAST('{}' AS JSON) object -> , CAST('[]' AS JSON) array -> , CAST('null' AS JSON) "null" -> , CAST('true' AS JSON) "true" -> , CAST('false' AS JSON) "false" -> , CAST('"string"' AS JSON) string -> ; +--------+-------+------+------+-------+----------+ | object | array | null | true | false | string | +--------+-------+------+------+-------+----------+ | {} | [] | null | true | false | "string" | +--------+-------+------+------+-------+----------+ 1 row in set (0.00 sec)
If the string is not parseable as JSON, you'll get a runtime error:
mysql> SELECT CAST('' AS JSON); ERROR 3141 (22032): Invalid JSON text in argument 1 to function cast_as_json: "The document is empty." at position 0 in ''. mysql> SELECT CAST('{]' AS JSON); ERROR 3141 (22032): Invalid JSON text in argument 1 to function cast_as_json: "Missing a name for object member." at position 1 in '{]'.
Note that many keywords that might be valid in other
environments, like NaN
, Infinity
,
javascript built-in constructor fields like
Number.EPSILON
, and even undefined
are
*not* valid in this context. Remember - this is JSON, not
javascript.
To get the JSON presentation of a plain, unquoted string value,
you can use the JSON_QUOTE()
function:
mysql> SELECT JSON_QUOTE(''), JSON_QUOTE('{]'); +----------------+------------------+ | JSON_QUOTE('') | JSON_QUOTE('{]') | +----------------+------------------+ | "" | "{]" | +----------------+------------------+ 1 row in set (0.00 sec)
SELECT
-ing from a JSON column Table columns can be
defined to be of JSON data type, and SELECT
-ing from
such a column will create JSON values at runtime. Of course, such
a column would first need to be populated before it yields JSON
values, and this can be done simply with an INSERT
statement. When INSERT
-ing non-JSON type values into
a column of the JSON type, MySQL will behave as if it first
converts these values to JSON-type, just as if it would apply
CAST(value AS JSON)
to those values. UPDATE:
Giuseppe
Maxia kindly pointed out that there is an issue when
INSERT
-ing the result of LOAD_FILE()
into a JSON column. You
first have to CONVERT()
the -binary- result to the
utf8 character set before it can be successfully accepted as
JSON. This was reported as bug
#79066. Calling functions that return JSON values The
JSON_QUOTE()
function mentioned above is one example
of built-in functions returning a JSON value. To create new JSON
documents from scratch, JSON_OBJECT()
and JSON_ARRAY()
are probably most useful:
mysql> SELECT JSON_ARRAY(1, 2, 3) array, JSON_OBJECT('name1', 'value1', 'name2', 'value2') object; +-----------+----------------------------------------+ | array | object | +-----------+----------------------------------------+ | [1, 2, 3] | {"name1": "value1", "name2": "value2"} | +-----------+----------------------------------------+ 1 row in set (0.00 sec)
Note that we could have achieved the previous result also by
CAST
ing literal string representations of these JSON
documents AS JSON
:
mysql> SELECT CAST('[1, 2, 3]' AS JSON) array, CAST('{"name1": "value1", "name2": "value2"}' AS JSON) object; +-----------+----------------------------------------+ | array | object | +-----------+----------------------------------------+ | [1, 2, 3] | {"name1": "value1", "name2": "value2"} | +-----------+----------------------------------------+ 1 row in set (0.00 sec)
However, as we shall see later on, this approach is not entirely
equivalent to constructing these documents through
JSON_ARRAY
and JSON_OBJECT
.
There are many more built-in JSON functions that return a value of the JSON
data type. Unlike JSON_QUOTE()
,
JSON_ARRAY()
and JSON_OBJECT()
, most of
these also require a JSON document as their first argument. In
these cases, the return value represents a modified instance of
the document passed as argument. Operating on JSON documents:
Extraction and Modification While the JSON document may be a
convenient unit for storing and transporting related items of
data, any meaningful processing of such documents will always
involve some operation to transform or modify such a document:
for example, extracting some item stored inside the document, or
adding or removing properties or array elements.
Manipulation of JSON documents always involves at least two
distinct items:
- The JSON document to operate on. This can be an explicit or implicitly obtained JSON document, constructed in any of the ways described earlier in this post. In general, functions that manipulate JSON documents accept the document that is being operated on as their first argument.
- A path. The path is an expression that identifies which part of the document to operate on. In general, the second argument of functions that manipulate JSON documents is a path expression. Depending on which function exactly, other arguments may or may not accept path expressions as well.
It is important to point out that none of the functions that modify JSON documents actually change the argument document inline: JSON functions are pure functions that don't have side effects. The modified document is always returned from the function as a new document. JSON path expressions in MySQL While the path is passed as a string value, it's actually an expression consisting of alternating identifiers and access operators that as a whole identifies a particular piece within the JSON document:
- Identifiers
-
There are 4 types of identifiers that can appear in a path:
-
$
(dollar sign) is a special identifier, which is essentially a placeholder for the current document being operated on. It can only appear at the start of the path expression - Property names are optionally double quoted names that identify properties ("fields") in a JSON object. Double quoted property names are required whenever the property name contains meta characters. For example, if the property name contains any interpunction or space characters, you need to double quote the name. A property name can appear immediately after a dot-access operator.
- Array indices are integers that identify array elements in a JSON array. Array indices can appear only within an array-access operator (which is denoted by a pair of square braces)
-
*
(asterisk) is also a special identifier. It indicates a wildcard that represents any property name or array index. So, the asterisk can appear after a dot-operator, in which case it denotes any property name, or it may appear between square braces, in which case it represents all existing indices of the array.
The asterisk essentially "forks" the path and may thus match multiple values in a JSON document. The MySQL JSON functions that grab data or meta data usually have a way to handle multiple matched values, but JSON functions that modify the document usually do not support this.
-
- Access operators
-
Paths can contain only 2 types of access operators:
-
dot-operator, denoted by a
.
-character. The dot-operator can appear in between any partial path expression and an identifier (including the special wildcard identifier*
). It has the effect of extracting the value identified by the identifier from the value identified by the path expression that precedes the dot.
This may sound more complicated than it really is: for example, the path$.myproperty
has the effect of extracting whatever value is associated with the top-level property calledmyproperty
; the path$.myobject.myproperty
has the effect of extracting the value associated with the property calledmyproperty
from the nested object stored in themyobject
property of the top-level document. -
array access-operator, denoted by a matching pair of
square braces:
[...]
. The braces should contain either an integer, indicating the position of an array element, or the*
(wildcard identifier) indicating all array element indices.
The array-access operator can appear after any path expression, and can be followed by either a dot-operator (followed by its associated property identifier), or another array access operator (to access nested array elements).
Currently, the braces can be used only to extract array elements. In javascript, braces can also contain a quoted property name to extract the value of the named property (equivalent to the dot-operator) but this is currently not supported in MySQL path expressions. (I believe this is a - minor - bug, but it's really no biggie since you can and probably should be using the dot-operator for properties anyway.)
-
dot-operator, denoted by a
Below is the syntax in a sort of EBNF notation in case you prefer that:
mysql-json-path ::= Document-placeholder path-expression? Document-placeholder ::= '$' path-expression ::= path-component path-expression* path-component ::= property-accessor | array-accessor property-accessor ::= '.' property-identifier property-identifier ::= Simple-property-name | quoted-property-name | wildcard-identifier Simple-property-name ::= <Please refer to JavaScript, The Definitive Guide, 2.7. Identifiers> quoted-property-name ::= '"' string-content* '"' string-content ::= Non-quote-character | Escaped-quote-character Non-quote-character ::= <Any character except " (double quote)> Escaped-quote-character ::= '\"' wildcard-identifier ::= '*' array-accessor ::= '[' element-identifier ']' element-identifier ::= [0-9]+ | wildcard-identifier
Grabbing data from JSON documents
-
json JSON_EXTRACT(json, path+)
- This functions gets the value at the specified path. Multiple path arguments may be passed, in which case any values matching the paths are returned as a JSON array.
-
json json-column->path
-
If you have a table with a column of the JSON type, then you
can use the
->
operator inside SQL statements as a shorthand forJSON_EXTRACT()
. Note that this operator only works inside SQL statements, and only if the left-hand operand is a column name; it does not work for arbitrary expressions of the JSON type. (Pity! I would love this to work for any expression of the JSON type, and in any context - not just SQL statements)
Grabbing metadata from JSON documents
-
bool JSON_CONTAINS(json, value, path?)
-
Checks whether the specified value appears in the specified
document. If the path is specified, the function returns
TRUE
only if the value appears at the specified path. If the path argument is omitted, the function looks *anywhere* in the document and returnsTRUE
if it finds the value (either as property value or as array element). -
bool JSON_CONTAINS_PATH(json, 'one'|'all', path+)
- Checks whether the specified JSON document contains one or all of the specified paths. Personally I think there are some issues with this function
-
int JSON_DEPTH(json)
- Number of levels present in the document
-
json-array JSON_KEYS(json-object, path?)
- Returns the property names of the specified object as a JSON-array. If path is specified, the properties of the object identified by the path are returned instead.
-
int JSON_LENGTH(json, path?)
-
Returns the number of keys (when the json document is an
object) or the number of elements (in case the json document is
an array). If a path is specified, the function is applied to
the value identified by the path rather than the document
itself. Ommitting the path is equivalent to passing
$
as path. -
string JSON_SEARCH(json, 'one'|'all', pattern, escape?, path*)
-
Searches for string values that match the specified pattern,
and returns the path or paths where the properties that match
the pattern are located. The second argument indicates when the
search should stop - in case it's
'one'
, search will stop as soon as a matching path was found, and the path is returned. In case of'all'
, search will continue until all matching properties are found. If this results in multiple paths, then a JSON array of paths will be returned. The pattern can contain%
and_
wildcard characters to match any number of characters or a single character (just as with the standard SQLLIKE
-operator). Theescape
argument can optionally define which character should be used to escape literal%
and_
characters. By default this is the backslash (\
). Finally, you can optionally limit which parts of the document will be searched by passing one or more json paths. Technically it is possible to pass several paths that include the same locations, but only unique paths will be returned. That is, if multiple paths are found, the array of paths that is returned will never contain the same path more than once.
Unfortunately, MySQL currently does not provide any function that allows you to search for property names. I think it would be very useful so I made a feature request. -
string JSON_TYPE(json)
-
Returns the name of the type of the argument value. It's
interesting to note that the set of type values returned by
this function are not equivalent to the types that are
distinguished by the JSON specification. Values returned by
this function are all uppercase string values. Some of these
indicate items that belong to the JSON type system, like:
"OBJECT"
,"ARRAY"
,"STRING"
,"BOOLEAN"
and"NULL"
(this is the uppercase string - not to be confused with the keyword for the SQL literalNULL
-value). But some refer to native MySQL data types:"INTEGER"
,"DOUBLE"
, and"DECIMAL"
;"DATE"
,"TIME"
, and"DATETIME"
, and"OPAQUE"
. -
bool JSON_VALID(string)
-
Returns whether the passed value could be parsed as a JSON
value. This is not limited to just JSON objects and arrays, but
will also parse JSON built-in special value keywords, like
null
,true
,false
.
Manipulating JSON documents
-
json JSON_INSERT(json, [path, value]+)
- Takes the argument json document, and adds (but does not overwrite) properties or array elements. Returns the resulting document.
-
json JSON_MERGE(json, json+)
- Folds multiple documents and returns the resulting document.
-
json JSON_REMOVE(json, path+)
- Remove one or more items specified by the path arguments from the document specified by the JSON argument, and returns the document after removing the specified paths.
-
json JSON_REPLACE(json, [path, value]+)
- Takes the argument document and overwrites (but does not add) items specified by path arguments, and returns the resulting document.
-
json JSON_SET(json, [path, value]+)
- Takes the argument document and adds or overwrites items specified by the path arguments, then returns the resulting document.
Functions to manipulate JSON arrays
-
json JSON_ARRAY_APPEND(json, [path, value]+)
- If the path exists and identifies an array, it appends the value to the array. If the path exists but identifies a value that is not an array, it wraps the value into a new array, and appends the value. If the path does not identify a value at all, the document remains unchanged for that path.
-
json JSON_ARRAY_INSERT(json, [array-element-path, value]+)
- This function inserts elements into existing arrays. The path must end with an array accessor - it must end with a pair of square braces containing an exact array index (not a wildcard). If the partial path up to the terminal array accessor identies an existing array, and the specified index is less than the array length, the value is inserted at the specified position. Any array elements at and beyond the specified position are shifted down one position to make room for the new element. If the specified index is equal to or exceeds the array length, the new value is appended to the array.
-
int JSON_LENGTH(json, path?)
- I already described this one as a function that grabs metadata, but I found this function to be useful particularly when applied arrays.
Removing array elements Note that there is no dedicated function
for removing elements from an array. It is simply done using
JSON_REMOVE. Just make sure the path argument
denotes an array accessor to identify the element to
remove.
To remove multiple elements from an array, you can specify
multiple path arguments. In this case, the removal operation is
performed sequentially, evaluating all passed path arguments from
left to right. So, you have to be very careful which path to
pass, since a preceding path may have changed the array you're
working on. For example, if you want to remove the first two
elements of an array, you should pass a path like
'$[0]'
twice. Passing '$[0]'
and
'$[1]'
will end up removing elements 0 and 2 of the
original array, since after removing the initial element at
'$[0]'
, the element that used to sit at position 1
has been shifted left to position 0. The element that then sits
at position 1 is the element that used to sit at position 2:
mysql> select json_remove('[1,2,3,4,5]', '$[0]', '$[0]') "remove elements 0 and 1" -> , json_remove('[1,2,3,4,5]', '$[0]', '$[1]') "remove elements 0 and 2" -> ; +-------------------------+-------------------------+ | remove elements 0 and 1 | remove elements 0 and 2 | +-------------------------+-------------------------+ | [3, 4, 5] | [2, 4, 5] | +-------------------------+-------------------------+ 1 row in set (0.00 sec)
Concatenating arrays There is no function dedicated to concatenating arrays. However, you can use JSON_MERGE to do so:
mysql> SELECT JSON_MERGE('[0,1]', '[2,3]'); +------------------------------+ | JSON_MERGE('[0,1]', '[2,3]') | +------------------------------+ | [0, 1, 2, 3] | +------------------------------+ 1 row in set (0.00 sec)
Slicing arrays There is no dedicated function or syntax to take a
slice of an array. If you don't need to slice arrays, then good -
you're lucky. If you do need it, I'm afraid you're up for a
challenge: I don't think there is a convenient way to do it. I
filed a feature request and I hope this will be followed
up. JSON Schema Validation Currently, the JSON functions provide
a JSON_VALID()
function, but this can
only check if a string conforms to the JSON syntax. It does not
verify whether the document conforms to predefined structures (a
schema).
I anticipate that it might be useful to be able to ascertain
schema conformance of JSON documents within MySQL. The exact
context is out of scope for this post, but I would already like
to let you know that I am working on a JSON schema validator. It
can be found on github here: mysql-json-schema-validator.
Stay tuned - I will do a writeup on that as soon as I complete a
few more features that I believe are essential. MySQL JSON is
actually a bit like BSON MySQL's JSON type is not just a blob
with a fancy name, and it is not entirely the same as standard
JSON. MySQL's JSON type is more like MongoDB's BSON: it
preserves native type information. The most straightforward way
to make this clear is by creating different sorts of JSON values
using CAST( ... AS JSON)
and then reporting the type
of the result using JSON_TYPE
:
mysql> SELECT JSON_TYPE(CAST('{}' AS JSON)) as "object" -> , JSON_TYPE(CAST('[]' AS JSON)) as "array" -> , JSON_TYPE(CAST('""' AS JSON)) as "string" -> , JSON_TYPE(CAST('true' AS JSON)) as "boolean" -> , JSON_TYPE(CAST('null' AS JSON)) as "null" -> , JSON_TYPE(CAST(1 AS JSON)) as "integer" -> , JSON_TYPE(CAST(1.1 AS JSON)) as "decimal" -> , JSON_TYPE(CAST(PI() AS JSON)) as "double" -> , JSON_TYPE(CAST(CURRENT_DATE AS JSON)) as "date" -> , JSON_TYPE(CAST(CURRENT_TIME AS JSON)) as "time" -> , JSON_TYPE(CAST(CURRENT_TIMESTAMP AS JSON)) as "datetime" -> , JSON_TYPE(CAST(CAST('""' AS BINARY) AS JSON)) as "blob" -> \G *************************** 1. row *************************** object: OBJECT array: ARRAY string: STRING boolean: BOOLEAN null: NULL integer: INTEGER decimal: DECIMAL double: DOUBLE date: DATE time: TIME datetime: DATETIME blob: BLOB 1 row in set (0.00 sec)
What this query shows is that internally, values of the JSON type preserve native type information. Personally, I think that is a good thing. JSON's standard type system is rather limited. I would love to see standard JSON support for proper decimal and datetime types. Comparing JSON objects to JSON objects The MySQL JSON type system is not just cosmetic - the attached internal type information affects how the values work in calculations and comparisons. Consider this comparison of two JSON objects:
mysql> SELECT CAST('{"num": 1.1}' AS JSON) = CAST('{"num": 1.1}' AS JSON); +-------------------------------------------------------------+ | CAST('{"num": 1.1}' AS JSON) = CAST('{"num": 1.1}' AS JSON) | +-------------------------------------------------------------+ | 1 | +-------------------------------------------------------------+ 1 row in set (0.00 sec)
This is already quite nice - you can't compare two objects like
that in javascript. Or actually, you can, but the result will be
false since you'd be comparing two distinct objects that simply
happen to have the same properties and property values. But
usually, with JSON, we're just interested in the data. Since the
objects that are compared here are totally equivalent with regard
to composition and content, I consider the ability to directly
compare objects as a bonus.
It gets even nicer:
mysql> SELECT CAST('{"num": 1.1, "date": "2015-11-01"}' AS JSON) = CAST('{"date": "2015-11-01", "num": 1.1}' AS JSON); +---------------------------------------------------------------------------------------------------------+ | CAST('{"num": 1.1, "date": "2015-11-01"}' AS JSON) = CAST('{"date": "2015-11-01", "num": 1.1}' AS JSON) | +---------------------------------------------------------------------------------------------------------+ | 1 | +---------------------------------------------------------------------------------------------------------+ 1 row in set (0.00 sec)
Again, the result is true, indicating that these objects are
equivalent. But you'll notice that the property names appear in
different order between these two objects. But the direct
comparison ignores the property order - it only takes into
account whether a property exists at a particular path, and
whether the property values are the same. One can argue about
whether the property order should be deemed significant in a
comparsison. The JSON spec doesn't specify so. But I'm inclined
to say that MySQL's behavior here is a nice feature.
Now let's try something a bit like that first comparison, but in
a slightly different way:
mysql> SELECT JSON_OBJECT('bla', current_date) -> , JSON_OBJECT('bla', current_date) = JSON_OBJECT('bla', current_date) -> , JSON_OBJECT('bla', current_date) = CAST('{"bla": "2015-11-01"}' AS JSON) -> \G *************************** 1. row *************************** JSON_OBJECT('bla', current_date): {"bla": "2015-11-01"} JSON_OBJECT('bla', current_date) = JSON_OBJECT('bla', current_date): 1 JSON_OBJECT('bla', current_date) = CAST('{"bla": "2015-11-01"}' AS JSON): 0 1 row in set (0.00 sec)
The difference here is of course creating the object using
JSON_OBJECT
as opposed to using CAST(... AS
JSON)
. While the string representation of the result of
JSON_OBJECT('bla', current_date)
looks exactly the
same like that of CAST('{"bla": "2015-11-01"}' AS
JSON)
, they are not equivalent: in the case of
JSON_OBJECT
, MySQL internally attached native type
information to the property which is of the type
DATE
(a type that does not exist in standard JSON),
whereas in the case of the CAST(... AS JSON)
operation, MySQL did not have any additional type information for
the value of the property, leaving it no other choice than to
assume a STRING
type. The following query proves the
point:
mysql> SELECT JSON_TYPE(JSON_EXTRACT(JSON_OBJECT('bla', current_date), '$.bla')) -> , JSON_TYPE(JSON_EXTRACT(CAST('{"bla": "2015-11-01"}' AS JSON), '$.bla')) -> \G *************************** 1. row *************************** JSON_TYPE(JSON_EXTRACT(JSON_OBJECT('bla', current_date), '$.bla')): DATE JSON_TYPE(JSON_EXTRACT(CAST('{"bla": "2015-11-01"}' AS JSON), '$.bla')): STRING 1 row in set (0.00 sec)
Comparing JSON values to non-JSON values Fortunately, comparison
of JSON values to MySQL non-JSON values is pretty consistent,
without requiring explicit CAST
operations. This may
sound obvious, but it's really not. The following query might
explain better what I mean. Consider a JSON object with a
property called "myProp" that has a string value of
"value1"
:
mysql> SELECT JSON_EXTRACT(JSON_OBJECT('myProp', 'value1'), '$.myProp'); +-----------------------------------------------------------+ | JSON_EXTRACT(JSON_OBJECT('myProp', 'value1'), '$.myProp') | +-----------------------------------------------------------+ | "value1" | +-----------------------------------------------------------+ 1 row in set (0.00 sec)
Note the double quotes around the value - when we extract the
value of the myProp
property, the result is a JSON
string - not a native MySQL character type. And when that result
is rendered by the client, its MySQL string representation
includes the double quotes. To get a proper MySQL string, we can
apply JSON_UNQUOTE()
, like this:
mysql> SELECT JSON_UNQUOTE(JSON_EXTRACT(JSON_OBJECT('myProp', 'value1'), '$.myProp')); +-------------------------------------------------------------------------+ | JSON_UNQUOTE(JSON_EXTRACT(JSON_OBJECT('myProp', 'value1'), '$.myProp')) | +-------------------------------------------------------------------------+ | value1 | +-------------------------------------------------------------------------+ 1 row in set (0.00 sec)
But fortunately, we don't really need to apply
JSON_UNQUOTE()
for most operations. For example, to
compare the extracted value with a regular MySQL string value, we
can simply do the comparison without explicitly casting the MySQL
string to a JSON type, or explicitly unquoting the JSON string
value to a MySQL string value:
mysql> SELECT JSON_EXTRACT(JSON_OBJECT('myProp', 'value1'), '$.myProp') = 'value1'; +----------------------------------------------------------------------+ | JSON_EXTRACT(JSON_OBJECT('myProp', 'value1'), '$.myProp') = 'value1' | +----------------------------------------------------------------------+ | 1 | +----------------------------------------------------------------------+ 1 row in set (0.00 sec)
Again, I think this is very good news!
Still, there definitely are some gotcha's. The following example
might explain what I mean:
mysql> SELECT CURRENT_DATE -> , CURRENT_DATE = '2015-11-01' -> , JSON_EXTRACT(JSON_OBJECT('myProp', CURRENT_DATE), '$.myProp') -> , JSON_EXTRACT(JSON_OBJECT('myProp', CURRENT_DATE), '$.myProp') = '2015-11-01' -> , JSON_EXTRACT(JSON_OBJECT('myProp', CURRENT_DATE), '$.myProp') = CURRENT_DATE -> , JSON_UNQUOTE(JSON_EXTRACT(JSON_OBJECT('myProp', CURRENT_DATE), '$.myProp')) = '2015-11-01' -> \G *************************** 1. row *************************** CURRENT_DATE: 2015-11-01 CURRENT_DATE = '2015-11-01': 1 JSON_EXTRACT(JSON_OBJECT('myProp', current_date), '$.myProp'): "2015-11-01" JSON_EXTRACT(JSON_OBJECT('myProp', CURRENT_DATE), '$.myProp') = '2015-11-01': 0 JSON_EXTRACT(JSON_OBJECT('myProp', CURRENT_DATE), '$.myProp') = CURRENT_DATE: 1 JSON_UNQUOTE(JSON_EXTRACT(JSON_OBJECT('myProp', CURRENT_DATE), '$.myProp')) = '2015-11-01': 1 1 row in set (0.00 sec)
Note that this is the type of thing that one might easily get
wrong. The comparison CURRENT_DATE = '2015-11-01'
suggests the MySQL date value is equal to its MySQL string
representation, and the comparison
JSON_EXTRACT(JSON_OBJECT('myProp', current_date),
'$.myProp') = CURRENT_DATE
suggests the value extracted
from the JSON document is also equal to the date value.
From these two results one might expect that
JSON_EXTRACT(JSON_OBJECT('myProp', CURRENT_DATE),
'$.myProp')
would be equal to '2015-11-01'
as
well, but the query clearly shows this is not the case. Only when
we explicitly apply JSON_UNQUOTE
does the date value
extracted from the JSON document become a real MySQL string,
which we then can compare with the string value
'2015-11-01'
successfully.
When you think about a minute what really happens, it does make
sense (at least, I think it does):
- A MySQL date is equivalent to the MySQL string representation of that date
- A MySQL date is equivalent to it's JSON date representation
- A JSON date is not equal to the MySQL string representation of that date
- A MySQL string representation of a JSON date is equal to the MySQL string representation of that date
That said, you might still find it can catch you when off guard.
Table columns of the JSON type The JSON type is not just a
runtime type - it is also available as a storage data type for
table columns. A problem though is that there is no direct
support for indexing JSON columns, which is sure to become a
problem in case you plan to query the table based on the contents
of the JSON document. Any WHERE
,
JOIN...ON
, GROUP BY
or ORDER
BY
-clause that relies on extracting a value from the JSON
column is sure to result in a full table scan.
There is a workaround though: Once you know the paths for those
parts of the document that will be used to filter, order and
aggregate the data, you can create generated columns to have these values
extracted from the document, and then put an index on those
generated columns. This practice is recommended for MySQL by the
manual page for CREATE TABLE
. A
complete example is given in the section called Secondary Indexes and Virtual Generated
Columns.
Obviously, this approach is not without issues:
- You need to rewrite your queries accordingly to use those generated columns rather than the raw extraction operations on the document. Or at least, you will have to if you want to benefit from your indexes.
- Having to create separate columns in advance seems at odds with schema flexibility, which I assume is a highly-valued feature for those that find they need JSON columns.
- The generated columns will require additional storage.
Of these concerns, I feel that the need to rewrite the queries is probably the biggest problem. UPDATE: Roy Lyseng kindly pointed out to me that I missed an important feature. MySQL is actually smart enough to use indexed generated columns on the json document. Just look at this query:
mysql> explain select doc from posts where json_extract(doc, '$.Id') = 1; +----+-------------+-------+------------+-------+---------------+------+---------+-------+------+----------+-------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+-------+---------------+------+---------+-------+------+----------+-------+ | 1 | SIMPLE | posts | NULL | const | id | id | 4 | const | 1 | 100.00 | NULL | +----+-------------+-------+------------+-------+---------------+------+---------+-------+------+----------+-------+
Note how the query does not directly reference the generated
column id
. MySQL magically understands that
json_extract(doc, '$.Id')
was used as expression for
a generated column, and this is enough to automatically include
it in the plan evaluation. Thanks Roy! This is marvelous :) The
additional storage seems to be the smallest issue, assuming the
number of items that you need to index is small as compared to
the entire document. (Although I can imagine the extra storage
would start to count when you want to extract large text columns
for full-text indexing). That said, if I understand correctly, if
you create the index on VIRTUAL
generated columns,
only the index will require extra storage - there won't also be
storage required for the columns themselves. (Note that creating
an index will always require extra storage - that's just how it
works, both in MySQL, as well as in specialized document
databases like MongoDB.)
As far as I can see now, any indexing scheme that requires us to
elect the items within the documents that we want to index in
advance suffers from the same drawback: If the schema evolves in
such a way that fields that used to be important enough to be
deemed fit for indexing get moved or renamed often, then this
practice will affect all aspects of any application that works on
the document store. My gut feeling is that despite the
theoretical possibility of schema flexibility, this will cause
enough inertia in the schema evolution (at least, with respect to
those items that we based our indexes on) to be well in time to
come up with other solutions. To be fair though, having to set up
generated columns would probably add a some extra inertia as
compared to a pure document database (like MongoDB).
But my main point still stands: if you choose to keep changing
the schema all the time, especially if it involves those items
that you need to filter, sort, or aggregate the data, then the
changes will affect almost every other layer of your application
- not just your database. Apparently, that's what you bargained
for and in the light of all other changes that would be needed to
support this practice of a dynamic schema evolution, it seems
that setting up a few extra columns should not be that big a
deal. JSON Columns and Indexing Example Just to illustrate how it
would work out, let's try and setup a table to store JSON
documents. For this example, I'm looking at the Stackexchange datasets. There are many such
datasets for various topic, and I'm looking at the one for
math.stackexchange.com because it has a decent
size - 873MB. Each of these archives comprises 8 xml files, and
I'm using the Posts.xml file. One post document might look like
this:
<row Id="1" PostTypeId="1" AcceptedAnswerId="9" CreationDate="2010-07-20T19:09:27.200" Score="85" ViewCount="4121" Body="<p>Can someone explain to me how there can be different kinds of infinities?</p>" OwnerUserId="10" LastEditorUserId="206259" LastEditorDisplayName="user126" LastEditDate="2015-02-18T03:10:12.210" LastActivityDate="2015-02-18T03:10:12.210" Title="Different kinds of infinities?" Tags="<set-theory><intuition><faq>" AnswerCount="10" CommentCount="1" FavoriteCount="28" />
I'm using Pentaho Data Integration to read these files and to convert them into JSON documents. These JSON documents look like this:
{ "Id": 1, "Body": "<p>Can someone explain to me how there can be different kinds of infinities?<\/p>", "Tags": "<set-theory><intuition><faq>", "Score": 85, "Title": "Different kinds of infinities?", "PostTypeId": 1, "AnswerCount": 10, "OwnerUserId": 10, "CommentCount": 1, "CreationDate": "2010-07-20 19:09:27", "LastEditDate": "2015-02-18 03:10:12", "AcceptedAnswerId": 9, "LastActivityDate": "2015-02-18 03:10:12", "LastEditorUserId": 206259 }
Initially, let's just start with a simple table called
posts
with a single JSON column called
doc
:
CREATE TABLE posts ( doc JSON );
After loading, I got a little over a million post documents in my table:
mysql> select count(*) from posts; +----------+ | count(*) | +----------+ | 1082988 | +----------+ 1 row in set (0.66 sec)
(There are actually some 5% more posts in the stackexchange data
dump, but my quick and dirty transformation to turn the XML into
JSON led to a bunch of invalid JSON documents, and I didn't
bother to perfect the transformation enough to get them all. A
million is more than enough to illustrate the approach
though.)
Now, let's find the post with Id
equal to
1
:
mysql> select doc from posts where json_extract(doc, '$.Id') = 1 -> \G *************************** 1. row *************************** doc: {"Id": 1, "Body": ">p<Can someone explain to me how there can be different kinds of infinities?</p>", "Tags": "<set-theory><intuition><faq>", "Score": 85, "Title": "Different kinds of infinities?", "PostTypeId": 1, "AnswerCount": 10, "OwnerUserId": 10, "CommentCount": 1, "CreationDate": "2010-07-20 19:09:27", "LastEditDate": "2015-02-18 03:10:12", "AcceptedAnswerId": 9, "LastActivityDate": "2015-02-18 03:10:12", "LastEditorUserId": 206259} 1 row in set (1.45 sec)
Obviously, the query plan requires a full table scan:
mysql> explain select doc from posts where json_extract(doc, '$.Id') = 1; +----+-------------+-------+------------+------+---------------+------+---------+------+---------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+------+---------------+------+---------+------+---------+----------+-------------+ | 1 | SIMPLE | posts | NULL | ALL | NULL | NULL | NULL | NULL | 1100132 | 100.00 | Using where | +----+-------------+-------+------------+------+---------------+------+---------+------+---------+----------+-------------+ 1 row in set, 1 warning (0.00 sec)
First, let's try and add a generated column for the
Id
. The Id
is, as its name implies,
unique, and it seems sensible to create a PRIMARY
KEY
for that as well:
mysql> ALTER TABLE posts -> ADD id INTEGER UNSIGNED -> GENERATED ALWAYS AS (JSON_EXTRACT(doc, '$.Id')) -> STORED -> NOT NULL PRIMARY KEY; Query OK, 1082988 rows affected (36.23 sec) Records: 1082988 Duplicates: 0 Warnings: 0
You might notice that in this case, the generated column is
STORED
rather than VIRTUAL
. This is the
case because MySQL won't let you create a PRIMARY
KEY
on a VIRTUAL
generated column. If you try
it anyway, you'll get:
mysql> ALTER TABLE posts -> ADD id INTEGER UNSIGNED -> GENERATED ALWAYS AS (JSON_EXTRACT(doc, '$.Id')) NOT NULL -> VIRTUAL -> PRIMARY KEY; ERROR 3106 (HY000): 'Defining a virtual generated column as primary key' is not supported for generated columns.
Now, let's try our -modified- query again:
mysql> explain select doc from posts where id = 1; +----+-------------+-------+------------+-------+---------------+---------+---------+-------+------+----------+-------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+-------+---------------+---------+---------+-------+------+----------+-------+ | 1 | SIMPLE | posts | NULL | const | PRIMARY | PRIMARY | 4 | const | 1 | 100.00 | NULL | +----+-------------+-------+------------+-------+---------------+---------+---------+-------+------+----------+-------+ 1 row in set, 1 warning (0.00 sec)
If you actually try to run the query you'll notice it returns
instantly - as is to be expected, since we can now access the
document directly via the PRIMARY KEY.
Now, let's try this again but using a VIRTUAL
column
and a UNIQUE
index:
mysql> ALTER TABLE posts -> DROP COLUMN id -> ; Query OK, 1082988 rows affected (35.44 sec) Records: 1082988 Duplicates: 0 Warnings: 0 mysql> ALTER TABLE posts -> ADD id INTEGER UNSIGNED -> GENERATED ALWAYS AS (JSON_EXTRACT(doc, '$.Id')) -> VIRTUAL -> NOT NULL UNIQUE; Query OK, 1082988 rows affected (36.61 sec) Records: 1082988 Duplicates: 0 Warnings: 0
Now the plan is:
mysql> explain select doc from posts where id = 1; +----+-------------+-------+------------+-------+---------------+------+---------+-------+------+----------+-------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+-------+------------+-------+---------------+------+---------+-------+------+----------+-------+ | 1 | SIMPLE | posts | NULL | const | id | id | 4 | const | 1 | 100.00 | NULL | +----+-------------+-------+------------+-------+---------------+------+---------+-------+------+----------+-------+ 1 row in set, 1 warning (0.00 sec)
The plan is almost the same, except of course that now access if
via the UNIQUE
key rather than the PRIMARY
KEY
. The query again returns almost instantly, although it
will be slightly slower.
That said, this example is not so much about making a benchmark
or measuring performance, it's more about showing how to achieve
some form of indexing when storing JSON documents in a MySQL
table. I truly hope someone else will try and conduct a serious
benchmark so that we can get an idea just how performance of the
MySQL JSON type compares to alternative solutions (like the
PostgreSQL JSON type, and MongoDB). I feel I lack both the
expertise and the tools to do so myself so I'd rather leave that
to experts.
Daniël
van Eeden kindly pointed out that query results maybe
different depending in the presence of an index. Please read
bug 76834 to learn how this may affect you. In
Conclusion
- MySQL JSON support looks pretty complete.
- Integration of JSON type system and MySQL native type system is, in my opinion, pretty good, but there are definitely gotcha's.
- Achieving indexing for JSON columns relies on a few specific workarounds, which may or may not be compatible with your requirements.
I hope this post was useful to you. I sure learned a lot by investigating the feature, and it gave me a few ideas of how I could use the JSON features in the future.