One of the interesting challenges in writing a system like VividCortex is writing a query digesting algorithm that deals with all of the technology-specific features, bugs, and quirks. When we added PostgreSQL performance monitoring recently, I knew this would be a new challenge because Postgres uses ANSI standard syntax for quoted strings, but MySQL uses different syntax. After checking a bit further, I also found something I didn’t know about: PostgreSQL also supports a nonstandard quoting syntax, which is called “dollar quoting.”
There are other interesting considerations about query digesting, so I thought it was worth a blog post.
Digesting queries is a huge issue. Because VividCortex digests similar SQL statements together, identifying literals accurately is important. If we think something is a literal and it’s an object name, we’ll group statements together when we shouldn’t. If we think something’s not a literal and it is, we’ll fail to group them, and create an explosion of metrics cardinality that could cause us serious performance issues. In either case it’ll be a very bad experience for customers.
Quoting Syntaxes
Let’s begin with MySQL. MySQL by default allows two string quoting syntaxes, with double-quotes and single-quotes. By default, these two statements are functionally the same:
select `col1` from `mydb`.`mytbl` where `user` = 'batman'
select `col1` from `mydb`.`mytbl` where `user` = "batman"
Neither of them is ANSI standard and they won’t port as-is to other databases, because of the backticks used to quote identifiers.
MySQL permits "double quotes"
to surround strings
because that makes it easy to insert single-quotes without
escaping: "this string isn't escaped."
Without that,
we’d have to use escaping. Standard SQL escaping requires
doubling quotes; MySQL uses backslashes. (More fun.)
ANSI standard syntax specifies that object identifiers are surrounded with double-quotes, however. Only single-quotes are legal string delimiters as far as I know.
select "col1" from "mydb"."mytbl" where "user" = 'batman'
For extra fun, MySQL’s SQL_MODE
setting can be
changed to use ANSI quoting syntax. For extra extra extra fun,
this is both a server configuration option, and a
session-settable variable. This makes a digesting program’s job
hard, to put it mildly. If someone’s using ANSI quoting on MySQL
and we only supported the default syntax, we’d incorrectly digest
the above statement to
select ? from ?.? where ? = ?
Now, considering PostgreSQL, the same thing holds. The default
delimiters around strings change from one of ["']
to
just [']
. If we don’t support this, we’ll
over-digest the SQL as just shown.
PostgreSQL’s Dollar-Quoting Syntax
That was bad enough, but then I discovered dollar-quoting syntax in PostgreSQL. The inconvenience of interpolating/escaping is bad enough that even PostgreSQL offers an alternative to make it easier.
In a nutshell, you can use arbitrary delimiters between dollar signs. You start your quote with anything you want, and as long as you end it with the same thing, it’s valid (which reminds me of Perl). There’s a lot of complexity to this, but the most common usage in practice seems to be a double-dollar-sign with nothing in between, like this:
select $$this string's quotes don't need to be escaped$$
But you don’t have to do that. You could also do
select $abc123$this string's quotes don't need to be escaped$abc123$
The correct way to digest both of those statements is
select ?
.
Prepared Statement Parameter Syntax; Other Variations
It doesn’t stop there. In MySQL, when you create a server-side
prepared statement, you send question-mark placeholders to the
server, like this: SELECT col FROM tbl WHERE id = ?
but in PostgreSQL, parameters are numbered and prefixed with, you
guessed it, a dollar sign: SELECT col FROM tbl WHERE id =
$1
. So we have to recognize that syntax too.
Another common place where many digesting algorithms fall down is
variable arguments, such as IN()
lists. Some
digesting algorithms, such as those used in a lot of databases
internally (and exposed via e.g. pg_stat_statements
tables) will digest the following two statements differently:
SELECT col FROM tbl WHERE id IN(1, 2, 3)
SELECT col FROM tbl WHERE id IN(4, 5)
Functionally, those are obviously the same query. The consequence of failing to consider them the same is that this potentially important query will be divided and scattered to the winds, and its impact or importance will be underestimated as a result.
Since we support network sniffing as well as reading from internal instrumentation such as the Performance Schema, we have to deal with the differences of digesting capabilities. We do not just accept what the database hands us. It’s not always good enough, so our agents have to figure out the mapping between the database’s digesting and ours. All while going to great lengths to avoid performance problems caused by the database’s views, too.
There are dozens of other special cases too, including but not limited to:
- Repeated clauses, such as multiple row syntax in MySQL’s
INSERT... VALUES(...),(...),(...)
syntax - Repeated, duplicate UNIONs
- Object identifiers (database and table names) that contain
variable parts, such as
cust_123
andcust_345
databases for a multi-tenant architecture (that is a simple example; the real world is messier) - Various legal but hard-to-recognize syntaxes such as
NULL
the literal versusIS NULL
the predicate, scientific notation, whitespace variations, and more
I had accumulated a good corpus of examples from my previous experience with query digesting tools, so I knew about these in advance. It’s not hard to implement these. It’s just hard to implement them correctly and efficiently.
Query Digesting: No SQL Parsing Allowed
Query digesting cannot be implemented with a traditional parser. We are not observing and parsing valid SQL all the time. Network traffic sniffing is a tricky business (validating, correcting, and resyncing when the packets are corrupt or out of order is a big subject in itself), and we also have a lot of constraints and safeguards built into our agents. For example, we may get incomplete packets or we may miss packets. Even if we get complete packets, the data may be too long; we have a configurable limit on how much of the query we’ll attempt to digest. There are many such edge cases to consider. Failing to consider them leaves the door open to a serious problem when that edge case shows up in the real world.
In addition, we have no need for actually parsing the SQL (or other query language) and generating an AST or other structure from it. Doing so is extremely costly relative to our digest algorithm.
The good news is, VividCortex supports all of the above quoting syntaxes out of the box. I don’t remember if we’ve discussed this on this blog before, but our query digesting algorithm is a hyper-efficient, single-pass algorithm that doesn’t use regular expressions or formal grammars. Because of my past work with regex-based query digesting algorithms, I knew from the start that we’d need to ban their use.
In practice, multiple implementations of this have reinforced the validity of the approach we’ve taken. I even know of an Application Performance Management vendor that uses a Perl module I wrote for Maatkit, many years ago, in their own product to digest queries! Why do they use it? Because it works, where all the alternative approaches would fail on unrecognized syntax, truncated or garbled data, or other issues. (I was greatly tickled to find out about this Perl code that lives on, apparently immortal, in a vendor’s product!)
Conclusions
Query digesting has to be robust, high performance, safeguarded, and sensitive to individual particularities of the various technologies we support. Our query digest algorithm has had these characteristics since the beginning. It’s some of the earliest code we wrote, and it’s also extremely well tested, both formally with unit tests, and in real usage.