Query Digesting and String Quoting Syntax in SQL

One of the interesting challenges in writing a system like VividCortex is writing a query digesting algorithm that deals with all of the technology-specific features, bugs, and quirks. When we added PostgreSQL performance monitoring recently, I knew this would be a new challenge because Postgres uses ANSI standard syntax for quoted strings, but MySQL uses different syntax. After checking a bit further, I also found something I didn’t know about: PostgreSQL also supports a nonstandard quoting syntax, which is called “dollar quoting.”

There are other interesting considerations about query digesting, so I thought it was worth a blog post.

Digesting queries is a huge issue. Because VividCortex digests similar SQL statements together, identifying literals accurately is important. If we think something is a literal and it’s an object name, we’ll group statements together when we shouldn’t. If we think something’s not a literal and it is, we’ll fail to group them, and create an explosion of metrics cardinality that could cause us serious performance issues. In either case it’ll be a very bad experience for customers.

Quoting Syntaxes

Let’s begin with MySQL. MySQL by default allows two string quoting syntaxes, with double-quotes and single-quotes. By default, these two statements are functionally the same:

select `col1` from `mydb`.`mytbl` where `user` = 'batman'
select `col1` from `mydb`.`mytbl` where `user` = "batman"

Neither of them is ANSI standard and they won’t port as-is to other databases, because of the backticks used to quote identifiers.

MySQL permits "double quotes" to surround strings because that makes it easy to insert single-quotes without escaping: "this string isn't escaped." Without that, we’d have to use escaping. Standard SQL escaping requires doubling quotes; MySQL uses backslashes. (More fun.)

ANSI standard syntax specifies that object identifiers are surrounded with double-quotes, however. Only single-quotes are legal string delimiters as far as I know.

select "col1" from "mydb"."mytbl" where "user" = 'batman'

For extra fun, MySQL’s SQL_MODE setting can be changed to use ANSI quoting syntax. For extra extra extra fun, this is both a server configuration option, and a session-settable variable. This makes a digesting program’s job hard, to put it mildly. If someone’s using ANSI quoting on MySQL and we only supported the default syntax, we’d incorrectly digest the above statement to

select ? from ?.? where ? = ?

Now, considering PostgreSQL, the same thing holds. The default delimiters around strings change from one of ["'] to just [']. If we don’t support this, we’ll over-digest the SQL as just shown.

PostgreSQL’s Dollar-Quoting Syntax

That was bad enough, but then I discovered dollar-quoting syntax in PostgreSQL. The inconvenience of interpolating/escaping is bad enough that even PostgreSQL offers an alternative to make it easier.

In a nutshell, you can use arbitrary delimiters between dollar signs. You start your quote with anything you want, and as long as you end it with the same thing, it’s valid (which reminds me of Perl). There’s a lot of complexity to this, but the most common usage in practice seems to be a double-dollar-sign with nothing in between, like this:

select $$this string's quotes don't need to be escaped$$

But you don’t have to do that. You could also do

select $abc123$this string's quotes don't need to be escaped$abc123$

The correct way to digest both of those statements is select ?.

Prepared Statement Parameter Syntax; Other Variations

It doesn’t stop there. In MySQL, when you create a server-side prepared statement, you send question-mark placeholders to the server, like this: SELECT col FROM tbl WHERE id = ? but in PostgreSQL, parameters are numbered and prefixed with, you guessed it, a dollar sign: SELECT col FROM tbl WHERE id = $1. So we have to recognize that syntax too.

Another common place where many digesting algorithms fall down is variable arguments, such as IN() lists. Some digesting algorithms, such as those used in a lot of databases internally (and exposed via e.g. pg_stat_statements tables) will digest the following two statements differently:

SELECT col FROM tbl WHERE id IN(1, 2, 3)
SELECT col FROM tbl WHERE id IN(4, 5)

Functionally, those are obviously the same query. The consequence of failing to consider them the same is that this potentially important query will be divided and scattered to the winds, and its impact or importance will be underestimated as a result.

Since we support network sniffing as well as reading from internal instrumentation such as the Performance Schema, we have to deal with the differences of digesting capabilities. We do not just accept what the database hands us. It’s not always good enough, so our agents have to figure out the mapping between the database’s digesting and ours. All while going to great lengths to avoid performance problems caused by the database’s views, too.

There are dozens of other special cases too, including but not limited to:

  • Repeated clauses, such as multiple row syntax in MySQL’s INSERT... VALUES(...),(...),(...) syntax
  • Repeated, duplicate UNIONs
  • Object identifiers (database and table names) that contain variable parts, such as cust_123 and cust_345 databases for a multi-tenant architecture (that is a simple example; the real world is messier)
  • Various legal but hard-to-recognize syntaxes such as NULL the literal versus IS NULL the predicate, scientific notation, whitespace variations, and more

I had accumulated a good corpus of examples from my previous experience with query digesting tools, so I knew about these in advance. It’s not hard to implement these. It’s just hard to implement them correctly and efficiently.

Query Digesting: No SQL Parsing Allowed

Query digesting cannot be implemented with a traditional parser. We are not observing and parsing valid SQL all the time. Network traffic sniffing is a tricky business (validating, correcting, and resyncing when the packets are corrupt or out of order is a big subject in itself), and we also have a lot of constraints and safeguards built into our agents. For example, we may get incomplete packets or we may miss packets. Even if we get complete packets, the data may be too long; we have a configurable limit on how much of the query we’ll attempt to digest. There are many such edge cases to consider. Failing to consider them leaves the door open to a serious problem when that edge case shows up in the real world.

In addition, we have no need for actually parsing the SQL (or other query language) and generating an AST or other structure from it. Doing so is extremely costly relative to our digest algorithm.

The good news is, VividCortex supports all of the above quoting syntaxes out of the box. I don’t remember if we’ve discussed this on this blog before, but our query digesting algorithm is a hyper-efficient, single-pass algorithm that doesn’t use regular expressions or formal grammars. Because of my past work with regex-based query digesting algorithms, I knew from the start that we’d need to ban their use.

In practice, multiple implementations of this have reinforced the validity of the approach we’ve taken. I even know of an Application Performance Management vendor that uses a Perl module I wrote for Maatkit, many years ago, in their own product to digest queries! Why do they use it? Because it works, where all the alternative approaches would fail on unrecognized syntax, truncated or garbled data, or other issues. (I was greatly tickled to find out about this Perl code that lives on, apparently immortal, in a vendor’s product!)

Conclusions

Query digesting has to be robust, high performance, safeguarded, and sensitive to individual particularities of the various technologies we support. Our query digest algorithm has had these characteristics since the beginning. It’s some of the earliest code we wrote, and it’s also extremely well tested, both formally with unit tests, and in real usage.

Pic Cred