SQL parser refactoring in 5.7.4 LAB release

We are refactoring the SQL parser: the sql_yacc.yy file and friends.

Refactoring the parser consists of a base task that provides the common framework for refactoring, and several follow-up tasks to refactor individual types of SQL statements. We have already completed two follow-up tasks: one to refactor SELECT statements, and another to refactor SET statements.

Parser refactoring goals and tasks

The old parser had critical limitations because of its grammar complexity and top-down parsing style:

  • The top-down parsing style is not natural for Bison/YACC parser generators (they generate bottom-up parsers), which lead us to have overcomplicated grammar.
  • It was hard to fix grammar related bugs.
  • It was difficult, and risky, to extend non-trivial SQL statements, as well as to simply make our SQL more standard-compliant.
  • The parser grammar had an enormous number of shift/reduce conflicts (about 160), making it hard to avoid new conflicts, and risky to remove old ones.
  • Syntax rules had numerous embedded actions, which created strong dependencies between the parser and the context.
  • The parser took decisions for later execution, while we wanted the parser to produce just an internal representation of the query, and add semantics in later stages.

Due to this, there have been numerous attempts at completely replacing the parser with other technologies, but due to the complexity, these attempts have all failed.

It seems that the main cause of those failures was a wish to redesign and rewrite the entire parser all at once (since the parser is monolithic). This is a really complex task, one that is possible with great effort, but the review work alone for such a large wholesale replacement of the parser is an even more complex task. This is because there is no known formal method to compare two parsers for the full equivalence. The tasks become more and more complex if we try to replace the LALR(1) parser generator (Bison, YACC) with something different (ANTLR, for example) as part of the same refactoring.

  • Thus, our goal is to refactor the parser incrementally, by easily reviewed and tested steps, while sticking with our current Bison parser generator.

Our new methodology allows us to refactor the SQL parser statement by statement, more or less independently, which significantly simplifies our work. Moreover, we divide each statement refactoring into a number of elementary intermediate changes: grammar rule by rule, by isolating direct and indirect recursion loops in the grammar, and so on. On each iteration, the parser is still fully functional. So we are able to run the existing regression tests, and add new ones after each intermediate step; something that a wholesale replace and rewrite approach prevented.

The main goal is a pure (context-independent), and much simpler, bottom-up parser:

  • We are introducing an intermediate layer between the parser and the resolver: the parse tree.
  • The parse tree is easy to save/restore and to debug, as it is just a tree of C++ objects (instead of a Bison-generated state machine).
  • The new grammar is simpler and less ambiguous, which makes it easier to maintain and extend.
  • Far fewer calls to the (expensive) _current_thread() function, which is needed in order to access the parse context.
  • More precise syntax error messages.
  • No useless intermediate rules.
  • All semantic actions are together, at the end of the rules.

Bits of methodology

We are trying to reuse existing things as much as possible. This is not a complete rewriting of the parser from scratch. The new intermediate layer (the parse tree) consists of parse tree nodes. Most of those nodes are not new objects, but rather good old Items (nodes of parsed expressions). In other words, the parse tree contains fragments of what is going to be the AST (Abstract Syntax Tree). The old AST depends on the parse context, whereas the new parser is required to be context-free, thus we have modified the Items to be context-independent at the parse stage. Basically, we have separated their constructors into two parts: the first one is for the allocation and initialization at the parse stage, and the second one is for the contextualization of the parse tree. The contextualization of the parse tree produces the AST.

Please see WL#7199 for details.

Status and future work

In the “April” 5.7 labs release, all rules for the SELECT statement and the SET statement have been transformed. We expect to transform the rules for the remaining DML statements (UPDATE, DELETE and INSERT) in future releases. We also expect to simplify the SELECT rules generally, so that they are better aligned with the SQL standard.