Shard-Query 2.0 Beta 1 released

It is finally here.  After three years of development, the new version of Shard-Query is finally available for broad testing.

This new version of Shard-Query is vastly improved over previous versions in many ways.  This is in large part due to the fact that the previous version of Shard-Query (version 1.1) entered into production at a large company.  Their feedback during implementation was invaluable in building the new Shard-Query features.   The great thing is that this means that many of the new 2.0 features have already been tested in at least one production environment.

This post is intended to highlight the new features in Shard-Query 2.0.  I will be making posts about individual features as well as posting benchmark results.

Configuration Repository
Shard-Query 1.x stored the configuration in a .ini file.  In an installation with many nodes, keeping the .ini file in sync was difficult.  There was no way to guarantee that all of Shard-Query saw the state of the cluster (in terms of node composition) as an entire unit.  Shard-Query 2.0 solves this problem and makes configuration and bootstrapping of new nodes simpler.  Shard-Query 2.0 stores the configuration inside of a database schema, rather than in a flat file.  Each node is made aware of the configuration repository through the node setup process.  The cluster configuration is then read from the repository each time Shard-Query initializes.

REST interface/GUI
Shard-Query 2.0 features a user interface built using the DooPHP MVC framework.  This interface was graciously contributed by Alex Hurd as part of the implementation I spoke of earlier.  The web interface provides for both query execution and post-installation Shard-Query configuration.

MySQL proxy 0.8.3 interface
Shard-Query now includes a Lua script for MySQL proxy.  This allows applications designed to interact with MySQL to work with Shard-Query transparently.  This is very useful for tools such as Mondrian, Tableau or other tools that are designed for working with and visualizing big data.

DDL and SHOW support
Shard-Query now supports creating and dropping tables and indexes.  The MySQL Proxy interface also supports SHOW commands such as SHOW TABLES and SHOW DATABASES.  Please note that cross-shard operations such as creating and dropping tables requires that all shards be online for the operation.  Failures may result in inconsistency (tables or indexes on some nodes, but not others) so please use with care.  Unfortunately MySQL does not have transactional DDL.

Extended SQL support
Previous versions of Shard-Query only supported limited support for SQL expressions.  Expressions such as ROUND(SUM(expr), 2) did not work.  Nor did SUM(expr)/COUNT(expr).  These issues have been resolved.  Shard-Query now supports nearly all MySQL SELECT syntax.  This includes complex expressions, ORDER BY, LIMIT, HAVING and WITH ROLLUP.  Support for subqueries in the FROM clause has been extended.  UNION and UNION ALL queries are now executed slightly differently as well.  SQL syntax checking is now performed as well, and malformed queries return familiar error messages.

Partitioning based parallelism
Tables partitioned at the MySQL level are now automatically detected and used for intra-query parallelism.  Currently RANGE, LIST, RANGE COLUMNS and LIST COLUMNS partitioning methods are supported for partition parallelism when a single column is used as the partition key. Shard-Query will automatically construct WHERE clauses to examine individual partitions and examine partitions in parallel. The next version of Shard-Query (2.5) will support all types of partitions for any number of partition key columns for MySQL 5.6.10 or greater, through use of the new PARTITION hint.

Improved parallel loading mechanism
The Shard-Query loader was formerly multi-process, but this did not work on Windows.  Now the loader has been split into two parts.  The first is a quick scan operation with identifies individual “chunks” of the table to load.  The second phase involves Gearman workers actually loading those chunks.  Currently, the flat file input must be available on all nodes running loader workers via the same path.  This likely means that the files must be placed on shared storage or duplicated on all nodes on which the loader workers run.  This limitation will probably be lifted in the future.

Asynchronous query execution
Shard-Query now supports executing queries asynchronously.  There is a job table which can be used to determine if a background job has completed yet.  See the bin/update_job_table script which actually populates the job table. If you don’r run (or cron) the script, you won’t see any jobs completing.  See another example in the GUI (it invokes the job table update code too).  This feature can be used to execute long running queries over large amounts of data.

Multi-shard (range) shard lookups
Queries which access more than one shard key via IN, BETWEEN or OR clauses will now be sent to only the shards which contain the keys in question.  In previous versions range lookups scanned all shards. A bug was also fixed in WHERE clause processing.  WHERE clause expressions such as “WHERE (the_shard_key IN(1,2,3,4,5))” were not handled properly due to the enclosing parenthesis, but this has been resolved.

Star Schema Optimization
Star schemata present unique challenges to traditional database optimizers.  By turning on star schema optimization Shard-Query can offer vastly improved query performance for queries which utilize a star schema.  Often an improvement of an order of magnitude or more can be seen when using this option, but it only works forstar schema.

Custom aggregate functions
The parser now supports the addition of custom aggregate functions to the SQL dialect.  A reference function called PERCENTILE(expr, N) is included.  Custom non-aggregate functions are not yet supported.