It is finally here. After three years of development, the new version of Shard-Query is finally available for broad testing.
This new version of Shard-Query is vastly improved over previous versions in many ways. This is in large part due to the fact that the previous version of Shard-Query (version 1.1) entered into production at a large company. Their feedback during implementation was invaluable in building the new Shard-Query features. The great thing is that this means that many of the new 2.0 features have already been tested in at least one production environment.
This post is intended to highlight the new features in Shard-Query 2.0. I will be making posts about individual features as well as posting benchmark results.
Configuration Repository
Shard-Query 1.x
stored the configuration in a .ini file. In an installation
with many nodes, keeping the .ini file in sync was difficult.
There was no way to guarantee that all of Shard-Query saw
the state of the cluster (in terms of node composition) as an
entire unit. Shard-Query 2.0 solves this problem and makes
configuration and bootstrapping of new nodes simpler.
Shard-Query 2.0 stores the configuration inside of a
database schema, rather than in a flat file. Each node is
made aware of the configuration repository through the node setup
process. The cluster configuration is then read from the
repository each time Shard-Query initializes.
REST interface/GUI
Shard-Query 2.0 features
a user interface built using the DooPHP MVC framework. This
interface was graciously contributed by Alex Hurd as part of the
implementation I spoke of earlier. The web interface
provides for both query execution and post-installation
Shard-Query configuration.
MySQL proxy 0.8.3 interface
Shard-Query now
includes a Lua script for MySQL proxy. This allows
applications designed to interact with MySQL to work with
Shard-Query transparently. This is very useful for tools
such as Mondrian, Tableau or other tools that are designed for
working with and visualizing big data.
DDL and SHOW support
Shard-Query now
supports creating and dropping tables and indexes. The
MySQL Proxy interface also supports SHOW commands such as SHOW
TABLES and SHOW DATABASES. Please note that cross-shard
operations such as creating and dropping tables requires that all
shards be online for the operation. Failures may result in
inconsistency (tables or indexes on some nodes, but not others)
so please use with care. Unfortunately MySQL does not have
transactional DDL.
Extended SQL support
Previous versions of
Shard-Query only supported limited support for SQL expressions.
Expressions such as ROUND(SUM(expr), 2) did not work.
Nor did SUM(expr)/COUNT(expr). These issues have been
resolved. Shard-Query now supports nearly all MySQL SELECT
syntax. This includes complex expressions, ORDER BY, LIMIT,
HAVING and WITH ROLLUP. Support for subqueries in the FROM
clause has been extended. UNION and UNION ALL queries are
now executed slightly differently as well. SQL syntax
checking is now performed as well, and malformed queries return
familiar error messages.
Partitioning based parallelism
Tables partitioned at the MySQL level are now automatically
detected and used for intra-query parallelism. Currently
RANGE, LIST, RANGE COLUMNS and LIST COLUMNS partitioning methods
are supported for partition parallelism when a single column
is used as the partition key. Shard-Query will automatically
construct WHERE clauses to examine individual partitions and
examine partitions in parallel. The next version of Shard-Query
(2.5) will support all types of partitions for any number of
partition key columns for MySQL 5.6.10 or greater, through use of
the new PARTITION hint.
Improved parallel loading mechanism
The Shard-Query loader was formerly multi-process, but this did
not work on Windows. Now the loader has been split into two
parts. The first is a quick scan operation with identifies
individual “chunks” of the table to load. The second phase
involves Gearman workers actually loading those chunks.
Currently, the flat file input must be available on all
nodes running loader workers via the same path.
This likely means that the files must be placed on shared
storage or duplicated on all nodes on which the loader workers
run. This limitation will probably be lifted in the future.
Asynchronous query execution
Shard-Query
now supports executing queries asynchronously. There is a
job table which can be used to determine if a background job has
completed yet. See the bin/update_job_table script which
actually populates the job table. If you don’r run (or cron) the
script, you won’t see any jobs completing. See another
example in the GUI (it invokes the job table update code too).
This feature can be used to execute long running queries
over large amounts of data.
Multi-shard (range) shard lookups
Queries
which access more than one shard key via IN, BETWEEN or OR
clauses will now be sent to only the shards which contain the
keys in question. In previous versions range lookups
scanned all shards. A bug was also fixed in WHERE clause
processing. WHERE clause expressions such as “WHERE
(the_shard_key IN(1,2,3,4,5))” were not handled properly due to
the enclosing parenthesis, but this has been resolved.
Star Schema Optimization
Star schemata
present unique challenges to traditional database optimizers.
By turning on star schema optimization Shard-Query can
offer vastly improved query performance for queries which utilize
a star schema. Often an improvement of an order of
magnitude or more can be seen when using this option, but it only
works forstar schema.
Custom aggregate functions
The parser now
supports the addition of custom aggregate functions to the SQL
dialect. A reference function called PERCENTILE(expr, N) is
included. Custom non-aggregate functions are not yet
supported.