In my previous posts I introduced two new conflict detection
functions, NDB$EPOCH and NDB$EPOCH_TRANS without explaining how
these functions actually detect conflicts? To simplify the
explanation I'll initially consider two circularly replicating
MySQL Servers, A and B, rather than two replicating Clusters, but
the principles are the same.
Commit ordering
Avoiding conflicts requires that data is only modified on one
Server at a time. This can be done by defining Master/Slave roles
or Active/Passive partitions etc. Where this is not done, and
data can be …
I've already described Justin Swanhart's Flexviews project as
something I think is cool. Since then Justin appears to
have been working more on Shard-Query which I also think is cool, perhaps
even more so than Flexviews.
On the page linked above, Shard-Query is described using the
following statements :
"Shard-Query is a distributed parallel query engine for
MySQL"
"ShardQuery is a PHP class which is intended to make working with
a partitioned dataset easier""ParallelPipelining - MPP
distributed query engines runs fragments of queries in parallel,
combining the results at the end. Like map/reduce except it
speaks SQL directly."
The things I like from the above description :
- Distributed …
In my last post I described the motivation for the new NDB$EPOCH
conflict detection function in MySQL
Cluster. This function detects when a row has been
concurrently updated on two asynchronously replicating MySQL
Cluster databases, and takes steps to keep the databases in
alignment.
With NDB$EPOCH, conflicts are detected and handled on a row
granularity, as opposed to column granularity, as this is the
granularity of the epoch metadata used to detect conflicts.
Dealing with conflicts on a row-by-row basis has implications for
schema and application design. The NDB$EPOCH_TRANS function
extends NDB$EPOCH, giving …
tl;dr : New 'automatic' optimistic conflict detection functions
available giving the best of both optimistic and pessimistic
replication on the same data
MySQL replication supports a number of topologies, and one of the
most interesting is an active-active, or master-master topology,
where two or more Servers accept read and write traffic, with
asynchronous replication between them.
This topology has a number of attractions, including :
- Potentially higher availability
- Potentially low impact on read/write latency
- Service availability insensitive to replication failures
- Conceptually simple
However, data consistency is hard to maintain in this
environment. Data, and access to it, must usually be partitioned …
Unlike most other MySQL storage engines, Ndb does not perform all
of its work in the MySQLD process. The Ndb table handler maps
Storage Engine Api calls onto NdbApi calls, which eventually result in
communication with data nodes. In terms of layers, we have SQL
-> Handler Api -> NdbApi -> Communication. At each of
these layer boundaries, the mapping between operations at the
upper layer to operations at the lower layer is non trivial,
based on runtime state, statistics, optimisations etc.
The MySQL status variables can be used to understand the
behaviour of the MySQL Server in terms of user commands
processed, and also how these map to some of the Storage Engine
Handler Api calls.
Status variables tracking user commands start with …
Most people looking at a diagram showing the Cluster architecture
soon want to know if the system can scale online. Api nodes such
as MySQLD processes can be added online, and the storage capacity
of existing data nodes can be increased online, but it was not
always possible to add new data nodes to the cluster without an
initial system restart requiring a backup and restore.
An online add node and data repartitioning feature was finally
implemented in MySQL Cluster 7.0. It's not clear how often users
actually do scale their Clusters online, but it certainly is a
cool thing to be able to do.
There are two parts to the feature :
- Online add an empty data node to an existing cluster
- Online rebalance existing data across the existing and new data nodes
Adding an empty data node to a cluster sounds trivial, but is
actually fairly complex given the cluster's …
MySQL Cluster distributes rows amongst the data nodes in a
cluster, and also provides data replication. How does this work?
What are the trade offs?
Table fragments
Tables are 'horizontally fragmented' into table fragments each
containing a disjoint subset of the rows of the table. The union
of rows in all table fragments is the set of rows in the table.
Rows are always identified by their primary key. Tables with no
primary key are given a hidden primary key by MySQLD.
By default, one table fragment is created for each data node in
the cluster at the time the table is created.
Node groups and Fragment replicas
The data nodes in a cluster are logically divided into Node
groups. The size of each Node group is controlled by the
NoOfReplicas parameter. All data nodes in a Node group store the
same data. In other words, where the NoOfReplicas parameter is
two or greater, each …
When MySQL AB bought Sun Microsystems in 2008 (or did Sun buy
MySQL?), most of the MySQL team merged with the existing Database
Technology Group (DBTG) within Sun. The DBTG group had been busy
working on JavaDB, Postgres and other DB related projects as well
as 'High Availability DB' (HADB), which was Sun's name for the
database formerly known as Clustra.
Clustra originated as a University research project which spun
out into a startup company and was then acquired by Sun around
the era of dot-com. A number of technical papers describing
aspects of Clustra's design and history can be found online, and it is in many ways similar to Ndb
Cluster, not just in their shared Scandinavian roots. Both are
shared-nothing parallel databases originally aimed at the
Telecoms market, supporting high availability and horizontal
scalability. Clustra has an impressive feature set and …
One thing that has puzzled me about MySQL Server is that it
became famous for sharded scale-out deployments in well known web
sites and yet has no visible support for such deployments. The
MySQL killer feature for some time has been built-in asynchronous
replication and gigabytes of blogs have been written about how to
setup, use, debug and optimise replication, but when it comes to
'sharding' there is nothing built in. Perhaps to have attempted
to implement something would have artificially constrained user's
imaginations, whereas having no support at all has allowed 1,000
solutions to sprout? Perhaps there just wasn't MySQL developer
bandwidth available, or perhaps it just wasn't the best use of
the available time. In any case, it remains unclaimed territory
to this day.
On first hearing of the Federated storage engine some years ago, …
Most software people are aware of the ACID
acronym coined by Jim Gray. With the growth of the web and open
source, the scaling and complexity constraints imposed on DBMS
implementations supporting ACID are more visible, and new (or at
least new terms for known) compromises and tradeoffs are being
discussed widely. The better known NoSQL
systems are giving insight by example into particular choices of
tradeoffs.
Working at MySQL, I have often been surprised at the variety of
potential alternatives when implementing a DBMS, and the number
of applications which don't need the full set of ACID letters in
the strictest form. The original MySQL storage engine, MyISAM
is one of the first and most successful examples of an 'ACID
remix'. The people …