tldr; A single MySQL Cluster prioritises Consistency in Network
partition events. Asynchronously replicating MySQL Clusters
prioritise Availability in Network partition events.
I was recently asked about the relationship between MySQL Cluster
and the CAP theorem. The CAP theorem is often described as
a pick two out of three problem, such as choosing from good,
cheap, fast. You can have any two, but you can't have all three.
For CAP the three qualities are 'Consistency', 'Availability' and
'Partition tolerance'. CAP states that in a system with data
replicated over a network only two of these three qualities can
be maintained at once, so which two does MySQL Cluster
provide?
Standard 'my interpretation of CAP' section
Everyone who discusses CAP like to rehash it, and I'm no
exception. Daniel Abadi has the best CAP write-up that
I've read so far, which reframes CAP as a decision about whether
to ultimately prioritise availability or data consistency in the
event of a network partition. This is how I think of CAP. He also
discusses related system behaviour in normal operation which I'll
return to later.
While this reframing clarifies CAP, the terms network partition,
availability and consistency also need some definition.
Network replicated database
CAP is only really relevant in the context of a network
replicated database (or filesystem or state machine). A network
replicated database stores copies of data in multiple different
systems (database nodes), connected by a network. Data can be
read and updated. Updates are propagated to all nodes with
replicas via the network. Database clients connect to database
nodes via the network to read data and make updates. Replication
may occur to improve availability, to improve request latency, or
to improve read bandwidth.
Availability
The network replicated database exists to provide services such
as Read and Write on the data it stores. Its availability can be
measured as the ability of any client to perform any service on
any data item.
This Service Availability can be compromised by :
- Failure of client nodes
- Network failures between clients and database nodes
- Network failures between database nodes
- Failure of database nodes
Client node and networking failures cannot really be considered a
property within the control of a database system, so I consider
their effects out of the scope of CAP. However, where clients
connect to a database node, and that database node is isolated
from other database nodes, whether or not those clients are given
service is within the scope of CAP.
Service Availability is not binary, it can partially degrade,
perhaps by affecting :
- A subset of all clients
- A subset of all stored data
- A subset of request types
The shades of grey within the definition of availability are
responsible for most of the arguments around CAP. If we take a
strict view - either all services available on all data for all
clients, or nothing, then availability is fragile and hard to
maintain. If we take a more flexible approach then some service
availabilty can be preserved even with a completely decimated
network. In the loosest definition, if any client receives any
service on any data, then the system is still available. Rather
than choose one position, I regard availability as a range from
100% down to 0% for a full outage. Anything in the middle is
reduced availability, but it does not mean that the system is not
serving its purpose adequately.
Consistency
For consistency to be satisfied, the multiple replicas of data in
a network replicated database should behave as though there were
only one copy of the data. Simultaneous reads of the same data
item from clients connected to different database nodes must
always return the same result. Where two or more updates to the
same data item are submitted simulteneously, they must be
serialised, or one must be rejected, or they must be merged so
that a single value results. This one-copy model makes it simple
for database clients to use the network replicated database as if
it were a single database system with one atomically read/written
copy of their data.
If one copy consistency is relaxed, then different database nodes
may observably have different values for the same data item
simultaneously. Over time the data copies may be aligned, but
clients accessing the data must beware that reads may not return
the results of the most recently accepted writes. This behaviour
may be described as eventual consistency. Providing eventual
consistency allows a network replicated database to maximise
availability, but pushes the problem of dealing with transient
inconsistencies up the stack to user applications. Furthermore
there are varying qualities of eventual consistency, with varying
guarantees and levels of application support available.
Network Partitions
Network partitions isolate subsets of the nodes of a network
replicated database. The interesting property of a network
partition is that each node subset cannot tell whether the other
node subset(s) are :
- dead
- alive but isolated from clients
- alive and reachable by clients but isolated from us
Not knowing the state of the other subset(s) is what forces a
system to decide between maximising service availability and
maximising consistency. The interesting case is 3) where some
database nodes (potentially containing all or some of the data)
are alive elsewhere and have clients connected to them. If those
clients are allowed to make writes on data copies stored on those
database nodes, then we must lose one copy consistency as we
cannot supply those new values in response to a read of our local
copy. If those clients are not allowed to make writes then we
have degraded service availability for them. Which is it to be?
This is the unavoidable choice at the centre of the CAP theorem.
Stated this way it seems less of a theorem and more of a
fact.
Back to MySQL Cluster - which does it provide?
A single MySQL Cluster prioritises data consistency over
availability when network partitions occur.
A pair of asynchronously replicating MySQL Clusters prioritise
service availability over data consistency when network
partitions occur.
So you can have it both ways with MySQL Cluster - Great!
Single MySQL Cluster - CP
Within a single MySQL Cluster, data is synchronously replicated
between database nodes using two-phase commit. Nodes are
monitored using heartbeats, and failed or silent nodes are
promptly isolated by live and responsive nodes. Where a network
partition occurs, live nodes in each partition regroup and decide
what to do next :
- If there are not enough live nodes to serve all of the data
stored - shutdown
Serving a subset of user data (and risking data consistency) is not an option - If there are not enough failed or unreachable nodes to serve
all of the data stored - continue and provide service
No other subset of nodes can be isolated from us and serving clients - If there are enough failed or unreachable nodes to serve all
of the data stored - arbitrate.
There could be another subset of nodes regrouped into a viable cluster out there.
Arbitration occurs to avoid the split brain scenario where a cluster could
theoretically split in two (or more), with each half (or third,
or quarter) accepting writes and diverging from the others. In
other words, arbitration occurs to preserve consistency.
Arbitration involves :
- Database nodes agree on an arbitrator in advance
- During node or network failure handling, no data writes are
committed.
- When arbitration is required due to node failures or network issues, viable node subsets (potential clusters) request permission from the previously agreed arbitrator to provide service.
- Each request to the arbitrator will result in either : Yes, No or timeout
- Anything other than Yes results in node shutdown.
- The arbitrator only says Yes once per election round (First
come first served). Therefore the arbitrator only says yes to one
potential cluster in a partitioned network.
Note that arbitration is not the same as achieving a quorum. A
cluster with three replicas and an arbitrator node can survive
the loss of two data nodes as long as the arbitrator remains
reachable to the last survivor. The arbitrator role is
lightweight as it is not involved in normal traffic. I am
surprised that the lightweight arbitrator pattern is not more
common.
How does a single MySQL Cluster degrade service availability as a
result of network partitions?
Where some subset of data nodes are isolated and shut-down
:
- Those nodes are 100% out of service, until they restart and
can rejoin the cluster
They will attempt to do so automatically
- Any clients connected only to those nodes are out of
service
By default clients attempt to connect to all data nodes, so partial connectivity issues needn't degrade client availability.
- The remaining live nodes are 100% in-service
- Clients connected to the remaining live nodes are 100% in
service
Where no subset of data nodes is live
- All clients experience 100% service loss, until the data
nodes restart and can rejoin the cluster
They will attempt to do so automatically.
A single MySQL Cluster does not degrade to partial data access,
or read only modes as a result of network partitions. It does not
sacrifice consistency.
How can MySQL Cluster be described as highly available if it
sacrifices availability for consistency in the event of a network
partition?
Availability is not binary - many types of network partition can
erode availability, for some clients, but do not extinguish it.
Some set of clients continue to receive 100% service. Only
double failures in the network can cause a
network partition resulting in full service loss.
Furthermore, network partitions are not the only risks to
availability, software errors, power failures, upgrades,
overloads are other potential sources of downtime which Cluster
is designed to overcome.
Asynchronously replicating clusters - AP
Where two Clusters are asynchronously replicating via normal
MySQL Replication, in a circular configuration, reads and writes
can be performed locally at both clusters. Data consistency
within each cluster is guaranteed as normal, but data consistency
across the two clusters is not. On the other hand, availability
is not compromised by network partitioning of the two clusters.
Each cluster can continue to accept read and write requests to
all of the data from any connected client.
Eventual consistency between the clusters is possible when using
conflict resolution functions such as NDB$EPOCH_TRANS, NDB$EPOCH, NDB$MAX etc.
How does consistency degrade between replicating MySQL Clusters
during a network partition?
This depends on the conflict resolution function chosen, and how
detected conflicts are handled. Some details of consistency
guarantees provided by NDB$EPOCH et al are described here.
What about normal operation?
Abadi's post introduced his PACELC acronym, standing
for something like :
if (network Partition)
{
trade-off Availability vs Consistency;
}
else
{
trade-off Latency vs Consistency;
}
My first comment has to be that it's bad form to put the common
case in an else branch!
However, it is certainly true that the properties during normal
operation are usually more important than what happens during a
network partition. The ELC section is stating that while all
database nodes are present, a network replicated database can
choose between minimising request Latency, or maintaining
Consistency. In theory this normal operation
latency-vs-consistency tradeoff could be completely independent
to the Network Partitioning availability-vs-consistency tradeoff,
e.g. you could have any of :
- PA EL (Partition - Availability, Else - Latency minimisation)
- PA EC (Partition - Availability, Else - Consistency)
- PC EL (Partition - Consistency, Else - Latency minimisation)
- PC EC (Partition - Consistency, Else - Consistency)
The common cases are 1 + 4, where we choose either consistency at
all times, or Maximum Availability and Minimum Latency. Case 2 is
a system which aims for consistency, but when a network partition
occurs, aims for Availability. Case 3 is a system which aims for
minimal request Latency, and when a partition occurs aims for
consistency.
Examples of systems of each type :
- Any eventually consistent system, especially with local-database-node updates + reads
- Best-effort consistent systems that degrade in failure modes (e.g. MySQL semi-synchronous replication)
- ???
- Always consistent systems (e.g. single database instance,
single MySQL Cluster)
I am not aware of systems meeting case 3 where normally they
minimise latency over consistency, but start choosing consistency
after a network partition. Maybe this category should be called
'repentant systems'?
The problem for systems in Cases 1 or 2 - anywhere where Latency
minimisation or Availability is chosen over consistency - is the
need for user applications to deal with potential
inconsistencies. It is not enough to say that things will
'eventually' be consistent. It's important to describe how
inconsistent they can be, whether the temporary inconsistencies
are values which were once valid, how those values relate to
other, connected values etc.
There are certainly applications which can operate correctly with
practical eventually consistent databases, but it's not well
known how to design applications and schemas to cope with the
transient states of an eventually consistent database. The first
ORM framework to opaquely support an underlying eventually
consistent database may actually be worth the effort to use! A
reasonable approach is to design schemas with associated
read/modification 'protocols' as if they were abstract data types
(ADTs). These ADTs can then have strengths and weaknesses,
properties and limitations which make sense in some parts of an
application schema where the need to support eventual consistency
overcomes the inherent effort and limitations.
Stonebraker and others have commented on network partitions being a minor
concern for a well designed datacentre-local network, where
redundancy can be reliably implemented. Also the latency cost of
maintaining consistency is lower as physical distances are
smaller and hop counts are lower. This results in 'CP' systems
being attractive at the data centre scale as the need to
sacrifice availability due to network partition is rarely
dominant, and the latency implications during normal operation
are bearable. Perhaps this highlights the need in these
theoretical discussions to illustrate theoretically problematic
latencies and availabilities with real numbers.
At a wider network scale, latencies are naturally higher,
implying that bandwidth is lower. The probability of network
partitions of some sort may also increase, due to the larger
number of components (and organisations) involved. The factors
combine to make 'AP' systems more palatable. The everyday latency
cost of consistency is higher, and losing availability due to
potentially more frequent network partitions may not be
acceptable. Again, real numbers are required to illuminate
whether the achievable latencies and probable availability
impacts are serious enough to warrant changing applications to
deal with eventually consistent data. For a particular
application there may or may not be a point at which an AP system
would meet its requirements better.
Consistent systems can be scaled across many nodes and high
latency links, but the observed operation latency, and the
necessary impacts to availability implied by link failure set a
natural ceiling on the desirable scale of a consistent system.
Paraphrasing John Mashey, "Bandwidth improves, latency is
forever". Applications that find the latency and availability
constraints of a single consistent system unacceptable, must
subdivide their datasets into smaller independent consistency
zones and manage potential consistency shear between them.
Finally (another excessively long post), I think the technical
and actual merits of widely distributed 'CP' systems are not well
known as they have not been commonly available. Many different
database systems support some form of asynchronous replication,
but few offer synchronous replication, fewer still offer to
support it over wide areas with higher latency and fluctuating
links. As this changes, the true potential and weaknesses of
these technologies, backed by real numbers, will start to
appear.
Edit 7/3/12 : Fix bad link