I got some very good comments/questions on my previous post on MariaDB global transaction ID, from Giuseppe and Robert (of Tungsten fame). I thought a follow-up post would be appropriate to answer and further elaborate on the comments, as the points they raise are very important and interesting.
(It also gives me the opportunity to explain more deeply a lot of interesting design decisions that I left out in the first post for the sake of brevity and clarity.)
On crash-safe slave
One of the things I really wanted to improve with global
transaction ID is to make the replication slaves more crash safe
with respect to their current replication state. This state is
mostly persistently stored information about which event(s) were
last executed on the slave, so that after a server restart the
slave will know from which point in the master binlog(s) to
resume replication. In current (5.5 and earlier) replication,
this state is stored simply by continuously writing a file
relay-log.info after each event executed. If the
server crashes, this is very susceptible to corruption where the
contents of the file no longer matches the actual state of tables
in the database. </p>
With MariaDB global transaction ID, the replication state is stored in the following table instead of in a plain file:
CREATE TABLE rpl_slave_state ( domain_id INT UNSIGNED NOT NULL, sub_id BIGINT UNSIGNED NOT NULL, server_id INT UNSIGNED NOT NULL, seq_no BIGINT UNSIGNED NOT NULL, PRIMARY KEY (domain_id, sub_id));
When a transaction is executed on the slave, this table is updated as part of the transaction. So if the table is created with InnoDB (or other transactional engine) and the replicated events also use transactional tables, then the replication state is crash safe. DDL, or non-transactional engines such as MyISAM, remain crash-unsafe of course.
A global transaction ID in MariaDB consists of
domain_id as described in the previous post, an
increasing sequence number, and the usual
Recall that the replication state with global transaction ID
consists of the last global transaction ID applied within each
independent replication stream, ie. a mapping from
domain_id to global transaction ID. This is what the
But what about the
sub_id in the above table? This
is to prevent concurrency issues when parallel replication is
used. If we want to be able to execute in parallel two
transactions with the same
domain_id, then these two
transactions will both need to update the table
rpl_slave_state. If the transactions would update
the same row in the table, then one transaction would have to
wait on a row lock for the other to commit. This would prevent
any kind of group commit, which would be a very serious
So instead, each transaction inserts a new, separate row into the
table to record the new global transaction ID applied. There may
thus be multiple entries for a given
sub_id is used to distinguish them, it is simply a
(local) integer that is increased for each transaction received
from the master binlog. Thus, at any given time the last applied
global transaction ID for given
domain_id is the one
with the highest
sub_id in the table.
In effect, the replication state is obtained with a query like this:
SELECT domain_id, server_id, seq_no FROM rpl_slave_state WHERE (domain_id, sub_id) IN (SELECT domain_id, MAX(sub_id) FROM rpl_slave_state GROUP BY domain_id)
Old rows are deleted when no longer needed.
Thus, two replicated transactions can be executed like this:
BEGIN; BEGIN; UPDATE some_table UPDATE other_table SET value = value + 1 SET name = "foo" WHERE id = 10; WHERE category LIKE "food_%"; INSERT INTO mysql.rpl_slave_state INSERT INTO mysql.rpl_slave_state SET domain_id = 1, SET domain_id = 1, sub_id = 100, sub_id = 101, server_id = 5, server_id = 5, seq_no = 300010; seq_no = 300011; COMMIT; COMMIT;
These two transactions can run completely independent, including
the insert into
rpl_slave_state. And the commits at
the end can be done together as a single group commit, where we
ensure that the second one is recorded as happening after the
first one so commit order (and binlog order) is preserved and
visibility is correct (second transaction not visible to any
query without the first also being visible).
Contrast this with how things would be with a rpl_slave_state
table with a single row per
BEGIN; BEGIN; UPDATE some_table UPDATE other_table SET value = value + 1 SET name = "foo" WHERE id = 10; WHERE category LIKE "food_%"; UPDATE bad_rpl_slave_state SET server_id = 5, seq_no = 300010 WHERE domain_id = 1; COMMIT; UPDATE bad_rpl_slave_state SET server_id = 5, seq_no = 300011 WHERE domain_id = 1; COMMIT;
Here the update of the replication state table in the second transaction would have to wait for the first transaction to commit, because of row locks. Group commit becomes impossible.
(I actually explained this issue to the replication developers at MySQL/Oracle a long time ago, but last time I looked at MySQL 5.6, they had ignored it...)
On where to store the replication state
As Giuseppe pointed out, in the global transaction ID design it is still written
that the replication state will be stored in the slave binlog,
not in the
rpl_slave_state table. Sorry about this,
I will get the document updated as soon as possible.
I had basically two ideas for how to store the slave state in a crash-safe way:
- In the slave's binlog.
- In a (transactional) table.
The big advantage of (2) is that it works also when the binlog is not enabled on the slave. Since there can still be substantial overhead to enabling the binlog, I currently plan to go with this approach.
The advantage of (1) is that it is potentially cheaper when the
binlog is enabled on the slave, as it commonly will be when
global transaction ID is enabled (to be able to promote a slave
as a new master, the binlog must be enabled, after all). We
already write every single global transaction ID applied into the
binlog, and if we crash, we already scan the binlog during crash
recovery. Thus, it is easy during crash recovery to rebuild the
replication state from the binlog contents. This way we get crash
safe slave state without the overhead of maintaining an extra
It will be possible in the future to refine this, so that we could use method (1) if binlog is enabled, else method (2). This might improve performance slightly when binlog is enabled. But we should first benchmark to check if such refinement will be worth it in terms of performance gained. It seems likely that any gains will be modest, at best.
On parallel replication
Parallel replication is something that has been long overdue, but is now a reality. MariaDB 10.0 will have multi-source replication, which is actually a form of parallel replication. MySQL 5.6 will have multi-threaded slave. Galera can do parallel replication, as can Tungsten I believe, though I am not familiar with details. There are several other mechanisms for parallel replication planned for later MariaDB releases, like MWL#184 and MDEV-520.
It is thus very important to think parallel replication into the design of global transaction ID from the start. I fully agree with Giuseppe's remarks here about MySQL 5.6 replication features failing completely to do this. They introduce in 5.6 three new features that require extensions to how the replication state is stored: crash-safe slave, global transaction ID, and multi-threaded slave. They have managed to do this by introducing three completely different solutions. This is just insane. It makes one wonder if Oracle management forbids Oracle developers to talk to each other, just like we already know they prevent discussions with the community ...
So, in relation to global transaction ID there are basically two kinds of parallel replication techniques: in-order and out-of-order. The two interact with global transaction ID in different ways.
On in-order parallel replication
In-order is when two (or more) different transactions are executed in parallel on the slave, but the commit of the second transaction is delayed to after the first transaction has committed. Galera is an example of this, I think. Planned tasks MWL#184 and possibly MDEV-520 are also in-order techniques.
In-order parallel replication is transparent to applications and users (at least with MVCC transactional engines like InnoDB), since changes only become visible on COMMIT, and commits are done in a serial way. It is thus also mostly transparent to global transaction ID, and does not need much special consideration for the design.
One thing that can be done, and that I am currently working on, is to integrate in-order parallel replication with group commit. Suppose we run transactions T1 and T2 in parallel on the slave, and suppose that T2 happens to complete first so that we have to wait in T2's commit for T1 to commit first. If we integrate this wait with group commit, we can actually commit T1 and T2 at the same time, taking care to write the commit records to the binlog and to the storage engine in the right order (T1 before T2). This way, the wait is likely to improve performance rather than reduce it, in fact.
On out-of-order parallel replication
Out-of-order parallel replication is when transactions can be committed in a different order on the slave than on the master. The MySQL 5.6 multi-threaded slave feature is an example of this.
Out-of-order must be explicitly enabled by the application/DBA, because it breaks fundamental semantics. If commits happen in different order, the slave database may be temporarily in a state that never existed on the master and may be invalid for the application. But if the application is written to tolerate such inconsistencies, and explicitly declares this to the database, then there may be potential for more parallelism than with in-order methods. This can make out-of-order interesting.
The typical example, which MySQL 5.6 multi-threaded slave uses, is when the application declares that transactions against different schemas are guaranteed independent. Different schemas can then be replicated independently (though if the application messes up and transactions happen to not really be independent, things can break). MariaDB 10.0 multi-source replication is another example, where the application declares a guarantee that two master servers can be replicated independently.
Out-of-order creates a challenge for global transaction ID when switching to a new master. Because events in the binlog on the new master are in different order, there will not in general be a single place from which to start replication without either loosing or duplicating some event.
MariaDB global transaction ID handles this by only allowing
out-of-order parallel replication between different replication
domains, never within a single domain. In effect, the
DBA/application explicitly declares the possible independent
replication streams, and then it is sufficient to remember one
global transaction ID per
domain_id as the position
reached within each independent stream.
Thus, suppose we have a master where updates to schemas are
independent, and we want to replicate them in parallel on slaves.
On the master, we configure 10 (say) domain IDs 20-29. When we
log a global transaction ID to the binlog, we set the
domain_id value to a hash of the used schema.
On the slave, we then configure 10 SQL threads. Two received
transactions with different
domain_id can be
executed in parallel. Two transactions using same schema will map
to the same
domain_id and will thus not be able to
execute in parallel. Thus we get MySQL 5.6 style multi-threaded
slave almost for free, using the exact same mechanism as for
executing multi-source replication in parallel. The replication
state on the slave will in this case consist of the 10 different
global transaction IDs reached within each of the 10 replication
domains. And they can be stored in the table
rpl_slave_state just as described above. Thus
replication state for out-of-order parallel replication is fully
integrated with the rest of the design, needing no special
And we can do more! The application (with suitable privileges) is
allowed to change
domain_id per-query. For example,
we can run all normal queries with
then if we have a long-running maintenance query like an ALTER
TABLE or something that updates every row in a large table, we
can execute it with
domain_id=2, if we take care
that no other queries conflict with it. This way, the
long-running query can run in parallel "in the background",
without causing any replication delay for normal queries.
In effect, the application or DBA now has great flexibility in declaring which queries can replicate independent of (and thus in parallel with) each other, and all this just falls out almost for free from the overall design. I foresee that this will be a very powerful feature to have for large, busy replication setups.
Note btw. that most out-of-order parallel replication techniques can also be done as in-order simply by delaying the final COMMIT steps of transactions to happen in-order. This way one could for example do per-schema parallel replication without polluting the replication state with many global transaction IDs. This should generally achieve similar improvement in overall throughput, though latency of individual transactions can be longer.
On "holes" in the global transaction ID sequences
Global transaction IDs have a sequence-number component, which ensures uniqueness by being always increasing. This raises the issue of whether an event will always have a sequence number exactly one bigger than the previous event, or if it is allowed to have "holes", where some sequence number is never allocated to an event.
For MariaDB global transaction ID, I took the approach that holes are allowed. There are a number of good reasons for this.
Mostly, I believe that a design that relies on "no holes" is a fundamental mistake. In MySQL 5.6 global transaction ID, holes are absolutely not allowed. If a hole ever turns up, you will be stuck with it literally forever. The MySQL 5.6 replication state lists every sequence number not yet applied on a slave server, so if one becomes missing it will forever remain. Unless you remove it manually, and as far as I have been able to determine, there are currently no facilities for this. Anyone who knows how fragile MySQL replication can be should realise that this is a recipe for disaster.
Another point: because of the strict "no holes" requirement, in
MySQL 5.6, when events are filtered with
--replicate-ignore-db or whatever, they had to
change the code so that a dummy event is used to replace the
filtered event. In effect, you cannot really filter any events
any more! I think that alone should be enough to realise that the
design is wrong.
A more subtle point is that a strict "no holes" requirement makes it much harder to correctly and scalable handle allocation of new numbers. Basically, to allocate next number in a multi-thread environment, a lock needs to be taken. We need to take this lock for as short as possible to preserve scalability. But then, what happens if we allocate some sequence number N to transaction T1, and then later we get some failure that prevents T1 from successfully committing and being written into the binlog? We now cannot simply rollback T1, because some other transaction T2 may have already allocated the next number, and then we would leave a hole. Subtle issues like this are important to achieve good scalability.
So I think it is wrong to base the design on never having holes. On the other hand, there is no reason to deliberately introduce holes just for the fun of it. Sequence numbers in MariaDB global transaction ID will generally be without holes, it is just that nothing will break if somehow a hole should sneak in.
Also, whenever a global transaction ID is received on a slave
server, the server's own internal counter for the next sequence
number to allocate will be set to one more than the received
sequence number, if it is currently smaller. This gives the very
nice property in a standard setup, where only one master is ever
written at any one time: The sequence number in itself will be
globally unique and always increasing. This means that one can
look at any two global transaction IDs and immediately know which
one comes first in the history, which can be very useful. It also
allows to give a warning if multiple masters are being written
without being configured with distinct replication domain ID.
This is detected when a server replicates a global transaction ID
domain_id as its own but smaller sequence
In multi-master-like setups, sequence number by itself can no
longer be globally unique. But even here, if the system is
correctly configured so that each actively written master has its
domain_id, sequence number will be unique per
domain_id and allow to order global transaction IDs
within one domain (and of course, between different domains there
is no well-defined ordering anyway).
On CHANGE MASTER TO syntax
In the previous post, I did not really go into what new syntax will be introduced for MariaDB global transaction ID, both for the sake of brevity, and also because it has not really been fully decided yet.
However, there will certainly be the possibility to change directly the replication slave state, ie. the global transaction ID to start replicating from within each replication domain. For example something like this:
CHANGE MASTER TO master_host = "master.lan", master_gtid_pos = "1-1-100,2-5-300";
This is a fine and supported way to start replication. I just
want to mention that it will also be supported to start
replication in the old way, with
master_log_pos, and the master will automatically
convert this into the corresponding
and set this for the slave. This can be convenient, as many tools
mysqldump or XtraBackup provide easy access to
the old-style binlog position. It is certainly an improvement
over MySQL 5.6 global transaction ID, where the only documented
way to setup a slave involves RESET MASTER (!) on the master
Incidentally, note that
master_gtid_pos has just one
global transaction ID per
domain_id, not one per
server_id. Thus, if not using any form of
multi-master, there will be just one global transaction ID to
So if we start with server 1 as the master, and then some time
later switch over to server 2 for a master, the binlog will have
global transaction IDs both with
server_id=2. But the slave binlog state will be just
a single global transaction ID, with
this case. Since binlog order is always the same within one
replication domain, a single global transaction ID is sufficient
to know the correct place to continue replication.
I think this is a very nice property, that the size of the
replication state is fixed: one global transaction ID per
configured replication domain. In contrast, for MySQL 5.6 global
transaction ID, any
server_id that ever worked as
master will remain in the replication state forever. If you ever
had a server id 666 you will still be stuck with it 10 years
later when you specify the replication state in CHANGE MASTER
(assuming they will at some point even allow specifying the
replication state in CHANGE MASTER).
Once the global transaction replication state is set, changing to a new master could happen with something like this:
CHANGE MASTER TO master_host = "master.lan", master_gtid_pos = AUTO;
This is the whole point of global transaction ID, of course, to be able to do this and automatically get replication started from the right point(s) in the binlog on the new master.
One idea that I have not yet decided about is to allow just this simple syntax:
CHANGE MASTER TO master_host = "master.lan";
so that if no starting position is specified, and a global
transaction state already exists, we default to
master_gtid_pos = AUTO. This would be a change in
current behaviour, so maybe it is not a good idea. On the other
hand, the current behaviour is to start replication from whatever
happens to be the first not yet purged binlog file on the master,
which is almost guaranteed to be wrong. So it is tempting to
change this simple syntax to just do the right thing.
On extensions to
Robert mentions some possible extensions to the
mysqlbinlog program, and I agree with those.
The master binlog is carefully designed so that it is easily
possible to locate any given global transaction ID and the
corresponding binlog position (or determine that such global
transaction ID is not present in any binlog files). In the
initial design this requires scanning one (but just one) binlog
file from the beginning; later we could add an index facility if
this becomes a bottleneck. The
should also support this, probably by allowing to specify a
global transaction ID (or multiple IDs) for
Robert also mentions the usefulness of an option to filter out events from within just one replication domain/stream. This is something I had not thought of, but it would clearly be useful and is simple to implement.
On session variable
With MariaDB global transaction ID,
becomes a session variable, as do newly introduced variables
allows an external replication mechanism to apply events from
another server outside the normal replication mechanism, and
still preserve the global transaction ID when the resulting
queries are logged in the local binlog. One important use case
for this is of course point-in-time recovery,
mysql. Here, mysqlbinlog can set these variables to
preserve the global transaction IDs on the events applied, so
that fail-over and so on will still work correctly.
Since messing with
server_id and so on has the
possibility to easily break replication, setting these requires