More on global transaction ID in MariaDB

I got some very good comments/questions on my previous post on MariaDB global transaction ID, from Giuseppe and Robert (of Tungsten fame). I thought a follow-up post would be appropriate to answer and further elaborate on the comments, as the points they raise are very important and interesting.

(It also gives me the opportunity to explain more deeply a lot of interesting design decisions that I left out in the first post for the sake of brevity and clarity.)

On crash-safe slave

One of the things I really wanted to improve with global transaction ID is to make the replication slaves more crash safe with respect to their current replication state. This state is mostly persistently stored information about which event(s) were last executed on the slave, so that after a server restart the slave will know from which point in the master binlog(s) to resume replication. In current (5.5 and earlier) replication, this state is stored simply by continuously writing a file relay-log.info after each event executed. If the server crashes, this is very susceptible to corruption where the contents of the file no longer matches the actual state of tables in the database. </p>

With MariaDB global transaction ID, the replication state is stored in the following table instead of in a plain file:

    CREATE TABLE rpl_slave_state (
        domain_id INT UNSIGNED NOT NULL,
        sub_id BIGINT UNSIGNED NOT NULL,
        server_id INT UNSIGNED NOT NULL,
        seq_no BIGINT UNSIGNED NOT NULL,
        PRIMARY KEY (domain_id, sub_id));

When a transaction is executed on the slave, this table is updated as part of the transaction. So if the table is created with InnoDB (or other transactional engine) and the replicated events also use transactional tables, then the replication state is crash safe. DDL, or non-transactional engines such as MyISAM, remain crash-unsafe of course.

A global transaction ID in MariaDB consists of domain_id as described in the previous post, an increasing sequence number, and the usual server_id. Recall that the replication state with global transaction ID consists of the last global transaction ID applied within each independent replication stream, ie. a mapping from domain_id to global transaction ID. This is what the table rpl_slave_state provides.

But what about the sub_id in the above table? This is to prevent concurrency issues when parallel replication is used. If we want to be able to execute in parallel two transactions with the same domain_id, then these two transactions will both need to update the table rpl_slave_state. If the transactions would update the same row in the table, then one transaction would have to wait on a row lock for the other to commit. This would prevent any kind of group commit, which would be a very serious performance bottleneck.

So instead, each transaction inserts a new, separate row into the table to record the new global transaction ID applied. There may thus be multiple entries for a given domain_id. The sub_id is used to distinguish them, it is simply a (local) integer that is increased for each transaction received from the master binlog. Thus, at any given time the last applied global transaction ID for given domain_id is the one with the highest sub_id in the table.

In effect, the replication state is obtained with a query like this:

    SELECT domain_id, server_id, seq_no
    FROM rpl_slave_state
    WHERE (domain_id, sub_id) IN
      (SELECT domain_id, MAX(sub_id) FROM rpl_slave_state GROUP BY domain_id)

Old rows are deleted when no longer needed.

Thus, two replicated transactions can be executed like this:

    BEGIN;                                 BEGIN;

    UPDATE some_table                      UPDATE other_table
       SET value = value + 1                  SET name = "foo"
     WHERE id = 10;                         WHERE category LIKE "food_%";

    INSERT INTO mysql.rpl_slave_state      INSERT INTO mysql.rpl_slave_state
       SET domain_id = 1,                     SET domain_id = 1,
           sub_id = 100,                          sub_id = 101,
           server_id = 5,                         server_id = 5,
           seq_no = 300010;                       seq_no = 300011;

    COMMIT;                                COMMIT;

These two transactions can run completely independent, including the insert into rpl_slave_state. And the commits at the end can be done together as a single group commit, where we ensure that the second one is recorded as happening after the first one so commit order (and binlog order) is preserved and visibility is correct (second transaction not visible to any query without the first also being visible).

Contrast this with how things would be with a rpl_slave_state table with a single row per domain_id:

    BEGIN;                                 BEGIN;

    UPDATE some_table                      UPDATE other_table
       SET value = value + 1                  SET name = "foo"
     WHERE id = 10;                         WHERE category LIKE "food_%";

    UPDATE bad_rpl_slave_state
       SET server_id = 5,
           seq_no = 300010
     WHERE domain_id = 1;

    COMMIT;

                                           UPDATE bad_rpl_slave_state
                                              SET server_id = 5,
                                                  seq_no = 300011
                                            WHERE domain_id = 1;

                                           COMMIT;

Here the update of the replication state table in the second transaction would have to wait for the first transaction to commit, because of row locks. Group commit becomes impossible.

(I actually explained this issue to the replication developers at MySQL/Oracle a long time ago, but last time I looked at MySQL 5.6, they had ignored it...)

On where to store the replication state

As Giuseppe pointed out, in the global transaction ID design it is still written that the replication state will be stored in the slave binlog, not in the rpl_slave_state table. Sorry about this, I will get the document updated as soon as possible.

I had basically two ideas for how to store the slave state in a crash-safe way:

  1. In the slave's binlog.
  2. In a (transactional) table.

The big advantage of (2) is that it works also when the binlog is not enabled on the slave. Since there can still be substantial overhead to enabling the binlog, I currently plan to go with this approach.

The advantage of (1) is that it is potentially cheaper when the binlog is enabled on the slave, as it commonly will be when global transaction ID is enabled (to be able to promote a slave as a new master, the binlog must be enabled, after all). We already write every single global transaction ID applied into the binlog, and if we crash, we already scan the binlog during crash recovery. Thus, it is easy during crash recovery to rebuild the replication state from the binlog contents. This way we get crash safe slave state without the overhead of maintaining an extra rpl_slave_state table.

It will be possible in the future to refine this, so that we could use method (1) if binlog is enabled, else method (2). This might improve performance slightly when binlog is enabled. But we should first benchmark to check if such refinement will be worth it in terms of performance gained. It seems likely that any gains will be modest, at best.

On parallel replication

Parallel replication is something that has been long overdue, but is now a reality. MariaDB 10.0 will have multi-source replication, which is actually a form of parallel replication. MySQL 5.6 will have multi-threaded slave. Galera can do parallel replication, as can Tungsten I believe, though I am not familiar with details. There are several other mechanisms for parallel replication planned for later MariaDB releases, like MWL#184 and MDEV-520.

It is thus very important to think parallel replication into the design of global transaction ID from the start. I fully agree with Giuseppe's remarks here about MySQL 5.6 replication features failing completely to do this. They introduce in 5.6 three new features that require extensions to how the replication state is stored: crash-safe slave, global transaction ID, and multi-threaded slave. They have managed to do this by introducing three completely different solutions. This is just insane. It makes one wonder if Oracle management forbids Oracle developers to talk to each other, just like we already know they prevent discussions with the community ...

So, in relation to global transaction ID there are basically two kinds of parallel replication techniques: in-order and out-of-order. The two interact with global transaction ID in different ways.

On in-order parallel replication

In-order is when two (or more) different transactions are executed in parallel on the slave, but the commit of the second transaction is delayed to after the first transaction has committed. Galera is an example of this, I think. Planned tasks MWL#184 and possibly MDEV-520 are also in-order techniques.

In-order parallel replication is transparent to applications and users (at least with MVCC transactional engines like InnoDB), since changes only become visible on COMMIT, and commits are done in a serial way. It is thus also mostly transparent to global transaction ID, and does not need much special consideration for the design.

One thing that can be done, and that I am currently working on, is to integrate in-order parallel replication with group commit. Suppose we run transactions T1 and T2 in parallel on the slave, and suppose that T2 happens to complete first so that we have to wait in T2's commit for T1 to commit first. If we integrate this wait with group commit, we can actually commit T1 and T2 at the same time, taking care to write the commit records to the binlog and to the storage engine in the right order (T1 before T2). This way, the wait is likely to improve performance rather than reduce it, in fact.

On out-of-order parallel replication

Out-of-order parallel replication is when transactions can be committed in a different order on the slave than on the master. The MySQL 5.6 multi-threaded slave feature is an example of this.

Out-of-order must be explicitly enabled by the application/DBA, because it breaks fundamental semantics. If commits happen in different order, the slave database may be temporarily in a state that never existed on the master and may be invalid for the application. But if the application is written to tolerate such inconsistencies, and explicitly declares this to the database, then there may be potential for more parallelism than with in-order methods. This can make out-of-order interesting.

The typical example, which MySQL 5.6 multi-threaded slave uses, is when the application declares that transactions against different schemas are guaranteed independent. Different schemas can then be replicated independently (though if the application messes up and transactions happen to not really be independent, things can break). MariaDB 10.0 multi-source replication is another example, where the application declares a guarantee that two master servers can be replicated independently.

Out-of-order creates a challenge for global transaction ID when switching to a new master. Because events in the binlog on the new master are in different order, there will not in general be a single place from which to start replication without either loosing or duplicating some event.

MariaDB global transaction ID handles this by only allowing out-of-order parallel replication between different replication domains, never within a single domain. In effect, the DBA/application explicitly declares the possible independent replication streams, and then it is sufficient to remember one global transaction ID per domain_id as the position reached within each independent stream.

Thus, suppose we have a master where updates to schemas are independent, and we want to replicate them in parallel on slaves. On the master, we configure 10 (say) domain IDs 20-29. When we log a global transaction ID to the binlog, we set the domain_id value to a hash of the used schema.

On the slave, we then configure 10 SQL threads. Two received transactions with different domain_id can be executed in parallel. Two transactions using same schema will map to the same domain_id and will thus not be able to execute in parallel. Thus we get MySQL 5.6 style multi-threaded slave almost for free, using the exact same mechanism as for executing multi-source replication in parallel. The replication state on the slave will in this case consist of the 10 different global transaction IDs reached within each of the 10 replication domains. And they can be stored in the table rpl_slave_state just as described above. Thus replication state for out-of-order parallel replication is fully integrated with the rest of the design, needing no special mechanisms.

And we can do more! The application (with suitable privileges) is allowed to change domain_id per-query. For example, we can run all normal queries with domain_id=1. But then if we have a long-running maintenance query like an ALTER TABLE or something that updates every row in a large table, we can execute it with domain_id=2, if we take care that no other queries conflict with it. This way, the long-running query can run in parallel "in the background", without causing any replication delay for normal queries.

In effect, the application or DBA now has great flexibility in declaring which queries can replicate independent of (and thus in parallel with) each other, and all this just falls out almost for free from the overall design. I foresee that this will be a very powerful feature to have for large, busy replication setups.

Note btw. that most out-of-order parallel replication techniques can also be done as in-order simply by delaying the final COMMIT steps of transactions to happen in-order. This way one could for example do per-schema parallel replication without polluting the replication state with many global transaction IDs. This should generally achieve similar improvement in overall throughput, though latency of individual transactions can be longer.

On "holes" in the global transaction ID sequences

Global transaction IDs have a sequence-number component, which ensures uniqueness by being always increasing. This raises the issue of whether an event will always have a sequence number exactly one bigger than the previous event, or if it is allowed to have "holes", where some sequence number is never allocated to an event.

For MariaDB global transaction ID, I took the approach that holes are allowed. There are a number of good reasons for this.

Mostly, I believe that a design that relies on "no holes" is a fundamental mistake. In MySQL 5.6 global transaction ID, holes are absolutely not allowed. If a hole ever turns up, you will be stuck with it literally forever. The MySQL 5.6 replication state lists every sequence number not yet applied on a slave server, so if one becomes missing it will forever remain. Unless you remove it manually, and as far as I have been able to determine, there are currently no facilities for this. Anyone who knows how fragile MySQL replication can be should realise that this is a recipe for disaster.

Another point: because of the strict "no holes" requirement, in MySQL 5.6, when events are filtered with --replicate-ignore-db or whatever, they had to change the code so that a dummy event is used to replace the filtered event. In effect, you cannot really filter any events any more! I think that alone should be enough to realise that the design is wrong.

A more subtle point is that a strict "no holes" requirement makes it much harder to correctly and scalable handle allocation of new numbers. Basically, to allocate next number in a multi-thread environment, a lock needs to be taken. We need to take this lock for as short as possible to preserve scalability. But then, what happens if we allocate some sequence number N to transaction T1, and then later we get some failure that prevents T1 from successfully committing and being written into the binlog? We now cannot simply rollback T1, because some other transaction T2 may have already allocated the next number, and then we would leave a hole. Subtle issues like this are important to achieve good scalability.

So I think it is wrong to base the design on never having holes. On the other hand, there is no reason to deliberately introduce holes just for the fun of it. Sequence numbers in MariaDB global transaction ID will generally be without holes, it is just that nothing will break if somehow a hole should sneak in.

Also, whenever a global transaction ID is received on a slave server, the server's own internal counter for the next sequence number to allocate will be set to one more than the received sequence number, if it is currently smaller. This gives the very nice property in a standard setup, where only one master is ever written at any one time: The sequence number in itself will be globally unique and always increasing. This means that one can look at any two global transaction IDs and immediately know which one comes first in the history, which can be very useful. It also allows to give a warning if multiple masters are being written without being configured with distinct replication domain ID. This is detected when a server replicates a global transaction ID with same domain_id as its own but smaller sequence number.

In multi-master-like setups, sequence number by itself can no longer be globally unique. But even here, if the system is correctly configured so that each actively written master has its own domain_id, sequence number will be unique per domain_id and allow to order global transaction IDs within one domain (and of course, between different domains there is no well-defined ordering anyway).

On CHANGE MASTER TO syntax

In the previous post, I did not really go into what new syntax will be introduced for MariaDB global transaction ID, both for the sake of brevity, and also because it has not really been fully decided yet.

However, there will certainly be the possibility to change directly the replication slave state, ie. the global transaction ID to start replicating from within each replication domain. For example something like this:

    CHANGE MASTER TO master_host = "master.lan",
           master_gtid_pos = "1-1-100,2-5-300";

This is a fine and supported way to start replication. I just want to mention that it will also be supported to start replication in the old way, with master_log_file and master_log_pos, and the master will automatically convert this into the corresponding master_gtid_pos and set this for the slave. This can be convenient, as many tools like mysqldump or XtraBackup provide easy access to the old-style binlog position. It is certainly an improvement over MySQL 5.6 global transaction ID, where the only documented way to setup a slave involves RESET MASTER (!) on the master server...

Incidentally, note that master_gtid_pos has just one global transaction ID per domain_id, not one per server_id. Thus, if not using any form of multi-master, there will be just one global transaction ID to set.

So if we start with server 1 as the master, and then some time later switch over to server 2 for a master, the binlog will have global transaction IDs both with server_id=1 and server_id=2. But the slave binlog state will be just a single global transaction ID, with server_id=2 in this case. Since binlog order is always the same within one replication domain, a single global transaction ID is sufficient to know the correct place to continue replication.

I think this is a very nice property, that the size of the replication state is fixed: one global transaction ID per configured replication domain. In contrast, for MySQL 5.6 global transaction ID, any server_id that ever worked as master will remain in the replication state forever. If you ever had a server id 666 you will still be stuck with it 10 years later when you specify the replication state in CHANGE MASTER (assuming they will at some point even allow specifying the replication state in CHANGE MASTER).

Once the global transaction replication state is set, changing to a new master could happen with something like this:

    CHANGE MASTER TO master_host = "master.lan",
                     master_gtid_pos = AUTO;

This is the whole point of global transaction ID, of course, to be able to do this and automatically get replication started from the right point(s) in the binlog on the new master.

One idea that I have not yet decided about is to allow just this simple syntax:

    CHANGE MASTER TO master_host = "master.lan";

so that if no starting position is specified, and a global transaction state already exists, we default to master_gtid_pos = AUTO. This would be a change in current behaviour, so maybe it is not a good idea. On the other hand, the current behaviour is to start replication from whatever happens to be the first not yet purged binlog file on the master, which is almost guaranteed to be wrong. So it is tempting to change this simple syntax to just do the right thing.

On extensions to mysqlbinlog

Robert mentions some possible extensions to the mysqlbinlog program, and I agree with those.

The master binlog is carefully designed so that it is easily possible to locate any given global transaction ID and the corresponding binlog position (or determine that such global transaction ID is not present in any binlog files). In the initial design this requires scanning one (but just one) binlog file from the beginning; later we could add an index facility if this becomes a bottleneck. The mysqlbinlog program should also support this, probably by allowing to specify a global transaction ID (or multiple IDs) for --start-position and --stop-position.

Robert also mentions the usefulness of an option to filter out events from within just one replication domain/stream. This is something I had not thought of, but it would clearly be useful and is simple to implement.

On session variable server_id

With MariaDB global transaction ID, server_id becomes a session variable, as do newly introduced variables gtid_domain_id and gtid_seq_no. This allows an external replication mechanism to apply events from another server outside the normal replication mechanism, and still preserve the global transaction ID when the resulting queries are logged in the local binlog. One important use case for this is of course point-in-time recovery, mysqlbinlog | mysql. Here, mysqlbinlog can set these variables to preserve the global transaction IDs on the events applied, so that fail-over and so on will still work correctly.

Since messing with server_id and so on has the possibility to easily break replication, setting these requires SUPER privileges.