Enabling MySQL Multi-Source Replication Workflows with GTIDs

The goal of this post is twofold: (i) to show the power of GTIDs and MySQL 5.6 replication infrastructure in general; and (ii) show MySQL users a way to hack multi-source replication rather painlessly and rather quickly on top of MySQL 5.6 when using the new MySQL 5.6 replication features. Multi-source Replication Multi-source replication is often regarded as a mean to aggregate, consolidate, different streams of data into one single server instance, a sink. In many cases the following assumptions hold:

  1. There are no conflicts between transactions from different sources/masters;
  2. Replication is asynchronous (just as it is in regular MySQL replication).

Such setups are interesting for some use cases, in particular those that are either: driven by the need to gather data in some central server to be able to make a backup of the whole data together; or driven by data analytics requirements which typically come down to the ability to "quickly" perform complex queries over the entire data set, even if it is split over multiple server instances - consolidating data on one single server makes querying the data easier, especially if there are cross-shards/partitions joins involved.

Given the assumptions listed above, and looking at the features in MySQL 5.6, one gets the feeling that something can be cooked up rather quickly on top of global transaction identifiers and mysqlbinlog to get a form of multi-source/data aggregation running. So, before going into the specifics on how one might be able to do it, lets look at the properties of GTIDs and mysqlbinlog.
MySQL 5.6 Global Transaction Identifiers MySQL 5.6 Global transaction Identifiers (GTIDs) is a very powerful feature. In fact it is more than just one feature, it is several packed together and simply branded as GTIDs. Allow me to highlight the following parts:

  • Global Unique Identifier - There is one global unique identifier associated with each transaction. This is recorded together with the data in the binary log. As such, wherever the changes are replicated, so are the identifiers;
  • Recording the execution history - The MySQL server will skip transactions that it has already processed before. This makes it resilient to mistakes and unwanted re-executions;
  • New Binary Log Dump protocol - GTIDs introduced a new master-slave interaction protocol. The biggest enhancement here is that the server that is dumping the binary log and the other party that is asking for it, will agree on which transactions should be exchanged. In practice, this means that when a slave connects to a master, it informs which transactions it has seen. Then the master will send all the others that the slave does not have yet;
  • Auto-positioning - Given the new dump protocol and the fact that a server records its GTID execution history, the slave and the master are able to negotiate and automatically determine the positioning on the replication stream. No need to for the DBA to take notes and/or do some math to calculate positions or transaction sets. Thus, CHANGE MASTER TO does not take any positioning data with GTIDs. It is automatic.

"Hmm... OK, this is all very nice... But I thought this post was somewhat related to multi-source replication. Is it not?" Yes. Getting there. Hang on.
mysqlbinlog in MySQL 5.6 mysqlbinlog has seen its functionality improved in MySQL 5.6. Along with a new switch named "--raw" (which allows copying binary logs from a remote server without decoding them, i.e., in raw format) several others were added, like --exclude-gtids or --stop-never. For the purposes of this blog post, I found the following mysqlbinlog switches very interesting:

  • stop-never
    • This switch instructs mysqlbinlog to not bail out once it reaches the end of the active log on the master. Instead, it waits for more events to be generated, as it it were a fake slave.
  • stop-never-slave-server-id= X
    • This sets the server_id when stop-never is used. When mysqlbinlog connects to a server, say M, that has several other slaves, one can use this switch to prevent mysqlbinlog from using the same server id that some other slave might be using while it is connected to M.
  • to-last-log
    • This option instructs mysqlbinlog to continue printing until the end of the last binlog of the server, instead of stopping at the end of the requested binlog file.
  • read-from-remote-master='BINLOG-DUMP-GTIDS' 
    • Makes mysqlbinlog use the new binary log dump protocol.
  • exclude-gtids=...
    • This switch instructs mysqlbinlog to exclude transactions with the given IDs from the output. The most interesting part, is that if connecting to a server and when used together with read-from-remote-master='BINLOG-DUMP-GTIDS', then the list of identifiers provided will be sent to the dump thread. Therefore, the dump thread will be able to filter out the excluded transactions before actually sending them.

"OK, this is interesting, but still nothing related to multi-source...". Relax, we are almost there!
Leveraging on GTIDs and mysqlbinlog for Crafting Multi-Source Replication At this point we know from above that the feature set from global transaction identifiers in MySQL 5.6 provides a powerful framework for creating seamless, resilient and flexible replication topologies with MySQL. In fact, this was well depicted, by Sven, here and here, especially targeting fail-over scenarios. Also, from above, we know that there are interesting switches in mysqlbinlog that open it up to a lot of use case scenarios if used in a creative way. So, lets combine both to craft a form of multi-source replication in MySQL 5.6.
Now this is where it gets interesting... Making use of MySQL 5.6 GTIDs, and keeping in mind the four GTIDs parts listed above, building a multi-source replication solution using GTIDs and mysqlbinlog is rather simple. Just a few lines in shell scripting.
What do we need then?

  1. MySQL 5.6 servers with gtid_mode=ON
  2. mysqlbinlog from a MySQL 5.6 release
  3. mysql CLI client

How do we do it?

  1. Start all servers with gtid_mode=ON.
  2. Create the shell script mysql-retrieve-and-apply.sh.
  3. Run it with the proper configuration parameters.
#
# Sample script to replicate from one server to another
# using mysqlbinlog as a fake slave and using GTIDs to do
# automatic positioning.
#
# usage:
# mysql-retrieve-and-apply.sh <sink_hostname> <sink_port> <source_hostname> <source_port> <fake_slave_id>
#
DESTINATION_HOSTNAME=$1
DESTINATION_PORT=$2
SOURCE_HOSTNAME=$3
SOURCE_PORT=$4
SLAVE_FAKE_ID=$5

# The binlog file to start from. For this example I use master-bin.000001.
MYSQLBINLOGFILE=mysqld-bin.000001

# The user for connecting to the mysql servers.
USERNAME=root
# Query to retrieve the GTID_EXECUTED from the sink.
QUERY="SELECT VARIABLE_VALUE FROM INFORMATION_SCHEMA.GLOBAL_VARIABLES WHERE VARIABLE_NAME = 'GTID_EXECUTED'"

# Where to find mysql and mysqlbinlog tools.
MYSQL=../bin/mysql
MYSQLBINLOG=../bin/mysqlbinlog

# Retrieve GTID_EXECUTED from the destination server.
GTID_EXECUTED=$($MYSQL -se "$QUERY" -h $DESTINATION_HOSTNAME -P $DESTINATION_PORT -u $USERNAME -p -N)

# Dump the binary log from the source and immediately apply it through a
# mysql client connection.
$MYSQLBINLOG --port=$SOURCE_PORT \
--host=$SOURCE_HOST \
--protocol=tcp \
--user=$USERNAME \
--stop-never \
--stop-never-slave-server-id=$SLAVE_FAKE_ID \
--read-from-remote-master='BINLOG-DUMP-GTIDS' \
--to-last-log \
--exclude-gtids=$GTID_EXECUTED \
$MYSQLBINLOGFILE \
| \
$MYSQL --user=$USERNAME \
--protocol=tcp \
--host=$DESTINATION_HOSTNAME \
--port=$DESTINATION_PORT -p


DISCLAIMER: The shell script above is very rudimentary, uses the root user to connect everywhere, does not handle errors, is very specific to this blog post and is nothing more than a proof-of-concept to demonstrate the idea. Do not use it in production as is.
Now, let me try to explain, by resorting to a hypothetical multi-source replication setup, how one could use this approach and why would it work .

First, for simplicity, assume that we only have three servers (A,B,C) where B is a sink and A and C are sources. Assume also that all three servers were started with gtid_mode=ON thence all servers will record their execution history. Since data flows from A -> C and C -> B, then B consolidates all the data from A and from C. The workloads submitted to A and C do not conflict, thus B is able to install successfully changes from both sources.

Second, the assume that B is connected to A and to C using the shell script presented above. The script: (i) starts by retrieving the set of GTIDs that the destination server has already executed; (ii) and then uses the retrieved set as input to the --exclude-gtids option of mysqlbinlog so that when it connects to the source server, it auto-positions itself using the new protocol; (iii) finally, once transactions start flowing from the source, the output of mysqlbinlog is piped through the mysql client into the destination server.

Theoretically very simple.

One can even stop and restart the shell script many times without having to do any kind of recovery or even advanced math to reposition in the replication stream. On every restart, the same procedure will be executed, thus the filtering of already known transactions and the automatic repositioning will take place everytime the script is aborted and restarted.
Strictly speaking, --exclude-gtids is not entirely necessary. If it were not used then the filtering would be done on the destination server by auto-skipping transactions that it had executed before. But the fact that mysqlbinlog (as well as the IO thread by the way) can instruct the dump thread to filter out transactions while sending the binary log, saves network bandwidth, avoids scanning and sending the entire binlog, and avoids extra work at the sink side. Notice, that this is also safer. Imagine that the source purges logs and the sink does not tell the dump thread that it has seen those purged transactions... This would result in an error when the sink connects because the source would state that it was not able to send the purged transactions that the sink is in practice requesting again, even though it does not need them.
Enough theory already! Lets spin this on real servers. Exercising The Idea Time to validate the small replication trick presented above.
Lets start by downloading the latest mysql-5.6.12 package for your environment. It is better to choose 5.6.12 or later since some bugs were fixed in mysqlbinlog, and a couple might have impact on the idea on this blog post (thinking on Bug #16532543 and Bug #16517775). Anyway, I am using Linux, so I will go for a linux 64bit tarball. After the download is successful, uncompress and untar it:

shell> tar xfz mysql-5.6.12-linux-glibc2.5-x86_64.tar.gz


Next, change into the mysql-test directory.

shell> cd mysql-5.6.12-linux-glibc2.5-x86_64/mysql-test


Now, lets start three servers very quickly using mysql-test-run. Bear in mind that we need to activate GTIDs.

shell> perl mtr --start --mysqld=--gtid-mode=ON \
--mysqld=--enforce-gtid-consistency \
--mysqld=--log-slave-updates \
rpl_row_img_eng_full


This starts three servers, but does not run the test cases. I picked the rpl_row_img_eng_full test, since it actually deploys three servers. Take note of the ports on which each server is listening. You will need that for later. In this example, we will be replicating from the first (server_id=1) and third (server_id=3) servers to the second one (server_id=2), aka sink.
On a different terminal, lets start the script I presented above, so that we get replication going from one server to another. Lets start replication from the first server to the second one:

shell> mysql-retrieve-and-apply.sh 127.0.0.1 13001 127.0.0.1 13000 201


At this point you might have been asked for the password a few times. At least two times. Once when the script connects to the sink to ask for @@gtid_executed and the other one when the script connects again to the sink to apply the events it is retrieving from the master. Note that the script is tailored to this environment. Mind you the location of mysqlbinlog and mysql clients was set to ../bin/mysqlbinlog and ../bin/mysql respectively and the name of the first binlog file to mysqld-bin.000001. Furthermore, the script connects with the user root. So, all of these settings are very specific to this testing environment. This should NOT be used in production as is.
Moving on... By now, you have replication setup from server #1 to server #2. Lets do the same with server #3.

shell> mysql-retrieve-and-apply.sh 127.0.0.1 13001 127.0.0.1 13002 203


Now connect to server #1 and create a table with one record:

shell> ../bin/mysql -u root -S var/tmp/mysqld.1.sock test
mysql> CREATE TABLE t1 (c1 INT);
mysql> INSERT INTO t1 VALUES (1);
mysql> SHOW GLOBAL VARIABLES LIKE 'gtid_executed';
+---------------+------------------------------------------+
| Variable_name | Value |
+---------------+------------------------------------------+
| gtid_executed | 7ec1eea8-d2f3-11e2-9d1a-00144fe67f50:1-2 |
+---------------+------------------------------------------+
1 row in set (0.00 sec)


We can clearly see that we have executed two transactions:

7ec1eea8-d2f3-11e2-9d1a-00144fe67f50:1-2


Now connect to server #3 and do the same, but using a different table name:

shell> ../bin/mysql -u root -S var/tmp/mysqld.3.sock test
mysql> CREATE TABLE t2 (c1 INT);
mysql> INSERT INTO t2 VALUES (1);
mysql> SHOW GLOBAL VARIABLES LIKE 'gtid_executed';
+---------------+------------------------------------------+
| Variable_name | Value |
+---------------+------------------------------------------+
| gtid_executed | 80003180-d2f3-11e2-9d1a-00144fe67f50:1-2 |
+---------------+------------------------------------------+
1 row in set (0.00 sec)


Again, in this case, we have two transactions:

80003180-d2f3-11e2-9d1a-00144fe67f50:1-2


On both sessions, we can inspect the result of showing the process list, for instance:

mysql> SELECT * FROM INFORMATION_SCHEMA.PROCESSLIST WHERE COMMAND LIKE '%Dump GTID%';
+----+------+-----------------+------+------------------+------+------------------------------------------------------------------+------+
| ID | USER | HOST | DB | COMMAND | TIME | STATE | INFO |
+----+------+-----------------+------+------------------+------+------------------------------------------------------------------+------+
| 2 | root | localhost:52560 | NULL | Binlog Dump GTID | 406 | Master has sent all binlog to slave; waiting for binlog to be up | NULL |
+----+------+-----------------+------+------------------+------+------------------------------------------------------------------+------+
1 row in set (0.00 sec)


As you can see, we have our session listening for GTIDs online. Now it is time to go and have a look at the sink and see if the GTIDs and the changes have reached the database.

shell> ../bin/mysql -u root -S var/tmp/mysqld.2.sock test
mysql> SHOW TABLES;
+----------------+
| Tables_in_test |
+----------------+
| t1 |
| t2 |
+----------------+
2 rows in set (0.00 sec)

mysql> SELECT * FROM t1;
+------+
| c1 |
+------+
| 1 |
+------+
1 row in set (0.00 sec)
mysql> SELECT * FROM t2;
+------+
| c1 |
+------+
| 2 |
+------+
1 row in set (0.00 sec)

mysql> SHOW GLOBAL VARIABLES LIKE 'GTID_EXECUTED';
+---------------+------------------------------------------------------------------------------------+
| Variable_name | Value |
+---------------+------------------------------------------------------------------------------------+
| gtid_executed | 7ec1eea8-d2f3-11e2-9d1a-00144fe67f50:1-2,
80003180-d2f3-11e2-9d1a-00144fe67f50:1-2 |
+---------------+------------------------------------------------------------------------------------+
1 row in set (0.00 sec)


As you can see, server #2 has gotten the changes and the GTIDs replicated from server #1 and server #3 into server #2 seamlessly. No need to handle positions, no need to do anything other than letting a simple script running. Cool! Conclusion The main goal of this blog post was to show a neat hack to deploy multi-source in MySQL 5.6 by making use of the awesome GTIDs infrastructure. And, yes, I know what you are thinking by now... Bi-directional replication? Well, yes! Given the properties of GTIDs you can easily do it like this as well (just recall the auto-skipping and auto-positioning properties of GTIDs). But bear in mind that, at the end of the day, this is still a hack! A nice one, but still a hack! ;) So, don't plan to put this in production in its current form.

OK, enough for now. Go try it yourself. Improve the script above. Or use the MySQL Utilities to build your own Python-based multi-source mysql aggregation tool... There are so many interesting possibilities and ideas to explore on top of the GTIDs framework and this is just one of them...
Enjoy!