How to implement MySQL Sharding

This is the first in a three part series of blogs in which I’ll try to explain how to take an existing application and shard it.

Database Sharding has proven itself a very successful strategy for scaling relational databases. Almost every large web-site/SaaS solution uses sharding when writing to its relational database. The reason is pretty simple – relational database technology is showing its age and just can’t meet today’s requirements: a massive number of operations/second, a lot of open connections (since there are many application servers talking to the database), huge amounts of data, and a very high write ratio (anything over 10% is high when it comes to relational databases).

Many sites and blogs posts explain what sharding is, for example here and here. But how do you shard your application? Actually, the flow is quite simple, and consists of just four steps:

  1. Analyze your schema to find the best sharding configuration.
  2. Start multiple database instances.
  3. Split the data between the databases according to the sharding configuration.
  4. Update your application code to support your sharding configuration.

(Alas – it’s incredibly difficult to implement sharding. That’s why we started ScaleBase – but we’re not doing a sales pitch here.)

In this post I’ll explore the first task – analyzing your schema to find the best sharding configuration.

Required Data

To build a sharding configuration, you’ll need the following data:

  1. Table list and size
    Big tables are great candidates for sharding since usually many SQL commands are executed on them (otherwise they wouldn’t be so big).
  2. List of foreign keys
    Foreign keys can help you understand dependencies between tables. Tables that depend on one another must be divided in a smart enough way, otherwise, your data integrity will be lost.
  3. SQL Query log (preferably one that was gathered after running a complete coverage test of the application).
    Some SQL operations are incredibly difficult to implement in a sharded environment (we’ll talk about this in our third post). You need the SQL log to understand whether you can implement sharding on some tables. The SQL log can also give a good indication of which tables are accessed heavily and which are not.

If you need instructions on how to collect this information, you can read our documentation here.

Of course, this section contains an assumption: that the database is well defined and already contains data. This is the easiest course of action. If you have a new database, and you are not sure yet how many rows each table will contain or what the SQL query log will show, you’ll just have to make some kind of educated guess.

Table Policies

Before you begin to implement sharding, it’s important to understand that not every table in the schema will be sharded. Since sharding limits your SQL capabilities (no join between sharded tables, uniqueness, auto-increment columns, etc.), you will enforce limits on your application that will be very difficult to overcome in code.

Usually, some tables will just be replicated across all the shards. As a matter of fact, most tables will be replicated (in ScaleBase, for the sake of discussion, we call these tables Global tables), and only some tables will be sharded. You can read more about table types here.

Deciding which Tables to Shard

The algorithm for choosing which tables to shard is not a very complex one:

  1. Look at the biggest tables in your schema.
    There is no exact number of tables to look at. Most schemas have several big tables, and the rest are very small. Table sizes are not divided evenly.
  2. Look at the SQL log – are there joins between those tables?
    1. If there are, take the smaller table, and make it a global table.
    2. If not, those tables will serve as your shard environment.
    3. Look at the SQL log
      1. If the tables are not accessed frequently, make them global tables.
      2. Look at the most accessed tables (especially those with many write commands) and mark them as sharded. Go to step 2 to make sure they can be sharded.

The list of tables you get contains the best candidates for sharding. In the next blog post we’ll learn how to migrate your existing data to a sharded environment.

ScaleBase

Just so you’ll know – ScaleBase will soon release a tool that can identify the best sharding policy for your schema. In a few weeks, you’ll be able to download it – and you don’t even have to use ScaleBase or buy a ScaleBase license in order to use it.