This is the first in a three part series of blogs in which I’ll try to explain how to take an existing application and shard it.
Database Sharding has proven itself a very successful strategy for scaling relational databases. Almost every large web-site/SaaS solution uses sharding when writing to its relational database. The reason is pretty simple – relational database technology is showing its age and just can’t meet today’s requirements: a massive number of operations/second, a lot of open connections (since there are many application servers talking to the database), huge amounts of data, and a very high write ratio (anything over 10% is high when it comes to relational databases).
(Alas – it’s incredibly difficult to implement sharding. That’s why we started ScaleBase – but we’re not doing a sales pitch here.)
In this post I’ll explore the first task – analyzing your schema to find the best sharding configuration.
To build a sharding configuration, you’ll need the following data:
Big tables are great candidates for sharding since usually many SQL commands are executed on them (otherwise they wouldn’t be so big).
Foreign keys can help you understand dependencies between tables. Tables that depend on one another must be divided in a smart enough way, otherwise, your data integrity will be lost.
Some SQL operations are incredibly difficult to implement in a sharded environment (we’ll talk about this in our third post). You need the SQL log to understand whether you can implement sharding on some tables. The SQL log can also give a good indication of which tables are accessed heavily and which are not.
If you need instructions on how to collect this information, you can read our documentation here.
Of course, this section contains an assumption: that the database is well defined and already contains data. This is the easiest course of action. If you have a new database, and you are not sure yet how many rows each table will contain or what the SQL query log will show, you’ll just have to make some kind of educated guess.
Before you begin to implement sharding, it’s important to understand that not every table in the schema will be sharded. Since sharding limits your SQL capabilities (no join between sharded tables, uniqueness, auto-increment columns, etc.), you will enforce limits on your application that will be very difficult to overcome in code.
Usually, some tables will just be replicated across all the shards. As a matter of fact, most tables will be replicated (in ScaleBase, for the sake of discussion, we call these tables Global tables), and only some tables will be sharded. You can read more about table types here.
Deciding which Tables to Shard
The algorithm for choosing which tables to shard is not a very complex one:
There is no exact number of tables to look at. Most schemas have several big tables, and the rest are very small. Table sizes are not divided evenly.
The list of tables you get contains the best candidates for sharding. In the next blog post we’ll learn how to migrate your existing data to a sharded environment.
Just so you’ll know – ScaleBase will soon release a tool that can identify the best sharding policy for your schema. In a few weeks, you’ll be able to download it – and you don’t even have to use ScaleBase or buy a ScaleBase license in order to use it.