All to often people force themselves into using a database like MySQL with no thought into whether if its the best solution to there problem. Why? Because their other applications use it, so why not the new application? Over the past couple of months I have been doing a ton of work for clients who use their database like most people use memcached . Lookup a row based on a key, update the data in the row, stuff the row back in the database. Rinse and repeat. Sure these setups vary sometimes, throwing in a “lookup” via username, or even the rare count. But for the most part they are designed to be simple.
A classic example is a simple online game. An online game may only require that an application retrieve a single record from the database. The record may contain all the vital stats for the game, be updated and stuffed back into the database. You would be surprised how many people use this type of system as I run into this type of application frequently. Keeping it simple, ensures that application is generally mean and lean and performs well. The issue is even this simple design can start to have issues as the data size increases and you blow through your available memory. Is there a better architecture? Is there a way to get more scalability out of your database? Is the database even the best place for this data?
I decided to walk through setting up a very simple application that does what I have seen many clients do. Using this application I can then compare using MySQL to using MySQL + Memcached, and then to other solutions like Tokyo Tyrant or Cassandra. My Application does the following:
A.) read a row from a database based on an integer based
primary key
B.) Update data from that row and replace the stored
contents on disk
C.) Use the data from that row to lookup up a row in
another table based on a text field ( called email address ).
Seems simple enough right? My two tables each contain 5M rows of data. let's see what happens:
You can see a dramatic drop off in performance as my data falls out of memory, that's not cool is it? After all database sizes tend to always grow and very rarely shrink. Which leads to a challenge faced by almost everyone how do you maintain your performance in the face of increasing data size?
Here is where people start to scratch their heads. They naturally assume they need to scale more, we need more memory! If performance sucks, we must need more. So here comes the bigger boxes, the read-only slaves, the complex sharding systems, the discussions on cluster, more memcached. We need to cover up the databases inefficiencies to ensure that our application scales.
The problem is for some applications, we are fixing symptoms, not the problem itself. No matter how much you want it to fit, some things may not work (like the Godfather 3). The issue is people assume that data storage has to be in the database. “It's data, it needs to go into the database.” is often the battle cry. But hold on to your hat, I am going to shock you. For some applications, putting your data in the database is silly. Yes the guy who blogs on bigdbahead.com and is writing this on the mysqlperformanceblog is saying you may not want to use a database. Heresy I know! But for many of us we already accept storing data ( at least temporarily ) outside the DB. Think memcached.
Almost everyone loves memcached, it's simple, fast, and just works. When your dataset exceeds your memory limitations or the database can simply not keep up any more this solution can really boost performance. I know you're thinking my simple key lookup should really benefit from memcached. So let's try it! I took the simple app I created that reads two rows, and update one of them to read from memcached if available, remove on update, and read from the db only when required. I tested with a memcached size of 1GB, 2GB, and 4GB. For these tests I left Innodb with a 256M buffer pool, or roughly with 9% of the total data in memory.
let's look at the 1GB Setting:
What, a performance regression? But we threw more memory at it!! How can that be!
Memcached is not a cure all. I have talked to many client's who say “we will just throw memcached at it”. Sometimes an app will scream other times it won't... and yet others require lots and lots of memory allocated to memcached to be successful. This application selects a random # between 1 and 2 Million and looks up the result via that key. It then uses data from that random row to look up a second piece of information via email address. Because the entire dataset is about 4GB and only 1G is in memcached, I keep pushing data out of memcached to make room for new records I am reading from the database. Remember memcached needs repeatability to be helpful. I am still getting a really solid # of hits in memcached, but the # of writes in MySQL coupled with the still large # of reads takes its toll. Another place where I have seen this kill clients is in apps that do some sequential scanning and do not have enough memory for memcached. For instance, if you have 1,000,000 rows of data, but enough memory to only store 500,000 rows... sequentially accessing this data will destroy the use of cache:
get record 1, miss, read from disk, cache record 1
….
get record 500,001, miss, read from disk, expunge record 1, cache
record 500,001
....
get record 1, miss, read from disk, expunge record 500,001, cache
record 1
you keep overwriting the cache before you can use it. So here the complexity of adding memcached hurts us, because the cache is not actually buying us anything.
Now bumping this up to 2GB actually makes the TPS jump around a lot, sometimes hitting 400 or 500 TPS and other times hitting as high as 1800 TPS. My guess is the movement is caused by the random #'s being generated and simply the luck of the draw.
Finally let's look when we have 4GB of memory allocated to memcached (full dataset fits ):
Here you can see that our “transactions” per second for this app increased almost 10Xby using memcached. The TPS I get here vary from 1100 TPS to just under 2000TPS with the average around 1400TPS. I think we would all be very happy if we could get a 10X performance boost from your application.
But wouldn't it be great if we could get more? I mean our reads are going pretty fast, but our writes leave a lot to be desired:
Over 17 MS to do an update. Wouldn't be great to just eliminate all the updates as well? What sort of throughput would we get? I will show you in part 2. Part 2 of this post will talk about performance in a 100% pure memcached environment. Part 3 will focus on these same benchmarks in Tokyo tyrant.
All to often people force themselves into using a database like MySQL with no thought into whether if its the best solution to there problem. Why? Because their other applications use it, so why not the new application? Over the past couple of months I have been doing a ton of work for clients who use their database like most people use memcached . Lookup a row based on a key, update the data in the row, stuff the row back in the database. Rinse and repeat. Sure these setups vary sometimes, throwing in a “lookup” via username, or even the rare count. But for the most part they are designed to be simple.
A classic example is a simple online game. An online game may only require that an application retrieve a single record from the database. The record may contain all the vital stats for the game, be updated and stuffed back into the database. You would be surprised how many people use this type of system as I run into this type of application frequently. Keeping it simple, ensures that application is generally mean and lean and performs well. The issue is even this simple design can start to have issues as the data size increases and you blow through your available memory. Is there a better architecture? Is there a way to get more scalability out of your database? Is the database even the best place for this data?
I decided to walk through setting up a very simple application that does what I have seen many clients do. Using this application I can then compare using MySQL to using MySQL + Memcached, and then to other solutions like Tokyo Tyrant or Cassandra. My Application does the following:
A.) read a row from a database based on an integer based primary key
B.) Update data from that row and replace the stored contents on disk
C.) Use the data from that row to lookup up a row in another table based on a text field ( called email address ).
Seems simple enough right? My two tables each contain 5M rows of data. let's see what happens:
You can see a dramatic drop off in performance as my data falls out of memory, that's not cool is it? After all database sizes tend to always grow and very rarely shrink. Which leads to a challenge faced by almost everyone how do you maintain your performance in the face of increasing data size?
Here is where people start to scratch their heads. They naturally assume they need to scale more, we need more memory! If performance sucks, we must need more. So here comes the bigger boxes, the read-only slaves, the complex sharding systems, the discussions on cluster, more memcached. We need to cover up the databases inefficiencies to ensure that our application scales.
The problem is for some applications, we are fixing symptoms, not the problem itself. No matter how much you want it to fit, somethings may not work (like the Godfather 3). The issue is people assume that data storage has to be in the database. “It's data, it needs to go into the database.” is often the battle cry. But hold on to your hat, I am going to shock you. For some applications, putting your data in the database is silly. Yes the guy who blogs on bigdbahead.com and is writing this on the mysqlperformanceblog is saying you may not want to use a database. Heresy I know! But for many of us we already accept storing data ( at least temporarily ) outside the DB. Think memcached.
Almost everyone loves memcached, it's simple, fast, and just works. When your dataset exceeds your memory limitations or the database can simply not keep up any more this solution can really boost performance. I know you're thinking my simple key lookup should really benefit from memcached. So let's try it! I took the simple app I created that reads two rows, and update one of them to read from memcached if available, remove on update, and read from the db only when required. I tested with a memcached size of 1GB, 2GB, and 4GB. For these tests I left Innodb with a 256M buffer pool, or roughly with 9% of the total data in memory.
let's look at the 1GB Setting:
What, a performance regression? But we threw more memory at it!! How can that be!
Memcached is not a cure all. I have talked to many client's who say “we will just throw memcached as it”. Sometimes an app will scream other times it won't... and yet others require lots and lots of memory allocated to memcached to be successful. This application selects a random # between 1 and 2 Million and looks up the result via that key. It then uses data from that random row to look up a second piece of information via email address. Because the entire dataset is about 4GB and only 1G is in memcached, I keep pushing data out of memcached to make room for new records I am reading from the database. Remember memcached needs repeatability to be helpful. I am still getting a really solid # of hits in memcached, but the # of writes in MySQL coupled with the still large # of reads takes its toll. Another place where I have seen this kill clients is in apps that do some sequential scanning and do not have enough memory for memcached. For instance, if you have 1,000,000 rows of data, but enough memory to only store 500,000 rows... sequentially accessing this data will destroy the use of cache:
-
get record 1, miss, read from disk, cache record 1
-
….
-
get record 500,001, miss, read from disk, expunge record 1, cache record 500,001
-
....
-
get record 1, miss, read from disk, expunge record 500,001, cache record 1
you keep overwriting the cache before you can use it. So here the complexity of adding memcached hurts us, because the cache is not actually buying us anything.
Now bumping this up to 2GB actually makes the TPS jump around a lot, sometimes hitting 400 or 500 TPS and other times hitting as high as 1800 TPS. My guess is the movement is caused by the random #'s being generated and simply the luck of the draw.
Finally let's look when we have 4GB of memory allocated to memcached (full dataset fits ):
Here you can see that our “transactions” per second for this app increased almost 10Xby using memcached. The TPS I get here vary from 1100 TPS to just under 2000TPS with the average around 1400TPS. I think we would all be very happy if we could get a 10X performance boost from your application.
But wouldn't it be great if we could get more? I mean our reads are going pretty fast, but our writes leave a lot to be desired:
Over 17 MS to do an update. Wouldn't be great to just eliminate all the updates as well? What sort of throughput would we get?
Entry posted by matt | 12 comments