The Cassandra database has been getting quite a lot of
publicity recently. I think this is a good thing in general, but
it seems that some people are considering using it for unsuitable
purposes.
Cassandra is a cluster database which uses multiple nodes to
provide
- Read-scaling
- Write-scaling
- High availability
Unless you need at least TWO of those things, you should probably
not bother.
Good reasons to use Cassandra:
High availability
Cassandra tolerates the failure of some nodes and will continue
to read data and take writes despite some nodes being offline or
unreachable - the exact behaviour depends on its settings and
what consistency level of read/write is requested.
Write scaling
Cassandra allows you to scale writes by just adding more nodes;
writes are split between nodes, hence you can generally get
better and better write performance by JUST adding more nodes
(NB: it doesn't necessarily do load balancing, so you might not
in all cases, but this is what it aspires to)
Less good reasons to use Cassandra
Read scaling
Cassandra gives you read-scaling in the same way as
write-scaling. This is a good thing, but can also be achieved
relatively easily* with a conventional database by adding more
and more read-only slaves / replicas, or using a cache (if you
tend to get a lot of similar requests). Many big MySQL users do
both.
Also Cassandra does NOT create more than the configured number of
replicas of any given piece of data, regardless of the amount of
traffic on that part, so you could end up having a small number
of servers hammered and the rest idle.
Bad reasons to use Cassandra
Schema flexibility
aka "I cannot figure out how to use ALTER TABLE", or at least
make a flexible conventional schema ...
Some people have cited schema flexibility as a good reason to use
Cassandra (same argument applies for Voldemort, Couchdb
etc).
However, in practice this is NOT a benefit, because it comes at
the cost of EVERYTHING ELSE YOU HAVE IN A TRADITIONAL
DATABASE.
Let's see what Cassandra does NOT do:
- Secondary indexes - I'd be really surprised if your app doesn't need any of those!
- Sorting, grouping or other advanced queries
- Filtering (mostly)
- Synchronous behaviour of updates
- Bulk updates (UPDATE 10,000 rows in one operation)
- Efficient table creation / drop
That's quite a big list (and very incomplete) so you'd better
have a better reason for using it than "I cannot figure out how
to use ALTER TABLE"
Because X or Y uses it
Just because Digg, Facebook et al use Cassandra, doesn't mean you
have to. Your data are probably more important than theirs. Your
workload is probably different from theirs. In particular, your
write/read scale requirements are probably less than
theirs.
I have a lot of respect for Facebook, Digg developers etc, but I
also have a lot of envy:
- They lose data, nobody cares
- They lose data, nobody rings up and complains
- They lose data, and NOBODY DEMANDS THEIR MONEY BACK
They could get a bit of bad press, their users might desert them
in numbers, but they wouldn't lose money directly and
immediately.
Most companies who have big data provide a service, which comes
with an SLA. The SLA often says that if we lose their data, they
get their money back.
* May or may not be easy, depending on the calibre of your
developers, ops staff, change control requirements, data
structure etc.