Showing entries 1 to 10 of 133
10 Older Entries »
Displaying posts with tag: RocksDB (reset)
Transaction Processing in NewSQL

This is a list of references for transaction processing in NewSQL systems. The work is exciting. I don't have much to add and wrote this to avoid losing interesting links. My focus is on OLTP, but some of these systems support more than that.

By NewSQL I mean the following. I am not trying to define "NewSQL" for the world:

  1. Support for multiple nodes because the storage/compute on one node isn't sufficient.
  2. Support for SQL with ACID transactions. If there are shards then cross-shard operations can be consistent and isolated.
  3. Replication does not prevent properties listed above when you are wiling to pay the price in commit overhead. Alas synchronous geo-replication is slow and too-slow commit is another form of downtime. I hope NewSQL systems make this less of a problem (async geo-replication for some or all commits, commutative operations). Contention and conflict are common in OLTP and it …
[Read more]
Durability debt

I define durability debt to be the amount of work that can be done to persist changes that have been applied to a database. Dirty pages must be written back for a b-tree. Compaction must be done for an LSM. Durability debt has IO and CPU components. The common IO overhead is from writing something back to the database. The common CPU overhead is from computing a checksum and optionally from compressing data.

From an incremental perspective (pending work per modified row) an LSM usually has less IO and more CPU durability debt than a B-Tree. From an absolute perspective the maximum durability debt can be much larger for an LSM than a B-Tree which is one reason why tuning can be more challenging for an LSM than a B-Tree.

In this post by LSM I mean LSM with leveled compaction.


The maximum durability debt for a B-Tree is limited by the size of the buffer pool. If the …

[Read more]
Bloom filter and cuckoo filter

The multi-level cuckoo filter (MLCF) in SlimDB builds on the cuckoo filter (CF) so I read the cuckoo filter paper. The big deal about the cuckoo filter is that it supports delete and a bloom filter does not. As far as I know the MLCF is updated when sorted runs arrive and depart a level -- so delete is required. A bloom filter in an LSM is per sorted run and delete is not required because the filter is created when the sorted run is written and dropped when the sorted run is unlinked.

I learned of the blocked bloom filter from the cuckoo filter paper (see here or …

[Read more]
Review of SlimDB from VLDB 2018

SlimDB is a paper worth reading from VLDB 2018. The highlights from the paper are that it shows:

  1. How to use less memory for filters and indexes with an LSM
  2. How to reduce the CPU penalty for queries with tiered compaction
  3. The benefit of more diversity in LSM tree shapes

Cache amplification has become more important as database:RAM ratios increase. With SSD it is possible to attach many TB of usable data to a server for OLTP. By usable I mean that the SSD has enough IOPs to access the data. But it isn't possible to grow the amount of RAM per server at that rate. Many of the early RocksDB workloads used database:RAM ratios that were about 10:1 and everything but the max level (Lmax) of the LSM …

[Read more]
5 things to set when configuring RocksDB and MyRock

The 5 options to set for RocksDB and MyRocks are:

  1. block cache size
  2. number of background threads
  3. compaction priority
  4. dynamic leveled compaction
  5. bloom filters

I have always wanted to do a "10 things" posts but prefer to keep this list small. It is unlikely that RocksDB can provide a great default for the block cache size and number of background threads because they depend on the amount of RAM and number of CPU cores in a server. But I hope RocksDB or MyRocks are changed to get better defaults for the other three which would shrink this list from 5 to 2.

My advice on setting the size of the RocksDB block cache has not changed assuming it is configured to use buffered IO (the default). With MyRocks this option is …

[Read more]
Comparing TokuDB, RocksDB and InnoDB Performance on Intel(R) Xeon(R) Gold 6140 CPU

Recently one of our customers wanted us to benchmark InnoDB, TokuDB and RocksDB on Intel(R) Xeon(R) Gold 6140 CPU (with 72 CPUs),  nvme SSD (7 TB) and  530 GB RAM for performance. We have used Ubuntu xenial 16.04.4, Percona Server 5.7 (included storage engines- InnoDB/XtraDB, TokuDB and RocksDB) and  Sysbench 1.0.15 with custom Lua scripts for this exercise, This benchmarking exercise included bulk INSERTS, WRITES, READS and READS-WRITES. We have tried our best to capture maximum information about the hardware infrastructure and copied / shared scripts we have used for benchmarking. This is not a paid / sponsored benchmarking effort by any of the software or hardware vendors, We will remain forever an vendor neutral and independent web-scale database infrastructure operations company with core expertise in performance, scalability, high availability and database reliability engineering. This benchmarking is …

[Read more]
Lock elision, pthreads and MySQL

Yesterday I learned that lock elision is supported in recent versions of glibc for pthread mutex and rw-lock. I am curious if anyone has results for MySQL with it. My memory is that InnoDB can suffer from contention on a rw-lock, but that is a custom rw-lock not the one included with glibc. But code above the storage engine uses mutex and maybe rw-lock from glibc.

A rw-lock where reads dominate can suffer from contention because it has at least twice the memory writes per lock/unlock pair compared to a mutex. So when the lock hold time is short a mutex wins even when exclusive access isn't required. This can often be seen in PMP output where there are convoys and the worst-case is when a thread gets stuck trying to get the internal latch during unlock, but the InnoDB custom rw-lock might not have …

[Read more]
Tiered or leveled compaction, why not both via adaptive compaction?

First there was leveled compaction and it was good, but it took a while for implementations to become popular. Then there was (size) tiered compaction and it was good too, but more confusing given the diversity in strategies for picking files to compact. RocksDB didn't help with the confusion by calling it universal compaction. Eventually compaction algorithms optimized for time series were added (see DTCS for Cassandra). Finally, Kudu and InfluxDB have specialized compaction algorithms that are also worth understanding.

This post is about adaptive compaction which is yet another compaction algorithm. The summary for adaptive compaction is:

[Read more]
Default options in MyRocks

We need to make MyRocks easier to configure -- this isn't a new idea. If you are using MyRocks with default options in mid-2018 then you are probably not using bloom filters, compression or my favorite compaction policy.

You can fix all of that by setting rocksdb_default_cf_options. I wish this were the default.
rocksdb_default_cf_options=block_based_table_factory={cache_index_and_filter_blocks=1;filter_policy=bloomfilter:10:false;whole_key_filtering=1};level_compaction_dynamic_level_bytes=true;optimize_filters_for_hits=true;compaction_pri=kMinOverlappingRatioThe above will enable the default compression type for all levels of the LSM tree which is Snappy in a recent MyRocks build with FB MySQL. But one of the proper distros only provides zlib and doing that for the small levels in the LSM tree (L0, L1, L2) might slow down compaction too much.

To set rocksdb_default_cf_options but disable compression use:

[Read more]
CPU overheads for RocksDB queries

An LSM like RocksDB has much better write and space efficiency than a B-Tree. That means with RocksDB you will use less SSD than with a B-Tree and the SSD will either last longer or you can use lower endurance SSD. But this efficiency comes at a cost. In the worst-case RocksDB might use 2X more CPU/query than a B-Tree, which means that in the worst case QPS with RocksDB might be half of what it is with a B-Tree. In the examples below I show that RocksDB can use ~2X more comparisons per query compared to a B-Tree and it is nice when results in practice can be explained by theory.

But first I want to explain the context where this matters and where it doesn't matter. It matters when the CPU overhead from RocksDB is a significant fraction of the query response time -- so the workload needs to be CPU bound (cached working set). …

[Read more]
Showing entries 1 to 10 of 133
10 Older Entries »