Showing entries 1 to 10 of 138
10 Older Entries »
Displaying posts with tag: RocksDB (reset)
What May Cause MySQL ERROR 1213

Probably all of us, MySQL users, DBAs and developers had seen error 1213 more than once, in one context or the other:
mysql> select * from t1;
ERROR 1213 (40001): Deadlock found when trying to get lock; try restarting transactionThe first thing that comes to mind in this case is: "OK, we have InnoDB deadlock, let's check the details", followed by the SHOW ENGINE INNODB STATUS check, like this:
mysql> show engine innodb status\G
*************************** 1. row ***************************
  Type: InnoDB
  Name:
Status:
=====================================
2018-12-08 17:41:11 0x7f2f8b8db700 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 12 seconds
-----------------
BACKGROUND THREAD

[Read more]
Review of TRIAD: Creating Synergies Between Memory, Disk and Log in Log Structured Key-Value Stores

This is review of TRIAD which was published in USENIX ATC 2017. It explains how to reduce write amplification for RocksDB leveled compaction although the ideas are useful for many LSM implementations. I share a review here because the paper has good ideas. It isn't easy to keep up with all of the LSM research, even when limiting the search to papers that reference RocksDB, and I didn't notice this paper until recently.

TRIAD reduces write amplification for an LSM with leveled compaction and with a variety of workloads gets up to 193% more throughput, up to 4X less write amplification and spends up to 77% less time doing compaction and flush. Per the …

[Read more]
Tuning MyRocks for performance

There are basically two things which I majorly like about using MyRocks, 1. LSM Advantage – smaller space & lower write amplification and 2. Best of MySQL like replication, storage engine centric database infrastructure operations and MySQL orchestration tools. Facebook built RocksDB as an embeddable and persistent key-value store with lower amplification factor () compared to InnoDB. Let me explain a scenario were InnoDB proves less efficient compared to RocksDB in SSD:
We know InnoDB is constrained by a fixed compressed page size. Alignment during fragmentation and compression causes extra unused space because the leaf nodes are not full. Let’s consider a InnoDB table with a compressed page size of 8KB. A 16KB in-memory page compressed to 5KB still uses 8KB on storage. Adding to this, each entry in the primary key index has 13 bytes of metadata (6 byte transaction id + 7 byte rollback pointer), and …

[Read more]
Converting an LSM to a B-Tree and back again

I wonder if it is possible to convert an LSM to a B-Tree. The goal is to do it online and in-place -- so I don't want two copies of the database while the conversion is in progress. I am interested in data structures for data management that adapt dynamically to improve performance and efficiency for a given workload. 
Workloads change in the short and long term. I hope that data structures can be adapt to the change and converting between an LSM and a B-Tree is one way to adapt. This is more likely to be useful when the data structure supports some kind of partitioning in the hope that different workloads can be isolated to different partitions -- and then some can use an LSM while others use a B-Tree.

LSM to B-Tree
A B-Tree is one tree. An LSM is a sequence of trees. Each sorted run in the LSM is a tree. With leveled compaction in RocksDB there are a few sorted runs in level 0 (L0) and then one sorted run …

[Read more]
Combining tiered and leveled compaction

There are simple optimization problems for LSM tuning. For example use leveled compaction to minimize space amplification and use tiered to minimize write amplification. But there are interesting problems that are harder to solve:

  1. maximize throughput given a constraint on write and/or space amplification
  2. minimize space and/or write amplification given a constraint on read amplification

To solve the first problem use leveled compaction if it can satisfy the write amp constraint, else use tiered compaction if it can satisfy the space amp constraint, otherwise there is no solution. The lack of a solution might mean the constraints are unreasonable but it can also mean we need to enhance LSM implementations to support more diversity in LSM tree shapes. Even when there is a solution using …

[Read more]
Transaction Processing in NewSQL

This is a list of references for transaction processing in NewSQL systems. The work is exciting. I don't have much to add and wrote this to avoid losing interesting links. My focus is on OLTP, but some of these systems support more than that.

By NewSQL I mean the following. I am not trying to define "NewSQL" for the world:

  1. Support for multiple nodes because the storage/compute on one node isn't sufficient.
  2. Support for SQL with ACID transactions. If there are shards then cross-shard operations can be consistent and isolated.
  3. Replication does not prevent properties listed above when you are wiling to pay the price in commit overhead. Alas synchronous geo-replication is slow and too-slow commit is another form of downtime. I hope NewSQL systems make this less of a problem (async geo-replication for some or all commits, commutative operations). Contention and conflict are common in OLTP and it …
[Read more]
Durability debt

I define durability debt to be the amount of work that can be done to persist changes that have been applied to a database. Dirty pages must be written back for a b-tree. Compaction must be done for an LSM. Durability debt has IO and CPU components. The common IO overhead is from writing something back to the database. The common CPU overhead is from computing a checksum and optionally from compressing data.

From an incremental perspective (pending work per modified row) an LSM usually has less IO and more CPU durability debt than a B-Tree. From an absolute perspective the maximum durability debt can be much larger for an LSM than a B-Tree which is one reason why tuning can be more challenging for an LSM than a B-Tree.

In this post by LSM I mean LSM with leveled compaction.

B-Tree

The maximum durability debt for a B-Tree is limited by the size of the buffer pool. If the …

[Read more]
Bloom filter and cuckoo filter

The multi-level cuckoo filter (MLCF) in SlimDB builds on the cuckoo filter (CF) so I read the cuckoo filter paper. The big deal about the cuckoo filter is that it supports delete and a bloom filter does not. As far as I know the MLCF is updated when sorted runs arrive and depart a level -- so delete is required. A bloom filter in an LSM is per sorted run and delete is not required because the filter is created when the sorted run is written and dropped when the sorted run is unlinked.

I learned of the blocked bloom filter from the cuckoo filter paper (see here or …

[Read more]
Review of SlimDB from VLDB 2018

SlimDB is a paper worth reading from VLDB 2018. The highlights from the paper are that it shows:

  1. How to use less memory for filters and indexes with an LSM
  2. How to reduce the CPU penalty for queries with tiered compaction
  3. The benefit of more diversity in LSM tree shapes

Overview
Cache amplification has become more important as database:RAM ratios increase. With SSD it is possible to attach many TB of usable data to a server for OLTP. By usable I mean that the SSD has enough IOPs to access the data. But it isn't possible to grow the amount of RAM per server at that rate. Many of the early RocksDB workloads used database:RAM ratios that were about 10:1 and everything but the max level (Lmax) of the LSM …

[Read more]
5 things to set when configuring RocksDB and MyRock

The 5 options to set for RocksDB and MyRocks are:

  1. block cache size
  2. number of background threads
  3. compaction priority
  4. dynamic leveled compaction
  5. bloom filters

I have always wanted to do a "10 things" posts but prefer to keep this list small. It is unlikely that RocksDB can provide a great default for the block cache size and number of background threads because they depend on the amount of RAM and number of CPU cores in a server. But I hope RocksDB or MyRocks are changed to get better defaults for the other three which would shrink this list from 5 to 2.
Options

My advice on setting the size of the RocksDB block cache has not changed assuming it is configured to use buffered IO (the default). With MyRocks this option is …

[Read more]
Showing entries 1 to 10 of 138
10 Older Entries »