This is a link to slides from my 5-minute talk
at the CIDR 2019 Gong Show. The slides are a brief
overview of the geek code for LSM trees. If you click on the
settings icon in the slide show you can view the speaker notes
which have links to blog posts that have more details. I also
pasted the links below. Given time I might add to this post, but
most of the content is in my past blog posts. Regardless I think
there is more to be discovered about performant, efficient and
manageable LSM trees.

The key points are there are more compaction algorithms to
discover, we need to make it easier to describe them and
compaction is a property of a level, not of the LSM tree. …

**RocksDB**(reset)

My last post explained the number of levels in
an LSM that minimizes write amplification using 3 different
estimates for the per-level write-amp. Assuming the per-level
growth factor is w then the 3 estimates were approximately w, w+1
and w-1 and named LWA-1, LWA-2 and LWA-3 in the post.

I realized there was a mistake in that post for the analysis of
LWA-3. The problem is that the per-level write-amp must be >=
1 (and really should be > 1) but the value of w-1 is <= 1
when the per-level growth factor is <= 2. By allowing the
per-level write-amp to be < 1 it easy to incorrectly show that
a huge number of levels reduces write-amp as I do for curve #3
in this graph. While I don't claim that (w-1) or
(w-1)/2 can't be a useful estimate for …

I previously used math to explain the number of
levels that minimizes write amplification for an LSM tree with
leveled compaction. My answer was one of ceil(ln(T)) or
floor(ln(T)) assuming the LSM tree has total fanout = T where T
is size(database) / size(memtable).

Then I heard from a coworker that the real answer is less than
floor(ln(T)). Then I heard from Niv Dayan, first author of
the Dostoevsky paper, that the real
answer is larger than ceil(ln(T)) and the optimal per-level
growth factor is ~2 rather than ~e.

All of our answers are correct. We have different answers because
we use different functions to estimate the per-level write-amp.
The graph of the …

Welcome to my first rant of 2019, although I have written about this before. While I enjoy
benchmarketing from a distance it is not much fun
to be in the middle of it. The RocksDB project has been
successful and thus becomes the base case for products and
research claiming that something else is better. While I have no
doubt that other things can be better I am wary about the
definition of * better*.

There are at least 3 ways to define better when evaluating database performance. The first, faster is better, ignores efficiency, the last two do not. I'd rather not ignore efficiency. The marginal return of X more QPS eventually becomes zero while the benefit of using less hardware is usually greater than zero.

- …

In my previous post on ERROR 1213 I noted that
Percona Server does not support the **SHOW ENGINE ROCKSDB TRANSACTION STATUS**
statement to get deadlock details in "text" form. I've got
some clarifications in my related feature request, PS-5114. So I decided to write this followup post
and show what is the way to get deadlock details for the
**ROCKSDB** tables in current versions of MariaDB and Percona
Server.

First of all, I'd like to check MariaDB's implementation of
MyRocks. For this I'll re-create deadlock scenario from that my
post with MariaDB 10.3.12 I have at hand. We should start with
installing **ROCKSDB** …

My old NUC cluster found a new home and I
downsized to 2 new NUC servers. The new server is NUC8i7beh with 16g
RAM, 500g Samsung 860 EVO for the OS and 500g Samsung 970 EVO for performance. The Samsung 860
is SATA and the Samsung 970 is an m.2 device. I expect to wear
out the performance devices as I have done that in the past. With the OS on a
separate device I avoid the need to reinstall the OS when that
happens.

…

I have written before and will write again about
using 3-tuples to explain the shape of an LSM tree. This makes it
easier to explain the configurations supported today and
configurations we might want to support tomorrow in addition to
traditional tiered and leveled compaction. The summary is that n
LSM tree has N levels labeled from L1 to Ln and Lmax is another
name for Ln. There is one 3-tuple per level and the components of
the 3-tuple are (type, fanout, runs) for Lk (level k)
where:

- type is Tiered or Leveled and explains compaction into that level
- fanout is the size of a sorted run in Lk relative to a sorted run from Lk-1, a real and >= 1
- runs is the number of sorted runs in that level, an integer and >= 1

Given the above how many valid configurations exist for …

[Read more]
How do you configure an LSM tree with leveled compaction to
minimize write amplification? For a given number of levels
write-amp is minimal when the same fanout (growth factor) is used
between all levels, but that does not explain the number of
levels to use. In this post I answer that question.

- The number of levels that minimizes write-amp is one of ceil(ln(T)) or floor(ln(T)) where T is the total fanout -- sizeof(database) / sizeof(memtable)
- When #1 is done then the per-level fanout is e when the number of levels is ln(t) and a value close to e when the number of levels is an integer.

**Introduction**

I don't recall reading this result elsewhere, but I am happy to
update this post with a link to such a result. I was encouraged
to answer this after a discussion with the RocksDB team and thank
Siying Dong for stating #2 above while leaving the math to me. I
assume the …

Probably all of us, MySQL users, DBAs and developers had seen
error 1213 more than once, in one context or the other:

mysql> select * from t1;

ERROR 1213 (40001): Deadlock found when trying to get lock; try
restarting transactionThe first thing that comes to mind in this
case is: "OK, we have InnoDB deadlock, let's check the details",
followed by the **SHOW ENGINE INNODB STATUS** check, like
this:

mysql> show engine innodb status\G

*************************** 1. row
***************************

Type: InnoDB

Name:

Status:

=====================================

2018-12-08 17:41:11 0x7f2f8b8db700 INNODB MONITOR OUTPUT

=====================================

Per second averages calculated from the last 12 seconds

-----------------

BACKGROUND THREAD

…

This is review of TRIAD which was published in USENIX ATC 2017. It explains
how to reduce write amplification for RocksDB leveled compaction
although the ideas are useful for many LSM implementations. I
share a review here because the paper has good ideas. It isn't
easy to keep up with all of the LSM research, even when limiting
the search to papers that reference RocksDB, and I didn't
notice this paper until recently.

TRIAD reduces write amplification for an LSM with leveled
compaction and with a variety of workloads gets up to 193% more
throughput, up to 4X less write amplification and spends up to
77% less time doing compaction and flush. Per the …