My goal for the year is more time learning math and less time
running MySQL benchmarks. I haven't done serious benchmarks for
more than 12 months. It was a great experience but I want to
learn new things. MySQL 8.0.14 has been released with fixes for a
serious bug I found via the insert benchmark. I won't confirm
whether it has been fixed. I hope someone else does.
My tests and methodology are described in posts for sysbench, linkbench and the insert benchmark. I hope the
upstream distros (MySQL, MariaDB, Percona) repeat my tests and
methodology and I am happy to answer questions about that. I even
have inscrutable shell scripts that …
I have been trying to solve the problem of finding an optimal LSM
configuration for a given workload. The real problem is larger
than that, which is to find the right index structure and the right configuration
for a given workload. But my focus is RocksDB so I will start by
solving for an LSM.
This link is to slides that summarizes my
effort. I have expressed the problem to be solved using
differentiable functions to express the cost that is to be
minimized. The cost functions have a mix of real and integer
valued parameters for which values must be determine to minimize
the cost. I have yet to solve the …
This is a link to slides from my 5-minute talk
at the CIDR 2019 Gong Show. The slides are a brief
overview of the geek code for LSM trees. If you click on the
settings icon in the slide show you can view the speaker notes
which have links to blog posts that have more details. I also
pasted the links below. Given time I might add to this post, but
most of the content is in my past blog posts. Regardless I think
there is more to be discovered about performant, efficient and
manageable LSM trees.
The key points are there are more compaction algorithms to
discover, we need to make it easier to describe them and
compaction is a property of a level, not of the LSM tree. …
My last post explained the number of levels in
an LSM that minimizes write amplification using 3 different
estimates for the per-level write-amp. Assuming the per-level
growth factor is w then the 3 estimates were approximately w, w+1
and w-1 and named LWA-1, LWA-2 and LWA-3 in the post.
I realized there was a mistake in that post for the analysis of
LWA-3. The problem is that the per-level write-amp must be >=
1 (and really should be > 1) but the value of w-1 is <= 1
when the per-level growth factor is <= 2. By allowing the
per-level write-amp to be < 1 it easy to incorrectly show that
a huge number of levels reduces write-amp as I do for curve #3
in this graph. While I don't claim that (w-1) or
(w-1)/2 can't be a useful estimate for …
I previously used math to explain the number of
levels that minimizes write amplification for an LSM tree with
leveled compaction. My answer was one of ceil(ln(T)) or
floor(ln(T)) assuming the LSM tree has total fanout = T where T
is size(database) / size(memtable).
Then I heard from a coworker that the real answer is less than
floor(ln(T)). Then I heard from Niv Dayan, first author of
the Dostoevsky paper, that the real
answer is larger than ceil(ln(T)) and the optimal per-level
growth factor is ~2 rather than ~e.
All of our answers are correct. We have different answers because
we use different functions to estimate the per-level write-amp.
The graph of the …
Welcome to my first rant of 2019, although I have written about this before. While I enjoy
benchmarketing from a distance it is not much fun
to be in the middle of it. The RocksDB project has been
successful and thus becomes the base case for products and
research claiming that something else is better. While I have no
doubt that other things can be better I am wary about the
definition of better.
There are at least 3 ways to define better when evaluating
database performance. The first, faster is better, ignores
efficiency, the last two do not. I'd rather not ignore
efficiency. The marginal return of X more QPS eventually becomes
zero while the benefit of using less hardware is usually greater
than zero.
- …
In my previous post on ERROR 1213 I noted that
Percona Server does not support the SHOW ENGINE ROCKSDB TRANSACTION STATUS
statement to get deadlock details in "text" form. I've got
some clarifications in my related feature request, PS-5114. So I decided to write this followup post
and show what is the way to get deadlock details for the
ROCKSDB tables in current versions of MariaDB and Percona
Server.
First of all, I'd like to check MariaDB's implementation of
MyRocks. For this I'll re-create deadlock scenario from that my
post with MariaDB 10.3.12 I have at hand. We should start with
installing ROCKSDB …
My old NUC cluster found a new home and I
downsized to 2 new NUC servers. The new server is NUC8i7beh with 16g
RAM, 500g Samsung 860 EVO for the OS and 500g Samsung 970 EVO for performance. The Samsung 860
is SATA and the Samsung 970 is an m.2 device. I expect to wear
out the performance devices as I have done that in the past. With the OS on a
separate device I avoid the need to reinstall the OS when that
happens.
…
I have written before and will write again about
using 3-tuples to explain the shape of an LSM tree. This makes it
easier to explain the configurations supported today and
configurations we might want to support tomorrow in addition to
traditional tiered and leveled compaction. The summary is that n
LSM tree has N levels labeled from L1 to Ln and Lmax is another
name for L1. There is one 3-tuple per level and the components of
the 3-tuple are (type, fanout, runs) for Lk (level k)
where:
- type is Tiered or Leveled and explains compaction into that level
- fanout is the size of a sorted run in Lk relative to a sorted run from Lk-1, a real and >= 1
- runs is the number of sorted runs in that level, an integer and >= 1
Given the above how many valid configurations exist for …
[Read more]
How do you configure an LSM tree with leveled compaction to
minimize write amplification? For a given number of levels
write-amp is minimal when the same fanout (growth factor) is used
between all levels, but that does not explain the number of
levels to use. In this post I answer that question.
- The number of levels that minimizes write-amp is one of ceil(ln(T)) or floor(ln(T)) where T is the total fanout -- sizeof(database) / sizeof(memtable)
- When #1 is done then the per-level fanout is e when the number of levels is ln(t) and a value close to e when the number of levels is an integer.
Introduction
I don't recall reading this result elsewhere, but I am happy to
update this post with a link to such a result. I was encouraged
to answer this after a discussion with the RocksDB team and thank
Siying Dong for stating #2 above while leaving the math to me. I
assume the …