Showing entries 1 to 10 of 167
10 Older Entries »
Displaying posts with tag: RocksDB (reset)
Deletes are fast and slow in an LSM

In an LSM deletes are fast for the deleter but can make queries that follow slower. The problem is that too many tombstones can get in the way of a query especially a range query.

A tombstone must remain in the LSM tree until any keys it deletes have been removed from the LSM. For example, if there is a tombstone for key "ABC" on level 1 of the LSM tree and a live value for that key on level 3 then that tombstone cannot be removed. It is hard to make the check (does live key exist below me) efficient.

I haven't read much about optimizing for tombstones in an LSM not named RocksDB. Perhaps I have not tried hard to find such details. Maybe this is something that LSM engine developers should explain more in public.

Confirming whether a tombstone can be dropped

This is based on code that I read ~2 years ago. Maybe RocksDB has changed today.

Tombstones are dropped during …

[Read more]
From disk to flashcache to flash

The past decade in database storage was interesting whether you stayed with local attach storage, used block & object storage from cloud or on-prem vendors or moved to OSS scale-out storage like Ceph, GlusterFS and MinIO. I am writing about my experience and will focus on local attach.

Over the past decade the DBMS deployments I cared for went from disk to flashcache to flash on the HW side and then from MySQL+InnoDB to MySQL+MyRocks on the SW side. I assume that HW changes faster than DBMS software. DBMS algorithms that can adapt to such changes will do better in the next decade.

One comment I have heard a few too many times is that storage performance doesn't matter much because you can fit the database in RAM. More …

[Read more]

Q: What is the best readahead size?

Perhaps I agree with Dr. Stonebraker. This is my answer which might not be the correct answer. My reasons for O_DIRECT are performance, quality of service (QoS) and manageability and performance might get too much attention. I don't dislike Linux but the VM, buffered IO, readahead and page cache are there for all Linux use cases. They must be general purpose. Complex system software like a DBMS isn't general purpose and can do its own thing when needed. Also, I appreciate that kernel developers have done a lot to make Linux better for a DBMS. One of the perks at FB was easy access to many kernel developers.

Most of my web-scale MySQL/InnoDB experience is with O_DIRECT. While InnoDB can use buffered IO we always chose O_DIRECT. Eventually, RocksDB …

[Read more]
Still web scale

I am excited to start a new job next week working on performance at MongoDB. I have been a fan of the people and product for years and I look forward to contributing from the inside. The reasons I have been a fan include the rate at which the product has improved, WiredTiger and their contribution to MongoRocks.

I look forward to learning the modern performance analysis tool chain courtesy of Brendan Gregg. His BPF book should be ready soon and there is much content on his web site. When I had to understand off-cpu stalls from IO and mutex contention there wasn't much available 10 years ago, thus PMP was born. While it served me well it is time to move on.

I will continue to blog including performance comparisons between database …

[Read more]
Throttling writes: LSM vs B-Tree

Reducing response time variance is important for some workloads. This post explains sources of variance for workloads with high write rates when the index structure is an LSM or a B-Tree. I previously wrote about this in my post on durability debt.

Short summary:

  1. For a given write rate stalls are more likely with a B-Tree than an LSM
  2. Many RocksDB write stalls can be avoided via configuration
  3. Write stalls with a B-Tree are smaller but more frequent versus an LSM
  4. Write stalls are more likely when the redo log isn't forced on commit
  5. The worst case difference between an LSM and B-Tree is larger when the working set isn't cached
  6. Life is easier but more expensive when the working set …
[Read more]
My theory on technical debt and OSS

This is a hypothesis, perhaps it is true. I am biased given that I spent 15 years acknowledging tech debt on a vendor-based project (MySQL) and not much time on community-based projects.

My theory on tech debt and OSS is that there is more acknowledgement of tech debt in vendor-based projects than community-based ones. This is an advantage for vendor-based projects assuming you are more likely to fix acknowledged problems. Of course there are other advantages for community-based projects.

I think there is more acknowledgement of tech debt in vendor-based projects because the community is criticizing someone else's effort rather than their own. This is human nature, even if the effect and behavior aren't always kind. I spent many years marketing bugs that needed to be fixed -- along with many years  leading teams working to fix those bugs.

The joy of database configuration

I am wary of papers with performance results for too many products.Too many means including results from systems for which you lack expertise. Wary means I have less faith in the comparison even when the ideas in the paper are awesome. I have expertise in MySQL, MongoDB, RocksDB, WiredTiger and InnoDB but even for them I have made and acknowledged ridiculous mistakes.

Database configuration is too hard. There are too many options, most of them aren't significant and the approach is bottom-up. I an expert on this -- in addition to years of tuning I have added more than a few options to RocksDB and MySQL.

This post was motivated by PostgreSQL. I want to run the insert benchmark for it and need a good configuration. I have nothing against PG with the exception of a few too many why …

[Read more]
Using an LSM for analytics

How do you use an LSM for analytics? I haven't thought much about it because my focus has been small data -- web-scale OLTP since 2006. It is a great question given that other RocksDB users (Yugabyte, TiDB, Rockset, CockroachDB) support analytics.

This post isn't a list of everything that is needed for an analytics engine. That has been described elsewhere. It is a description of problems that must be solved when trying to use well-known techniques with an LSM. I explain the LSM challenges for a vectorized query engine, columnar storage, analytics indexes and bitmap indexes.

There is much speculation in this post. Some of it is informed -- I used to maintain bitmap indexes at Oracle. There is also big opportunity for research and for R&D in this space. I am happy to be corrected and be told of papers that I should read.

Vectorized query engine

See …

[Read more]
Just put the cold data over there

There are several ways to use less SSD for an OLTP workload: choose a database engine has less space amplification, store less data, move the cold data elsewhere. The first approach is a plan while the others are goals. A plan is something you can implement. A goal requires a plan to get done.

This matters when you want to decrease the cost of storage for a database workload but not everyone needs to do that. The first approach assumes your DBMS supports an LSM with leveled compaction and compression (MariaDB and Percona Server include MyRocks, ScyllaDB and Cassandra are also options). The second approach, store less data, assumes you can get everyone to agree to remove data and that is a hard conversation.

The third approach, move the cold data elsewhere, is a wonderful goal. I wonder how often that goal …

[Read more]
Learned indexes for an LSM?

The Learned Indexes paper opened a new area of research for storage engines. The idea is to use the distribution of the data to make searches faster and/or reduce the size of search structures. Whether ML will be used is an implementation artifact to me. I am in this for the efficiency gains -- reducing space and CPU read amplification. I prefer to call this topic learned search structures rather than learned indexes because the public isn't ready for learned recovery and learned consistency.

While the Learned Indexes paper has ideas for hash maps and bloom filters most research has focused on a B-Tree. I expect learned indexes to work better with an LSM because its search structures (block indexes) are created off line with respect to user operations. There is more time for analysis and …

[Read more]
Showing entries 1 to 10 of 167
10 Older Entries »