Deletion patterns in LSM
Deleting a range of keys is a common pattern in RocksDB. Most systems built on top of RocksDB have multi-component key schemas, where keys sharing a common prefix are logically related. Here are some examples.
MyRocks is a MySQL fork using RocksDB as its storage engine. Each key’s first four bytes identify the table or index to which that key belongs. Thus dropping a table or index involves deleting all the keys with that prefix.
Rockssandra is a Cassandra variant that uses RocksDB as its storage engine. One
of its admin tool commands,
nodetool cleanup, removes key-ranges that have been migrated
to other nodes in the cluster.
Marketplace uses RocksDB to store product data. Its key begins with product ID, and it stores various data associated with the product in separate keys. When a product is removed, all these keys must be deleted.
When we decide what to improve, we try to find a use case that’s common across users, since we want to build a generally useful system, not one that has many one-off features for individual users. The range deletion pattern is common as illustrated above, so from this perspective it’s a good target for optimization.
Existing mechanisms: challenges and opportunities
The most common pattern we see is scan-and-delete, i.e., advance an iterator
through the to-be-deleted range, and issue a
Delete for each key. This is
slow (involves read I/O) so cannot be done in any critical path. Additionally,
it creates many tombstones, which slows down iterators and doesn’t offer a deadline
for space reclamation.
Another common pattern is using a custom compaction filter that drops keys in the deleted range(s). This deletes the range asynchronously, so cannot be used in cases where readers must not see keys in deleted ranges. Further, it has the disadvantage of outputting tombstones to all but the bottom level. That’s because compaction cannot detect whether dropping a key would cause an older version at a lower level to reappear.
If space reclamation time is important, or it is important that the deleted
range not affect iterators, the user can trigger
CompactRange on the deleted
range. This can involve arbitrarily long waits in the compaction queue, and
increases write-amp. By the time it’s finished, however, the range is completely
gone from the LSM.
DeleteFilesInRange can be used prior to compacting the deleted range as long
as snapshot readers do not need to access them. It drops files that are
completely contained in the deleted range. That saves write-amp because, in
CompactRange, the file data would have to be rewritten several times before it
reaches the bottom of the LSM, where tombstones can finally be dropped.
In addition to the above approaches having various drawbacks, they are quite complicated to reason about and implement. In an ideal world, deleting a range of keys would be (1) simple, i.e., a single API call; (2) synchronous, i.e., when the call finishes, the keys are guaranteed to be wiped from the DB; (3) low latency so it can be used in critical paths; and (4) a first-class operation with all the guarantees of any other write, like atomicity, crash-recovery, etc.
v1: Getting it to work
Where to persist them?
The first place we thought about storing them is inline with the data blocks. We could not think of a good way to do it, however, since the start of a range tombstone covering a key could be anywhere, making binary search impossible. So, we decided to investigate segregated storage.
A second solution we considered is appending to the manifest. This file is append-only, periodically compacted, and stores metadata like the level to which each SST belongs. This is tempting because it leverages an existing file, which is maintained in the background and fully read when the DB is opened. However, it conceptually violates the manifest’s purpose, which is to store metadata. It also has no way to detect when a range tombstone no longer covers anything and is droppable. Further, it’d be possible for keys above a range tombstone to disappear when they have their seqnums zeroed upon compaction to the bottommost level.
A third candidate is using a separate column family. This has similar problems to the manifest approach. That is, we cannot easily detect when a range tombstone is obsolete, and seqnum zeroing can cause a key to go from above a range tombstone to below, i.e., disappearing. The upside is we can reuse logic for memory buffering, consistent reads/writes, etc.
The problems with the second and third solutions indicate a need for range tombstones to be aware of flush/compaction. An easy way to achieve this is put them in the SST files themselves - but not in the data blocks, as explained for the first solution. So, we introduced a separate meta-block for range tombstones. This resolved the problem of when to obsolete range tombstones, as it’s simple: when they’re compacted to the bottom level. We also reused the LSM invariants that newer versions of a key are always in a higher level to prevent the seqnum zeroing problem. This approach has the side benefit of constraining the range tombstones seen during reads to ones in a similar key-range.
When there are range tombstones in an SST, they are segregated in a separate meta-block
Logical range tombstones (left) and their corresponding physical key-value representation (right)
WriteBatch stores range tombstones in its buffer which are logged to the WAL and
then applied to a dedicated range tombstone memtable during
Write. Later in
the background the range tombstone memtable and its corresponding data memtable
are flushed together into a single SST with a range tombstone meta-block. SSTs
periodically undergo compaction which rewrites SSTs with point data and range
tombstones dropped or merged wherever possible.
We chose to use a dedicated memtable for range tombstones. The memtable representation is always skiplist in order to minimize overhead in the usual case, which is the memtable contains zero or a small number of range tombstones. The range tombstones are segregated to a separate memtable for the same reason we segregated range tombstones in SSTs. That is, we did not know how to interleave the range tombstone with point data in a way that we would be able to find it for arbitrary keys that it covers.
Lifetime of point keys and range tombstones in RocksDB
During flush and compaction, we chose to write out all non-obsolete range tombstones unsorted. Sorting by a single dimension is easy to implement, but doesn’t bring asymptotic improvement to queries over range data. Ideally, we want to store skylines (see “Read Path” subsection below) computed over our ranges so we can binary search. However, a couple of concerns cause doing this in flush and compaction to feel unsatisfactory: (1) we need to store multiple skylines, one for each snapshot, which further complicates the range tombstone meta-block encoding; and (2) even if we implement this, the range tombstone memtable still needs to be linearly scanned. Given these concerns we decided to defer collapsing work to the read side, hoping a good caching strategy could optimize this at some future point.
In point lookups, we aggregate range tombstones in an unordered vector as we search through live memtable, immutable memtables, and then SSTs. When a key is found that matches the lookup key, we do a scan through the vector, checking whether the key is deleted.
In iterators, we aggregate range tombstones into a skyline as we visit live
memtable, immutable memtables, and SSTs. The skyline is expensive to construct but fast to determine whether a key is covered. The skyline keeps track of the most recent range tombstone found to optimize
(Image source: Leetcode) The skyline problem involves taking building location/height data in the unsearchable form of A and converting it to the form of B, which is binary-searchable. With overlapping range tombstones, to achieve efficient searching we need to solve an analogous problem, where the x-axis is the key-space and the y-axis is the sequence number.
For the v1 implementation, writes are much faster compared to the scan and
delete (optionally within a transaction) pattern.
DeleteRange only logs to WAL
and applies to memtable. Logging to WAL always
fflushes, and optionally
fdatasyncs. Applying to memtable is always an in-memory operation.
Since range tombstones have a dedicated skiplist memtable, the complexity of inserting is O(log(T)), where T is the number of existing buffered range tombstones.
Reading in the presence of v1 range tombstones, however, is much slower than reads in a database where scan-and-delete has happened, due to the linear scan over range tombstone memtables/meta-blocks.
Iterating in a database with v1 range tombstones is usually slower than in a scan-and-delete database, although the gap lessens as iterations grow longer. When an iterator is first created and seeked, we construct a skyline over its tombstones. This operation is O(T*log(T)) where T is the number of tombstones found across live memtable, immutable memtable, L0 files, and one file from each of the L1+ levels. However, moving the iterator forwards or backwards is simply a constant-time operation (excluding edge cases, e.g., many range tombstones between consecutive point keys).
v2: Making it fast
DeleteRange’s negative impact on read perf is a barrier to its adoption. The
root cause is range tombstones are not stored or cached in a format that can be
efficiently searched. We needed to design DeleteRange so that we could maintain
write performance while making read performance competitive with workarounds
used in production (e.g., scan-and-delete).
The key idea of the redesign is that, instead of globally collapsing range tombstones, we can locally “fragment” them for each SST file and memtable to guarantee that:
- no range tombstones overlap; and
- range tombstones are ordered by start key.
Combined, these properties make range tombstones binary searchable. This fragmentation will happen on the read path, but unlike the previous design, we can easily cache many of these range tombstone fragments on the read path.
The write path remains unchanged.
When an SST file is opened, its range tombstones are fragmented and cached. For point lookups, we binary search each file’s fragmented range tombstones for one that covers the lookup key. Unlike the old design, once we find a tombstone, we no longer need to search for the key in lower levels, since we know that any keys on those levels will be covered (though we do still check the current level since there may be keys written after the range tombstone).
For range scans, we create iterators over all the fragmented range tombstones and store them in a list, seeking each one to cover the start key of the range scan (if possible), and query each encountered key in this structure as in the old design, advancing range tombstone iterators as necessary. In effect, we implicitly create a skyline. This requires significantly less work on iterator creation, but since each memtable/SST has its own range tombstone iterator, querying range tombstones requires key comparisons (and possibly iterator increments) for several iterators (as opposed to v1, where we had a global collapsed representation of all range tombstones). As a result, very long range scans may become slower than before, but short range scans are an order of magnitude faster, which are the more common class of range scan.
To understand the performance of this new design, we used
db_bench to compare point lookup, short range scan,
and long range scan performance across:
- the v1 DeleteRange design,
- the scan-and-delete workaround, and
- the v2 DeleteRange design.
In these benchmarks, we used a database with 5 million data keys, and 10000 range tombstones (ignoring those dropped during compaction) that were written in regular intervals after 4.5 million data keys were written. Writing the range tombstones ensures that most of them are not compacted away, and we have more tombstones in higher levels that cover keys in lower levels, which allows the benchmarks to exercise more interesting behavior when reading deleted keys.
Point lookup benchmarks read 100000 keys from a database using
readwhilewriting. Range scan benchmarks used
seekrandomwhilewriting and seeked 100000 times, and advanced up to 10 keys away from the seek position for short range scans, and advanced up to 1000 keys away from the seek position for long range scans.
The results are summarized in the tables below, averaged over 10 runs (note the
different SHAs for v1 benchmarks are due to a new
db_bench flag that was added in order to compare performance with databases with no tombstones; for brevity, those results are not reported here). Also note that the block cache was large enough to hold the entire db, so the large throughput is due to limited I/Os and little time spent on decompression. The range tombstone blocks are always pinned uncompressed in memory. We believe these setup details should not affect relative performance between versions.
|Name||SHA||avg micros/op||avg ops/sec|
Short Range Scans
|Name||SHA||avg micros/op||avg ops/sec|
Long Range scans
|Name||SHA||avg micros/op||avg ops/sec|
Note that memtable range tombstones are fragmented every read; for now this is acceptable, since we expect there to be relatively few range tombstones in memtables (and users can enforce this by keeping track of the number of memtable range deletions and manually flushing after it passes a threshold). In the future, a specialized data structure can be used for storing range tombstones in memory to avoid this work.
Another future optimization is to create a new format version that requires range tombstones to
be stored in a fragmented form. This would save time when opening SST files, and when
is not -1 (i.e., files may be opened several times).
Special thanks to Peter Mattis and Nikhil Benesch from Cockroach Labs, who were early users of DeleteRange v1 in production, contributed the cleanest/most efficient v1 aggregation implementation, found and fixed bugs, and provided initial DeleteRange v2 design and continued help.
Thanks to Huachao Huang and Jinpeng Zhang from PingCAP for early DeleteRange v1 adoption, bug reports, and fixes.