RocksDB

Java Foreign Function Interface

Posted February 20, 2024

Java Foreign Function Interface (FFI)

Evolved Binary has been working on several aspects of how the Java API to RocksDB can be improved. The recently introduced FFI features in Java provide significant opportunities for improving the API. We have investigated this through a prototype implementation.

Java 19 introduced a new FFI Preview which is described as an API by which Java programs can interoperate with code and data outside of the Java runtime. By efficiently invoking foreign functions (i.e., code outside the JVM), and by safely accessing foreign memory (i.e., memory not managed by the JVM), the API enables Java programs to call native libraries and process native data without the brittleness and danger of JNI.

If the twin promises of efficiency and safety are realised, then using FFI as a mechanism to support a future RocksDB API may be of significant benefit.

Remove the complexity of JNI access to C++ RocksDB
Improve RocksDB Java API performance
Reduce the opportunity for coding errors in the RocksDB Java API

Here’s what we did. We have

created a prototype FFI branch
updated the RocksDB Java build to use Java 19
implemented an FFI Preview API version of core RocksDB feature (get())
Extended the current JMH benchmarks to also benchmark the new FFI methods. Usefully, JNI and FFI can co-exist peacefully, so we use the existing RocksDB Java to do support work around the FFI-based get() implementation.

Implementation

How JNI Works

JNI requires a preprocessing step during build/compilation to generate header files which are linked into by Pure Java code. C++ implementations of the methods in the headers are implemented. Corresponding native methods are declared in Java and the whole is linked together.

Code in the C++ methods uses what amounts to a JNI library to access Java values and objects and to create Java objects in response.

How FFI Works

FFI provides the facility for Java to call existing native (in our case C++) code from Pure Java without having to generate support files during compilation steps. FFI does support an external tool (jextract) which makes generating common boilerplate easier and less error prone, but we choose to start prototyping without it, in part better to understand how things really work.

FFI does its job by providing 2 things

A model for allocating, reading and writing native memory and native structures within that memory
A model for discovering and calling native methods with parameters consisting of native memory references and/or values

The called C++ is invoked entirely natively. It does not have to access any Java objects to retrieve data it needs. Therefore existing packages in C++ and other sufficiently low level languages can be called from Java without having to implement stubs in the C++.

Our Approach

While we could in principle avoid writing any C++, C++ objects and classes are not easily defined in the FFI model, so to begin with it is easier to write some very simple C-like methods/stubs in C++ which can immediately call into the object-oriented core of RocksDB. We define structures with which to pass parameters to and receive results from the C-like method(s) we implement.

`C++` Side

The first method we implement is

extern "C" int rocksdb_ffi_get_pinnable(
    ROCKSDB_NAMESPACE::DB* db, ROCKSDB_NAMESPACE::ReadOptions* read_options,
    ROCKSDB_NAMESPACE::ColumnFamilyHandle* cf, rocksdb_input_slice_t* key,
    rocksdb_pinnable_slice_t* value);

our input structure is

typedef struct rocksdb_input_slice {
  const char* data;
  size_t size;
} rocksdb_input_slice_t;

and our output structure is a pinnable slice (of which more later)

typedef struct rocksdb_pinnable_slice {
  const char* data;
  size_t size;
  ROCKSDB_NAMESPACE::PinnableSlice* pinnable_slice;
  bool is_pinned;
} rocksdb_pinnable_slice_t;

`Java` Side

We implement an FFIMethod class to advertise a java.lang.invoke.MethodHandle for each of our helper stubs

  public static MethodHandle GetPinnable; // handle which refers to the rocksdb_ffi_get_pinnable method in C++
  public static MethodHandle ResetPinnable; // handle which refers to the rocksdb_ffi_reset_pinnable method in C++

We also implement an FFILayout class to describe each of the passed structures (rocksdb_input_slice , rocksdb_pinnable_slice and rocksdb_output_slice) in Java terms

 public static class InputSlice {
  static final GroupLayout Layout = ...
  static final VarHandle Data = ...
  static final VarHandle Size =  ...
 };

 public static class PinnableSlice {
  static final GroupLayout Layout = ...
  static final VarHandle Data = ...
  static final VarHandle Size =  ...
  static final VarHandle IsPinned =  ...
 };

 public static class OutputSlice {
  static final GroupLayout Layout = ...
  static final VarHandle Data = ...
  static final VarHandle Size =  ...
 };

The FFIDB class, which implements the public Java FFI API methods, makes use of FFIMethod and FFILayout to make the code for each individual method as idiomatic and efficient as possible. This class also contains java.lang.foreign.MemorySession and java.lang.foreign.SegmentAllocator objects which control the lifetime of native memory sessions and allow us to allocate lifetime-limited native memory which can be written and read by Java, and passed to native methods.

At the user level, we then present a method which wraps the details of use of FFIMethod and FFILayout to implement our single, core Java API get() method

 public GetPinnableSlice getPinnableSlice(final ReadOptions readOptions,
      final ColumnFamilyHandle columnFamilyHandle, final MemorySegment keySegment,
      final GetParams getParams)

The flow of implementation of getPinnableSlice(), in common with any other core RocksDB FFI API method becomes:

Allocate MemorySegments for C++ structures using Layouts from FFILayout
Write to the allocated structures using VarHandles from FFILayout
Invoke the native method using the MethodHandle from FFIMethod and addresses of instantiated MemorySegments, or value types, as parameters
Read the call result and the output parameter(s), again using VarHandles from FFILayout to perform the mapping.

For the getPinnableSlice() method, on successful return from an invocation of rocksdb_ffi_get(), the PinnableSlice object will contain the data and size fields of a pinnable slice (see below) containing the requested value. A MemorySegment referring to the native memory of the pinnable slice is then constructed, and used by the client to retrieve the value in whatever fashion they choose.

Pinnable Slices

RocksDB offers core (C++) API methods using the concept of a PinnableSlice to return fetched data values while reducing copies to a minimum. We take advantage of this to base our central get() method(s) on PinnableSlices. Methods mirroring the existing JNI-based API can then be implemented in pure Java by wrapping the core getPinnableSlice().

So we implement

public record GetPinnableSlice(Status.Code code, Optional<FFIPinnableSlice> pinnableSlice) {}

public GetPinnableSlice getPinnableSlice(
      final ColumnFamilyHandle columnFamilyHandle, final byte[] key)

and we wrap that to provide

public record GetBytes(Status.Code code, byte[] value, long size) {}

public GetBytes get(final ColumnFamilyHandle columnFamilyHandle, final byte[] key)

Benchmark Results

We extended existing RocksDB Java JNI benchmarks with new benchmarks based on FFI. Full benchmark run on Ubuntu, including new benchmarks.

java --enable-preview --enable-native-access=ALL-UNNAMED -jar target/rocksdbjni-jmh-1.0-SNAPSHOT-benchmarks.jar -p keyCount=100000 -p keySize=128 -p valueSize=4096,65536 -p columnFamilyTestType="no_column_family" -rf csv org.rocksdb.jmh.GetBenchmarks

JNI vs FFI

Discussion

We have plotted the performance (more operations is better) of a selection of benchmarks,

q "select Benchmark,Score from ./plot/jmh-result-fixed.csv where \"Param: keyCount\"=100000 and \"Param: valueSize\"=65536 -d, -H

JNI versions of benchmarks are previously implemented jmh benchmarks for measuring the performance of the current RocksDB Java interface.
FFI versions of benchmarks are equivalent benchmarks (as far as possible) implemented using the FFI mechanisms.

We can see that for all benchmarks which have equivalent FFI and JNI pairs, the JNI version is only very marginally faster. FFI has successfully optimized away most of the extra safety-checking of the new invocation mechanism.

Our initial implementation of FFI benchmarks lagged the JNI benchmarks quite significantly, but we have received extremely helpful support from Maurizio Cimadamore of the Panama Dev team, to help us optimize the performance of our FFI implementation. We consider that the small remaining performance gap is a feature of the remaining extra bounds checking of FFI.

For basic get() the result buffer is allocated by the method, so that there is a cost of allocation associated with each request.

ffiGet vs get
The JNI version is very marginally faster than FFI

For preallocated get() where the result buffer is supplied to the method, we avoid an allocation of a fresh result buffer on each call, and the test recycles its result buffers. Then the same small difference persists

JNI is very marginally faster than FFI
preallocatedGet() is a lot faster than basic get()

We implemented some methods where the key for the get() is randomized, so that any ordering effects can be accounted for. The same differences persisted.

The FFI interface gives us a natural way to expose RocksDB’s pinnable slice mechanism. When we provide a benchmark which accesses the raw PinnableSlice API, as expected this is the fastest method of any; however we are not comparing like with like:

ffiGetPinnableSlice() returns a handle to the RocksDB memory containing the slice, and presents that as an FFI MemorySegment. No copying of the memory in the segment occurs.

As noted above, we implement the new FFI-based get() methods using the new FFI-based getPinnableSlice() method, and copying out the result. So the ffiGet and ffiPreallocatedGet benchmarks use this mechanism underneath.

In an effort to discover whether using the Java APIs to copy from the pinnable slice backed MemorySegment was a problem, we implemented a separate ffiGetOutputSlice() benchmark which copies the result into a (Java allocated native memory) segment at the C++ side.

ffiGetOutputSlice() is faster than ffiPreallocatedGet() and is in fact at least as fast as preallocatedGet(), which is an almost exact analogue in the JNI world.

So it appears that we can build an FFI-based API with equal performance to the JNI-based one.

Thinking about the (very small, but probably statistically significant) difference between our ffiGetPinnableSlice()-based FFI calls and the JNI-based calls, it is reasonable to expect that some of the cost is the extra FFI call to C++ to release the pinned slice as a separate operation. A null FFI method call is extremely fast, but it does take some time.

We would recommend looking again the performance of the FFI-based implementation when Panama is release post-Preview in Java 21. It seems that at least with Java 20 the performance is of our FFI benchmarks is not significantly different from that of the Java 19 version.

Copies versus Calls

The second method call over the FFI boundary to release a pinnable slice has a cost. We compared the ffiGetOutputSlice() and ffiGetPinnableSlice() benchmarks in order to examine this cost. We ran it with a fixed ky size (128 bytes); the key size is likely to be pretty much irrelevant anyway; we varied the value size read from 16 bytes to 16k, and we found a crossover point between 1k and 4k for performance:

Plot

ffiGetOutputSlice() is faster when values read are 1k in size or smaller. The cost of an extra copy in the C++ side from the pinnable slice buffer into the supplied buffer allocated by Java Foreign Memory API is less than the cost of the extra call to release a pinnable slice.
ffiGetPinnableSlice() is faster when values read are 4k in size, or larger. Consistent with intuition, the advantage grows with larger read values.

The way that the RocksDB API is constructed means that of the 2 methods compared, ffiGetOutputSlice() will always make exactly 1 more copy than ffiGetPinnableSlice(). The underlying RocksDB C++ API will always copy into its own temporary buffer if it decides that it cannot pin an internal buffer, and that will be returned as the pinnable slice. There is a potential optimization where the temporary buffer could be replaced by an output buffer, such as that supplied by ffiGetOutputSlice(); in practice that is a hard fix to hack in. Its effectiveness depends on how often RocksDB fails to pin an internal buffer.

A solution which either filled a buffer or returned a pinnable slice would give us the best of both worlds.

Other Conclusions

Build Processing

It is easier to implement an interface using FFI than JNI. No intermediate build processing or code generation steps were needed to implement this protoype.
For a production version, we would urge using jextract to automate the process of generating Java API methods from the set of supporting stubs we generate.

Safety

The use of jextract will give a similar level of type security to the use of JNI, when crossing the language boundary. But we do not believe FFI is significantly more type-safe than JNI for method invocation. Neither is it less safe, though.

Native Memory

Panama’s Foreign-Memory Access API appears to us to be the most significant part of the whole project. At the Java side of RocksDB it gives us a clean mechanism (a MemorySegment) for holding RocksDB data (e.g. as from the result of a get()) call pending its forwarding to client code or network buffers.

We have taken advantage of this mechanism to provide the core FFIDB.getPinnableSlice() method in our Panama-based API. The rest of our prototype get() API, duplicating the existing JNI-based API, is then a Pure Java library on top of FFIDB.getPinnableSlice() and FFIPinnableSlice.reset().

The common standard for foreign memory opens up the possibility of efficient interoperation between RocksDB and Java clients (e.g. Kafka). We think that this is really the key to higher performing, more integrated Java-based systems:

This could result in data never being copied into Java memory, or a significant reduction in copies, as native MemorySegments are handed off between co-operating Java clients of fundamentally native APIs. This extra potential performance can be extremely useful when 2 or more clients are interoperating; we still need to provide a simplest possible API wrapping these calls (like our prototype get()), which operates at a similar level to the current Java API.
Some thought should be applied to how this architecture would interact with the cache layer(s) in RocksDB, and whether it can be accommodated within the present RocksDB architecture. How long can 3rd-party applications pin pages in the RocksDB cache without disrupting RocksDB normal behaviour (e.g. compaction) ?

Summary

Panama/FFI (in Preview) is a highly capable technology for (re)building the RocksDB Java API, although the supported language level of RocksDB and the planned release schedule for Panama mean that it could not replace JNI in production for some time to come.
Panama/FFI would seem to offer comparable performance to JNI; there is no strong performance argument for a re-implementation of a standalone RocksDB Java API. But the opportunity to provide a natural pinnable slice-based API gives a lot of flexibility; not least because an efficient API could be built mostly in Java with only a small underlying layer implementing the pinnable slice interface.
Panama/FFI can remove some boilerplate (native method declarations) and allow Java programs to access C libraries without stub code, but calling a C++-based library still requires C stubs; a possible approach would be to use the RocksDB C API as the basis for a rebuilt Java API. This would allow us to remove all the existing JNI boilerplate, and concentrate support effort on the C API. An alternative approach would be to build a robust API based on Reference Counting, but using FFI.
Panama/FFI really shines as a foreign memory standard for a Java API that can allow efficient interoperation between RocksDB Java clients and other (Java and native) components of a system. Foreign Memory gives us a model for how to efficiently return data from RocksDB; as pinnable slices with their contents presented in MemorySegments. If we focus on designing an API for native interoperability we think this can be highly productive in opening RocksDB to new uses and opportunities in future.

Appendix

Code and Data

The Experimental Pull Request contains the source code implemented, together with further data plots and the source CSV files for all data plots.

Running

This is an example run; the jmh parameters (after -p) can be changed to measure performance with varying key counts, and key and value sizes.

java --enable-preview --enable-native-access=ALL-UNNAMED -jar target/rocksdbjni-jmh-1.0-SNAPSHOT-benchmarks.jar -p keyCount=100000 -p keySize=128 -p valueSize=4096,65536 -p columnFamilyTestType="no_column_family" -rf csv org.rocksdb.jmh.GetBenchmarks -wi 1 -to 1m -i 1

Processing

Use q to select the csv output for analysis and graphing.

Note that we edited the column headings for easier processing

q "select Benchmark,Score,Error from ./plot/jmh-result.csv where keyCount=100000 and valueSize=65536" -d, -H -C readwrite

Java 19 installation

We followed the instructions to install Azul. Then select the correct instance of java locally:

sudo update-alternatives --config java
sudo update-alternatives --config javac

And set JAVA_HOME appropriately. In my case, sudo update-alternatives --config java listed a few JVMs thus:

          /usr/lib/jvm/bellsoft-java8-full-amd64/bin/java   20803123  auto mode
          /usr/lib/jvm/bellsoft-java8-full-amd64/bin/java   20803123  manual mode
          /usr/lib/jvm/java-11-openjdk-amd64/bin/java       1111      manual mode
* 3            /usr/lib/jvm/zulu19/bin/java                      2193001   manual mode

For our environment, we set this:

export JAVA_HOME=/usr/lib/jvm/zulu19

The default version of Maven avaiable on the Ubuntu package repositories (3.6.3) is incompatible with Java 19. You will need to install a later Maven, and use it. I used 3.8.7 successfully.

Java 20, 21, 22 and subsequent versions

The FFI version we used was a preview in Java 19, and the interface has changed through to Java 22, where it has been finalized. Future work with this prototype will need to update the code to use the changed interface.

Java API Performance Improvements

Posted November 06, 2023

RocksDB Java API Performance Improvements

Evolved Binary has been working on several aspects of how the Java API to RocksDB can be improved. Two aspects of this which are of particular importance are performance and the developer experience.

We have built some synthetic benchmark code to determine which are the most efficient methods of transferring data between Java and C++.
We have used the results of the synthetic benchmarking to guide plans for rationalising the API interfaces.
We have made some opportunistic performance optimizations/fixes within the Java API which have already yielded noticable improvements.

Synthetic JNI API Performance Benchmarks

The synthetic benchmark repository contains tests designed to isolate the Java to/from C++ interaction of a canonical data intensive Key/Value Store implemented in C++ with a Java (JNI) API layered on top.

JNI provides several mechanisms for allowing transfer of data between Java buffers and C++ buffers. These mechanisms are not trivial, because they require the JNI system to ensure that Java memory under the control of the JVM is not moved or garbage collected whilst it is being accessed outside the direct control of the JVM.

We set out to determine which of multiple options for transfer of data from C++ to Java and vice-versa were the most efficient. We used the Java Microbenchmark Harness to set up repeatable benchmarks to measure all the options.

We explore these and some other potential mechanisms in the detailed results (in our Synthetic JNI performance repository)

We summarise this work here:

The Model

In C++ we represent the on-disk data as an in-memory map of (key, value) pairs.
For a fetch query, we expect the result to be a Java object with access to the contents of the value. This may be a standard Java object which does the job of data access (a byte[] or a ByteBuffer) or an object of our own devising which holds references to the value in some form (a FastBuffer pointing to com.sun.unsafe.Unsafe unsafe memory, for instance).

Data Types

There are several potential data types for holding data for transfer, and they are unsurprisingly quite connected underneath.

Byte Array

The simplest data container is a raw array of bytes (byte[]).

There are 3 different mechanisms for transferring data between a byte[] and C++

At the C++ side, the method JNIEnv.GetArrayCritical() allows access to a C++ pointer to the underlying array.
The JNIEnv methods GetByteArrayElements() and ReleaseByteArrayElements() fetch references/copies to and from the contents of a byte array, with less concern for critical sections than the critical methods, though they are consequently more likely/certain to result in (extra) copies.
The JNIEnv methods GetByteArrayRegion() and SetByteArrayRegion() transfer raw C++ buffer data to and from the contents of a byte array. These must ultimately do some data pinning for the duration of copies; the mechanisms may be similar or different to the critical operations, and therefore performance may differ.

Byte Buffer

A ByteBuffer abstracts the contents of a collection of bytes, and was in fact introduced to support a range of higher-performance I/O operations in some circumstances.

There are 2 types of byte buffers in Java, indirect and direct. Indirect byte buffers are the standard, and the memory they use is on-heap as with all usual Java objects. In contrast, direct byte buffers are used to wrap off-heap memory which is accessible to direct network I/O. Either type of ByteBuffer can be allocated at the Java side, using the allocate() and allocateDirect() methods respectively.

Direct byte buffers can be created in C++ using the JNI method JNIEnv.NewDirectByteBuffer() to wrap some native (C++) memory.

Direct byte buffers can be accessed in C++ using the JNIEnv.GetDirectBufferAddress() and measured using JNIEnv.GetDirectBufferCapacity()

Unsafe Memory

The call com.sun.unsafe.Unsafe.allocateMemory() returns a handle which is (of course) just a pointer to raw memory, and can be used as such on the C++ side. We could turn it into a byte buffer on the C++ side by calling JNIEnv.NewDirectByteBuffer(), or simply use it as a native C++ buffer at the expected address, assuming we record or remember how much space was allocated.

A custom FastBuffer class provides access to unsafe memory from the Java side.

Allocation

For these benchmarks, allocation has been excluded from the benchmark costs by pre-allocating a quantity of buffers of the appropriate kind as part of the test setup. Each run of the benchmark acquires an existing buffer from a pre-allocated FIFO list, and returns it afterwards. A small test has confirmed that the request and return cycle is of insignificant cost compared to the benchmark API call.

GetJNIBenchmark Performance

Benchmarks ran for a duration of order 6 hours on an otherwise unloaded VM, the error bars are small and we can have strong confidence in the values derived and plotted.

Raw JNI Get small

Comparing all the benchmarks as the data size tends large, the conclusions we can draw are:

Indirect byte buffers add cost; they are effectively an overhead on plain byte[] and the JNI-side only allows them to be accessed via their encapsulated byte[].
SetRegion and GetCritical mechanisms for copying data into a byte[] are of very comparable performance; presumably the behaviour behind the scenes of SetRegion is very similar to that of declaring a critical region, doing a memcpy() and releasing the critical region.
GetElements methods for transferring data from C++ to Java are consistently less efficient than SetRegion and GetCritical.
Getting into a raw memory buffer, passed as an address (the handle of an Unsafe or of a netty ByteBuf) is of similar cost to the more efficient byte[] operations.
Getting into a direct nio.ByteBuffer is of similar cost again; while the ByteBuffer is passed over JNI as an ordinary Java object, JNI has a specific method for getting hold of the address of the direct buffer, and using this, the get() cost with a ByteBuffer is just that of the underlying C++ memcpy().

At small(er) data sizes, we can see whether other factors are important.

Raw JNI Get large

Indirect byte buffers are the most significant overhead here. Again, we can conclude that this is due to pure overhead compared to byte[] operations.
At the lowest data sizes, netty ByteBufs and unsafe memory are marginally more efficient than byte[]s or (slightly less efficient) direct nio.Bytebuffers. This may be explained by even the small cost of calling the JNI model on the C++ side simply to acquire a direct buffer address. The margins (nanoseconds) here are extremely small.

Post processing the results

Our benchmark model for post-processing is to transfer the results into a byte[]. Where the result is already a byte[] this may seem like an unfair extra cost, but the aim is to model the least cost processing step for any kind of result.

Copying into a byte[] using the bulk methods supported by byte[], nio.ByteBuffer have comparable performance.
Accessing the contents of an Unsafe buffer using the supplied unsafe methods is inefficient. The access is word by word, in Java.
Accessing the contents of a netty ByteBuf is similarly inefficient; again the access is presumably word by word, using normal Java mechanisms.

Copy out JNI Get

PutJNIBenchmark

We benchmarked Put methods in a similar synthetic fashion in less depth, but enough to confirm that the performance profile is similar/symmetrical. As with get() using GetElements is the least performant way of implementing transfers to/from Java objects in C++/JNI, and other JNI mechanisms do not differ greatly one from another.

Lessons from Synthetic API

Performance analysis shows that for get(), fetching into allocated byte[] is equally as efficient as any other mechanism, as long as JNI region methods are used for the internal data transfer. Copying out or otherwise using the result on the Java side is straightforward and efficient. Using byte[] avoids the manual memory management required with direct nio.ByteBuffers, which extra work does not appear to provide any gain. A C++ implementation using the GetRegion JNI method is probably to be preferred to using GetCritical because while their performance is equal, GetRegion is a higher-level/simpler abstraction.

Vitally, whatever JNI transfer mechanism is chosen, the buffer allocation mechanism and pattern is crucial to achieving good performance. We experimented with making use of netty’s pooled allocator part of the benchmark, and the difference of getIntoPooledNettyByteBuf, using the allocator, compared to getIntoNettyByteBuf using the same pre-allocate on setup as every other benchmark, is significant.

Equally importantly, transfer of data to or from buffers should where possible be done in bulk, using array copy or buffer copy mechanisms. Thought should perhaps be given to supporting common transformations in the underlying C++ layer.

API Recommendations

Of course there is some noise within the results. but we can agree:

Don’t make copies you don’t need to make
Don’t allocate/deallocate when you can avoid it

Translating this into designing an efficient API, we want to:

Support API methods that return results in buffers supplied by the client.
Support byte[]-based APIs as the simplest way of getting data into a usable configuration for a broad range of Java use.
Support direct ByteBuffers as these can reduce copies when used as part of a chain of ByteBuffer-based operations. This sort of sophisticated streaming model is most likely to be used by clients where performance is important, and so we decide to support it.
Support indirect ByteBuffers for a combination of reasons:
- API consistency between direct and indirect buffers
- Simplicity of implementation, as we can wrap byte[]-oriented methods
Continue to support methods which allocate return buffers per-call, as these are the easiest to use on initial encounter with the RocksDB API.

High performance Java interaction with RocksDB ultimately requires architectural decisions by the client

Use more complex (client supplied buffer) API methods where performance matters
Don’t allocate/deallocate where you don’t need to
- recycle your own buffers where this makes sense
- or make sure that you are supplying the ultimate destination buffer (your cache, or a target network buffer) as input to RocksDB get() and put() calls

We are currently implementing a number of extra methods consistently across the Java fetch and store APIs to RocksDB in the PR Java API consistency between RocksDB.put() , .merge() and Transaction.put() , .merge() according to these principles.

Optimizations

Reduce Copies within API Implementation

Having analysed JNI performance as described, we reviewed the core of RocksJNI for opportunities to improve the performance. We noticed one thing in particular; some of the get() methods of the Java API had not been updated to take advantage of the new PinnableSlice methods.

Fixing this turned out to be a straightforward change, which has now been incorporated in the codebase Improve Java API get() performance by reducing copies

Performance Results

Using the JMH performances tests we updated as part of the above PR, we can see a small but consistent improvement in performance for all of the different get method variants which we have enhanced in the PR.

java -jar target/rocksdbjni-jmh-1.0-SNAPSHOT-benchmarks.jar -p keyCount=1000,50000 -p keySize=128 -p valueSize=1024,16384 -p columnFamilyTestType="1_column_family" GetBenchmarks.get GetBenchmarks.preallocatedByteBufferGet GetBenchmarks.preallocatedGet

The y-axis shows ops/sec in throughput, so higher is better.

Analysis

Before the invention of the Pinnable Slice the simplest RocksDB (native) API Get() looked like this:

Status Get(const ReadOptions& options,
                           ColumnFamilyHandle* column_family, const Slice& key,
                           std::string* value)

After PinnableSlice the correct way for new code to implement a get() is like this

Status Get(const ReadOptions& options,
                    ColumnFamilyHandle* column_family, const Slice& key,
                    PinnableSlice* value)

But of course RocksDB has to support legacy code, so there is an inline method in db.h which re-implements the former using the latter. And RocksJava API implementation seamlessly continues to use the std::string-based get()

Let’s examine what happens when get() is called from Java

jint Java_org_rocksdb_RocksDB_get__JJ_3BII_3BIIJ(
   JNIEnv* env, jobject, jlong jdb_handle, jlong jropt_handle, jbyteArray jkey,
   jint jkey_off, jint jkey_len, jbyteArray jval, jint jval_off, jint jval_len,
   jlong jcf_handle)

Create an empty std::string value
Call DB::Get() using the std::string variant
Copy the resultant std::string into Java, using the JNI SetByteArrayRegion() method

So stage (3) costs us a copy into Java. It’s mostly unavoidable that there will be at least the one copy from a C++ buffer into a Java buffer.

But what does stage 2 do ?

Create a PinnableSlice(std::string&) which uses the value as the slice’s backing buffer.
Call DB::Get() using the PinnableSlice variant
Work out if the slice has pinned data, in which case copy the pinned data into value and release it.
..or, if the slice has not pinned data, it is already in value (because we tried, but couldn’t pin anything).

So stage (2) costs us a copy into a std::string. But! It’s just a naive std::string that we have copied a large buffer into. And in RocksDB, the buffer is or can be large, so an extra copy something we need to worry about.

Luckily this is easy to fix. In the Java API (JNI) implementation:

Create a PinnableSlice() which uses its own default backing buffer.
Call DB::Get() using the PinnableSlice variant of the RocksDB API
Copy the data indicated by the PinnableSlice straight into the Java output buffer using the JNI SetByteArrayRegion() method, then release the slice.
Work out if the slice has successfully pinned data, in which case copy the pinned data straight into the Java output buffer using the JNI SetByteArrayRegion() method, then release the pin.
..or, if the slice has not pinned data, it is in the pinnable slice’s default backing buffer. All that is left, is to copy it straight into the Java output buffer using the JNI SetByteArrayRegion() method.

In the case where the PinnableSlice has succesfully pinned the data, this saves us the intermediate copy to the std::string. In the case where it hasn’t, we still have the extra copy so the observed performance improvement depends on when the data can be pinned. Luckily, our benchmarking suggests that the pin is happening in a significant number of cases.

On discussion with the RocksDB core team we understand that the core PinnableSlice optimization is most likely to succeed when pages are loaded from the block cache, rather than when they are in memtable. And it might be possible to successfully pin in the memtable as well, with some extra coding effort. This would likely improve the results for these benchmarks.

Jay Zhuang

Time-Aware Tiered Storage in RocksDB

Posted November 09, 2022

TL:DR

Tiered storage is now natively supported in the RocksDB with the option last_level_temperature, time-aware Tiered storage feature guarantees the recently written data are put in the hot tier storage with the option preclude_last_level_data_seconds.

Background

RocksDB Tiered Storage assigns a data temperature when creating the new SST which hints the file system to put the data on the corresponding storage media, so the data in a single DB instance can be placed on different storage media. Before the feature, the user typically creates multiple DB instances for different storage media, for example, one DB instance stores the recent hot data and migrates the data to another cold DB instance when the data becomes cold. Tracking and migrating the data could be challenging. With the RocksDB tiered storage feature, RocksDB compaction migrates the data from hot storage to cold storage.

Currently, RocksDB supports assigning the last level file temperature. In an LSM tree, typically the last level data is most likely the coldest. As the most recent data is on the higher level and gradually compacted to the lower level. The higher level data is more likely to be read, because:

RocksDB read always queries from the higher level to the lower level until it finds the data;
The high-level data is much more likely to be read and written by the compactions.

Problem

Generally in the LSM tree, hotter data is likely on the higher levels as mentioned before, but it is not always the case, for example for the skewed dataset, the recent data could be compacted to the last level first. For the universal compaction, a major compaction would compact all data to the last level (the cold tier) which includes both recent data that should be cataloged as hot data. In production, we found the majority of the compaction load is actually major compaction (more than 80%).

Goal and Non-goals

It’s hard to predict the hot and cold data. The most frequently accessed data should be cataloged as hot data. But it is hard to predict which key is going to be accessed most, it is also hard to track the per-key based access history. The time-aware tiered storage feature is only focusing on the use cases that the more recent data is more likely to be accessed. Which is the majority of the cases, but not all.

User APIs

Here are the 3 main tiered storage options:

Temperature last_level_temperature = Temperature::kUnknown;
uint64_t preclude_last_level_data_seconds = 0;
uint64_t preserve_internal_time_seconds = 0;

last_level_temperature defines the data temperature for the last level SST files, which is typically kCold or kWarm. RocksDB doesn’t check the option value, instead it just passes that to the file_system API with FileOptions.temperature when creating the last level SST files. For all the other files, non-last-level SST files, and non-SST files like manifest files, the temperature is set to kUnknown, which typically maps to hot data. The user can also get each SST’s temperature information through APIs:

db.GetLiveFilesStorageInfo();
db.GetLiveFilesMetaData();
db.GetColumnFamilyMetaData();

User Metrics

Here are the tiered storage related statistics:

HOT_FILE_READ_BYTES,
WARM_FILE_READ_BYTES,
COLD_FILE_READ_BYTES,
HOT_FILE_READ_COUNT,
WARM_FILE_READ_COUNT,
COLD_FILE_READ_COUNT,
// Last level and non-last level statistics
LAST_LEVEL_READ_BYTES,
LAST_LEVEL_READ_COUNT,
NON_LAST_LEVEL_READ_BYTES,
NON_LAST_LEVEL_READ_COUNT,

And more details from IOStats:

struct FileIOByTemperature {
// the number of bytes read to Temperature::kHot file
uint64_t hot_file_bytes_read;
// the number of bytes read to Temperature::kWarm file
uint64_t warm_file_bytes_read;
// the number of bytes read to Temperature::kCold file
uint64_t cold_file_bytes_read;
// total number of reads to Temperature::kHot file
uint64_t hot_file_read_count;
// total number of reads to Temperature::kWarm file
uint64_t warm_file_read_count;
// total number of reads to Temperature::kCold file
uint64_t cold_file_read_count;

Implementation

There are 2 main components for this feature. One is the time-tracking, and another is the per-key based placement compaction. These 2 components are relatively independent and linked together during the compaction initialization phase which gets the sequence number for splitting the hot and cold data. The time-tracking components can even be enabled independently by setting the option preserve_internal_time_seconds. The purpose of that is before migrating existing user cases to the tiered storage feature and avoid compacting the existing hot data to the cold tier (detailed in the migration session below).

Unlike the user-defined timestamp feature, the time tracking feature doesn’t have accurate time information for each key. It only samples the time information and gives a rough estimation for the key write time. Here is the high-level graph for the implementation:

Time Tracking

Time tracking information is recorded by a periodic task which gets the latest sequence number and the current time and then stores it in an in-memory data structure. The interval of the periodic task is determined by the user setting preserve_internal_time_seconds and dividing that by 100. For example, if 3 days of data should be precluded from the last level, then the interval of the periodic task is about 0.7 hours (3 * 24 / 100 ~= 0.72), which also means only the latest 100 seq->time pairs needed in memory.

Currently, the in-memory seq_time_mapping is only used during Flush() and encoded to the SST property. The data is delta encoded and again maximum 100 pairs are stored, so the extra data size is pretty small (far less than 1KB per SST) and only non-last-level SSTs need to have that information. Internally, RocksDB also uses the minimal sequence number and SST creation time from the SST metadata to improve the time accuracy. The sequence number to time information is distributed in each SST, ranging from the min seqno to max seqno for that SST file, so each SST has its self-contained time information. This also means there could be redundancy for the time information, for example, if 2 SSTs have an overlapped sequence number (which is very likely for non-L0 files), the same seq->time pair may exist in both SSTs. For the future, the time information could also be useful for other potential features like a better estimate of the oldest timestamp for an SST which is critical for the RocksDB TTL feature.

Per-Key Placement Compaction

Compare to normal compaction which only outputs the data to a single level, Per-key placement compaction can output data to 2 different levels, as per-per placement compaction is only for the last level compaction, so the 2 output levels would always be the penultimate level, and the last level. The compaction places the key to its corresponding tier by simply checking the key’s sequence number.

At the beginning of the compaction, the compaction job collects all seq to time information from every input SSTs and merges them together, then based on the current time to get the oldest sequence number that should be put into non-last-level (hot tier). During the last level compaction, as long as the key is newer than the oldest_sequence_number, it will be placed in the penultimate level (hot tier) instead of the last level (cold tier).

Note, RocksDB also places the keys that are within the user snapshot in the hot tier, there’re a few reasons for that:

It’s reasonable to assume snapshot-protected data are hot data;
Avoid mixing the sequence number not zeroed out data with old last-level data, which is desirable to reduce the oldest obsolete data time (it’s defined as the oldest SST time that has a non-zero sequence number). It also means tombstones are always placed in the hot tier, which is also desirable as it should be pretty small.
The original motivation was to avoid moving data from the lower level to a higher level in case the user increases the preclude_last_level_data_seconds, so the snapshot-protected data in the last level will become hot again, and moving data to a higher level. It’s not always safe to move data from a lower level to a higher level in the LSM tree which could cause key conflict. Later we added a conflict check to allow the data to move up as long as there’s no key conflict, but then the movement is not guaranteed (see Migration for details)

Migration

Once the user enables the feature, it enables both time tracking and per-key placement compaction at the same time. As the existing data, it can still be mismarked as cold data. To have a smooth migration to the feature. The user can enable the time-tracking feature first. For example, if the user plans to set preclude_last_level_data_seconds to 3 days, the user can enable time tracking 3 days earlier with preserve_internal_time_seconds. Then when enabling the tiered storage feature, it already has the time information for the last 3 days’ hot data, then per-key placement compaction won’t compact them to the last level.

Just preserving the time information won’t prevent the data from compacting to the last level (which should be still on the hot tier). Once the preclude_last_level_data_seconds and last_level_temperature features are enabled, some of the last-level data might need to move up. Currently, RocksDB just does a conflict check, the hot/cold split in this case is not guaranteed.

Summary

Time-aware tired storage feature guarantees the new data is placed in the hot tier, which is ideal for the tiering use cases where the most recent data is likely the hot data. It’s done by tracking the write time information and per-key placement compaction to split the hot/cold data.

The tiered storage feature is actively being developed, any suggestions or PRs will be welcomed.

Acknowledgements

We thank Siying Dong and Andrew Kryczka for brainstorming and reviewing the feature design and implementation. And it was my fortune to work with the RocksDB team members!

Jay Zhuang

Reduce Write Amplification by Aligning Compaction Output File Boundaries

Posted October 31, 2022

TL;DR

By cutting the compaction output file earlier and allowing larger than targeted_file_size to align the compaction output files to the next level files, it can reduce WA (Write Amplification) by more than 10%. The feature is enabled by default after the user upgrades RocksDB to version 7.8.0+.

Background

RocksDB level compaction picks one file from the source level and compacts to the next level, which is a typical partial merge compaction algorithm. Compared to the full merge compaction strategy for example universal compaction, it has the benefits of smaller compaction size, better parallelism, etc. But it also has a larger write amplification (typically 20-30 times user data). One of the problems is wasted compaction at the beginning and ending:

In the diagram above, SST11 is selected for the compaction, it overlaps with SST20 to SST23, so all these files are selected for compaction. But the beginning and ending of the SST on Level 2 are wasted, which also means it will be compacted again when SST10 is compacting down. If the file boundaries are aligned, then the wasted compaction size could be reduced. On average, the wasted compaction is 1 file size: 0.5 at the beginning, and 0.5 at the end. Typically the average compaction fan-out is about 6 (with the default max_bytes_for_level_multiplier = 10), then 1 / (6 + 1) ~= 14% of compaction is wasted.

implementation

To reduce such wasted compaction, RocksDB now tries to align the compaction output file to the next level’s file. So future compactions will have fewer wasted compaction. For example, the above case might be cut like this:

The trade-off is the file won’t be cut exactly after it exceeds target_file_size_base, instead, it will be more likely cut when it’s aligned with the next level file’s boundary, so the file size might be more varied. It could be as small as 50% of target_file_size or as large as 2x target_file_size. It will only impact non-bottommost-level files, which should be only ~11% of the data. Internally, RocksDB tries to cut the file so its size is close to the target_file_size setting but also aligned with the next level boundary. When the compaction output file hit a next-level file boundary, either the beginning or ending boundary, it will cut if:

current_size > ((5 * min(bounderies_num, 8) + 50) / 100) * target_file_size

(details)

The file size is also capped at 2x target_file_size: details. Another benefit of cutting the file earlier is having more trivial move compaction, which is moving the file from a high level to a low level without compacting anything. Based on a compaction simulator test, the trivial move data is increased by 30% (but still less than 1% compaction data is trivial move):

Based on the db_bench test, it can save ~12% compaction load, here is the test command and result:

TEST_TMPDIR=/data/dbbench ./db_bench --benchmarks=fillrandom,readrandom -max_background_jobs=12 -num=400000000 -target_file_size_base=33554432

# baseline:
Flush(GB): cumulative 25.882, interval 7.216
Cumulative compaction: 285.90 GB write, 162.36 MB/s write, 269.68 GB read, 153.15 MB/s read, 2926.7 seconds

# with this change:
Flush(GB): cumulative 25.882, interval 7.753
Cumulative compaction: 249.97 GB write, 141.96 MB/s write, 233.74 GB read, 132.74 MB/s read, 2534.9 seconds

The feature is enabled by default by upgrading to RocksDB 7.8 or later versions, as the feature should have a limited impact on the file size and have great write amplification improvements. If in a rare case, it needs to opt out, set

options.level_compaction_dynamic_file_size = false;

Other Options and Benchmark

We also tested a few other options, starting with a fixed threshold: 75% of the target_file_size and 50%. Then with a dynamic threshold that is explained, but still limiting file size smaller than the target_file_size.

Baseline (main branch before PR#10655);
Fixed Threshold 75%: after 75% of target file size, cut the file whenever it aligns with a low level file boundary;
Fixed Threshold 50%: reduce the threshold to 50% of target file size;
Dynamic Threshold (5*bounderies_num + 50) percent of target file size and maxed at 90%;
Dynamic Threshold + allow 2x the target file size (chosen option).

Test Environment and Data

To speed up the benchmark, we introduced a compaction simulator within Rocksdb (details), which replaced the physical SST with in-memory data (a large bitset). Which can test compaction more consistently. As it’s a simulator, it has its limitations:

it assumes each key-value has the same size;

no deletion (but has override);
doesn’t consider data compression;
single-threaded and finish all compactions before the next flush (so no write stall).

We use 3 kinds of the dataset for tests:

Random Data, has an override, evenly distributed;
Zipf distribution with alpha = 1.01, moderately skewed;
Zipf distribution with alpha = 1.2, highly skewed.

Write Amplification

As we can see, all options are better than the baseline. Option5 (brown) and option3 (green) have similar WA improvements. (The sudden WA drop during ~40G Random Dataset is because we enabled level_compaction_dynamic_level_bytes and the level number was increased from 3 to 4, the similar test result without enabling level_compaction_dynamic_level_bytes).

File Size Distribution at the End of Test

This is the file size distribution at the end of the test, which loads about 100G data. As this change only impacts the non-bottommost file size, and the majority of the SST files are bottommost, there’re no significant differences:

All Compaction Generated File Sizes

The high-level files are much more likely to be compacted, so all compaction-generated files size has more significant change:

Overall option5 has most of the file size close to the target file size. vs. option3 has a much smaller size. Here are more detailed stats for compaction output file size:

              base           50p           75p       dynamic     2xdynamic
count  1.656000e+03  1.960000e+03  1.770000e+03  1.687000e+03  1.705000e+03
mean   3.116062e+07  2.634125e+07  2.917876e+07  3.060135e+07  3.028076e+07
std    7.145242e+06  1.065134e+07  8.800474e+06  7.612939e+06  8.046139e+06

Summary

Allowing more dynamic file size and aligning the compaction output file to the next level file’s boundary improves the RocksDB write amplification by more than 10%, which will be enabled by default in 7.8.0 release. We picked a simple algorithm to decide when to cut the output file, which can be further improved. For example, by estimating output file size with index information. Any suggestions or PR are welcomed.

Acknowledgements

We thank Siying Dong for initializing the file-cutting idea and thank Andrew Kryczka, Mark Callaghan for contributing to the ideas. And Changyu Bi for the detailed code review.

Asynchronous IO in RocksDB

Posted October 07, 2022

Summary

RocksDB provides several APIs to read KV pairs from a database, including Get and MultiGet for point lookups and Iterator for sequential scanning. These APIs may result in RocksDB reading blocks from SST files on disk storage. The types of blocks and the frequency with which they are read from storage is workload dependent. Some workloads may have a small working set and thus may be able to cache most of the data required, while others may have large working sets and have to read from disk more often. In the latter case, the latency would be much higher and throughput would be lower than the former. They would also be dependent on the characteristics of the underlying storage media, making it difficult to migrate from one medium to another, for example, local flash to disaggregated flash.

One way to mitigate the impact of storage latency is to read asynchronously and in parallel as much as possible, in order to hide IO latency. We have implemented this in RocksDB in Iterators and MultiGet. In Iterators, we prefetch data asynchronously in the background for each file being iterated on, unlike the current implementation that does prefetching synchronously, thus blocking the iterator thread. In MultiGet, we determine the set of files that a given batch of keys overlaps, and read the necessary data blocks from those files in parallel using an asynchronous file system API. These optimizations have significantly decreased the overall latency of the RocksDB MultiGet and iteration APIs on slower storage compared to local flash.

The optimizations described here are in the internal implementation of Iterator and MultiGet in RocksDB. The user API is still synchronous, so existing code can easily benefit from it. We might consider async user APIs in the future.

Design

API

A new flag in ReadOptions, async_io, controls the usage of async IO. This flag, when set, enables async IO in Iterators and MultiGet. For MultiGet, an additional ReadOptions flag, optimize_multiget_for_io (defaults to true), controls how aggressively to use async IO. If the flag is not set, files in the same level are read in parallel but not different levels. If the flag is set, the level restriction is removed and as many files as possible are read in parallel, regardless of level. The latter might have a higher CPU cost depending on the workload.

At the FileSystem layer, we use the FSRandomAccessFile::ReadAsync API to start an async read, providing a completion callback.

Scan

A RocksDB scan usually involves the allocation of a new iterator, followed by a Seek call with a target key to position the iterator, followed by multiple Next calls to iterate through the keys sequentially. Both the Seek and Next operations present opportunities to read asynchronously, thereby reducing the scan latency.

A scan usually involves iterating through keys in multiple entities - the active memtable, sealed and unflushed memtables, every L0 file, and every non-empty non-zero level. The first two are completely in memory and thus not impacted by IO latency. The latter two involve reading from SST files. This means that an increase in IO latency has a multiplier effect, since multiple L0 files and levels have to be iterated on.

Some factors, such as block cache and prefix bloom filters, can reduce the number of files to iterate and number of reads from the files. Nevertheless, even a few reads from disk can dominate the overall latency. RocksDB uses async IO in both Seek and Next to mitigate the latency impact, as described below.

Seek

A RocksDB iterator maintains a collection of child iterators, one for each L0 file and for each non-empty non-zero levels. For a Seek operation every child iterator has to Seek to the target key. This is normally done serially, by doing synchronous reads from SST files when the required data blocks are not in cache. When the async_io option is enabled, RocksDB performs the Seek in 2 phases - 1) Locate the data block required for Seek in each file/level and issue an async read, and 2) in the second phase, reseek with the same key, which will wait for the async read to finish at each level and position the table iterator. Phase 1 reads multiple blocks in parallel, reducing overall Seek latency.

For the iterator Next operation, RocksDB tries to reduce the latency due to IO by prefetching data from the file. This prefetching occurs when a data block required by Next is not present in the cache. The reads from file and prefetching is managed by the FilePrefetchBuffer, which is an object that’s created per table iterator (BlockBasedTableIterator). The FilePrefetchBuffer reads the required data block, and an additional amount of data that varies depending on the options provided by the user in ReadOptions and BlockBasedTableOptions. The default behavior is to start prefetching on the third read from a file, with an initial prefetch size of 8KB and doubling it on every subsequent read, upto a max of 256KB.

While the prefetching in the previous paragraph helps, it is still synchronous and contributes to the iterator latency. When the async_io option is enabled, RocksDB prefetches in the background, i.e while the iterator is scanning KV pairs. This is accomplished in FilePrefetchBuffer by maintaining two prefetch buffers. The prefetch size is calculated as usual, but its then split across the two buffers. As the iteration proceeds and data in the first buffer is consumed, the buffer is cleared and an async read is scheduled to prefetch additional data. This read continues in the background while the iterator continues to process data in the second buffer. At this point, the roles of the two buffers are reversed. This does not completely hide the IO latency, since the iterator would have to wait for an async read to complete after the data in memory has been consumed. However, it does hide some of it by overlapping CPU and IO, and async prefetch can be happening on multiple levels in parallel, further reducing the latency.

Scan flow

MultiGet

The MultiGet API accepts a batch of keys as input. Its a more efficient way of looking up multiple keys compared to a loop of Gets. One way MultiGet is more efficient is by reading multiple data blocks from an SST file in a batch, for keys in the same file. This greatly reduces the latency of the request, compared to a loop of Gets. The MultiRead FileSystem API is used to read a batch of data blocks.

MultiGet flow

Even with the MultiRead optimization, subset of keys that are in different files still need to be read serially. We can take this one step further and read multiple files in parallel. In order to do this, a few fundamental changes were required in the MultiGet implementation -

Coroutines - A MultiGet involves determining the set of keys in a batch that overlap an SST file, and then calling TableReader::MultiGet to do the actual lookup. The TableReader probes the bloom filter, traverses the index block, looks up the block cache for the necessary, reads the missing data blocks from the SST file, and then searches for the keys in the data blocks. There is a significant amount of context that’s accumulated at each stage, and it would be rather complex to interleave data blocks reads by multiple TableReaders. In order to simplify it, we used async IO with C++ coroutines. The TableReader::MultiGet is implemented as a coroutine, and the coroutine is suspended after issuing async reads for missing data blocks. This allows the top-level MultiGet to iterate through the TableReaders for all the keys, before waiting for the reads to finish and resuming the coroutines.
Filtering - The downside of using coroutines is the CPU overhead, which is non-trivial. To minimize the overhead, its desirable to not use coroutines as much as possible. One scenario in which we can completely avoid the call to a TableReader::MultiGet coroutine is if we know that none of the overlapping keys are actually present in the SST file. This can easily determined by probing the bloom filter. In the previous implementation, the bloom filter lookup was embedded in TableReader::MultiGet. However, we could easily implement is as a separate step, before calling TableReader::MultiGet.
Splitting batches - The default strategy of MultiGet is to lookup keys in one level (or L0 file), before moving on to the next. This limits the amount of IO parallelism we can exploit. For example, the keys in a batch may not be clustered together, and may be scattered over multiple files. Even if they are clustered together in the key space, they may not all be in the same level. In order to optimize for these situations, we determine the subset of keys that are likely to be in a given level, and then split the MultiGet batch into 2 - the subset in that level, and the remainder. The batch containing the remainder can then be processed in parallel. The subset of keys likely to be in a level is determined by the filtering step.

Together, these changes enabled two types of latency optimization in MultiGet using async IO - single-level and multi-level. The former reads data blocks in parallel from multiple files in the same LSM level, while the latter reads in parallel from multiple files in multiple levels.

Results

Command used to generate the database:

buck-out/opt/gen/rocks/tools/rocks_db_bench —db=/rocks_db_team/prefix_scan —env_uri=ws://ws.flash.ftw3preprod1 -logtostderr=false -benchmarks="fillseqdeterministic" -key_size=32 -value_size=512 -num=5000000 -num_levels=4 -multiread_batched=true -use_direct_reads=false -adaptive_readahead=true -threads=1 -cache_size=10485760000 -async_io=false -multiread_stride=40000 -disable_auto_compactions=true -compaction_style=1 -bloom_bits=10

Structure of the database:

Level[0]: /000233.sst(size: 24828520 bytes) Level[0]: /000232.sst(size: 49874113 bytes) Level[0]: /000231.sst(size: 100243447 bytes) Level[0]: /000230.sst(size: 201507232 bytes) Level[1]: /000224.sst - /000229.sst(total size: 405046844 bytes) Level[2]: /000211.sst - /000223.sst(total size: 814190051 bytes) Level[3]: /000188.sst - /000210.sst(total size: 1515327216 bytes)

MultiGet

MultiGet benchmark command:

buck-out/opt/gen/rocks/tools/rocks_db_bench -use_existing_db=true —db=/rocks_db_team/prefix_scan -benchmarks="multireadrandom" -key_size=32 -value_size=512 -num=5000000 -batch_size=8 -multiread_batched=true -use_direct_reads=false -duration=60 -ops_between_duration_checks=1 -readonly=true -threads=4 -cache_size=300000000 -async_io=true -multiread_stride=40000 -statistics —env_uri=ws://ws.flash.ftw3preprod1 -logtostderr=false -adaptive_readahead=true -bloom_bits=10

Single-file

The default MultiGet implementation of reading from one file at a time had a latency of 1292 micros/op.

multireadrandom : 1291.992 micros/op 3095 ops/sec 60.007 seconds 185768 operations; 1.6 MB/s (46768 of 46768 found) rocksdb.db.multiget.micros P50 : 9664.419795 P95 : 20757.097056 P99 : 29329.444444 P100 : 46162.000000 COUNT : 23221 SUM : 239839394

Single-level

MultiGet with async_io=true and optimize_multiget_for_io=false had a latency of 775 micros/op.

multireadrandom : 774.587 micros/op 5163 ops/sec 60.009 seconds 309864 operations; 2.7 MB/s (77816 of 77816 found) rocksdb.db.multiget.micros P50 : [6029.601964](tel:6029601964) P95 : 10727.467932 P99 : 13986.683940 P100 : 47466.000000 COUNT : 38733 SUM : 239750172

Multi-level

With all optimizations turned on, MultiGet had the lowest latency of 508 micros/op.

multireadrandom : 507.533 micros/op 7881 ops/sec 60.003 seconds 472896 operations; 4.1 MB/s (117536 of 117536 found) rocksdb.db.multiget.micros P50 : 3923.819467 P95 : 7356.182075 P99 : 10880.728723 P100 : 28511.000000 COUNT : 59112 SUM : 239642721

Scan

Benchmark command:

buck-out/opt/gen/rocks/tools/rocks_db_bench -use_existing_db=true —db=/rocks_db_team/prefix_scan -benchmarks="seekrandom" -key_size=32 -value_size=512 -num=5000000 -batch_size=8 -multiread_batched=true -use_direct_reads=false -duration=60 -ops_between_duration_checks=1 -readonly=true -threads=4 -cache_size=300000000 -async_io=true -multiread_stride=40000 -statistics —env_uri=ws://ws.flash.ftw3preprod1 -logtostderr=false -adaptive_readahead=true -bloom_bits=10 -seek_nexts=65536

With async scan

seekrandom : 414442.303 micros/op 9 ops/sec 60.288 seconds 581 operations; 326.2 MB/s (145 of 145 found)

Without async scan

seekrandom : 848858.669 micros/op 4 ops/sec 60.529 seconds 284 operations; 158.1 MB/s (74 of 74 found)

Known Limitations

These optimizations apply only to block based table SSTs. File system support for the ReadAsync and Poll interfaces is required. Currently, it is available only for PosixFileSystem.

The MultiGet async IO optimization has a few additional limitations -

Depends on folly, which introduces a few additional build steps
Higher CPU overhead due to coroutines. The CPU overhead of MultiGet may increase 6-15%, with the worst case being a single threaded MultiGet batch of keys with 1 key/file intersection and 100% cache hit rate. A more realistic case of multiple threads with a few keys (~4) overlap per file should see ~6% higher CPU util.
No parallelization of metadata reads. A metadata read will block the thread.
A few other cases will also be in serial, such as additional block reads for merge operands.

Andrew Kryczka

Verifying crash-recovery with lost buffered writes

Posted October 05, 2022

Introduction

Writes to a RocksDB instance go through multiple layers before they are fully persisted. Those layers may buffer writes, delaying their persistence. Depending on the layer, buffered writes may be lost in a process or system crash. A process crash loses writes buffered in process memory only. A system crash additionally loses writes buffered in OS memory.

The new test coverage introduced in this post verifies there is no hole in the recovered data in either type of crash. A hole would exist if any recovered write were newer than any lost write, as illustrated below. This guarantee is important for many applications, such as those that use the newest recovered write to determine the starting point for replication.

Valid (no hole) recovery: all recovered writes (1 and 2) are older than all lost writes (3 and 4)

Invalid (hole) recovery: a recovered write (4) is newer than a lost write (3)

The new test coverage assumes all writes use the same options related to buffering/persistence. For example, we do not cover the case of alternating writes with WAL disabled and WAL enabled (WriteOptions::disableWAL). It also assumes the crash does not have any unexpected consequences like corrupting persisted data.

Testing for holes in the recovery is challenging because there are many valid recovery outcomes. Our solution involves tracing all the writes and then verifying the recovery matches a prefix of the trace. This proves there are no holes in the recovery. See “Extensions for lost buffered writes” subsection below for more details.

Testing actual system crashes would be operationally difficult. Our solution simulates system crash by buffering written but unsynced data in process memory such that it is lost in a process crash. See “Simulating system crash” subsection below for more details.

Scenarios covered

We began testing recovery has no hole in the following new scenarios. This coverage is included in our internal CI that periodically runs against the latest commit on the main branch.

Process crash with WAL disabled (WriteOptions::disableWAL=1), which loses writes since the last memtable flush.
System crash with WAL enabled (WriteOptions::disableWAL=0), which loses writes since the last memtable flush or WAL sync (WriteOptions::sync=1, SyncWAL(), or FlushWAL(true /* sync */)).
Process crash with manual WAL flush (DBOptions::manual_wal_flush=1), which loses writes since the last memtable flush or manual WAL flush (FlushWAL()).
System crash with manual WAL flush (DBOptions::manual_wal_flush=1), which loses writes since the last memtable flush or synced manual WAL flush (FlushWAL(true /* sync */), or FlushWAL(false /* sync */) followed by WAL sync).

Issues found

Solution details

Basic setup

Our correctness testing framework consists of a stress test program (db_stress) and a wrapper script (db_crashtest.py). db_crashtest.py manages instances of db_stress, starting them and injecting crashes. db_stress operates a DB and test oracle (“Latest values file”).

At startup, db_stress verifies the DB using the test oracle, skipping keys that had pending writes when the last crash happened. db_stress then stresses the DB with random operations, keeping the test oracle up-to-date.

As the name “Latest values file” implies, this test oracle only tracks the latest value for each key. As a result, this setup is unable to verify recoveries involving lost buffered writes, where recovering older values is tolerated as long as there is no hole.

Extensions for lost buffered writes

To accommodate lost buffered writes, we extended the test oracle to include two new files: “verifiedSeqno.state” and “verifiedSeqno.trace”. verifiedSeqno is the sequence number of the last successful verification. “verifiedSeqno.state” is the expected values file at that sequence number, and “verifiedSeqno.trace” is the trace file of all operations that happened after that sequence number.

When buffered writes may have been lost by the previous db_stress instance, the current db_stress instance must reconstruct the latest values file before startup verification. M is the recovery sequence number of the current db_stress instance and N is the recovery sequence number of the previous db_stress instance. M is learned from the DB, while N is learned from the filesystem by parsing the “*.{trace,state}” filenames. Then, the latest values file (“LATEST.state”) can be reconstructed by replaying the first M-N traced operations (in “N.trace”) on top of the last instance’s starting point (“N.state”).

When buffered writes may be lost by the current db_stress instance, we save the current expected values into “M.state” and begin tracing newer operations in “M.trace”.

Simulating system crash

When simulating system crash, we send file writes to a TestFSWritableFile, which buffers unsynced writes in process memory. That way, the existing db_stress process crash mechanism will lose unsynced writes.

TestFSWritableFile is implemented as follows.

Append() buffers the write in a local std::string rather than calling write().
Sync() transfers the local std::strings content to PosixWritableFile::Append(), which will then write() it to the OS page cache.

Next steps

An untested guarantee is that RocksDB recovers all writes that the user explicitly flushed out of the buffers lost in the crash. We may recover more writes than these due to internal flushing of buffers, but never less. Our test oracle needs to be further extended to track the lower bound on the sequence number that is expected to survive a crash.

We would also like to make our system crash simulation more realistic. Currently we only drop unsynced regular file data, but we should drop unsynced directory entries as well.

Acknowledgements

Hui Xiao added the manual WAL flush coverage and compatibility with TransactionDB. Zhichao Cao added the system crash simulation. Several RocksDB team members contributed to this feature’s dependencies.

Changyu Bi

Andrew Kryczka

Per Key-Value Checksum

Posted July 18, 2022

Summary

Silent data corruptions can severely impact RocksDB users. As a key-value library, RocksDB resides at the bottom of the user space software stack for many diverse applications. Returning wrong query results can cause unpredictable consequences for our users so must be avoided.

To prevent and detect corruption, RocksDB has several consistency checks [1], especially focusing on the storage layer. For example, SST files contain block checksums that are verified during reads, and each SST file has a full file checksum that can be verified when files are transferred.

Other sources of corruptions, such as those from faulty CPU/memory or heap corruptions, pose risks for which protections are relatively underdeveloped. Meanwhile, recent work [2] suggests one per thousand machines in our fleet will at some point experience a hardware error that is exposed to an application. Additionally, software bugs can increase the risk of heap corruptions at any time.

Hardware/heap corruptions are naturally difficult to detect in the application layer since they can compromise any data or control flow. Some factors we take into account when choosing where to add protection are the volume of data, the importance of the data, the CPU instructions that operate on the data, and the duration it resides in memory. One recently added protection, detect_filter_construct_corruption, has proven itself useful in preventing corrupt filters from being persisted. We have seen hardware encounter machine-check exceptions a few hours after we detected a corrupt filter.

The next way we intend to detect hardware and heap corruptions before they cause queries to return wrong results is through developing a new feature: per key-value checksum. This feature will eventually provide optional end-to-end integrity protection for every key-value pair. RocksDB 7.4 offers substantial coverage of the user write and recovery paths with per key-value checksum protection.

User API

For integrity protection during recovery, no change is required. Recovery is always protected.

For user write protection, RocksDB allows the user to specify per key-value protection through WriteOptions::protection_bytes_per_key or pass in protection_bytes_per_key to WriteBatch constructor when creating a WriteBatch directly. Currently, only 0 (default, no protection) and 8 bytes per key are supported. This should be fine for write batches as they do not usually contain a huge number of keys. We are working on supporting more settings as 8 bytes per key might cause considerable memory overhead when the protection is extended to memtable entries.

Feature Design

Data Structures

Protection info

For protecting key-value pairs, we chose to use a hashing algorithm, xxh3 [3], for its good efficiency without relying on special hardware. While algorithms like crc32c can guarantee detection of certain patterns of bit flips, xxh3 offers no such guarantees. This is acceptable for us as we do not expect any particular error pattern [4], and even if we did, xxh3 can achieve a collision probability close enough to zero for us by tuning the number of protection bytes per key-value.

Key-value pairs have multiple representations in RocksDB: in WriteBatch, in memtable entries and in data blocks. In this post we focus on key-values in write batches and memtable as in-memory data blocks are not yet protected.

Besides user key and value, RocksDB includes internal metadata in the per key-value checksum calculation. Depending on the representation, internal metadata consists of some combination of sequence number, operation type, and column family ID. Note that since timestamp (when enabled) is part of the user key it is protected as well.

The protection info consists of the XOR’d result of the xxh3 hash for all the protected components. This allows us to efficiently transform protection info for different representations. See below for an example converting WriteBatch protection info to memtable protection info.

A risk of using XOR is the possibility of swapping corruptions (e.g., key becomes the value and the value becomes the key). To mitigate this risk, we use an independent seed for hashing each type of component.

The following two figures illustrate how protection info in WriteBatch and memtable are calculated from a key-value’s components.

Protection info for a key-value in a WriteBatch

Protection info for a key-value in a memtable

The next figure illustrates how protection info for a key-value can be transformed to protect that same key-value in a different representation. Note this is done without recalculating the hash for all the key-value’s components.

Protection info for a key-value in a memtable derived from an existing WriteBatch protection info

Above, we see two (small) components are hashed: column family ID and sequence number. When a key-value is inserted from WriteBatch into memtable, it is assigned a sequence number and drops the column family ID since each memtable is associated with one column family. Recall the xxh3 of column family ID was included in the WriteBatch protection info, which is canceled out by the column family ID xxh3 included in the XOR.

WAL fragment

WAL (Write-ahead-log) persists write batches that correspond to operations in memtables and enables consistent database recovery after restart. RocksDB writes to WAL in chunks of some fixed block size for efficiency. It is possible that some write batch does not fit into the space left in the current block and/or is larger than the fixed block size. Thus, serialized write batches (WAL records) are divided into WAL fragments before being written to WAL. The format of a WAL fragment is in the following diagram (there is another legacy format detailed in code comments). Roughly, the Type field indicates whether a fragment is at the beginning, middle or end of a record, and is used to group fragments.

Note that each fragment is prefixed by a crc32c checksum that is calculated over Type, Log # and Payload. This ensures that RocksDB can detect corruptions that happened to the WAL in the storage layer.

Write batch

As mentioned above, a WAL record is a serialized WriteBatch that is split into physical fragments during writes to WAL. During DB recovery, once a WAL record is reconstructed from one or more fragments, it is copied into the content of a WriteBatch. The write batch will then be used to restore the memtable states.

Besides the recovery path, a write batch is always constructed during user writes. Firstly, RocksDB allows users to construct a write batch directly, and pass it to DB through DB::Write() API for execution. Higher-level buffered write APIs like Transaction rely on a write batch to buffer writes prior to executing them. For unbuffered write APIs like DB::Put(), RocksDB constructs a write batch internally with the input user key and value.

The above diagram shows a rough representation of a write batch in memory. Contents is the concatenation of serialized user operations in this write batch. Each operation consists of user key, value, op_type and optionally column family ID. With per key-value checksum protection enabled, a vector of ProtectionInfo is stored in the write batch, one for each user operation.

Memtable entry

A memtable entry is similar to write batch content, except that it captures only a single user operation and that it does not contain column family ID (since memtable is per column family). User key and value are length-prefixed, and seqno and optype are combined in a fixed 8 bytes representation.

Processes

In order to protect user writes and recovery, per key-value checksum is covered in the following code paths.

WriteBatch write

Per key-value checksum coverage starts with the user buffers that contain user key and/or value. When users call DB Write APIs (e.g., DB::Put()), or when users add operations into write batches directly (e.g. WriteBatch::Put()), RocksDB constructs ProtectionInfo from the user buffer (e.g. here) and stores the protection information within the corresponding WriteBatch object as diagramed below. Then the user key and/or value are copied into the WriteBatch, thus starting per key-value checksum protection from user buffer.

WAL write

Before a WriteBatch leaves RocksDB and be persisted in a WAL file, it is verified against its ProtectionInfo to ensure its content is not corrupted. We added WriteBatch::VerifyChecksum() for this purpose. Once we verify the content of a WriteBatch, it is then divided into potentially multiple WAL fragments and persisted in the underlying file system. From that point on, the integrity protection is handed off to the per fragment crc32c checksum that is persisted in WAL too.

Memtable write

Similar to the WAL write path, ProtectionInfo is verified before an entry is inserted into a memtable. The difference here is that an memtable entry has its own buffer, and the content of a WriteBatch is copied into the memtable entry. So the ProtectionInfo is verified against the memtable entry buffer instead. The current per key-value checksum protection ends at this verification on the buffer containing a memtable entry, and one of the future work is to extend the coverage to key-value pairs in memtables.

WAL read

This is for the DB recovery path: WAL fragments are read into memory, concatenated together to form WAL records, and then WriteBatches are constructed from WAL records and added to memtables. In RocksDB 7.4, once a WriteBatch copies its content from a WAL record, ProtectionInfo is constructed from the WriteBatch content and per key-value protection starts. However, this copy operation is not protected, neither is the reconstruction of a WAL record from WAL fragments. To provide protection from silent data corruption during these memory copying operations, we added checksum handshake detailed below in RocksDB 7.5.

When a WAL fragment is first read into memory, its crc32c checksum is verified. The WAL fragment is then appended to the buffer containing a WAL record. RocksDB uses xxh3’s streaming API to calculate the checksum of the WAL record and updates the streaming hash state with the new WAL fragment content whenever it is appended to the WAL record buffer (e.g. here). After the WAL record is constructed, it is copied into a WriteBatch and ProtectionInfo is constructed from the write batch content. Then, the xxh3 checksum of the WAL record is verified against the write batch content to complete the checksum handshake. If the checksum verification succeeds, then we are more confident that ProtectionInfo is calculated based on uncorrupted data, and the protection coverage continues with the newly constructed ProtectionInfo along the write code paths mentioned above.

Future work

Future coverage expansion will cover memtable KVs, flush, compaction and user reads etc.

References

[1] http://rocksdb.org/blog/2021/05/26/online-validation.html

[2] H. D. Dixit, L. Boyle, G. Vunnam, S. Pendharkar, M. Beadon, and S. Sankar, ‘Detecting silent data corruptions in the wild’. arXiv, 2022.

[3] https://github.com/Cyan4973/xxHash

[4] https://github.com/Cyan4973/xxHash/issues/229#issuecomment-511956403

Ribbon Filter

Posted December 29, 2021

Summary

Since version 6.15 last year, RocksDB supports Ribbon filters, a new alternative to Bloom filters that save space, especially memory, at the cost of more CPU usage, mostly in constructing the filters in the background. Most applications with long-lived data (many hours or longer) will likely benefit from adopting a Ribbon+Bloom hybrid filter policy. Here we explain why and how.

Ribbon filter on RocksDB wiki

Ribbon filter paper

Problem & background

Bloom filters play a critical role in optimizing point queries and some range queries in LSM-tree storage systems like RocksDB. Very large DBs can use 10% or more of their RAM memory for (Bloom) filters, so that (average case) read performance can be very good despite high (worst case) read amplification, which is useful for lowering write and/or space amplification. Although the format_version=5 Bloom filter in RocksDB is extremely fast, all Bloom filters use around 50% more space than is theoretically possible for a hashed structure configured for the same false positive (FP) rate and number of keys added. What would it take to save that significant share of “wasted” filter memory, and when does it make sense to use such a Bloom alternative?

A number of alternatives to Bloom filters were known, especially for static filters (not modified after construction), but all the previously known structures were unsatisfying for SSTs because of some combination of

Not enough space savings for CPU increase. For example, Xor filters use 3-4x more CPU than Bloom but only save 15-20% of space. GOV can save around 30% space but requires around 10x more CPU than Bloom.
Inconsistent space savings. Cuckoo filters and Xor+ filters offer significant space savings for very low FP rates (high bits per key) but little or no savings for higher FP rates (low bits per key). (Higher FP rates are considered best for largest levels of LSM.) Spatially-coupled Xor filters require very large number of keys per filter for large space savings.
Inflexible configuration. No published alternatives offered the same continuous configurability of Bloom filters, where any FP rate and any fractional bits per key could be chosen. This flexibility improves memory efficiency with the optimize_filters_for_memory option that minimizes internal fragmentation on filters.

Ribbon filter development and implementation

The Ribbon filter came about when I developed a faster, simpler, and more adaptable algorithm for constructing a little-known Xor-based structure from Dietzfelbinger and Walzer. It has very good space usage for required CPU time (~30% space savings for 3-4x CPU) and, with some engineering, Bloom-like configurability. The complications were managable for use in RocksDB:

Ribbon space efficiency does not naturally scale to very large number of keys in a single filter (whole SST file or partition), but with the current 128-bit Ribbon implementation in RocksDB, even 100 million keys in one filter saves 27% space vs. Bloom rather than 30% for 100,000 keys in a filter.
More temporary memory is required during construction, ~230 bits per key for 128-bit Ribbon vs. ~75 bits per key for Bloom filter. A quick calculation shows that if you are saving 3 bits per key on the generated filter, you only need about 50 generated filters in memory to offset this temporary memory usage. (Thousands of filters in memory is typical.) Starting in RocksDB version 6.27, this temporary memory can be accounted for under block cache using BlockBasedTableOptions::reserve_table_builder_memory.
Ribbon filter queries use relatively more CPU for lower FP rates (but still O(1) relative to number of keys added to filter). This should be OK because lower FP rates are only appropriate when then cost of a false positive is very high (worth extra query time) or memory is not so constrained (can use Bloom instead).

Future: data in the paper suggests that 32-bit Balanced Ribbon (new name: Bump-Once Ribbon) would improve all of these issues and be better all around (except for code complexity).

Ribbon vs. Bloom in RocksDB configuration

Different applications and hardware configurations have different constraints, but we can use hardware costs to examine and better understand the trade-off between Bloom and Ribbon.

Same FP rate, RAM vs. CPU hardware cost

Under ideal conditions where we can adjust our hardware to suit the application, in terms of dollars, how much does it cost to construct, query, and keep in memory a Bloom filter vs. a Ribbon filter? The Ribbon filter costs more for CPU but less for RAM. Importantly, the RAM cost directly depends on how long the filter is kept in memory, which in RocksDB is essentially the lifetime of the filter. (Temporary RAM during construction is so short-lived that it is ignored.) Using some consumer hardware and electricity prices and a predicted balance between construction and queries, we can compute a “break even” duration in memory. To minimize cost, filters with a lifetime shorter than this should be Bloom and filters with a lifetime longer than this should be Ribbon. (Python code)

# Commodity prices based roughly on consumer prices and rough guesses
# Upfront cost of a CPU per hardware thread
upfront_dollars_per_cpu_thread = 30.0

# CPU average power usage per hardware thread
watts_per_cpu_thread = 3.5

# Upfront cost of a GB of RAM
upfront_dollars_per_gb_ram = 8.0

# RAM average power usage per GB
# https://www.crucial.com/support/articles-faq-memory/how-much-power-does-memory-use
watts_per_gb_ram = 0.375

# Estimated price of power per kilowatt-hour, including overheads like conversion losses and cooling
dollars_per_kwh = 0.35

# Assume 3 year hardware lifetime
hours_per_lifetime = 3 * 365 * 24
seconds_per_lifetime = hours_per_lifetime * 60 * 60

# Number of filter queries per key added in filter construction is heavily dependent on workload.
# When replication is in layer above RocksDB, it will be low, likely < 1. When replication is in
# storage layer below RocksDB, it will likely be > 1. Using a rough and general guesstimate.
key_query_per_construct = 1.0

#==================================
# Bloom & Ribbon filter performance
typical_bloom_bits_per_key = 10.0
typical_ribbon_bits_per_key = 7.0

# Speeds here are sensitive to many variables, especially query speed because it
# is so dependent on memory latency. Using this benchmark here:
# for IMPL in 2 3; do
#   ./filter_bench -impl=$IMPL -quick -m_keys_total_max=200 -use_full_block_reader
# done
# and "Random filter" queries.
nanoseconds_per_construct_bloom_key = 32.0
nanoseconds_per_construct_ribbon_key = 140.0

nanoseconds_per_query_bloom_key = 500.0
nanoseconds_per_query_ribbon_key = 600.0

#==================================
# Some constants
kwh_per_watt_lifetime = hours_per_lifetime / 1000.0
bits_per_gb = 8 * 1024 * 1024 * 1024

#==================================
# Crunching the numbers
# on CPU for constructing filters
dollars_per_cpu_thread_lifetime = upfront_dollars_per_cpu_thread + watts_per_cpu_thread * kwh_per_watt_lifetime * dollars_per_kwh
dollars_per_cpu_thread_second = dollars_per_cpu_thread_lifetime / seconds_per_lifetime

dollars_per_construct_bloom_key = dollars_per_cpu_thread_second * nanoseconds_per_construct_bloom_key / 10**9
dollars_per_construct_ribbon_key = dollars_per_cpu_thread_second * nanoseconds_per_construct_ribbon_key / 10**9

dollars_per_query_bloom_key = dollars_per_cpu_thread_second * nanoseconds_per_query_bloom_key / 10**9
dollars_per_query_ribbon_key = dollars_per_cpu_thread_second * nanoseconds_per_query_ribbon_key / 10**9

dollars_per_bloom_key_cpu = dollars_per_construct_bloom_key + key_query_per_construct * dollars_per_query_bloom_key
dollars_per_ribbon_key_cpu = dollars_per_construct_ribbon_key + key_query_per_construct * dollars_per_query_ribbon_key

# on holding filters in RAM
dollars_per_gb_ram_lifetime = upfront_dollars_per_gb_ram + watts_per_gb_ram * kwh_per_watt_lifetime * dollars_per_kwh
dollars_per_gb_ram_second = dollars_per_gb_ram_lifetime / seconds_per_lifetime

dollars_per_bloom_key_in_ram_second = dollars_per_gb_ram_second / bits_per_gb * typical_bloom_bits_per_key
dollars_per_ribbon_key_in_ram_second = dollars_per_gb_ram_second / bits_per_gb * typical_ribbon_bits_per_key

#==================================
# How many seconds does it take for the added cost of constructing a ribbon filter instead
# of bloom to be offset by the added cost of holding the bloom filter in memory?
break_even_seconds = (dollars_per_ribbon_key_cpu - dollars_per_bloom_key_cpu) / (dollars_per_bloom_key_in_ram_second - dollars_per_ribbon_key_in_ram_second)
print(break_even_seconds)
# -> 3235.1647730256936

So roughly speaking, filters that live in memory for more than an hour should be Ribbon, and filters that live less than an hour should be Bloom. This is very interesting, but how long do filters live in RocksDB?

First let’s consider the average case. Write-heavy RocksDB loads are often backed by flash storage, which has some specified write endurance for its intended lifetime. This can be expressed as device writes per day (DWPD), and supported DWPD is typically < 10.0 even for high end devices (excluding NVRAM). Roughly speaking, the DB would need to be writing at a rate of 20+ DWPD for data to have an average lifetime of less than one hour. Thus, unless you are prematurely burning out your flash or massively under-utilizing available storage, using the Ribbon filter has the better cost profile on average.

Predictable lifetime

But we can do even better than optimizing for the average case. LSM levels give us very strong data lifetime hints. Data in L0 might live for minutes or a small number of hours. Data in Lmax might live for days or weeks. So even if Ribbon filters weren’t the best choice on average for a workload, they almost certainly make sense for the larger, longer-lived levels of the LSM. As of RocksDB 6.24, you can specify a minimum LSM level for Ribbon filters with NewRibbonFilterPolicy, and earlier levels will use Bloom filters.

Resident filter memory

The above analysis assumes that nearly all filters for all live SST files are resident in memory. This is true if using cache_index_and_filter_blocks=0 and max_open_files=-1 (defaults), but cache_index_and_filter_blocks=1 is popular. In that case, if you use optimize_filters_for_hits=1 and non-partitioned filters (a popular MyRocks configuration), it is also likely that nearly all live filters are in memory. However, if you don’t use optimize_filters_for_hits and use partitioned filters, then cold data (by age or by key range) can lead to only a portion of filters being resident in memory. In that case, benefit from Ribbon filter is not as clear, though because Ribbon filters are smaller, they are more efficient to read into memory.

RocksDB version 6.21 and later include a rough feature to determine block cache usage for data blocks, filter blocks, index blocks, etc. Data like this is periodically dumped to LOG file (stats_dump_period_sec):

Block cache entry stats(count,size,portion): DataBlock(441761,6.82 GB,75.765%) FilterBlock(3002,1.27 GB,14.1387%) IndexBlock(17777,887.75 MB,9.63267%) Misc(1,0.00 KB,0%)
Block cache LRUCache@0x7fdd08104290#7004432 capacity: 9.00 GB collections: 2573 last_copies: 10 last_secs: 0.143248 secs_since: 0

This indicates that at this moment in time, the block cache object identified by LRUCache@0x7fdd08104290#7004432 (potentially used by multiple DBs) uses roughly 14% of its 9GB, about 1.27 GB, on filter blocks. This same data is available through DB::GetMapProperty with DB::Properties::kBlockCacheEntryStats, and (with some effort) can be compared to total size of all filters (not necessarily in memory) using rocksdb.filter.size from DB::Properties::kAggregatedTableProperties.

Sanity checking lifetime

Can we be sure that using filters even makes sense for such long-lived data? We can apply the current 5 minute rule for caching SSD data in RAM. A 4KB filter page holds data for roughly 4K keys. If we assume at least one negative (useful) filter query in its lifetime per added key, it can satisfy the 5 minute rule with a lifetime of up to about two weeks. Thus, the lifetime threshold for “no filter” is about 300x higher than the lifetime threshold for Ribbon filter.

What to do with saved memory

The default way to improve overall RocksDB performance with more available memory is to use more space for caching, which improves latency, CPU load, read IOs, etc. With cache_index_and_filter_blocks=1, savings in filters will automatically make room for caching more data blocks in block cache. With cache_index_and_filter_blocks=0, consider increasing block cache size.

Using the space savings to lower filter FP rates is also an option, but there is less evidence for this commonly improving existing optimized configurations.

Generic recommendation

If using NewBloomFilterPolicy(bpk) for a large persistent DB using compression, try using NewRibbonFilterPolicy(bpk) instead, which will generate Ribbon filters during compaction and Bloom filters for flush, both with the same FP rate as the old setting. Once new SST files are generated under the new policy, this should free up some memory for more caching without much effect on burst or sustained write speed. Both kinds of filters can be read under either policy, so there’s always an option to adjust settings or gracefully roll back to using Bloom filter only (keeping in mind that SST files must be replaced to see effect of that change).

Andrew Kryczka

Preset Dictionary Compression

Posted May 31, 2021

Summary

Compression algorithms relying on an adaptive dictionary, such as LZ4, zstd, and zlib, struggle to achieve good compression ratios on small inputs when using the basic compress API. With the basic compress API, the compressor starts with an empty dictionary. With small inputs, not much content gets added to the dictionary during the compression. Combined, these factors suggest the dictionary will never have enough contents to achieve great compression ratios.

RocksDB groups key-value pairs into data blocks before storing them in files. For use cases that are heavy on random accesses, smaller data block size is sometimes desirable for reducing I/O and CPU spent reading blocks. However, as explained above, smaller data block size comes with the downside of worse compression ratio when using the basic compress API.

Fortunately, zstd and other libraries offer advanced compress APIs that preset the dictionary. A preset dictionary makes it possible for the compressor to start from a useful state instead of from an empty one, making compression immediately effective.

RocksDB now optionally takes advantage of these dictionary presetting APIs. The challenges in integrating this feature into the storage engine were more substantial than apparent on the surface. First, we need to target a preset dictionary to the relevant data. Second, preset dictionaries need to be trained from data samples, which need to be gathered. Third, preset dictionaries need to be persisted since they are needed at decompression time. Fourth, overhead in accessing the preset dictionary must be minimized to prevent regression in critical code paths. Fifth, we need easy-to-use measurement to evaluate candidate use cases and production impact.

In production, we have deployed dictionary presetting to save space in multiple RocksDB use cases with data block size 8KB or smaller. We have measured meaningful benefit to compression ratio in use cases with data block size up to 16KB. We have also measured a use case that can save both CPU and space by reducing data block size and turning on dictionary presetting at the same time.

Feature design

Targeting

Over time we have considered a few possibilities for the scope of a dictionary.

Subcompaction
SST file
Column family

The original choice was subcompaction scope. This enabled an approach with minimal buffering overhead because we could collect samples while generating the first output SST file. The dictionary could then be trained and applied to subsequent SST files in the same subcompaction.

However, we found a large use case where the proximity of data in the keyspace was more correlated with its similarity than we had predicted. In particular, the approach of training a dictionary on an adjacent file yielded substantially worse ratios than training the dictionary on the same file it would be used to compress. In response to this finding, we changed the preset dictionary scope to per SST file.

With this change in approach, we had to face the problem we had hoped to avoid: how can we compress all of an SST file’s data blocks with the same preset dictionary while that dictionary can only be trained after many data blocks have been sampled? The solutions we considered both involved a new overhead. We could read the input more than once and introduce I/O overhead, or we could buffer the uncompressed output file data blocks until a dictionary is trained, introducing memory overhead. We chose to take the hit on memory overhead.

Another approach that we considered was associating multiple dictionaries with a column family. For example, in MyRocks there could be a dictionary trained on data from each large table. When compressing a data block, we would look at the table to which its data belongs and pick the corresponding dictionary. However, this approach would introduce many challenges. RocksDB would need to be aware of the key schema to know where are the table boundaries. RocksDB would also need to periodically update the dictionaries to account for changes in data pattern. It would need somewhere to store dictionaries at column family scope. Overall, we thought these challenges were too difficult to pursue the approach.

Training

Raw samples mode (`zstd_max_train_bytes == 0`)

As mentioned earlier, the approach we took is to build the dictionary from buffered uncompressed data blocks. The first row of data blocks in these diagrams illustrate this buffering. The second row illustrates training samples selected from the buffered blocks. In raw samples mode (above), the final dictionary is simply the concatenation of these samples. Whereas, in zstd training mode (below), these samples will be passed to the trainer to produce the final dictionary.

zstd training mode (`zstd_max_train_bytes > 0`)

Compression path

Once the preset dictionary is generated by the above process, we apply it to the buffered data blocks and write them to the output file. Thereafter, newly generated data blocks are immediately compressed and written out.

One optimization here is available to zstd v0.7.0+ users. Instead of deserializing the dictionary on each compress invocation, we can do that work once and reuse it. A ZSTD_CDict holds this digested dictionary state and is passed to the compress API.

Persistence

When an SST file’s data blocks are compressed using a preset dictionary, that dictionary is stored inside the file for later use in decompression.

SST file layout with the preset dictionary in its own (uncompressed) block

Decompression path

To decompress, we need to provide both the data block and the dictionary used to compress it. Since dictionaries are just blocks in a file, we access them through block cache. However this additional load on block cache can be problematic. It can be alleviated by pinning the dictionaries to avoid going through the LRU locks.

An optimization analogous to the digested dictionary exists for certain zstd users (see User API section for details). When enabled, the block cache stores the digested dictionary state for decompression (ZSTD_DDict) instead of the block contents. In some cases we have seen decompression CPU decrease overall when enabling dictionary thanks to this optimization.

Measurement

Typically our first step in evaluating a candidate use case is an offline analysis of the data. This gives us a quick idea whether presetting dictionary will be beneficial without any code, config, or data changes. Our sst_dump tool reports what size SST files would have been using specified compression libraries and options. We can select random SST files and compare the size with vs. without dictionary.

When that goes well, the next step is to see how it works in a live DB, like a production shadow or canary. There we can observe how it affects application/system metrics.

Even after dictionary is enabled, there is the question of how much space was finally saved. We provide a way to A/B test size with vs. without dictionary while running in production. This feature picks a sample of data blocks to compress in multiple ways – one of the outputs is stored, while the other outputs are thrown away after counting their size. Due to API limitations, the stored output always has to be the dictionary-compressed one, so this feature can only be used after enabling dictionary. The size with and without dictionary are stored in the SST file as table properties. These properties can be aggregated across all SST files in a DB (and across all DBs in a tier) to learn the final space saving.

User API

RocksDB allows presetting compression dictionary for users of LZ4, zstd, and zlib. The most advanced capabilities are available to zstd v1.1.4+ users who statically link (see below). Newer versions of zstd (v1.3.6+) have internal changes to the dictionary trainer and digested dictionary management, which significantly improve memory and CPU efficiency.

Run-time settings:

CompressionOptions::max_dict_bytes: Limit on per-SST file dictionary size. Increasing this causes dictionaries to consume more space and memory for the possibility of better data block compression. A typical value we use is 16KB.
(zstd only) CompressionOptions::zstd_max_train_bytes: Limit on training data passed to zstd dictionary trainer. Larger values cause the training to consume more CPU (and take longer) while generating more effective dictionaries. The starting point guidance we received from zstd team is to set it to 100x CompressionOptions::max_dict_bytes.
CompressionOptions::max_dict_buffer_bytes: Limit on data buffering from which training samples are gathered. By default we buffer up to the target file size per ongoing background job. If this amount of memory is concerning, this option can constrain the buffering with the downside that training samples will cover a smaller portion of the SST file. Work is ongoing to charge this memory usage to block cache so it will not need to be accounted for separately.
BlockBasedTableOptions::cache_index_and_filter_blocks: Controls whether metadata blocks including dictionary are accessed through block cache or held in table reader memory (yes, its name is outdated).
BlockBasedTableOptions::metadata_cache_options: Controls what metadata blocks are pinned in block cache. Pinning avoids LRU contention at the risk of cold blocks holding memory.
ColumnFamilyOptions::sample_for_compression: Controls frequency of measuring extra compressions on data blocks using various libraries with default settings (i.e., without preset dictionary).

Compile-time setting:

(zstd only) EXTRA_CXXFLAGS=-DZSTD_STATIC_LINKING_ONLY: Hold digested dictionaries in block cache to save repetitive deserialization overhead. This saves a lot of CPU for read-heavy workloads. This compiler flag is necessary because one of the digested dictionary APIs we use is marked as experimental. We still use it in production, however.

Function:

DB::GetPropertiesOfAllTables(): The properties kSlowCompressionEstimatedDataSize and kFastCompressionEstimatedDataSize estimate what the data block size (kDataSize) would have been if the corresponding compression library had been used. These properties are only present when ColumnFamilyOptions::sample_for_compression causes one or more samples to be measured, and they become more accurate with higher sampling frequency.

Tool:

sst_dump --command=recompress: Offline analysis tool that reports what the SST file size would have been using the specified compression library and options.

RocksDB Secondary Cache

Posted May 27, 2021

Introduction

The RocksDB team is implementing support for a block cache on non-volatile media, such as a local flash device or NVM/SCM. It can be viewed as an extension of RocksDB’s current volatile block cache (LRUCache or ClockCache). The non-volatile block cache acts as a second tier cache that contains blocks evicted from the volatile cache. Those blocks are then promoted to the volatile cache as they become hotter due to access.

This feature is meant for cases where the DB is located on remote storage or cloud storage. The non-volatile cache is officially referred to in RocksDB as the SecondaryCache. By maintaining a SecondaryCache that’s an order of magnitude larger than DRAM, fewer reads would be required from remote storage, thus reducing read latency as well as network bandwidth consumption.

From the user point of view, the local flash cache will support the following requirements -

Provide a pointer to a secondary cache when opening a DB
Be able to share the secondary cache across DBs in the same process
Have multiple secondary caches on a host
Support persisting the cache across process restarts and reboots by ensuring repeatability of the cache key

Architecture

Design

When designing the API for a SecondaryCache, we had a choice between making it visible to the RocksDB code (table reader) or hiding it behind the RocksDB block cache. There are several advantages of hiding it behind the block cache -

Allows flexibility in insertion of blocks into the secondary cache. A block can be inserted on eviction from the RAM tier, or it could be eagerly inserted.
It makes the rest of the RocksDB code less complex by providing a uniform interface regardless of whether a secondary cache is configured or not
Makes parallel reads, peeking in the cache for prefetching, failure handling etc. easier
Makes it easier to extend to compressed data if needed, and allows other persistent media, such as PM, to be added as an additional tier

We decided to make the secondary cache transparent to the rest of RocksDB code by hiding it behind the block cache. A key issue that we needed to address was the allocation and ownership of memory of the cached items - insertion into the secondary cache may require that memory be allocated by the same. This means that parts of the cached object that can be transferred to the secondary cache needs to be copied out (referred to as unpacking), and on a lookup the data stored in the secondary cache needs to be provided to the object constructor (referred to as packing). For RocksDB cached objects such as data blocks, index and filter blocks, and compression dictionaries, unpacking involves copying out the raw uncompressed BlockContents of the block, and packing involves constructing the corresponding block/index/filter/dictionary object using the raw uncompressed data.

Another alternative we considered was the existing PersistentCache interface. However, we decided to not pursue it and eventually deprecate it for the following reasons -

It is exposed directly to the table reader code, which makes it more difficult to implement different policies such as inclusive/exclusive cache, as well as extending it to more sophisticated admission control policies
The interface does not allow for custom memory allocation and object packing/unpacking, so new APIs would have to be defined anyway
The current PersistentCache implementation is very simple and does not have any admission control policies

API

The interface between RocksDB’s block cache and the secondary cache is designed to allow pluggable implementations. For FB internal usage, we plan to use Cachelib with a wrapper to provide the plug-in implementation and use folly and other fbcode libraries, which cannot be used directly by RocksDB, to efficiently implement the cache operations. The following diagrams show the flow of insertion and lookup of a block.

Insert flow

Lookup flow

An item in the secondary cache is referenced by a SecondaryCacheHandle. The handle may not be immediately ready or have a valid value. The caller can call IsReady() to determine if its ready, and can call Wait() in order to block until it becomes ready. The caller must call Value() after it becomes ready to determine if the item was successfully read. Value() must return nullptr on failure.

class SecondaryCacheHandle {
 public:
  virtual ~SecondaryCacheHandle() {}

  // Returns whether the handle is ready or not
  virtual bool IsReady() = 0;

  // Block until handle becomes ready
  virtual void Wait() = 0;

  // Return the value. If nullptr, it means the lookup was unsuccessful
  virtual void* Value() = 0;

  // Return the size of value
  virtual size_t Size() = 0;
};

The user of the secondary cache (for example, BlockBasedTableReader indirectly through LRUCache) must implement the callbacks defined in CacheItemHelper, in order to facilitate the unpacking/packing of objects for saving to and restoring from the secondary cache. The CreateCallback must be implemented to construct a cacheable object from the raw data in secondary cache.

  // The SizeCallback takes a void* pointer to the object and returns the size
  // of the persistable data. It can be used by the secondary cache to allocate
  // memory if needed.
  using SizeCallback = size_t (*)(void* obj);

  // The SaveToCallback takes a void* object pointer and saves the persistable
  // data into a buffer. The secondary cache may decide to not store it in a
  // contiguous buffer, in which case this callback will be called multiple
  // times with increasing offset
  using SaveToCallback = Status (*)(void* from_obj, size_t from_offset,
                                    size_t length, void* out);

  // A function pointer type for custom destruction of an entry's
  // value. The Cache is responsible for copying and reclaiming space
  // for the key, but values are managed by the caller.
  using DeleterFn = void (*)(const Slice& key, void* value);

  // A struct with pointers to helper functions for spilling items from the
  // cache into the secondary cache. May be extended in the future. An
  // instance of this struct is expected to outlive the cache.
  struct CacheItemHelper {
    SizeCallback size_cb;
    SaveToCallback saveto_cb;
    DeleterFn del_cb;

    CacheItemHelper() : size_cb(nullptr), saveto_cb(nullptr), del_cb(nullptr) {}
    CacheItemHelper(SizeCallback _size_cb, SaveToCallback _saveto_cb,
                    DeleterFn _del_cb)
        : size_cb(_size_cb), saveto_cb(_saveto_cb), del_cb(_del_cb) {}
  };

  // The CreateCallback is passed by the block cache user to Lookup(). It
  // takes in a buffer from the NVM cache and constructs an object using
  // it. The callback doesn't have ownership of the buffer and should
  // copy the contents into its own buffer.
  // typedef std::function<Status(void* buf, size_t size, void** out_obj,
  //                             size_t* charge)>
  //    CreateCallback;
  using CreateCallback = std::function<Status(void* buf, size_t size,
                                              void** out_obj, size_t* charge)>;

The secondary cache provider must provide a concrete implementation of the SecondaryCache abstract class.

// SecondaryCache
//
// Cache interface for caching blocks on a secondary tier (which can include
// non-volatile media, or alternate forms of caching such as compressed data)
class SecondaryCache {
 public:
  virtual ~SecondaryCache() {}

  virtual std::string Name() = 0;

  static const std::string Type() { return "SecondaryCache"; }

  // Insert the given value into this cache. The value is not written
  // directly. Rather, the SaveToCallback provided by helper_cb will be
  // used to extract the persistable data in value, which will be written
  // to this tier. The implementation may or may not write it to cache
  // depending on the admission control policy, even if the return status is
  // success.
  virtual Status Insert(const Slice& key, void* value,
                        const Cache::CacheItemHelper* helper) = 0;

  // Lookup the data for the given key in this cache. The create_cb
  // will be used to create the object. The handle returned may not be
  // ready yet, unless wait=true, in which case Lookup() will block until
  // the handle is ready
  virtual std::unique_ptr<SecondaryCacheHandle> Lookup(
      const Slice& key, const Cache::CreateCallback& create_cb, bool wait) = 0;

  // At the discretion of the implementation, erase the data associated
  // with key
  virtual void Erase(const Slice& key) = 0;

  // Wait for a collection of handles to become ready. This would be used
  // by MultiGet, for example, to read multitple data blocks in parallel
  virtual void WaitAll(std::vector<SecondaryCacheHandle*> handles) = 0;

  virtual std::string GetPrintableOptions() const = 0;
};

A SecondaryCache is configured by the user by providing a pointer to it in LRUCacheOptions -

struct LRUCacheOptions {
  ...
  // A SecondaryCache instance to use as an additional cache tier
  std::shared_ptr<SecondaryCache> secondary_cache;
  ...
};

Current Status

The initial RocksDB support for the secondary cache has been merged into the main branch, and will be available in the 6.21 release. This includes providing a way for the user to configure a secondary cache when instantiating RocksDB’s LRU cache (volatile block cache), spilling blocks evicted from the LRU cache to the flash cache, promoting a block read from the SecondaryCache to the LRU cache, update tools such as cache_bench and db_bench to specify a flash cache. The relevant PRs are #8271, #8191, and #8312.

We prototyped an end-to-end solution, with the above PRs as well as a Cachelib based implementation of the SecondaryCache. We ran a mixgraph benchmark to simulate a realistic read/write workload. The results showed a 15% gain with the local flash cache over no local cache, and a ~25-30% reduction in network reads with a corresponding decrease in cache misses.

Throughput

Hit Rate

Future Work

In the short term, we plan to do the following in order to fully integrate the SecondaryCache with RocksDB -

Use DB session ID as the cache key prefix to ensure uniqueness and repeatability
Optimize flash cache usage of MultiGet and iterator workloads
Stress testing
More benchmarking

Longer term, we plan to deploy this in production at Facebook.

Call to Action

We are hoping for a community contribution of a secondary cache implementation, which would make this feature usable by the broader RocksDB userbase. If you are interested in contributing, please reach out to us in this issue.

Siying Dong

Online Validation

Posted May 26, 2021

To prevent or mitigate data corrution in RocksDB when some software or hardware issues happens, we keep adding online consistency checks and improving existing ones.

We improved ColumnFamilyOptions::force_consistency_checks and enabled it by default. The option does some basic consistency checks to LSM-tree, e.g., files in one level are not overlapping. The DB will be frozen from new writes if a violation is detected. Previously, the feature’s check was too limited and didn’t always freeze the DB in a timely manner. Last year, we made the checking stricter so that it can catch much more corrupted LSM-tree structures. We also fixed several issues where the checking failure was swallowed without freezing the DB. After making force_consistency_checks more reliable, we changed the default value to be on.

ColumnFamilyOptions::paranoid_file_checks does some more expensive extra checking when generating a new SST file. Last year, we advanced coverage to this feature: after every SST file is generated, the SST file is created, read back keys one by one and check two things: (1) the keys are in comparator order (also available and enabled by default during file write via ColumnFamilyOptions::check_flush_compaction_key_order); (2) the hash of all the KVs is the same as calculated when we add KVs into it. These checks detect certain corruptions so we can prevent the corrupt files from being applied to the DB. We suggest users turn it on at least in shadow environments, and consider to run it in production too if you can afford the overheads.

A recent feature is added to check the count of entries added into memtable while flushing it into an SST file. This feature is to have some online coverage to memtable corruption, caused by either software bug or hardware issue. This feature will be released in the coming release (6.21) and by default on. In the future, we will check more counters during memtables, e.g. number of puts or number of deletes.

We also improved the reporting of online validation errors to improve debuggability. For example, failure to parse a corrupt key now reports details about the corrupt key. Since we did not want to expose key data in logs, error messages, etc., by default, this reporting is opt-in via DBOptions::allow_data_in_errors.

More online checking features are planned and some are more sophisticated, including key/value checksums and sample based query validation.

Levi Tamasi

Integrated BlobDB

Posted May 26, 2021

Background

BlobDB is essentially RocksDB for large-value use cases. The basic idea, which was proposed in the WiscKey paper, is key-value separation: by storing large values in dedicated blob files and storing only small pointers to them in the LSM tree, we avoid copying the values over and over again during compaction, thus reducing write amplification. Historically, BlobDB supported only FIFO and TTL based use cases that can tolerate some data loss. In addition, it was incompatible with many widely used RocksDB features, and required users to adopt a custom API. In 2020, we decided to rearchitect BlobDB from the ground up, taking the lessons learned from WiscKey and the original BlobDB but also drawing inspiration and incorporating ideas from other similar systems. Our goals were to eliminate the above limitations and to create a new integrated version that enables customers to use the well-known RocksDB API, has feature parity with the core of RocksDB, and offers better performance. This new implementation is now available and provides the following improvements over the original:

API. In contrast with the legacy BlobDB implementation, which had its own StackableDB-based interface (rocksdb::blob_db::BlobDB), the new version can be used via the well-known rocksdb::DB API, and can be configured simply by using a few column family options.
Consistency. With the integrated BlobDB implementation, RocksDB’s consistency guarantees and various write options (like using the WAL or synchronous writes) now apply to blobs as well. Moreover, the new BlobDB keeps track of blob files in the RocksDB MANIFEST.
Write performance. When using the old BlobDB, blobs are extracted and immediately written to blob files by the BlobDB layer in the application thread. This has multiple drawbacks from a performance perspective: first, it requires synchronization; second, it means that expensive operations like compression are performed in the application thread; and finally, it involves flushing the blob file after each blob. The new code takes a completely different approach by offloading blob file building to RocksDB’s background jobs, i.e. flushes and compactions. This means that similarly to SSTs, any given blob file is now written by a single background thread, eliminating the need for locking, flushing, or performing compression in the foreground. Note that this approach is also a better fit for network-based file systems where small writes might be expensive and opens up the possibility of file format optimizations that involve buffering (like dictionary compression).
Read performance. The old code relies on each read (i.e. Get, MultiGet, or iterator) taking a snapshot and uses those snapshots when deciding which obsolete blob files can be removed. The new BlobDB improves this by generalizing RocksDB’s Version concept, which historically referred to the set of live SST files at a given point in time, to include the set of live blob files as well. This has performance benefits like making the read path mostly lock-free by utilizing thread-local storage. We have also introduced a blob file cache that can be utilized to keep frequently accessed blob files open.
Garbage collection. Key-value separation means that if a key pointing to a blob gets overwritten or deleted, the blob becomes unreferenced garbage. To be able to reclaim this space, BlobDB now has garbage collection capabilities. GC is integrated into the compaction process and works by relocating valid blobs residing in old blob files as they are encountered during compaction. Blob files can be marked obsolete (and eventually deleted in one shot) once they contain nothing but garbage. This is more efficient than the method used by WiscKey, which involves performing a Get operation to find out whether a blob is still referenced followed by a Put to update the reference, which in turn results in garbage collection competing and potentially conflicting with the application’s writes.
Feature parity with the RocksDB core. The new BlobDB supports way more features than the original and is near feature parity with vanilla RocksDB. In particular, we support all basic read/write APIs (with the exception of Merge, which is coming soon), recovery, compression, atomic flush, column families, compaction filters, checkpoints, backup/restore, transactions, per-file checksums, and the SST file manager. In addition, the new BlobDB’s options can be dynamically adjusted using the SetOptions interface.

API

The new BlobDB can be configured (on a per-column family basis if needed) simply by using the following options:

enable_blob_files: set it to true to enable key-value separation.
min_blob_size: values at or above this threshold will be written to blob files during flush or compaction.
blob_file_size: the size limit for blob files.
blob_compression_type: the compression type to use for blob files. All blobs in the same file are compressed using the same algorithm.
enable_blob_garbage_collection: set this to true to make BlobDB actively relocate valid blobs from the oldest blob files as they are encountered during compaction.
blob_garbage_collection_age_cutoff: the threshold that the GC logic uses to determine which blob files should be considered “old.” For example, the default value of 0.25 signals to RocksDB that blobs residing in the oldest 25% of blob files should be relocated by GC. This parameter can be tuned to adjust the trade-off between write amplification and space amplification.

The above options are all dynamically adjustable via the SetOptions API; changing them will affect subsequent flushes and compactions but not ones that are already in progress.

In terms of compaction styles, we recommend using leveled compaction with BlobDB. The rationale behind universal compaction in general is to provide lower write amplification at the expense of higher read amplification; however, as we will see later in the Performance section, BlobDB can provide very low write amp and good read performance with leveled compaction. Therefore, there is really no reason to take the hit in read performance that comes with universal compaction.

In addition to the above, consider tuning the following non-BlobDB specific options:

write_buffer_size: this is the memtable size. You might want to increase it for large-value workloads to ensure that SST and blob files contain a decent number of keys.
target_file_size_base: the target size of SST files. Note that even when using BlobDB, it is important to have an LSM tree with a “nice” shape and multiple levels and files per level to prevent heavy compactions. Since BlobDB extracts and writes large values to blob files, it makes sense to make this parameter significantly smaller than the memtable size. One guideline is to set blob_file_size to the same value as write_buffer_size (adjusted for compression if needed) and make target_file_size_base proportionally smaller based on the ratio of key size to value size.
max_bytes_for_level_base: consider setting this to a multiple (e.g. 8x or 10x) of target_file_size_base.

As mentioned above, the new BlobDB now also supports compaction filters. Key-value separation actually enables an optimization here: if the compaction filter of an application can make a decision about a key-value solely based on the key, it is unnecessary to read the value from the blob file. Applications can take advantage of this optimization by implementing the new FilterBlobByKey method of the CompactionFilter interface. This method gets called by RocksDB first whenever it encounters a key-value where the value is stored in a blob file. If this method returns a “final” decision like kKeep, kRemove, kChangeValue, or kRemoveAndSkipUntil, RocksDB will honor that decision; on the other hand, if the method returns kUndetermined, RocksDB will read the blob from the blob file and call FilterV2 with the value in the usual fashion.

Performance

We tested the performance of the new BlobDB for six different value sizes between 1 KB and 1 MB using a customized version of our standard benchmark suite on a box with an 18-core Skylake DE CPU (running at 1.6 GHz, with hyperthreading enabled), 64 GB RAM, a 512 GB boot SSD, and two 1.88 TB M.2 SSDs in a RAID0 configuration for data. The RocksDB version used was equivalent to 6.18.1, with some benchmarking and statistics related enhancements. Leveled and universal compaction without key-value separation were used as reference points. Note that for simplicity, we use “leveled compaction” and “universal compaction” as shorthand for leveled and universal compaction without key-value separation, respectively, and “BlobDB” for BlobDB with leveled compaction.

Our benchmarks cycled through six different workloads: two write-only ones (initial load and overwrite), two read/write ones (point lookup/write mix and range scan/write mix), and finally two read-only ones (point lookups and range scans). The first two phases performed a fixed amount of work (see below), while the final four were run for a fixed amount of time, namely 30 minutes each. Each phase other than the first one started with the database state left behind by the previous one. Here’s a brief description of the workloads:

Initial load: this workload has two distinct stages, a single-threaded random write stage during which compactions are disabled (so all data is flushed to L0, where it remains for the rest of the stage), followed by a full manual compaction. The random writes are performed with load-optimized settings, namely using the vector memtable implementation and with concurrent memtable writes and WAL disabled. This stage was used to populate the database with 1 TB worth of raw values, e.g. 2^30 (~1 billion) 1 KB values or 2^20 (~1 million) 1 MB values.
Overwrite: this is a multi-threaded random write workload using the usual skiplist memtable, with compactions, WAL, and concurrent memtable writes enabled. In our tests, 16 writer threads were used. The total number of writes was set to the same number as in the initial load stage and split up evenly between the writer threads. For instance, for the 1 MB value size, we had 2^20 writes divided up between the 16 threads, resulting in each thread performing 2^16 write operations. At the end of this phase, a “wait for compactions” step was added to prevent this workload from exhibiting artificially low write amp or conversely, the next phase showing inflated write amp.
Point lookup/write mix: a single writer thread performing random writes while N (in our case, 16) threads perform random point lookups. WAL is enabled and all writes are synced.
Range scan/write mix: similar to the above, with one writer thread and N reader threads (where N was again set to 16 in our tests). The reader threads perform random range scans, with 10 Next calls per Seek. Again, WAL is enabled, and sync writes are used.
Point lookups (read-only): N=16 threads perform random point lookups.
Range scans (read-only): N=16 threads execute random range scans, with 10 Nexts per Seek like above.

With that out of the way, let’s see how the new BlobDB performs against traditional leveled and universal compaction. In the next few sections, we’ll be looking at write amplification as well as read and write performance. We’ll also briefly compare the write performance of the new BlobDB with the legacy implementation.

Write amplification

Reducing write amp is the original motivation for key-value separation. Here, we follow RocksDB’s definition of write amplification (as used in compaction statistics and the info log). That is, we define write amp as the total amount of data written by flushes and compactions divided by the amount of data written by flushes, where “data written” includes SST files and blob files as well (if applicable). The following charts show that BlobDB significantly reduces write amplification for all of our (non-read only) workloads.

For the initial load, where due to the nature of the workload both leveled and universal already have a low write amp factor of 1.6, BlobDB has a write amp close to the theoretical minimum of 1.0, namely in the 1.0..1.02 range, depending on value size. How is this possible? Well, the trick is that when key-value separation is used, the full compaction step only has to sort the keys but not the values. This results in a write amp that is about 36% lower than the already low write amp you get with either leveled or universal.

In the case of the overwrite workload, BlobDB had a write amp between 1.4 and 1.7 depending on value size. This is around 75-78% lower than the write amp of leveled compaction (6.1 to 6.8) and 70-77% lower than universal (5.7 to 6.2); for this workload, there wasn’t a huge difference between the performance of leveled and universal.

When it comes to the point lookup/write mix workload, BlobDB had a write amp between 1.4 and 1.8. This is 83-88% lower than the write amp of leveled compaction, which had values between 10.8 and 12.5. Universal fared much better than leveled under this workload, and had write amp in the 2.2..6.6 range; however, BlobDB still provided significant gains for all value sizes we tested: namely, write amp was 18-77% lower than that of universal, depending on value size.

As for the range scan/write mix workload, BlobDB again had a write amp between 1.4 and 1.8, while leveled had values between 13.6 and 14.9, and universal was between 2.8 and 5.0. In other words, BlobDB’s write amp was 88-90% lower than that of leveled, and 46-70% lower than that of universal.

Write amplification

Write performance

In terms of write performance, there are other factors to consider besides write amplification. The following charts show some interesting metrics for the two write-only workloads (initial load and overwrite). As discussed earlier, these two workloads perform a fixed amount of work; the two charts in the top row show how long it took BlobDB, leveled, and universal to complete that work. Note that each bar is broken down into two, corresponding to the two stages of each workload (random write and full compaction for initial load, and random write and waiting for compactions for overwrite).

For initial load, note that the random write stage takes the same amount of time regardless of which algorithm is used. This is not surprising considering the fact that compactions are disabled during this stage and thus RocksDB is simply writing L0 files (and in BlobDB’s case, blob files) as fast as it can. The second stage, on the other hand, is very different: as mentioned above, BlobDB essentially only needs to read, sort, and rewrite the keys during compaction, which can be done much much faster (with 1 MB values, more than a hundred times faster) than doing the same for large key-values. Due to this, initial load completed 2.3x to 4.7x faster overall when using BlobDB.

As for the overwrite workload, BlobDB performs much better during both stages. The two charts in the bottom row help explain why. In the case of both leveled and universal compaction, compactions can’t keep up with the write rate, which eventually leads to back pressure in the form of write stalls. As shown in the chart below, both leveled and universal stall between ~40% and ~70% of the time; on the other hand, BlobDB is stall-free except for the largest value size tested (1 MB). This naturally leads to higher throughput, namely 2.1x to 3.5x higher throughput compared to leveled, and 1.6x to 3.0x higher throughput compared to universal. The overwrite time chart also shows that the catch-up stage that waits for all compactions to finish is much shorter (and in fact, at larger value sizes, negligible) with BlobDB.

Write performance

Read/write and read-only performance

The charts below show the read performance (in terms of operations per second) of BlobDB versus leveled and universal compaction under the two read/write workloads and the two read-only workloads. BlobDB meets or exceeds the read performance of leveled compaction, except for workloads involving range scans at the two smallest value sizes tested (1 KB and 4 KB). It also provides better (in some cases, much better) read performance than universal across the board. In particular, BlobDB provides up 1.4x higher read performance than leveled (for larger values), and up to 5.6x higher than universal.

Read-write and read-only performance

Comparing the two BlobDB implementations

To compare the write performance of the new BlobDB with the legacy implementation, we ran two versions of the first (single-threaded random write) stage of the initial load benchmark using 1 KB values: one with WAL disabled, and one with WAL enabled. The new implementation completed the load 4.6x faster than the old one without WAL, and 2.3x faster with WAL.

Comparing the two BlobDB implementations

Future work

There are a few remaining features that are not yet supported by the new BlobDB. The most important one is Merge (and the related GetMergeOperands API); in addition, we don’t currently support the EventListener interface, the GetLiveFilesMetaData and GetColumnFamilyMetaData APIs, secondary instances, and ingestion of blob files. We will continue to work on closing this gap.

We also have further plans when it comes to performance. These include optimizing garbage collection, introducing a dedicated cache for blobs, improving iterator and MultiGet performance, and evolving the blob file format amongst others.

Siying Dong

(Call For Contribution) Make Universal Compaction More Incremental

Posted April 12, 2021

Motivation

Universal Compaction is an important compaction style, but few changes were made after we made the structure multi-leveled. Yet the major restriction of always compacting full sorted run is not relaxed. Compared to Leveled Compaction, where we usually only compile several SST files together, in universal compaction, we frequently compact GBs of data. Two issues with this gap: 1. it makes it harder to unify universal and leveled compaction; 2. periodically data is fully compacted, and in the mean time space is doubled. To ease the problem, we can break the restriction and do similar as leveled compaction, and bring it closer to unified compaction.

We call for help for making following improvements.

How Universal Compaction Works

In universal, whole levels are compacted together to satisfy two conditions (See wiki page for more details):

total size / bottommost level size > a threshold, or
total number of sorted runs (non-0 levels + L0 files) is within a threshold

1 is to limit extra space overhead used for dead data and 2 is for read performance.

If 1 is triggered, likely a full compaction will be triggered. If 2 is triggered, RocksDB compact some sorted runs to bring the number down. It does it by using a simple heuristic so that less writes needed for that purpose over time: it starts from compacting smaller files, but if total size to compact is similar to or larger than size of the next level, it will take that level together, as soon on (whether it is the best heuristic is another question and we’ve never seriously looked at it).

How We Can Improve?

Let’s start from condition 1. Here we do full compaction but is not necessary. A simple optimization would be to compact so that just enough files are merged into the bottommost level (Lmax) to satisfy condition 1. It would work if we only need to pick some files from Lmax-1, or if it is cheaper over time, we can pick some files from other levels too.

Then condition 2. If we finish condition 1, there might be holes in some ranges in older levels. These holes might make it possible that only by compacting some sub ranges, we can fix the LSM-tree for condition 2. RocksDB can take single files into consideration and apply more sophisticated heuristic.

This new approach makes universal compaction closer to leveled compaction. The operation for 1 is closer to how Leveled compaction triggeres Lmax-1 to Lmax compaction. And 2 can potentially be implemented as something similar to level picking in Leveled Compaction. In fact, all those file picking can co-existing in one single compaction style and there isn’t fundamental conflicts to that.

Limitation

There are two limitations:

Periodic automatic full compaction is unpleasant but at the same time is pleasant in another way. Some users might uses it to reason that everything is periodically collapsed so dead data is gone and old data is rewritten. We need to make sure periodic compaction works to continue with that.
L0 to the first non-L0 level compaction is the first time data is partitioned in LSM-tree so that incremental compaction by range is possible. We might need to do more of these compactions in order to make incremental possible, which will increase compaction slightly.
Compacting subset of a level would introduce some extra overhead for unaligned files, just as in leveled compaction. More SST boundary cutting heuristic can reduce this overhead but it will be there.

But I believe the benefits would outweight the limitations. Reducing temporary space doubling and moving towards to unified compaction would be important achievements.

Interested in Help?

Compaction is the core of LSM-tree, but its improvements are far overdue. If you are a user of universal compaction and would be able to benefit from those improvements, we will be happy to work with you on speeding up the project and bring them to RocksDB sooner. Feel free to communicate with us in this issue.

Maysam Yabandeh

Higher write throughput with `unordered_write` feature

Posted August 15, 2019

Since RocksDB 6.3, The unordered_write=true option together with WritePrepared transactions offers 34-42% higher write throughput compared to vanilla RocksDB. If the application can handle more relaxed ordering guarantees, the gain in throughput would increase to 63-131%.

Background

Currently RocksDB API delivers the following powerful guarantees:

Atomic reads: Either all of a write batch is visible to reads or none of it.
Read-your-own writes: When a write thread returns to the user, a subsequent read by the same thread will be able to see its own writes.
Immutable Snapshots: The reads visible to the snapshot are immutable in the sense that it will not be affected by any in-flight or future writes.

`unordered_write`

The unordered_write feature, when turned on, relaxes the default guarantees of RocksDB. While it still gives read-your-own-write property, neither atomic reads nor the immutable snapshot properties are provided any longer. However, RocksDB users could still get read-your-own-write and immutable snapshots when using this feature in conjunction with TransactionDB configured with WritePrepared transactions and two_write_queues. You can read here to learn about the design of unordered_write and here to learn more about WritePrepared transactions.

How to use it?

To get the same guarantees as vanilla RocksdB:

DBOptions db_options;
db_options.unordered_write = true;
db_options.two_write_queues = true;
DB* db;
{
  TransactionDBOptions txn_db_options;
  txn_db_options.write_policy = TxnDBWritePolicy::WRITE_PREPARED;
  txn_db_options.skip_concurrency_control = true;
  TransactionDB* txn_db;
  TransactionDB::Open(options, txn_db_options, kDBPath, &txn_db);
  db = txn_db;
}
db->Write(...);

To get relaxed guarantees:

DBOptions db_options;
db_options.unordered_write = true;
DB* db;
DB::Open(db_options, kDBPath, &db);
db->Write(...);

Benchmarks

TEST_TMPDIR=/dev/shm/ ~/db_bench --benchmarks=fillrandom --threads=32 --num=10000000 -max_write_buffer_number=16 --max_background_jobs=64 --batch_size=8 --writes=3000000 -level0_file_num_compaction_trigger=99999 --level0_slowdown_writes_trigger=99999 --level0_stop_writes_trigger=99999 -enable_pipelined_write=false -disable_auto_compactions --transaction_db=true --unordered_write=1 --disable_wal=0

Throughput with unordered_write=true and using WritePrepared transaction:

WAL: +42%
No-WAL: +34% Throughput with unordered_write=true
WAL: +63%
NoWAL: +131%

Maysam Yabandeh

format_version 4

Posted March 08, 2019

The data blocks in RocksDB consist of a sequence of key/values pairs sorted by key, where the pairs are grouped into restart intervals specified by block_restart_interval. Up to RocksDB version 5.14, where the latest and default value of BlockBasedTableOptions::format_version is 2, the format of index and data blocks are the same: index blocks use the same key format of <user_key,seq> and encode pointers to data blocks, <offset,size>, to a byte string and use them as values. The only difference is that the index blocks use index_block_restart_interval for the size of restart intervals. format_version=3,4 offer more optimized, backward-compatible, yet forward-incompatible format for index blocks.

Pros

Using format_version=4 significantly reduces the index block size, in some cases around 4-5x. This frees more space in block cache, which would result in higher hit rate for data and filter blocks, or offer the same performance with a smaller block cache size.

Cons

Being forward-incompatible means that if you enable format_version=4 you cannot downgrade to a RocksDB version lower than 5.16.

How to use it?

BlockBasedTableOptions::format_version = 4
BlockBasedTableOptions::index_block_restart_interval = 16

What is format_version 3?

(Since RocksDB 5.15) In most cases, the sequence number seq is not necessary for keys in the index blocks. In such cases, format_version=3 skips encoding the sequence number and sets index_key_is_user_key in TableProperties, which is used by the reader to know how to decode the index block.

What is format_version 4?

(Since RocksDB 5.16) Changes the format of index blocks by delta encoding the index values, which are the block handles. This saves the encoding of BlockHandle::offset of the non-head index entries in each restart interval. If used, TableProperties::index_value_is_delta_encoded is set, which is used by the reader to know how to decode the index block. The format of each key is (shared_size, non_shared_size, shared, non_shared). The format of each value, i.e., block handle, is (offset, size) whenever the shared_size is 0, which included the first entry in each restart point. Otherwise the format is delta-size = block handle size - size of last block handle.

The index format in format_version=4 would be as follows:

restart_point   0: k, v (off, sz), k, v (delta-sz), ..., k, v (delta-sz)
restart_point   1: k, v (off, sz), k, v (delta-sz), ..., k, v (delta-sz)
...
restart_point n-1: k, v (off, sz), k, v (delta-sz), ..., k, v (delta-sz)
where, k is key, v is value, and its encoding is in parenthesis.

Abhishek Madan

Andrew Kryczka

DeleteRange: A New Native RocksDB Operation

Posted November 21, 2018

Motivation

Deletion patterns in LSM

Deleting a range of keys is a common pattern in RocksDB. Most systems built on top of RocksDB have multi-component key schemas, where keys sharing a common prefix are logically related. Here are some examples.

MyRocks is a MySQL fork using RocksDB as its storage engine. Each key’s first four bytes identify the table or index to which that key belongs. Thus dropping a table or index involves deleting all the keys with that prefix.

Rockssandra is a Cassandra variant that uses RocksDB as its storage engine. One of its admin tool commands, nodetool cleanup, removes key-ranges that have been migrated to other nodes in the cluster.

Marketplace uses RocksDB to store product data. Its key begins with product ID, and it stores various data associated with the product in separate keys. When a product is removed, all these keys must be deleted.

When we decide what to improve, we try to find a use case that’s common across users, since we want to build a generally useful system, not one that has many one-off features for individual users. The range deletion pattern is common as illustrated above, so from this perspective it’s a good target for optimization.

Existing mechanisms: challenges and opportunities

The most common pattern we see is scan-and-delete, i.e., advance an iterator through the to-be-deleted range, and issue a Delete for each key. This is slow (involves read I/O) so cannot be done in any critical path. Additionally, it creates many tombstones, which slows down iterators and doesn’t offer a deadline for space reclamation.

Another common pattern is using a custom compaction filter that drops keys in the deleted range(s). This deletes the range asynchronously, so cannot be used in cases where readers must not see keys in deleted ranges. Further, it has the disadvantage of outputting tombstones to all but the bottom level. That’s because compaction cannot detect whether dropping a key would cause an older version at a lower level to reappear.

If space reclamation time is important, or it is important that the deleted range not affect iterators, the user can trigger CompactRange on the deleted range. This can involve arbitrarily long waits in the compaction queue, and increases write-amp. By the time it’s finished, however, the range is completely gone from the LSM.

DeleteFilesInRange can be used prior to compacting the deleted range as long as snapshot readers do not need to access them. It drops files that are completely contained in the deleted range. That saves write-amp because, in CompactRange, the file data would have to be rewritten several times before it reaches the bottom of the LSM, where tombstones can finally be dropped.

In addition to the above approaches having various drawbacks, they are quite complicated to reason about and implement. In an ideal world, deleting a range of keys would be (1) simple, i.e., a single API call; (2) synchronous, i.e., when the call finishes, the keys are guaranteed to be wiped from the DB; (3) low latency so it can be used in critical paths; and (4) a first-class operation with all the guarantees of any other write, like atomicity, crash-recovery, etc.

v1: Getting it to work

Where to persist them?

The first place we thought about storing them is inline with the data blocks. We could not think of a good way to do it, however, since the start of a range tombstone covering a key could be anywhere, making binary search impossible. So, we decided to investigate segregated storage.

A second solution we considered is appending to the manifest. This file is append-only, periodically compacted, and stores metadata like the level to which each SST belongs. This is tempting because it leverages an existing file, which is maintained in the background and fully read when the DB is opened. However, it conceptually violates the manifest’s purpose, which is to store metadata. It also has no way to detect when a range tombstone no longer covers anything and is droppable. Further, it’d be possible for keys above a range tombstone to disappear when they have their seqnums zeroed upon compaction to the bottommost level.

A third candidate is using a separate column family. This has similar problems to the manifest approach. That is, we cannot easily detect when a range tombstone is obsolete, and seqnum zeroing can cause a key to go from above a range tombstone to below, i.e., disappearing. The upside is we can reuse logic for memory buffering, consistent reads/writes, etc.

The problems with the second and third solutions indicate a need for range tombstones to be aware of flush/compaction. An easy way to achieve this is put them in the SST files themselves - but not in the data blocks, as explained for the first solution. So, we introduced a separate meta-block for range tombstones. This resolved the problem of when to obsolete range tombstones, as it’s simple: when they’re compacted to the bottom level. We also reused the LSM invariants that newer versions of a key are always in a higher level to prevent the seqnum zeroing problem. This approach has the side benefit of constraining the range tombstones seen during reads to ones in a similar key-range.

When there are range tombstones in an SST, they are segregated in a separate meta-block

Logical range tombstones (left) and their corresponding physical key-value representation (right)

Write path

WriteBatch stores range tombstones in its buffer which are logged to the WAL and then applied to a dedicated range tombstone memtable during Write. Later in the background the range tombstone memtable and its corresponding data memtable are flushed together into a single SST with a range tombstone meta-block. SSTs periodically undergo compaction which rewrites SSTs with point data and range tombstones dropped or merged wherever possible.

We chose to use a dedicated memtable for range tombstones. The memtable representation is always skiplist in order to minimize overhead in the usual case, which is the memtable contains zero or a small number of range tombstones. The range tombstones are segregated to a separate memtable for the same reason we segregated range tombstones in SSTs. That is, we did not know how to interleave the range tombstone with point data in a way that we would be able to find it for arbitrary keys that it covers.

Lifetime of point keys and range tombstones in RocksDB

During flush and compaction, we chose to write out all non-obsolete range tombstones unsorted. Sorting by a single dimension is easy to implement, but doesn’t bring asymptotic improvement to queries over range data. Ideally, we want to store skylines (see “Read Path” subsection below) computed over our ranges so we can binary search. However, a couple of concerns cause doing this in flush and compaction to feel unsatisfactory: (1) we need to store multiple skylines, one for each snapshot, which further complicates the range tombstone meta-block encoding; and (2) even if we implement this, the range tombstone memtable still needs to be linearly scanned. Given these concerns we decided to defer collapsing work to the read side, hoping a good caching strategy could optimize this at some future point.

Read path

In point lookups, we aggregate range tombstones in an unordered vector as we search through live memtable, immutable memtables, and then SSTs. When a key is found that matches the lookup key, we do a scan through the vector, checking whether the key is deleted.

In iterators, we aggregate range tombstones into a skyline as we visit live memtable, immutable memtables, and SSTs. The skyline is expensive to construct but fast to determine whether a key is covered. The skyline keeps track of the most recent range tombstone found to optimize Next and Prev.

(Image source: Leetcode) The skyline problem involves taking building location/height data in the unsearchable form of A and converting it to the form of B, which is binary-searchable. With overlapping range tombstones, to achieve efficient searching we need to solve an analogous problem, where the x-axis is the key-space and the y-axis is the sequence number.

Performance characteristics

For the v1 implementation, writes are much faster compared to the scan and delete (optionally within a transaction) pattern. DeleteRange only logs to WAL and applies to memtable. Logging to WAL always fflushes, and optionally fsyncs or fdatasyncs. Applying to memtable is always an in-memory operation. Since range tombstones have a dedicated skiplist memtable, the complexity of inserting is O(log(T)), where T is the number of existing buffered range tombstones.

Reading in the presence of v1 range tombstones, however, is much slower than reads in a database where scan-and-delete has happened, due to the linear scan over range tombstone memtables/meta-blocks.

Iterating in a database with v1 range tombstones is usually slower than in a scan-and-delete database, although the gap lessens as iterations grow longer. When an iterator is first created and seeked, we construct a skyline over its tombstones. This operation is O(T*log(T)) where T is the number of tombstones found across live memtable, immutable memtable, L0 files, and one file from each of the L1+ levels. However, moving the iterator forwards or backwards is simply a constant-time operation (excluding edge cases, e.g., many range tombstones between consecutive point keys).

v2: Making it fast

DeleteRange’s negative impact on read perf is a barrier to its adoption. The root cause is range tombstones are not stored or cached in a format that can be efficiently searched. We needed to design DeleteRange so that we could maintain write performance while making read performance competitive with workarounds used in production (e.g., scan-and-delete).

Representations

The key idea of the redesign is that, instead of globally collapsing range tombstones, we can locally “fragment” them for each SST file and memtable to guarantee that:

no range tombstones overlap; and
range tombstones are ordered by start key.

Combined, these properties make range tombstones binary searchable. This fragmentation will happen on the read path, but unlike the previous design, we can easily cache many of these range tombstone fragments on the read path.

Write path

The write path remains unchanged.

Read path

When an SST file is opened, its range tombstones are fragmented and cached. For point lookups, we binary search each file’s fragmented range tombstones for one that covers the lookup key. Unlike the old design, once we find a tombstone, we no longer need to search for the key in lower levels, since we know that any keys on those levels will be covered (though we do still check the current level since there may be keys written after the range tombstone).

For range scans, we create iterators over all the fragmented range tombstones and store them in a list, seeking each one to cover the start key of the range scan (if possible), and query each encountered key in this structure as in the old design, advancing range tombstone iterators as necessary. In effect, we implicitly create a skyline. This requires significantly less work on iterator creation, but since each memtable/SST has its own range tombstone iterator, querying range tombstones requires key comparisons (and possibly iterator increments) for several iterators (as opposed to v1, where we had a global collapsed representation of all range tombstones). As a result, very long range scans may become slower than before, but short range scans are an order of magnitude faster, which are the more common class of range scan.

Benchmarks

To understand the performance of this new design, we used db_bench to compare point lookup, short range scan, and long range scan performance across:

the v1 DeleteRange design,
the scan-and-delete workaround, and
the v2 DeleteRange design.

In these benchmarks, we used a database with 5 million data keys, and 10000 range tombstones (ignoring those dropped during compaction) that were written in regular intervals after 4.5 million data keys were written. Writing the range tombstones ensures that most of them are not compacted away, and we have more tombstones in higher levels that cover keys in lower levels, which allows the benchmarks to exercise more interesting behavior when reading deleted keys.

Point lookup benchmarks read 100000 keys from a database using readwhilewriting. Range scan benchmarks used seekrandomwhilewriting and seeked 100000 times, and advanced up to 10 keys away from the seek position for short range scans, and advanced up to 1000 keys away from the seek position for long range scans.

The results are summarized in the tables below, averaged over 10 runs (note the different SHAs for v1 benchmarks are due to a new db_bench flag that was added in order to compare performance with databases with no tombstones; for brevity, those results are not reported here). Also note that the block cache was large enough to hold the entire db, so the large throughput is due to limited I/Os and little time spent on decompression. The range tombstone blocks are always pinned uncompressed in memory. We believe these setup details should not affect relative performance between versions.

Point Lookups

Name	SHA	avg micros/op	avg ops/sec
v1	35cd754a6	1.3179	759,830.90
scan-del	7528130e3	0.6036	1,667,237.70
v2	7528130e3	0.6128	1,634,633.40

Short Range Scans

Name	SHA	avg micros/op	avg ops/sec
v1	0ed738fdd	6.23	176,562.00
scan-del	PR 4677	2.6844	377,313.00
v2	PR 4677	2.8226	361,249.70

Long Range scans

Name	SHA	avg micros/op	avg ops/sec
v1	0ed738fdd	52.7066	19,074.00
scan-del	PR 4677	38.0325	26,648.60
v2	PR 4677	41.2882	24,714.70

Future Work

Note that memtable range tombstones are fragmented every read; for now this is acceptable, since we expect there to be relatively few range tombstones in memtables (and users can enforce this by keeping track of the number of memtable range deletions and manually flushing after it passes a threshold). In the future, a specialized data structure can be used for storing range tombstones in memory to avoid this work.

Another future optimization is to create a new format version that requires range tombstones to be stored in a fragmented form. This would save time when opening SST files, and when max_open_files is not -1 (i.e., files may be opened several times).

Acknowledgements

Special thanks to Peter Mattis and Nikhil Benesch from Cockroach Labs, who were early users of DeleteRange v1 in production, contributed the cleanest/most efficient v1 aggregation implementation, found and fixed bugs, and provided initial DeleteRange v2 design and continued help.

Thanks to Huachao Huang and Jinpeng Zhang from PingCAP for early DeleteRange v1 adoption, bug reports, and fixes.

Fenggang Wu

Improving Point-Lookup Using Data Block Hash Index

Posted August 23, 2018

We’ve designed and implemented a data block hash index in RocksDB that has the benefit of both reducing the CPU util and increasing the throughput for point lookup queries with a reasonable and tunable space overhead.

Specifially, we append a compact hash table to the end of the data block for efficient indexing. It is backward compatible with the data base created without this feature. After turned on the hash index feature, existing data will be gradually converted to the hash index format.

Benchmarks with db_bench show the CPU utilization of one of the main functions in the point lookup code path, DataBlockIter::Seek(), is reduced by 21.8%, and the overall RocksDB throughput is increased by 10% under purely cached workloads, at an overhead of 4.6% more space. Shadow testing with Facebook production traffic shows good CPU improvements too.

How to use it

Two new options are added as part of this feature: BlockBasedTableOptions::data_block_index_type and BlockBasedTableOptions::data_block_hash_table_util_ratio.

The hash index is disabled by default unless BlockBasedTableOptions::data_block_index_type is set to data_block_index_type = kDataBlockBinaryAndHash. The hash table utilization ratio is adjustable using BlockBasedTableOptions::data_block_hash_table_util_ratio, which is valid only if data_block_index_type = kDataBlockBinaryAndHash.

// the definitions can be found in include/rocksdb/table.h

// The index type that will be used for the data block.
enum DataBlockIndexType : char {
  kDataBlockBinarySearch = 0,  // traditional block type
  kDataBlockBinaryAndHash = 1, // additional hash index
};

// Set to kDataBlockBinaryAndHash to enable hash index
DataBlockIndexType data_block_index_type = kDataBlockBinarySearch;

// #entries/#buckets. It is valid only when data_block_hash_index_type is
// kDataBlockBinaryAndHash.
double data_block_hash_table_util_ratio = 0.75;

Data Block Hash Index Design

Current data block format groups adjacent keys together as a restart interval. One block consists of multiple restart intervals. The byte offset of the beginning of each restart interval, i.e. a restart point, is stored in an array called restart interval index or binary seek index. RocksDB does a binary search when performing point lookup for keys in data blocks to find the right restart interval the key may reside. We will use binary seek and binary search interchangeably in this post.

In order to find the right location where the key may reside using binary search, multiple key parsing and comparison are needed. Each binary search branching triggers CPU cache miss, causing much CPU utilization. We have seen that this binary search takes up considerable CPU in production use-cases.

We implemented a hash map at the end of the block to index the key to reduce the CPU overhead of the binary search. The hash index is just an array of pointers pointing into the binary seek index.

Each array element is considered as a hash bucket when storing the location of a key (or more precisely, the restart index of the restart interval where the key resides). When multiple keys happen to hash into the same bucket (hash collision), we just mark the bucket as “collision”. So that when later querying on that key, the hash table lookup knows that there was a hash collision happened so it can fall back to the traditional binary search to find the location of the key.

We define hash table utilization ratio as the #keys/#buckets. If a utilization ratio is 0.5 and there are 100 buckets, 50 keys are stored in the bucket. The less the util ratio, the less hash collision, and the less chance for a point lookup falls back to binary seek (fall back ratio) due to the collision. So a small util ratio has more benefit to reduce the CPU time but introduces more space overhead.

Space overhead depends on the util ratio. Each bucket is a uint8_t (i.e. one byte). For a util ratio of 1, the space overhead is 1Byte per key, the fall back ratio observed is ~52%.

Things that Need Attention

Customized Comparator

Hash index will hash different keys (keys with different content, or byte sequence) into different hash values. This assumes the comparator will not treat different keys as equal if they have different content.

The default bytewise comparator orders the keys in alphabetical order and works well with hash index, as different keys will never be regarded as equal. However, some specially crafted comparators will do. For example, say, a StringToIntComparator can convert a string into an integer, and use the integer to perform the comparison. Key string “16” and “0x10” is equal to each other as seen by this StringToIntComparator, but they probably hash to different value. Later queries to one form of the key will not be able to find the existing key been stored in the other format.

We add a new function member to the comparator interface:

virtual bool CanKeysWithDifferentByteContentsBeEqual() const { return true; }

Every comparator implementation should override this function and specify the behavior of the comparator. If a comparator can regard different keys equal, the function returns true, and as a result the hash index feature will not be enabled, and vice versa.

NOTE: to use the hash index feature, one should 1) have a comparator that can never treat different keys as equal; and 2) override the CanKeysWithDifferentByteContentsBeEqual() function to return false, so the hash index can be enabled.

Util Ratio’s Impact on Data Block Cache

Adding the hash index to the end of the data block essentially takes up the data block cache space, making the effective data block cache size smaller and increasing the data block cache miss ratio. Therefore, a very small util ratio will result in a large data block cache miss ratio, and the extra I/O may drag down the throughput gain achieved by the hash index lookup. Besides, when compression is enabled, cache miss also incurs data block decompression, which is CPU-consuming. Therefore the CPU may even increase if using a too small util ratio. The best util ratio depends on workloads, cache to data ratio, disk bandwidth/latency etc. In our experiment, we found util ratio = 0.5 ~ 1 is a good range to explore that brings both CPU and throughput gains.

Limitations

As we use uint8_t to store binary seek index, i.e. restart interval index, the total number of restart intervals cannot be more than 253 (we reserved 255 and 254 as special flags). For blocks having a larger number of restart intervals, the hash index will not be created and the point lookup will be done by traditional binary seek.

Data block hash index only supports point lookup. We do not support range lookup. Range lookup request will fall back to BinarySeek.

RocksDB supports many types of records, such as Put, Delete, Merge, etc (visit here for more information). Currently we only support Put and Delete, but not Merge. Internally we have a limited set of supported record types:

kPutRecord,          <=== supported
kDeleteRecord,       <=== supported
kSingleDeleteRecord, <=== supported
kTypeBlobIndex,      <=== supported

For records not supported, the searching process will fall back to the traditional binary seek.

Evaluation

To evaluate the CPU util reduction and isolate other factors such as disk I/O and block decompression, we first evaluate the hash idnex in a purely cached workload. We observe that the CPU utilization of one of the main functions in the point lookup code path, DataBlockIter::Seek(), is reduced by 21.8% and the overall throughput is increased by 10% at an overhead of 4.6% more space.

However, general worload is not always purely cached. So we also evaluate the performance under different cache space pressure. In the following test, we use db_bench with RocksDB deployed on SSDs. The total DB size is 5~6GB, and it is about 14GB if decompressed. Different block cache sizes are used, ranging from 14GB down to 2GB, with an increasing cache miss ratio.

Orange bars are representing our hash index performance. We use a hash util ratio of 1.0 in this test. Block size are set to 16KiB with the restart interval as 16.

We can see that if cache size is greater than 8GB, hash index can bring throughput gain. Cache size greater than 8GB can be translated to a cache miss ratio smaller than 40%. So if the workload has a cache miss ratio smaller than 40%, hash index is able to increase the throughput.

Besides, shadow testing with Facebook production traffic shows good CPU improvements too.

Rocksdb Tuning Advisor

Posted August 01, 2018

The performance of Rocksdb is contingent on its tuning. However, because of the complexity of its underlying technology and a large number of configurable parameters, a good configuration is sometimes hard to obtain. The aim of the python command-line tool, Rocksdb Advisor, is to automate the process of suggesting improvements in the configuration based on advice from Rocksdb experts.

Overview

Experts share their wisdom as rules comprising of conditions and suggestions in the INI format (refer rules.ini). Users provide the Rocksdb configuration that they want to improve upon (as the familiar Rocksdb OPTIONS file — example) and the path of the file which contains Rocksdb logs and statistics. The Advisor creates appropriate DataSource objects (for Rocksdb logs, options, statistics etc.) and provides them to the Rules Engine. The Rules uses rules from experts to parse data-sources and trigger appropriate rules. The Advisor’s output gives information about which rules were triggered, why they were triggered and what each of them suggests. Each suggestion provided by a triggered rule advises some action on a Rocksdb configuration option, for example, increase CFOptions.write_buffer_size, set bloom_bits to 2 etc.

Usage

An example command to run the tool:

cd rocksdb/tools/advisor
python3 -m advisor.rule_parser_example --rules_spec=advisor/rules.ini --rocksdb_options=test/input_files/OPTIONS-000005 --log_files_path_prefix=test/input_files/LOG-0 --stats_dump_period_sec=20

Sample output where a Rocksdb log-based rule has been triggered :

Rule: stall-too-many-memtables
LogCondition: stall-too-many-memtables regex: Stopping writes because we have \d+ immutable memtables \(waiting for flush\), max_write_buffer_number is set to \d+
Suggestion: inc-bg-flush option : DBOptions.max_background_flushes action : increase suggested_values : ['2']
Suggestion: inc-write-buffer option : CFOptions.max_write_buffer_number action : increase
scope: col_fam:
{'default'}

For more information, refer to advisor.

RocksDB 5.10.2 Released!

Posted February 05, 2018

Public API Change

When running make with environment variable USE_SSE set and PORTABLE unset, will use all machine features available locally. Previously this combination only compiled SSE-related features.

New Features

CRC32C is now using the 3-way pipelined SSE algorithm crc32c_3way on supported platforms to improve performance. The system will choose to use this algorithm on supported platforms automatically whenever possible. If PCLMULQDQ is not supported it will fall back to the old Fast_CRC32 algorithm.
Provide lifetime hints when writing files on Linux. This reduces hardware write-amp on storage devices supporting multiple streams.
Add a DB stat, NUMBER_ITER_SKIP, which returns how many internal keys were skipped during iterations (e.g., due to being tombstones or duplicate versions of a key).
Add PerfContext counters, key_lock_wait_count and key_lock_wait_time, which measure the number of times transactions wait on key locks and total amount of time waiting.

Bug Fixes

Fix IOError on WAL write doesn’t propagate to write group follower
Make iterator invalid on merge error.
Fix performance issue in IngestExternalFile() affecting databases with large number of SST files.
Fix possible corruption to LSM structure when DeleteFilesInRange() deletes a subset of files spanned by a DeleteRange() marker.
Fix DB::Flush() keep waiting after flush finish under certain condition.

Maysam Yabandeh

WritePrepared Transactions

Posted December 19, 2017

RocksDB supports both optimistic and pessimistic concurrency controls. The pessimistic transactions make use of locks to provide isolation between the transactions. The default write policy in pessimistic transactions is WriteCommitted, which means that the data is written to the DB, i.e., the memtable, only after the transaction is committed. This policy simplified the implementation but came with some limitations in throughput, transaction size, and variety in supported isolation levels. In the below, we explain these in detail and present the other write policies, WritePrepared and WriteUnprepared. We then dive into the design of WritePrepared transactions.

WriteCommitted, Pros and Cons

With WriteCommitted write policy, the data is written to the memtable only after the transaction commits. This greatly simplifies the read path as any data that is read by other transactions can be assumed to be committed. This write policy, however, implies that the writes are buffered in memory in the meanwhile. This makes memory a bottleneck for large transactions. The delay of the commit phase in 2PC (two-phase commit) also becomes noticeable since most of the work, i.e., writing to memtable, is done at the commit phase. When the commit of multiple transactions are done in a serial fashion, such as in 2PC implementation of MySQL, the lengthy commit latency becomes a major contributor to lower throughput. Moreover this write policy cannot provide weaker isolation levels, such as READ UNCOMMITTED, that could potentially provide higher throughput for some applications.

Alternatives: WritePrepared and WriteUnprepared

To tackle the lengthy commit issue, we should do memtable writes at earlier phases of 2PC so that the commit phase become lightweight and fast. 2PC is composed of Write stage, where the transaction ::Put is invoked, the prepare phase, where ::Prepare is invoked (upon which the DB promises to commit the transaction if later is requested), and commit phase, where ::Commit is invoked and the transaction writes become visible to all readers. To make the commit phase lightweight, the memtable write could be done at either ::Prepare or ::Put stages, resulting into WritePrepared and WriteUnprepared write policies respectively. The downside is that when another transaction is reading data, it would need a way to tell apart which data is committed, and if they are, whether they are committed before the transaction’s start, i.e., in the read snapshot of the transaction. WritePrepared would still have the issue of buffering the data, which makes the memory the bottleneck for large transactions. It however provides a good milestone for transitioning from WriteCommitted to WriteUnprepared write policy. Here we explain the design of WritePrepared policy. We will cover the changes that make the design to also supported WriteUnprepared in an upcoming post.

WritePrepared in a nutshell

These are the primary design questions that needs to be addressed: 1) How do we identify the key/values in the DB with transactions that wrote them? 2) How do we figure if a key/value written by transaction Txn_w is in the read snapshot of the reading transaction Txn_r? 3) How do we rollback the data written by aborted transactions?

With WritePrepared, a transaction still buffers the writes in a write batch object in memory. When 2PC ::Prepare is called, it writes the in-memory write batch to the WAL (write-ahead log) as well as to the memtable(s) (one memtable per column family); We reuse the existing notion of sequence numbers in RocksDB to tag all the key/values in the same write batch with the same sequence number, prepare_seq, which is also used as the identifier for the transaction. At commit time, it writes a commit marker to the WAL, whose sequence number, commit_seq, will be used as the commit timestamp of the transaction. Before releasing the commit sequence number to the readers, it stores a mapping from prepare_seq to commit_seq in an in-memory data structure that we call CommitCache. When a transaction reading values from the DB (tagged with prepare_seq) it makes use of the CommitCache to figure if commit_seq of the value is in its read snapshot. To rollback an aborted transaction, we apply the status before the transaction by making another write that cancels out the writes of the aborted transaction.

The CommitCache is a lock-free data structure that caches the recent commit entries. Looking up the entries in the cache must be enough for almost all th transactions that commit in a timely manner. When evicting the older entries from the cache, it still maintains some other data structures to cover the corner cases for transactions that takes abnormally too long to finish. We will cover them in the design details below.

Benchmark Results

Here we presents the improvements observed in MyRocks with sysbench and linkbench:

benchmark………..tps………p95 latency….cpu/query
insert……………….68%
update-noindex…30%……38%
update-index…….61%…….28%
read-write…………6%……..3.5%
read-only………..-1.2%…..-1.8%
linkbench………….1.9%……+overall……..0.6%

Here are also the detailed results for In-Memory Sysbench and SSD Sysbench curtesy of @mdcallag.

Learn more here.

Andrew Kryczka

Auto-tuned Rate Limiter

Posted December 18, 2017

Introduction

Our rate limiter has been hard to configure since users need to pick a value that is low enough to prevent background I/O spikes, which can impact user-visible read/write latencies. Meanwhile, picking too low a value can cause memtables and L0 files to pile up, eventually leading to writes stalling. Tuning the rate limiter has been especially difficult for users whose DB instances have different workloads, or have workloads that vary over time, or commonly both.

To address this, in RocksDB 5.9 we released a dynamic rate limiter that adjusts itself over time according to demand for background I/O. It can be enabled simply by passing auto_tuned=true in the NewGenericRateLimiter() call. In this case rate_bytes_per_sec will indicate the upper-bound of the window within which a rate limit will be picked dynamically. The chosen rate limit will be much lower unless absolutely necessary, so setting this to the device’s maximum throughput is a reasonable choice on dedicated hosts.

Algorithm

We use a simple multiplicative-increase, multiplicative-decrease algorithm. We measure demand for background I/O as the ratio of intervals where the rate limiter is drained. There are low and high watermarks for this ratio, which will trigger a change in rate limit when breached. The rate limit can move within a window bounded by the user-specified upper-bound, and a lower-bound that we derive internally. Users can expect this lower bound to be 1-2 orders of magnitude less than the provided upper-bound (so don’t provide INT64_MAX as your upper-bound), although it’s subject to change.

Benchmark Results

Data is ingested at 10MB/s and the rate limiter was created with 1000MB/s as its upper bound. The dynamically chosen rate limit hovers around 125MB/s. The other clustering of points at 50MB/s is due to number of compaction threads being reduced to one when there’s no compaction pressure.

The following graph summarizes the above two time series graphs in CDF form. In particular, notice the p90 - p100 for background write rate are significantly lower with auto-tuned rate limiter enabled.

Maysam Yabandeh

RocksDB 5.8 Released!

Posted September 28, 2017

Public API Change

Users of Statistics::getHistogramString() will see fewer histogram buckets and different bucket endpoints.
Slice::compare and BytewiseComparator Compare no longer accept Slices containing nullptr.
Transaction::Get and Transaction::GetForUpdate variants with PinnableSlice added.

New Features

Add Iterator::Refresh(), which allows users to update the iterator state so that they can avoid some initialization costs of recreating iterators.
Replace dynamic_cast<> (except unit test) so people can choose to build with RTTI off. With make, release mode is by default built with -fno-rtti and debug mode is built without it. Users can override it by setting USE_RTTI=0 or 1.
Universal compactions including the bottom level can be executed in a dedicated thread pool. This alleviates head-of-line blocking in the compaction queue, which cause write stalling, particularly in multi-instance use cases. Users can enable this feature via Env::SetBackgroundThreads(N, Env::Priority::BOTTOM), where N > 0.
Allow merge operator to be called even with a single merge operand during compactions, by appropriately overriding MergeOperator::AllowSingleOperand.
Add DB::VerifyChecksum(), which verifies the checksums in all SST files in a running DB.
Block-based table support for disabling checksums by setting BlockBasedTableOptions::checksum = kNoChecksum.

Bug Fixes

Fix wrong latencies in rocksdb.db.get.micros, rocksdb.db.write.micros, and rocksdb.sst.read.micros.
Fix incorrect dropping of deletions during intra-L0 compaction.
Fix transient reappearance of keys covered by range deletions when memtable prefix bloom filter is enabled.
Fix potentially wrong file smallest key when range deletions separated by snapshot are written together.

Maysam Yabandeh

FlushWAL; less fwrite, faster writes

Posted August 25, 2017

When DB::Put is called, the data is written to both memtable (to be flushed to SST files later) and the WAL (write-ahead log) if it is enabled. In the case of a crash, RocksDB can recover as much as the memtable state that is reflected into the WAL. By default RocksDB automatically flushes the WAL from the application memory to the OS buffer after each ::Put. It however can be configured to perform the flush manually after an explicit call to ::FlushWAL. Not doing fwrite syscall after each ::Put offers a tradeoff between reliability and write latency for the general case. As we explain below, some applications such as MyRocks benefit from this API to gain higher write throughput with however no compromise in reliability.

How much is the gain?

Using ::FlushWAL API along with setting DBOptions.concurrent_prepare, MyRocks achieves 40% higher throughput in Sysbench’s update-nonindex benchmark.

Write, Flush, and Sync

The write to the WAL is first written to the application memory buffer. The buffer in the next step is “flushed” to OS buffer by calling fwrite syscall. The OS buffer is later “synced” to the persistent storage. The data in the OS buffer, although not persisted yet, will survive the application crash. By default, the flush occurs automatically upon each call to DB::Put or DB::Write. The user can additionally request sync after each write by setting WriteOptions::sync.

FlushWAL API

The user can turn off the automatic flush of the WAL by setting DBOptions::manual_wal_flush. In that case, the WAL buffer is flushed when it is either full or DB::FlushWAL is called by the user. The API also accepts a boolean argument should we want to sync right after the flush: ::FlushWAL(true).

Success story: MyRocks

Some applications that use RocksDB, already have other machinsims in place to provide reliability. MySQL for example uses 2PC (two-phase commit) to write to both binlog as well as the storage engine such as InnoDB and MyRocks. The group commit logic in MySQL allows the 1st phase (Prepare) to be run in parallel but after a commit group is formed performs the 2nd phase (Commit) in a serial manner. This makes low commit latency in the storage engine essential for achieving high throughput. The commit in MyRocks includes writing to the RocksDB WAL, which as explaiend above, by default incures the latency of flushing the WAL new appends to the OS buffer.

Since binlog helps in recovering from some failure scenarios, MySQL can provide reliability without however needing a storage WAL flush after each individual commit. MyRocks benefits from this property, disables automatic WAL flush in RocksDB, and manually calls ::FlushWAL when requested by MySQL.

Maysam Yabandeh

PinnableSlice; less memcpy with point lookups

Posted August 24, 2017

The classic API for DB::Get receives a std::string as argument to which it will copy the value. The memcpy overhead could be non-trivial when the value is large. The new API receives a PinnableSlice instead, which avoids memcpy in most of the cases.

What is PinnableSlice?

Similarly to Slice, PinnableSlice refers to some in-memory data so it does not incur the memcpy cost. To ensure that the data will not be erased while it is being processed by the user, PinnableSlice, as its name suggests, has the data pinned in memory. The pinned data are released when PinnableSlice object is destructed or when ::Reset is invoked explicitly on it.

How good is it?

Here are the improvements in throughput for an in-memory benchmark:

value 1k byte: 14%
value 10k byte: 34%

Any limitations?

PinnableSlice tries to avoid memcpy as much as possible. The primary gain is when reading large values from the block cache. There are however cases that it would still have to copy the data into its internal buffer. The reason is mainly the complexity of implementation and if there is enough motivation on the application side. the scope of PinnableSlice could be extended to such cases too. These include:

Merged values
Reads from memtables

How to use it?

PinnableSlice pinnable_val;
while (!stopped) { 
   auto s = db->Get(opt, cf, key, &pinnable_val);
   // ... use it
   pinnable_val.Reset(); // then release it immediately
}

You can also initialize the internal buffer of PinnableSlice by passing your own string in the constructor. simple_example.cc demonstrates that with more examples.

Yi Wu

RocksDB 5.6.1 Released!

Posted July 25, 2017

Public API Change

Scheduling flushes and compactions in the same thread pool is no longer supported by setting max_background_flushes=0. Instead, users can achieve this by configuring their high-pri thread pool to have zero threads. See https://github.com/facebook/rocksdb/wiki/Thread-Pool for more details.
Replace Options::max_background_flushes, Options::max_background_compactions, and Options::base_background_compactions all with Options::max_background_jobs, which automatically decides how many threads to allocate towards flush/compaction.
options.delayed_write_rate by default take the value of options.rate_limiter rate.
Replace global variable IOStatsContext iostats_context with IOStatsContext* get_iostats_context(); replace global variable PerfContext perf_context with PerfContext* get_perf_context().

New Features

Change ticker/histogram statistics implementations to use core-local storage. This improves aggregation speed compared to our previous thread-local approach, particularly for applications with many threads. See http://rocksdb.org/blog/2017/05/14/core-local-stats.html for more details.
Users can pass a cache object to write buffer manager, so that they can cap memory usage for memtable and block cache using one single limit.
Flush will be triggered when 7/8 of the limit introduced by write_buffer_manager or db_write_buffer_size is triggered, so that the hard threshold is hard to hit. See https://github.com/facebook/rocksdb/wiki/Write-Buffer-Manager for more details.
Introduce WriteOptions.low_pri. If it is true, low priority writes will be throttled if the compaction is behind. See https://github.com/facebook/rocksdb/wiki/Low-Priority-Write for more details.
DB::IngestExternalFile() now supports ingesting files into a database containing range deletions.

Bug Fixes

Shouldn’t ignore return value of fsync() in flush.

Aaron Gao

RocksDB 5.5.1 Released!

Posted June 29, 2017

New Features

FIFO compaction to support Intra L0 compaction too with CompactionOptionsFIFO.allow_compaction=true.
Statistics::Reset() to reset user stats.
ldb add option –try_load_options, which will open DB with its own option file.
Introduce WriteBatch::PopSavePoint to pop the most recent save point explicitly.
Support dynamically change max_open_files option via SetDBOptions()
Added DB::CreateColumnFamilie() and DB::DropColumnFamilies() to bulk create/drop column families.
Add debugging function GetAllKeyVersions to see internal versions of a range of keys.
Support file ingestion with universal compaction style
Support file ingestion behind with option allow_ingest_behind
New option enable_pipelined_write which may improve write throughput in case writing from multiple threads and WAL enabled.

Bug Fixes

Fix the bug that Direct I/O uses direct reads for non-SST file
Fix the bug that flush doesn’t respond to fsync result

Andrew Kryczka

Level-based Compaction Changes

Posted June 26, 2017

Introduction

RocksDB provides an option to limit the number of L0 files, which bounds read-amplification. Since L0 files (unlike files at lower levels) can span the entire key-range, a key might be in any file, thus reads need to check them one-by-one. Users often wish to configure a low limit to improve their read latency.

Although, the mechanism with which we enforce L0’s file count limit may be unappealing. When the limit is reached, RocksDB intentionally delays user writes. This slows down accumulation of files in L0, and frees up resources for compacting files down to lower levels. But adding delays will significantly increase user-visible write latency jitter.

Also, due to how L0 files can span the entire key-range, compaction parallelization is limited. Files at L0 or L1 may be locked due to involvement in pending L0->L1 or L1->L2 compactions. We can only schedule a parallel L0->L1 compaction if it does not require any of the locked files, which is typically not the case.

To handle these constraints better, we added a new type of compaction, L0->L0. It quickly reduces file count in L0 and can be scheduled even when L1 files are locked, unlike L0->L1. We also changed the L0->L1 picking algorithm to increase opportunities for parallelism.

Old L0->L1 Picking Logic

Previously, our logic for picking which L0 file to compact was the same as every other level: pick the largest file in the level. One special property of L0->L1 compaction is that files can overlap in the input level, so those overlapping files must be pulled in as well. For example, a compaction may look like this:

This compaction pulls in every L0 and L1 file. This happens regardless of which L0 file is initially chosen as each file overlaps with every other file.

Users may insert their data less uniformly in the key-range. For example, a database may look like this during L0->L1 compaction:

Let’s say the third file from the top is the largest, and let’s say the top two files are created after the compaction started. When the compaction is picked, the fourth L0 file and six rightmost L1 files are pulled in due to overlap. Notice this leaves the database in a state where we might not be able to schedule parallel compactions. For example, if the sixth file from the top is the next largest, we can’t compact it because it overlaps with the top two files, which overlap with the locked L0 files.

We can now see the high-level problems with this approach more clearly. First, locked files in L0 or L1 prevent us from parallelizing compactions. When locked files block L0->L1 compaction, there is nothing we can do to eliminate L0 files. Second, L0->L1 compactions are relatively slow. As we saw, when keys are uniformly distributed, L0->L1 compacts two entire levels. While this is happening, new files are being flushed to L0, advancing towards the file count limit.

New L0->L0 Algorithm

We introduced compaction within L0 to improve both parallelization and speed of reducing L0 file count. An L0->L0 compaction may look like this:

Say the L1->L2 compaction started first. Now L0->L1 is prevented by the locked L1 file. In this case, we compact files within L0. This allows us to start the work for eliminating L0 files earlier. It also lets us do less work since we don’t pull in any L1 files, whereas L0->L1 compaction would’ve pulled in all of them. This lets us quickly reduce L0 file count to keep read-amp low while sustaining large bursts of writes (i.e., fast accumulation of L0 files).

The tradeoff is this increases total compaction work, as we’re now compacting files without contributing towards our eventual goal of moving them towards lower levels. Our benchmarks, though, consistently show less compaction stalls and improved write throughput. One justification is that L0 file data is highly likely in page cache and/or block cache due to it being recently written and frequently accessed. So, this type of compaction is relatively cheap compared to compactions at lower levels.

This feature is available since RocksDB 5.4.

New L0->L1 Picking Logic

Recall how the old L0->L1 picking algorithm chose the largest L0 file for compaction. This didn’t fit well with L0->L0 compaction, which operates on a span of files. That span begins at the newest L0 file, and expands towards older files as long as they’re not being compacted. Since the largest file may be anywhere, the old L0->L1 picking logic could arbitrarily prevent us from getting a long span of files. See the second illustration in this post for a scenario where this would happen.

So, we changed the L0->L1 picking algorithm to start from the oldest file and expand towards newer files as long as they’re not being compacted. For example:

Now, there can never be L0 files unreachable for L0->L0 due to L0->L1 selecting files in the middle. When longer spans of files are available for L0->L0, we perform less compaction work per deleted L0 file, thus improving efficiency.

This feature will be available in RocksDB 5.7.

Performance Changes

Mark Callaghan did the most extensive benchmarking of this feature’s impact on MyRocks. See his results here. Note the primary change between his March 17 and April 14 builds is the latter performs L0->L0 compaction.

Sagar Vemuri

RocksDB 5.4.5 Released!

Posted May 26, 2017

Public API Change

Support dynamically changing stats_dump_period_sec option via SetDBOptions().
Added ReadOptions::max_skippable_internal_keys to set a threshold to fail a request as incomplete when too many keys are being skipped while using iterators.
DB::Get in place of std::string accepts PinnableSlice, which avoids the extra memcpy of value to std::string in most of cases.
- PinnableSlice releases the pinned resources that contain the value when it is destructed or when ::Reset() is called on it.
- The old API that accepts std::string, although discouraged, is still supported.
Replace Options::use_direct_writes with Options::use_direct_io_for_flush_and_compaction. See Direct IO wiki for details.

New Features

Memtable flush can be avoided during checkpoint creation if total log file size is smaller than a threshold specified by the user.
Introduce level-based L0->L0 compactions to reduce file count, so write delays are incurred less often.
(Experimental) Partitioning filters which creates an index on the partitions. The feature can be enabled by setting partition_filters when using kFullFilter. Currently the feature also requires two-level indexing to be enabled. Number of partitions is the same as the number of partitions for indexes, which is controlled by metadata_block_size.
DB::ResetStats() to reset internal stats.
Added CompactionEventListener and EventListener::OnFlushBegin interfaces.
Added DB::CreateColumnFamilie() and DB::DropColumnFamilies() to bulk create/drop column families.
Facility for cross-building RocksJava using Docker.

Bug Fixes

Fix WriteBatchWithIndex address use after scope error.
Fix WritableFile buffer size in direct IO.
Add prefetch to PosixRandomAccessFile in buffered io.
Fix PinnableSlice access invalid address when row cache is enabled.
Fix huge fallocate calls fail and make XFS unhappy.
Fix memory alignment with logical sector size.
Fix alignment in ReadaheadRandomAccessFile.
Fix bias with read amplification stats (READ_AMP_ESTIMATE_USEFUL_BYTES and READ_AMP_TOTAL_READ_BYTES).
Fix a manual / auto compaction data race.
Fix CentOS 5 cross-building of RocksJava.
Build and link with ZStd when creating the static RocksJava build.
Fix snprintf’s usage to be cross-platform.
Fix build errors with blob DB.
Fix readamp test type inconsistency.

Andrew Kryczka

Core-local Statistics

Posted May 14, 2017

Origins: Global Atomics

Until RocksDB 4.12, ticker/histogram statistics were implemented with std::atomic values shared across the entire program. A ticker consists of a single atomic, while a histogram consists of several atomics to represent things like min/max/per-bucket counters. These statistics could be updated by all user/background threads.

For concurrent/high-throughput workloads, cache line bouncing of atomics caused high CPU utilization. For example, we have tickers that count block cache hits and misses. Almost every user read increments these tickers a few times. Many concurrent user reads would cause the cache lines containing these atomics to bounce between cores.

Performance

Here are perf results for 32 reader threads where most reads (99%+) are served by uncompressed block cache. Such a scenario stresses the statistics code heavily.

Benchmark command: TEST_TMPDIR=/dev/shm/ perf record -g ./db_bench -statistics -use_existing_db=true -benchmarks=readrandom -threads=32 -cache_size=1048576000 -num=1000000 -reads=1000000 && perf report -g --children

Perf snippet for “cycles” event:

  Children  Self    Command   Shared Object  Symbol
+   30.33%  30.17%  db_bench  db_bench       [.] rocksdb::StatisticsImpl::recordTick
+    3.65%   0.98%  db_bench  db_bench       [.] rocksdb::StatisticsImpl::measureTime

Perf snippet for “cache-misses” event:

  Children  Self    Command   Shared Object  Symbol
+   19.54%  19.50%  db_bench  db_bench 	     [.] rocksdb::StatisticsImpl::recordTick
+    3.44%   0.57%  db_bench  db_bench       [.] rocksdb::StatisticsImpl::measureTime

The high CPU overhead for updating tickers and histograms corresponds well to the high cache misses.

Thread-locals: Faster Updates

Since RocksDB 4.12, ticker/histogram statistics use thread-local storage. Each thread has a local set of atomic values that no other thread can update. This prevents the cache line bouncing problem described above. Even though updates to a given value are always made by the same thread, atomics are still useful to synchronize with aggregations for querying statistics.

Implementing this approach involved a couple challenges. First, each query for a statistic’s global value must aggregate all threads’ local values. This adds some overhead, which may pass unnoticed if statistics are queried infrequently. Second, exited threads’ local values are still needed to provide accurate statistics. We handle this by merging a thread’s local values into process-wide variables upon thread exit.

Performance

Update benchmark setup is same as before. CPU overhead improved 7.8x compared to global atomics, corresponding to a 17.8x reduction in cache-misses overhead.

Perf snippet for “cycles” event:

  Children  Self    Command   Shared Object  Symbol
+    2.96%  0.87%   db_bench  db_bench       [.] rocksdb::StatisticsImpl::recordTick
+    1.37%  0.10%   db_bench  db_bench       [.] rocksdb::StatisticsImpl::measureTime

Perf snippet for “cache-misses” event:

  Children  Self    Command   Shared Object  Symbol
+    1.21%  0.65%   db_bench  db_bench       [.] rocksdb::StatisticsImpl::recordTick
     0.08%  0.00%   db_bench  db_bench       [.] rocksdb::StatisticsImpl::measureTime

To measure statistics query latency, we ran sysbench with 4K OLTP clients concurrently with one client that queries statistics repeatedly. Times shown are in milliseconds.

 min: 18.45
 avg: 27.91
 max: 231.65
 95th percentile: 55.82

Core-locals: Faster Querying

The thread-local approach is working well for applications calling RocksDB from only a few threads, or polling statistics infrequently. Eventually, though, we found use cases where those assumptions do not hold. For example, one application has per-connection threads and typically runs into performance issues when connection count grows very high. For debugging such issues, they want high-frequency statistics polling to correlate issues in their application with changes in RocksDB’s state.

Once PR #2258 lands, ticker/histogram statistics will be local to each CPU core. Similarly to thread-local, each core updates only its local values, thus avoiding cache line bouncing. Local values are still atomics to make aggregation possible. With this change, query work depends only on number of cores, not the number of threads. So, applications with many more threads than cores can no longer impact statistics query latency.

Performance

Update benchmark setup is same as before. CPU overhead worsened ~23% compared to thread-local, while cache performance was unchanged.

Perf snippet for “cycles” event:

  Children  Self    Command   Shared Object  Symbol
+    2.96%  0.87%   db_bench  db_bench       [.] rocksdb::StatisticsImpl::recordTick
+    1.37%  0.10%   db_bench  db_bench       [.] rocksdb::StatisticsImpl::measureTime

Perf snippet for “cache-misses” event:

  Children  Self    Command   Shared Object  Symbol
+    1.21%  0.65%   db_bench  db_bench       [.] rocksdb::StatisticsImpl::recordTick
     0.08%  0.00%   db_bench  db_bench       [.] rocksdb::StatisticsImpl::measureTime

Query latency is measured same as before with times in milliseconds. Average latency improved by 6.3x compared to thread-local.

 min: 2.47
 avg: 4.45
 max: 91.13
 95th percentile: 7.56

Maysam Yabandeh

Partitioned Index/Filters

Posted May 12, 2017

As DB/mem ratio gets larger, the memory footprint of filter/index blocks becomes non-trivial. Although cache_index_and_filter_blocks allows storing only a subset of them in block cache, their relatively large size negatively affects the performance by i) occupying the block cache space that could otherwise be used for caching data, ii) increasing the load on the disk storage by loading them into the cache after a miss. Here we illustrate these problems in more detail and explain how partitioning index/filters alleviates the overhead.

How large are the index/filter blocks?

RocksDB has by default one index/filter block per SST file. The size of the index/filter varies based on the configuration but for a SST of size 256MB the index/filter block of size 0.5/5MB is typical, which is much larger than the typical data block size of 4-32KB. That is fine when all index/filters fit perfectly into memory and hence are read once per SST lifetime, not so much when they compete with data blocks for the block cache space and are also likely to be re-read many times from the disk.

What is the big deal with large index/filter blocks?

When index/filter blocks are stored in block cache they are effectively competing with data blocks (as well as with each other) on this scarce resource. A filter of size 5MB is occupying the space that could otherwise be used to cache 1000s of data blocks (of size 4KB). This would result in more cache misses for data blocks. The large index/filters also kick each other out of the block cache more often and exacerbate their own cache miss rate too. This is while only a small part of the index/filter block might have been actually used during its lifetime in the cache.

After the cache miss of an index/filter, it has to be reloaded from the disk, and its large size is not helping in reducing the IO cost. While a simple point lookup might need at most a couple of data block reads (of size 4KB) one from each layer of LSM, it might end up also loading multiple megabytes of index/filter blocks. If that happens often then the disk is spending more time serving index/filters rather than the actual data blocks.

What is partitioned index/filters?

With partitioning, the index/filter of a SST file is partitioned into smaller blocks with an additional top-level index on them. When reading an index/filter, only top-level index is loaded into memory. The partitioned index/filter then uses the top-level index to load on demand into the block cache the partitions that are required to perform the index/filter query. The top-level index, which has much smaller memory footprint, can be stored in heap or block cache depending on the cache_index_and_filter_blocks setting.

Success stories

HDD, 100TB DB

In this example we have a DB of size 86G on HDD and emulate the small memory that is present to a node with 100TB of data by using direct IO (skipping OS file cache) and a very small block cache of size 60MB. Partitioning improves throughput by 11x from 5 op/s to 55 op/s.

SSD, Linkbench

In this example we have a DB of size 300G on SSD and emulate the small memory that would be available in presence of other DBs on the same node by by using direct IO (skipping OS file cache) and block cache of size 6G and 2G. Without partitioning the linkbench throughput drops from 38k tps to 23k when reducing block cache size from 6G to 2G. With partitioning the throughput drops from 38k to only 30k.

Learn more here.

Siying Dong

RocksDB 5.2.1 Released!

Posted March 02, 2017

Public API Change

NewLRUCache() will determine number of shard bits automatically based on capacity, if the user doesn’t pass one. This also impacts the default block cache when the user doesn’t explict provide one.
Change the default of delayed slowdown value to 16MB/s and further increase the L0 stop condition to 36 files.

New Features

Added new overloaded function GetApproximateSizes that allows to specify if memtable stats should be computed only without computing SST files’ stats approximations.
Added new function GetApproximateMemTableStats that approximates both number of records and size of memtables.
(Experimental) Two-level indexing that partition the index and creates a 2nd level index on the partitions. The feature can be enabled by setting kTwoLevelIndexSearch as IndexType and configuring index_per_partition.

Bug Fixes

RangeSync() should work if ROCKSDB_FALLOCATE_PRESENT is not set
Fix wrong results in a data race case in Get()
Some fixes related to 2PC.
Fix several bugs in Direct I/O supports.
Fix a regression bug which can cause Seek() to miss some keys if the return key has been updated many times after the snapshot which is used by the iterator.

Islam AbdelRahman

Bulkloading by ingesting external SST files

Posted February 17, 2017

Introduction

One of the basic operations of RocksDB is writing to RocksDB, Writes happen when user call (DB::Put, DB::Write, DB::Delete … ), but what happens when you write to RocksDB ? .. this is a brief description of what happens.

User insert a new key/value by calling DB::Put() (or DB::Write())
We create a new entry for the new key/value in our in-memory structure (memtable / SkipList by default) and we assign it a new sequence number.
When the memtable exceeds a specific size (64 MB for example), we convert this memtable to a SST file, and put this file in level 0 of our LSM-Tree
Later, compaction will kick in and move data from level 0 to level 1, and then from level 1 to level 2 .. and so on

But what if we can skip these steps and add data to the lowest possible level directly ? This is what bulk-loading does

Bulkloading

Write all of our keys and values into SST file outside of the DB
Add the SST file into the LSM directly

This is bulk-loading, and in specific use-cases it allow users to achieve faster data loading and better write-amplification.

and doing it is as simple as

Options options;
SstFileWriter sst_file_writer(EnvOptions(), options, options.comparator);
Status s = sst_file_writer.Open(file_path);
assert(s.ok());

// Insert rows into the SST file, note that inserted keys must be 
// strictly increasing (based on options.comparator)
for (...) {
  s = sst_file_writer.Add(key, value);
  assert(s.ok());
}

// Ingest the external SST file into the DB
s = db_->IngestExternalFile({"/home/usr/file1.sst"}, IngestExternalFileOptions());
assert(s.ok());

You can find more details about how to generate SST files and ingesting them into RocksDB in this wiki page

Use cases

There are multiple use cases where bulkloading could be useful, for example

Generating SST files in offline jobs in Hadoop, then downloading and ingesting the SST files into RocksDB
Migrating shards between machines by dumping key-range in SST File and loading the file in a different machine
Migrating from a different storage (InnoDB to RocksDB migration in MyRocks)

Maysam Yabandeh

RocksDB 5.1.2 Released!

Posted February 07, 2017

Public API Change

Support dynamically change delete_obsolete_files_period_micros option via SetDBOptions().
Added EventListener::OnExternalFileIngested which will be called when IngestExternalFile() add a file successfully.
BackupEngine::Open and BackupEngineReadOnly::Open now always return error statuses matching those of the backup Env.

Bug Fixes

Fix the bug that if 2PC is enabled, checkpoints may loss some recent transactions.
When file copying is needed when creating checkpoints or bulk loading files, fsync the file after the file copying.

Yi Wu

RocksDB 5.0.1 Released!

Posted January 06, 2017

Public API Change

Options::max_bytes_for_level_multiplier is now a double along with all getters and setters.
Support dynamically change delayed_write_rate and max_total_wal_size options via SetDBOptions().
Introduce DB::DeleteRange for optimized deletion of large ranges of contiguous keys.
Support dynamically change delayed_write_rate option via SetDBOptions().
Options::allow_concurrent_memtable_write and Options::enable_write_thread_adaptive_yield are now true by default.
Remove Tickers::SEQUENCE_NUMBER to avoid confusion if statistics object is shared among RocksDB instance. Alternatively DB::GetLatestSequenceNumber() can be used to get the same value.
Options.level0_stop_writes_trigger default value changes from 24 to 32.
New compaction filter API: CompactionFilter::FilterV2(). Allows to drop ranges of keys.
Removed flashcache support.
DB::AddFile() is deprecated and is replaced with DB::IngestExternalFile(). DB::IngestExternalFile() remove all the restrictions that existed for DB::AddFile.

New Features

Add avoid_flush_during_shutdown option, which speeds up DB shutdown by not flushing unpersisted data (i.e. with disableWAL = true). Unpersisted data will be lost. The options is dynamically changeable via SetDBOptions().
Add memtable_insert_with_hint_prefix_extractor option. The option is mean to reduce CPU usage for inserting keys into memtable, if keys can be group by prefix and insert for each prefix are sequential or almost sequential. See include/rocksdb/options.h for more details.
Add LuaCompactionFilter in utilities. This allows developers to write compaction filters in Lua. To use this feature, LUA_PATH needs to be set to the root directory of Lua.
No longer populate “LATEST_BACKUP” file in backup directory, which formerly contained the number of the latest backup. The latest backup can be determined by finding the highest numbered file in the “meta/” subdirectory.

Siying Dong

RocksDB 4.11.2 Released!

Posted September 28, 2016

We abandoned release candidates 4.10.x and directly go to 4.11.2 from 4.9, to make sure the latest release is stable. In 4.11.2, we fixed several data corruption related bugs introduced in 4.9.0.

4.11.2 (9/15/2016)

Bug fixes

Segfault when failing to open an SST file for read-ahead iterators.
WAL without data for all CFs is not deleted after recovery.

Yi Wu

RocksDB 4.8 Released!

Posted July 26, 2016

4.8.0 (5/2/2016)

Public API Change

Allow preset compression dictionary for improved compression of block-based tables. This is supported for zlib, zstd, and lz4. The compression dictionary’s size is configurable via CompressionOptions::max_dict_bytes.
Delete deprecated classes for creating backups (BackupableDB) and restoring from backups (RestoreBackupableDB). Now, BackupEngine should be used for creating backups, and BackupEngineReadOnly should be used for restorations. For more details, see https://github.com/facebook/rocksdb/wiki/How-to-backup-RocksDB%3F
Expose estimate of per-level compression ratio via DB property: “rocksdb.compression-ratio-at-levelN”.
Added EventListener::OnTableFileCreationStarted. EventListener::OnTableFileCreated will be called on failure case. User can check creation status via TableFileCreationInfo::status.

New Features

Add ReadOptions::readahead_size. If non-zero, NewIterator will create a new table reader which performs reads of the given size.

Siying Dong

RocksDB 4.5.1 Released!

Posted April 26, 2016

4.5.1 (3/25/2016)

Bug Fixes

Fix failures caused by the destorying order of singleton objects.

4.5.0 (2/5/2016)

Public API Changes

Add a new perf context level between kEnableCount and kEnableTime. Level 2 now does not include timers for mutexes.
Statistics of mutex operation durations will not be measured by default. If you want to have them enabled, you need to set Statistics::stats_level_ to kAll.
DBOptions::delete_scheduler and NewDeleteScheduler() are removed, please use DBOptions::sst_file_manager and NewSstFileManager() instead

New Features

ldb tool now supports operations to non-default column families.
Add kPersistedTier to ReadTier. This option allows Get and MultiGet to read only the persited data and skip mem-tables if writes were done with disableWAL = true.
Add DBOptions::sst_file_manager. Use NewSstFileManager() in include/rocksdb/sst_file_manager.h to create a SstFileManager that can be used to track the total size of SST files and control the SST files deletion rate.

Yueh-Hsuan Chiang

RocksDB Options File

Posted March 07, 2016

In RocksDB 4.3, we added a new set of features that makes managing RocksDB options easier. Specifically:

Persisting Options Automatically: Each RocksDB database will now automatically persist its current set of options into an INI file on every successful call of DB::Open(), SetOptions(), and CreateColumnFamily() / DropColumnFamily().
Load Options from File: We added LoadLatestOptions() / LoadOptionsFromFile() that enables developers to construct RocksDB options object from an options file.
Sanity Check Options: We added CheckOptionsCompatibility that performs compatibility check on two sets of RocksDB options.

RocksDB AMA

Posted February 25, 2016

RocksDB developers are doing a Reddit Ask-Me-Anything now at 10AM – 11AM PDT! We welcome you to stop by and ask any RocksDB related questions, including existing / upcoming features, tuning tips, or database design.

Here are some enhancements that we’d like to focus on over the next six months:

2-Phase Commit
Lua support in some custom functions
Backup and repair tools
Direct I/O to bypass OS cache
RocksDB Java API

https://www.reddit.com/r/IAmA/comments/47k1si/we_are_rocksdb_developers_ask_us_anything/

Siying Dong

RocksDB 4.2 Release!

Posted February 24, 2016

New RocksDB release - 4.2!

New Features

Introduce CreateLoggerFromOptions(), this function create a Logger for provided DBOptions.
Add GetAggregatedIntProperty(), which returns the sum of the GetIntProperty of all the column families.
Add MemoryUtil in rocksdb/utilities/memory.h. It currently offers a way to get the memory usage by type from a list rocksdb instances.

Siying Dong

Option of Compaction Priority

Posted January 29, 2016

The most popular compaction style of RocksDB is level-based compaction, which is an improved version of LevelDB’s compaction algorithm. Page 9- 16 of this slides gives an illustrated introduction of this compaction style. The basic idea that: data is organized by multiple levels with exponential increasing target size. Except a special level 0, every level is key-range partitioned into many files. When size of a level exceeds its target size, we pick one or more of its files, and merge the file into the next level.

Siying Dong

Analysis File Read Latency by Level

Posted November 16, 2015

In many use cases of RocksDB, people rely on OS page cache for caching compressed data. With this approach, verifying effective of the OS page caching is challenging, because file system is a black box to users.

As an example, a user can tune the DB as following: use level-based compaction, with L1 - L4 sizes to be 1GB, 10GB, 100GB and 1TB. And they reserve about 20GB memory as OS page cache, expecting level 0, 1 and 2 are mostly cached in memory, leaving only reads from level 3 and 4 requiring disk I/Os. However, in practice, it’s not easy to verify whether OS page cache does exactly what we expect. For example, if we end up with doing 4 instead of 2 I/Os per query, it’s not easy for users to figure out whether the it’s because of efficiency of OS page cache or reading multiple blocks for a level. Analysis like it is especially important if users run RocksDB on hard drive disks, for the gap of latency between hard drives and memory is much higher than flash-based SSDs.

Venkatesh Radhakrishnan

Use Checkpoints for Efficient Snapshots

Posted November 10, 2015

Checkpoint is a feature in RocksDB which provides the ability to take a snapshot of a running RocksDB database in a separate directory. Checkpoints can be used as a point in time snapshot, which can be opened Read-only to query rows as of the point in time or as a Writeable snapshot by opening it Read-Write. Checkpoints can be used for both full and incremental backups.

Yueh-Hsuan Chiang

GetThreadList

Posted October 27, 2015

We recently added a new API, called GetThreadList(), that exposes the RocksDB background thread activity. With this feature, developers will be able to obtain the real-time information about the currently running compactions and flushes such as the input / output size, elapsed time, the number of bytes it has written. Below is an example output of GetThreadList. To better illustrate the example, we have put a sample output of GetThreadList into a table where each column represents a thread status:

Siying Dong

Dynamic Level Size for Level-Based Compaction

Posted July 23, 2015

In this article, we follow up on the first part of an answer to one of the questions in our AMA, the dynamic level size in level-based compaction.

Dmitri Smirnov

RocksDB is now available in Windows Platform

Posted July 22, 2015

Over the past 6 months we have seen a number of use cases where RocksDB is successfully used by the community and various companies to achieve high throughput and volume in a modern server environment.

We at Microsoft Bing could not be left behind. As a result we are happy to announce the availability of the Windows Port created here at Microsoft which we intend to use as a storage option for one of our key/value data stores.

Igor Canadi

Spatial indexing in RocksDB

Posted July 17, 2015

About a year ago, there was a need to develop a spatial database at Facebook. We needed to store and index Earth’s map data. Before building our own, we looked at the existing spatial databases. They were all very good technology, but also general purpose. We could sacrifice a general-purpose API, so we thought we could build a more performant database, since it would be specifically designed for our use-case. Furthermore, we decided to build the spatial database on top of RocksDB, because we have a lot of operational experience with running and tuning RocksDB at a large scale.

Igor Canadi

RocksDB 2015 H2 roadmap

Posted July 15, 2015

Every 6 months, RocksDB team gets together to prioritize the work ahead of us. We just went through this exercise and we wanted to share the results with the community. Here’s what RocksDB team will be focusing on for the next 6 months:

Igor Canadi

RocksDB in osquery

Posted June 12, 2015

Check out this blog post by Mike Arpaia and Ted Reed about how osquery leverages RocksDB to build an embedded pub-sub system. This article is a great read and contains insights on how to properly use RocksDB.

Igor Canadi

Integrating RocksDB with MongoDB

Posted April 22, 2015

Over the last couple of years, we have been busy integrating RocksDB with various services here at Facebook that needed to store key-value pairs locally. We have also seen other companies using RocksDB as local storage components of their distributed systems.

Siying Dong

WriteBatchWithIndex: Utility for Implementing Read-Your-Own-Writes

Posted February 27, 2015

RocksDB can be used as a storage engine of a higher level database. In fact, we are currently plugging RocksDB into MySQL and MongoDB as one of their storage engines. RocksDB can help with guaranteeing some of the ACID properties: durability is guaranteed by RocksDB by design; while consistency and isolation need to be enforced by concurrency controls on top of RocksDB; Atomicity can be implemented by committing a transaction’s writes with one write batch to RocksDB in the end.

Leonidas Galanis

Reading RocksDB options from a file

Posted February 24, 2015

RocksDB options can be provided using a file or any string to RocksDB. The format is straightforward: write_buffer_size=1024;max_write_buffer_number=2. Any whitespace around = and ; is OK. Moreover, options can be nested as necessary. For example BlockBasedTableOptions can be nested as follows: write_buffer_size=1024; max_write_buffer_number=2; block_based_table_factory={block_size=4k};. Similarly any white space around { or } is ok. Here is what it looks like in code:

Leonidas Galanis

Migrating from LevelDB to RocksDB

Posted January 16, 2015

If you have an existing application that uses LevelDB and would like to migrate to using RocksDB, one problem you need to overcome is to map the options for LevelDB to proper options for RocksDB. As of release 3.9 this can be automatically done by using our option conversion utility found in rocksdb/utilities/leveldb_options.h. What is needed, is to first replace leveldb::Options with rocksdb::LevelDBOptions. Then, use rocksdb::ConvertOptions( ) to convert the LevelDBOptions struct into appropriate RocksDB options. Here is an example:

Lei Jin

RocksDB 3.5 Release!

Posted September 15, 2014

New RocksDB release - 3.5!

New Features

Add include/utilities/write_batch_with_index.h, providing a utility class to query data out of WriteBatch when building it.
new ReadOptions.total_order_seek to force total order seek when block-based table is built with hash index.

Feng Zhu

New Bloom Filter Format

Posted September 12, 2014

Introduction

In this post, we are introducing “full filter block” — a new bloom filter format for block based table. This could bring about 40% of improvement for key query under in-memory (all data stored in memory, files stored in tmpfs/ramfs, an example workload. The main idea behind is to generate a big filter that covers all the keys in SST file to avoid lots of unnecessary memory look ups.

Radheshyam Balasundaram

Cuckoo Hashing Table Format

Posted September 12, 2014

Introduction

We recently introduced a new Cuckoo Hashing based SST file format which is optimized for fast point lookups. The new format was built for applications which require very high point lookup rates (~4Mqps) in read only mode but do not use operations like range scan, merge operator, etc. But, the existing RocksDB file formats were built to support range scan and other operations and the current best point lookup in RocksDB is 1.2 Mqps given by PlainTable format. This prompted a hashing based file format, which we present here. The new table format uses a cache friendly version of Cuckoo Hashing algorithm with only 1 or 2 memory accesses per lookup.

Yueh-Hsuan Chiang

RocksDB 3.3 Release

Posted July 29, 2014

Check out new RocksDB release on GitHub!

New Features in RocksDB 3.3:

JSON API prototype.
Performance improvement on HashLinkList: We addressed performance outlier of HashLinkList caused by skewed bucket by switching data in the bucket from linked list to skip list. Add parameter threshold_use_skiplist in NewHashLinkListRepFactory().

Lei Jin

RocksDB 3.2 release

Posted June 27, 2014

Check out new RocksDB release on GitHub!

New Features in RocksDB 3.2:

PlainTable now supports a new key encoding: for keys of the same prefix, the prefix is only written once. It can be enabled through encoding_type paramter of NewPlainTableFactory()
Add AdaptiveTableFactory, which is used to convert from a DB of PlainTable to BlockBasedTabe, or vise versa. It can be created using NewAdaptiveTableFactory()

Lei Jin

Avoid Expensive Locks in Get()

Posted June 27, 2014

As promised in the previous blog post!

RocksDB employs a multiversion concurrency control strategy. Before reading data, it needs to grab the current version, which is encapsulated in a data structure called SuperVersion.

Siying Dong

PlainTable — A New File Format

Posted June 23, 2014

In this post, we are introducing “PlainTable” – a file format we designed for RocksDB, initially to satisfy a production use case at Facebook.

Design goals:

All data stored in memory, in files stored in tmpfs/ramfs. Support DBs larger than 100GB (may be sharded across multiple RocksDB instance).
Optimize for prefix hashing
Less than or around 1 micro-second average latency for single Get() or Seek().
Minimize memory consumption.
Queries efficiently return empty results

Igor Canadi

RocksDB 3.1 release

Posted May 22, 2014

Check out the new release on Github!

New features in RocksDB 3.1:

We released 3.1 so fast after 3.0 because one of our internal customers needed materialized hash index.

Igor Canadi

RocksDB 3.0 release

Posted May 19, 2014

Check out new RocksDB release on Github!

New features in RocksDB 3.0:

Column Family support
Ability to chose different checksum function
Deprecated ReadOptions::prefix_seek and ReadOptions::prefix

Siying Dong

Reducing Lock Contention in RocksDB

Posted May 14, 2014

In this post, we briefly introduce the recent improvements we did to RocksDB to improve the issue of lock contention costs.

RocksDB has a simple thread synchronization mechanism (See RocksDB Architecture Guide to understand terms used below, like SST tables or mem tables). SST tables are immutable after being written and mem tables are lock-free data structures supporting single writer and multiple readers. There is only one single major lock, the DB mutex (DBImpl.mutex_) protecting all the meta operations, including:

Lei Jin

Indexing SST Files for Better Lookup Performance

Posted April 21, 2014

For a Get() request, RocksDB goes through mutable memtable, list of immutable memtables, and SST files to look up the target key. SST files are organized in levels.

On level 0, files are sorted based on the time they are flushed. Their key range (as defined by FileMetaData.smallest and FileMetaData.largest) are mostly overlapped with each other. So it needs to look up every L0 file.

Igor Canadi

RocksDB 2.8 release

Posted April 07, 2014

Check out the new RocksDB 2.8 release on Github.

RocksDB 2.8. is mostly focused on improving performance for in-memory workloads. We are seeing read QPS as high as 5M (we will write a separate blog post on this).

Xing Jin

The 1st RocksDB Local Meetup Held on March 27, 2014

Posted April 02, 2014

On Mar 27, 2014, RocksDB team @ Facebook held the 1st RocksDB local meetup in FB HQ (Menlo Park, California). We invited around 80 guests from 20+ local companies, including LinkedIn, Twitter, Dropbox, Square, Pinterest, MapR, Microsoft and IBM. Finally around 50 guests showed up, totaling around 60% show-up rate.

Igor Canadi

How to persist in-memory RocksDB database?

Posted March 27, 2014

In recent months, we have focused on optimizing RocksDB for in-memory workloads. With growing RAM sizes and strict low-latency requirements, lots of applications decide to keep their entire data in memory. Running in-memory database with RocksDB is easy – just mount your RocksDB directory on tmpfs or ramfs [1]. Even if the process crashes, RocksDB can recover all of your data from in-memory filesystem. However, what happens if the machine reboots?

Igor Canadi

How to backup RocksDB?

Posted March 27, 2014

In RocksDB, we have implemented an easy way to backup your DB. Here is a simple example:

#include "rocksdb/db.h"
#include "utilities/backupable_db.h"
using namespace rocksdb;

DB* db;
DB::Open(Options(), "/tmp/rocksdb", &db);
BackupableDB* backupable_db = new BackupableDB(db, BackupableDBOptions("/tmp/rocksdb_backup"));
backupable_db->Put(...); // do your thing
backupable_db->CreateNewBackup();
delete backupable_db; // no need to also delete db

Meta Open Source

GitHub Twitter Terms of Use Privacy Policy

Java Foreign Function Interface (FFI)

Implementation

How JNI Works

How FFI Works

Our Approach

C++ Side

Java Side

Pinnable Slices

Benchmark Results

Discussion

Copies versus Calls

Other Conclusions

Build Processing

Safety

Native Memory

Summary

Appendix

Code and Data

Running

Processing

Java 19 installation

Java 20, 21, 22 and subsequent versions

RocksDB Java API Performance Improvements

Synthetic JNI API Performance Benchmarks

The Model

Data Types

Byte Array

Byte Buffer

Unsafe Memory

Allocation

GetJNIBenchmark Performance

Post processing the results

PutJNIBenchmark

Lessons from Synthetic API

API Recommendations

Optimizations

Reduce Copies within API Implementation

Performance Results

Analysis

TL:DR

Background

Problem

Goal and Non-goals

User APIs

User Metrics

Implementation

Time Tracking

Per-Key Placement Compaction

Migration

Summary

Acknowledgements

TL;DR

Background

implementation

Other Options and Benchmark

Test Environment and Data

Write Amplification

File Size Distribution at the End of Test

All Compaction Generated File Sizes

Summary

Acknowledgements

Summary

Design

API

Scan

Seek

Next

MultiGet

Results

MultiGet

Single-file

Single-level

Multi-level

Scan

With async scan

Without async scan

Known Limitations

Introduction

Scenarios covered

Issues found

`C++` Side

`Java` Side