<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>RocksDB</title>
    <description>RocksDB is an embeddable persistent key-value store for fast storage.
</description>
    <link>https://rocksdb.org/feed.xml</link>
    <atom:link href="http://rocksdb.org/feed.xml" rel="self" type="application/rss+xml"/>
    <pubDate>Thu, 18 Jun 2026 04:07:56 +0000</pubDate>
    <lastBuildDate>Thu, 18 Jun 2026 04:07:56 +0000</lastBuildDate>
    <generator>Jekyll v3.10.0</generator>
    
      <item>
        <title>Resumable Remote Compaction</title>
        <description>&lt;h2 id=&quot;background&quot;&gt;Background&lt;/h2&gt;

&lt;p&gt;RocksDB can offload compaction work to remote workers through the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CompactionService&lt;/code&gt; API. In this model, the &lt;strong&gt;primary RocksDB instance&lt;/strong&gt; selects the input files and sends a serialized &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CompactionServiceInput&lt;/code&gt; to a worker; the &lt;strong&gt;remote worker&lt;/strong&gt; runs &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DB::OpenAndCompact()&lt;/code&gt;, writes output SSTs to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;output_directory&lt;/code&gt;, and returns a serialized &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CompactionServiceResult&lt;/code&gt; that the primary RocksDB instance installs into its LSM tree. See the &lt;a href=&quot;https://github.com/facebook/rocksdb/wiki/Remote-Compaction&quot;&gt;Remote Compaction wiki&lt;/a&gt; for the full architecture. This lets operators scale compaction throughput with stateless workers while keeping the primary RocksDB instance’s CPU and I/O available for serving reads and writes. However, remote compaction jobs can be long-running—sometimes processing hundreds of gigabytes of input. When a worker crashes, gets preempted, or times out, the entire compaction must restart from scratch, wasting all output produced before the interruption and increasing compaction debt on the primary RocksDB instance.&lt;/p&gt;

&lt;h2 id=&quot;how-resumable-remote-compaction-works&quot;&gt;How Resumable Remote Compaction Works&lt;/h2&gt;

&lt;p&gt;Resumable remote compaction introduces a &lt;strong&gt;checkpoint-and-resume&lt;/strong&gt; mechanism. During a compaction, the worker periodically saves its progress to the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;output_directory&lt;/code&gt;. If the compaction is interrupted, a subsequent call to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;OpenAndCompact()&lt;/code&gt; with the same output directory can pick up from the last checkpoint rather than starting over.&lt;/p&gt;

&lt;h3 id=&quot;checkpointing&quot;&gt;Checkpointing&lt;/h3&gt;

&lt;p&gt;After each output SST file is completed, the worker persists a progress checkpoint to a &lt;strong&gt;compaction progress file&lt;/strong&gt; in the output directory &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;output_directory&lt;/code&gt;. The checkpoint records which internal key to resume from and the metadata of all completed output files. Progress records use &lt;strong&gt;delta encoding&lt;/strong&gt;—each record only contains files completed since the last checkpoint—to keep serialization cost linear.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/static/images/resumable-remote-compaction/checkpointing-overview.svg&quot; alt=&quot;Checkpointing overview&quot; /&gt;&lt;/p&gt;

&lt;p&gt;The worker skips checkpointing at boundaries where resuming could be unsafe or requires complicated handling: when range deletions span the file boundary or when adjacent output files share the same user key. These constraints ensure that resuming produces the same results as if the compaction was not interrupted.&lt;/p&gt;

&lt;h3 id=&quot;resuming&quot;&gt;Resuming&lt;/h3&gt;

&lt;p&gt;When &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;OpenAndCompact()&lt;/code&gt; is called with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;allow_resumption = true&lt;/code&gt;, it scans the output directory for a valid progress file. If one is found, it loads the checkpointed state, seeks the input iterator to the recorded resume key, restores the output file state, and continues compaction from that point. If the progress file is corrupted or missing, the system falls back to a fresh compaction by cleaning the directory.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/static/images/resumable-remote-compaction/resume-flow.svg&quot; alt=&quot;Resume flow&quot; /&gt;&lt;/p&gt;

&lt;h2 id=&quot;how-to-enable-it&quot;&gt;How to Enable It&lt;/h2&gt;

&lt;p&gt;On the primary RocksDB instance, set a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CompactionService&lt;/code&gt; implementation on the DB options. On the remote worker, pass &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;allow_resumption = true&lt;/code&gt; in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;OpenAndCompactOptions&lt;/code&gt; when calling &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DB::OpenAndCompact()&lt;/code&gt;. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;output_directory&lt;/code&gt; must be the same across retries for resumption to work—each retry call with the same directory will automatically detect and resume from the previous checkpoint. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;REMOTE_COMPACT_RESUMED_BYTES&lt;/code&gt; statistics ticker tracks the total bytes of output files reused from a previous interrupted run, giving visibility into how much work resumption saved.&lt;/p&gt;

&lt;div class=&quot;language-cpp highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;rougeHighlight&quot;&gt;&lt;code&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;rouge-gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;rouge-code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;// Primary RocksDB instance&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;DBOptions&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;db_options&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;db_options&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;compaction_service&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;make_shared&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;MyCompactionService&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// Remote worker&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;OpenAndCompactOptions&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;options&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;options&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;allow_resumption&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;true&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;string&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;result&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;Status&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DB&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;OpenAndCompact&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;options&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;db_path&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;             &lt;span class=&quot;c1&quot;&gt;// source database path&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;output_directory&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;    &lt;span class=&quot;c1&quot;&gt;// where output SSTs and progress are stored&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;compaction_input&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;    &lt;span class=&quot;c1&quot;&gt;// serialized CompactionServiceInput&lt;/span&gt;
    &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;result&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;             &lt;span class=&quot;c1&quot;&gt;// serialized CompactionServiceResult&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;override_options&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;future-work&quot;&gt;Future Work&lt;/h2&gt;

&lt;p&gt;Today this feature targets remote compaction. The same checkpoint-and-resume mechanism could also support &lt;strong&gt;local compaction&lt;/strong&gt; after a crash. The core persistence and resume logic is already in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CompactionJob&lt;/code&gt;; the remaining work is to integrate it with local compaction scheduling and recovery.&lt;/p&gt;
</description>
        <pubDate>Tue, 19 May 2026 00:00:00 +0000</pubDate>
        <link>http://rocksdb.org/blog/2026/05/19/resumable-remote-compaction.html</link>
        <guid isPermaLink="true">http://rocksdb.org/blog/2026/05/19/resumable-remote-compaction.html</guid>
        
        
        <category>blog</category>
        
      </item>
    
      <item>
        <title>Interpolation search for SST index blocks</title>
        <description>&lt;p&gt;For workloads with uniformly distributed keys, RocksDB now supports &lt;strong&gt;interpolation search&lt;/strong&gt; for SST index blocks as an alternative to the default binary search.&lt;/p&gt;

&lt;h2 id=&quot;the-idea&quot;&gt;The idea&lt;/h2&gt;

&lt;p&gt;Binary search always splits the remaining range in half:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;rougeHighlight&quot;&gt;&lt;code&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;rouge-gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;rouge-code&quot;&gt;&lt;pre&gt;mid = low + (high - low) / 2
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;That’s Θ(log n) probes regardless of the data. Interpolation search instead estimates where the target should land based on its value relative to the current boundaries:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;rougeHighlight&quot;&gt;&lt;code&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;rouge-gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;rouge-code&quot;&gt;&lt;pre&gt;probe = low + (target - key[low]) * (high - low) / (key[high] - key[low])
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;On uniformly distributed keys, that’s expected O(log log n) probes. The canonical example: for an index block with restart keys &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;0, 1, 2, ..., 1023&lt;/code&gt; and a seek for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;900&lt;/code&gt;, binary search needs about 10 hops; interpolation search lands on it in 1.&lt;/p&gt;

&lt;p&gt;The catch is that pure interpolation search degrades to O(n) on badly skewed data.&lt;/p&gt;

&lt;h2 id=&quot;turning-a-key-into-a-number&quot;&gt;Turning a key into a number&lt;/h2&gt;

&lt;p&gt;The interpolation formula needs numeric values, but index keys are variable-length byte slices. RocksDB extracts a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;uint64_t&lt;/code&gt; per key by reading the first 8 bytes after the common prefix shared by the block’s boundary keys, in big-endian, and zero-pads to the right if the remaining bytes are too short.&lt;/p&gt;

&lt;div class=&quot;language-cpp highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;rougeHighlight&quot;&gt;&lt;code&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;rouge-gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;rouge-code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;kr&quot;&gt;inline&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;uint64_t&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;ReadBe64FromKey&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Slice&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;bool&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;is_user_key&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;size_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;offset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// ... strip internal seq/type bytes if needed ...&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;offset&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;kt&quot;&gt;uint64_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;val&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;memcpy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;val&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;offset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;sizeof&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;val&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;));&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;port&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kLittleEndian&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;?&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;EndianSwapValue&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;val&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;val&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// pad short tails with zeros on the right (preserves bytewise order)&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Big-endian + zero-pad preserves bytewise ordering, so the linear interpolation formula stays consistent with the comparator. This is also why the feature requires &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BytewiseComparator&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Two distinct keys can still collapse to the same &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;uint64_t&lt;/code&gt; once you go past the first 8 non-shared bytes. To avoid a divide-by-zero, we simply fall back to binary search in that case.&lt;/p&gt;

&lt;h2 id=&quot;how-to-enable-it&quot;&gt;How to enable it&lt;/h2&gt;

&lt;p&gt;To force interpolation search on every index block:&lt;/p&gt;

&lt;div class=&quot;language-cpp highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;rougeHighlight&quot;&gt;&lt;code&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;rouge-gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;rouge-code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;rocksdb&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;BlockBasedTableOptions&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;table_options&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;table_options&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;index_block_search_type&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;rocksdb&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;BlockBasedTableOptions&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kInterpolation&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;kauto-per-block-selection&quot;&gt;kAuto: per-block selection&lt;/h2&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kAuto&lt;/code&gt; is the recommended way to use the feature. It chooses the search algorithm for each index block automatically, based on a uniformity hint written into the block footer at SST construction time:&lt;/p&gt;

&lt;div class=&quot;language-cpp highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;rougeHighlight&quot;&gt;&lt;code&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;rouge-gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;rouge-code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;table_options&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;index_block_search_type&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;rocksdb&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;BlockBasedTableOptions&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kAuto&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;table_options&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;uniform_cv_threshold&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;0.2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;When &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;uniform_cv_threshold &amp;gt;= 0&lt;/code&gt;, the SST writer scans each index block’s restart keys and computes the &lt;strong&gt;coefficient of variation (CV)&lt;/strong&gt; of the gaps between consecutive numeric key values:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;rougeHighlight&quot;&gt;&lt;code&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;rouge-gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;rouge-code&quot;&gt;&lt;pre&gt;gap[i] = key_value[i + 1] - key_value[i]
CV     = stddev(gap) / mean(gap)
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Lower CV means the gaps are more uniform — and the more likely interpolation search will outperform binary search. The CV is computed incrementally with Welford’s online algorithm, so the scan is one pass over the restart points.&lt;/p&gt;

&lt;p&gt;If &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CV &amp;lt; uniform_cv_threshold&lt;/code&gt;, RocksDB sets an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;is_uniform&lt;/code&gt; bit in the block footer. At read time, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kAuto&lt;/code&gt; resolves to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kInterpolation&lt;/code&gt; only when that bit is set &lt;em&gt;and&lt;/em&gt; the comparator is bytewise; otherwise it uses &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kBinary&lt;/code&gt;.&lt;/p&gt;

&lt;h3 id=&quot;write-overhead&quot;&gt;Write overhead&lt;/h3&gt;

&lt;p&gt;Computing the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;is_uniform&lt;/code&gt; bit is a cheap operation as it is only computed for the index blocks in a SST file. CPU profiling of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;db_bench -benchmarks=fillseq,compact -compression_type=none -disable_wal=1&lt;/code&gt; attributes only ~0.08% of write-path CPU to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ScanForUniformity&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;After a few more releases, we plan to enable kAuto and uniform_cv_threshold by default.&lt;/p&gt;

&lt;h2 id=&quot;benchmarks&quot;&gt;Benchmarks&lt;/h2&gt;

&lt;p&gt;Setup — populate a DB and force a single-level shape so all reads hit the same index structure, then measure point-read throughput:&lt;/p&gt;

&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;rougeHighlight&quot;&gt;&lt;code&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;rouge-gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;rouge-code&quot;&gt;&lt;pre&gt;# Build a release binary
make clean &amp;amp;&amp;amp; DEBUG_LEVEL=0 make db_bench

# Load + compact, varying the index_shortening_mode
./db_bench -benchmarks=fillrandom,compact \
           -index_shortening_mode=1

# Then point-read against the populated DB
./db_bench -use_existing_db=true -benchmarks=readrandom \
           -index_block_search_type=binary_search   # or interpolation_search / auto_search
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;index_shortening_mode=1&lt;/code&gt; (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kShortenSeparators&lt;/code&gt;) keeps the file’s last index key intact, which preserves a roughly uniform numeric distribution for the benchmark.&lt;/p&gt;

&lt;p&gt;Results, averaged over multiple runs:&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt;Mode&lt;/th&gt;
      &lt;th&gt;ops/s&lt;/th&gt;
      &lt;th&gt;vs &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;binary_search&lt;/code&gt;&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;binary_search&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;335,749&lt;/td&gt;
      &lt;td&gt;baseline&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;interpolation_search&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;366,598&lt;/td&gt;
      &lt;td&gt;&lt;strong&gt;+9.2%&lt;/strong&gt;&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;auto_search&lt;/code&gt;&lt;/td&gt;
      &lt;td&gt;366,832&lt;/td&gt;
      &lt;td&gt;&lt;strong&gt;+9.2%&lt;/strong&gt;&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;h2 id=&quot;compatibility&quot;&gt;Compatibility&lt;/h2&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;is_uniform&lt;/code&gt; bit reuses a previously-reserved bit in the data block footer. SSTs written by older RocksDB never set it and decode as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;is_uniform = false&lt;/code&gt;, so they read with binary search under &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kAuto&lt;/code&gt;. However, after the bit is set, if read by an older RocksDB version &amp;lt; 11.0.0, it will read it as a corruption error.&lt;/p&gt;

&lt;h2 id=&quot;future-work&quot;&gt;Future work&lt;/h2&gt;

&lt;p&gt;Some future opportunities can involve extending interpolation search to data blocks, as well as supporting other comparators such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ReverseBytewiseComparator&lt;/code&gt;.&lt;/p&gt;
</description>
        <pubDate>Mon, 04 May 2026 00:00:00 +0000</pubDate>
        <link>http://rocksdb.org/blog/2026/05/04/interpolation-search.html</link>
        <guid isPermaLink="true">http://rocksdb.org/blog/2026/05/04/interpolation-search.html</guid>
        
        
        <category>blog</category>
        
      </item>
    
      <item>
        <title>RocksDB development finds a CPU bug</title>
        <description>&lt;p&gt;This is the story of how a RocksDB unit test I added four years ago, a mini-stress test you might call it, revealed &lt;a href=&quot;https://www.amd.com/en/resources/product-security/bulletin/amd-sb-7055.html&quot;&gt;a novel hardware bug in a newer CPU&lt;/a&gt;. It was scary enough to be assigned a “high severity” CVE.&lt;/p&gt;

&lt;h2 id=&quot;background-unique-identifiers&quot;&gt;Background: Unique Identifiers&lt;/h2&gt;
&lt;p&gt;About four years ago, we &lt;a href=&quot;https://github.com/facebook/rocksdb/pull/9126&quot;&gt;added unique identifiers to SST files&lt;/a&gt; to give them stable identifiers across different filesystems for caching purposes. Part of the motivation here was to eliminate our dependence on the uniqueness and non-recycling of unique identifiers on files provided by the OS filesystem. (Some filesystems were only &lt;a href=&quot;https://github.com/facebook/rocksdb/issues/7405#issuecomment-694595587&quot;&gt;guaranteeing uniqueness among existing files, not among all files even in recent history&lt;/a&gt;.) I would call this dependency problem the &lt;em&gt;great tension&lt;/em&gt; between reusing existing solutions and code self-reliance. You don’t want to duplicate others’ work but you also don’t want to be subject to their bugs or changing / misaligned requirements. Striking this balance can be tricky, but in this case it was clear to us that we didn’t want to rely on all the possible filesystems providing quality unique identifiers.&lt;/p&gt;

&lt;p&gt;If you’re comfortable with large random numbers (e.g. 128 bits), you probably agree that persisting random identifiers (or &lt;a href=&quot;https://github.com/pdillinger/unique_id/blob/main/README.md&quot;&gt;quasi-random&lt;/a&gt;, which &lt;a href=&quot;https://dl.acm.org/doi/10.1145/3584372.3588674&quot;&gt;I helped formalize in a paper&lt;/a&gt;, &lt;a href=&quot;https://arxiv.org/abs/2304.07109&quot;&gt;also on arXiv&lt;/a&gt;) with each file would be safer and more predictable than relying so crucially on a minor feature of OS filesystems.&lt;/p&gt;

&lt;h2 id=&quot;high-quality-randomness&quot;&gt;High Quality Randomness&lt;/h2&gt;
&lt;p&gt;However, that assumes we have access to &lt;em&gt;high quality&lt;/em&gt; random numbers (at least a good one or two to start from - see the paper). Because RocksDB intends to be cross-platform, we want to minimize platform-specific dependencies and prefer cross-platform dependencies. But that could easily land us back where we didn’t want to be: susceptible to a bug or hiccup in one implementation of what we needed.&lt;/p&gt;

&lt;p&gt;Fortunately, the nature of random entropy allows &lt;em&gt;combining&lt;/em&gt; sources so that your result is as good as your &lt;em&gt;best&lt;/em&gt; input source, so even if one is bad, you only have a problem if they’re all bad. And we had the advantages that (a) we only needed uniqueness, not security, which reduced the need for extra scrutiny and allowed us to use the quasi-random approach, and (b) the quasi-random approach minimized the amount of entropy needed, so the performance cost of acquiring each unit of entropy was almost inconsequential. Therefore, I combined these sources of entropy:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;C++11’s &lt;a href=&quot;https://en.cppreference.com/w/cpp/numeric/random/random_device.html&quot;&gt;std::random_device&lt;/a&gt; which is supposed to provide high quality but is allowed not to.&lt;/li&gt;
  &lt;li&gt;A hash of various environment parameters including hostname, process id, thread id, and various macro and micro time readings.&lt;/li&gt;
  &lt;li&gt;Platform-specific UUID generator (Linux and Windows only)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;trust-but-verify&quot;&gt;Trust But Verify&lt;/h2&gt;
&lt;p&gt;To verify the quality of each of these sources on an ongoing basis, &lt;a href=&quot;https://github.com/facebook/rocksdb/pull/8708&quot;&gt;I added unit tests&lt;/a&gt; that used many threads to create thousands of unique identifiers based on one of the above sources at a time and verified their uniqueness. For a high quality source, the probability of any duplicate 128-bit IDs among thousands is negligible, even if running these tests continuously for decades.&lt;/p&gt;

&lt;h2 id=&quot;thats-weird&quot;&gt;That’s Weird&lt;/h2&gt;
&lt;p&gt;That was pretty much the story until some months ago the test based on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;std::random_device&lt;/code&gt; failed, once. It was quite suspicious because the number of unique IDs was not just one short of expectation, it was dozens or hundreds short. However, even that could be explained by a random CPU hiccup or bit flip in which we generated fewer IDs to begin with. (You might have noticed an increasing amount of RocksDB development effort and portion of CPU time going into checks that are logically redundant but exist to detect CPU miscalculations before the corruption propagates too far.)&lt;/p&gt;

&lt;p&gt;But then it failed again about a month later. No failures for four years, then two failures in two months. This smelled really bad. Digging into the details I noticed a crucial correlation: both of the failed test jobs had run on the same type of hardware, though in completely different data centers.&lt;/p&gt;

&lt;p&gt;From there I did the natural thing for an engineer: scale it up to try to reproduce the failure. And that was remarkably easy. By increasing the number of threads in the job to around the number of cores it would fail quickly and consistently on all systems using the same type of newer CPU, and pass on everything else. I tested some variants of this to establish some more details, including&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;std::random_device&lt;/code&gt; using “rdrand” and “/dev/urandom” sources were not affected, and&lt;/li&gt;
  &lt;li&gt;libc++ (from clang) was not affected, only libstdc++ (from GCC)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;root-cause-analysis&quot;&gt;Root Cause Analysis&lt;/h2&gt;
&lt;p&gt;From there Meta colleagues investigated the low-level details. They found the problem to be that the RDSEED instruction on this type of processor would return 0 and “success” much more often than would randomly be expected, but only on some cores and only under “complex micro-architectural conditions reproducible under memory-load,” as a colleague describes it. A mitigating Linux kernel patch was developed to signal that RDSEED was unavailable on these processors, with the intention of rolling it out internally at Meta to avoid problems until a fix came from the OEM. &lt;a href=&quot;https://www.amd.com/en/resources/product-security/bulletin/amd-sb-7055.html&quot;&gt;AMD quickly acknowledged the issue and announced planned mitigation&lt;/a&gt;, including a CPU microcode update.&lt;/p&gt;

&lt;h2 id=&quot;with-apologies&quot;&gt;With Apologies&lt;/h2&gt;
&lt;p&gt;Although I worked to keep the information confidential until the OEM publicly acknowledged the issue, the uncoordinated disclosure via the Linux mailing list was due to zealous remediation efforts that crossed multiple infrastructure teams at Meta. We regret the mistake and are working to improve controls on the processes that failed to coordinate with the OEM first.&lt;/p&gt;

&lt;h2 id=&quot;key-takeaways&quot;&gt;Key Takeaways&lt;/h2&gt;
&lt;ul&gt;
  &lt;li&gt;Test what you depend on.&lt;/li&gt;
  &lt;li&gt;Have redundancies and/or sanity checks for what you depend on.&lt;/li&gt;
  &lt;li&gt;Even CPUs can have bugs, usually flaky individual units but occasionally a bug affecting all units.&lt;/li&gt;
&lt;/ul&gt;
</description>
        <pubDate>Tue, 17 Feb 2026 00:00:00 +0000</pubDate>
        <link>http://rocksdb.org/blog/2026/02/17/cpu-bug.html</link>
        <guid isPermaLink="true">http://rocksdb.org/blog/2026/02/17/cpu-bug.html</guid>
        
        
        <category>blog</category>
        
      </item>
    
      <item>
        <title>BitFields API: Type-Safe Bit Packing for Lock-Free Data Structures</title>
        <description>&lt;p&gt;Modern concurrent data structures increasingly rely on &lt;a href=&quot;https://en.cppreference.com/w/cpp/atomic/atomic&quot;&gt;atomic operations&lt;/a&gt; to avoid the overhead of locking. A valuable but under-utilized technique for maximizing the effectiveness of atomic operations is &lt;a href=&quot;https://en.wikipedia.org/wiki/Bit_field&quot;&gt;bit packing&lt;/a&gt;—fitting multiple logical fields into a single atomic variable for algorithmic simplicity and efficiency. However, language support for bit packing does not guarantee dense packing, and manually managing bit manipulation quickly becomes error-prone, especially when dealing with complex state machines.&lt;/p&gt;

&lt;p&gt;To address this in RocksDB, we have developed a reusable &lt;strong&gt;BitFields API&lt;/strong&gt;, a type-safe, zero-overhead abstraction for bit packing in C++. This works in conjunction with clean wrappers for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;std::atomic&lt;/code&gt; for powerful and relatively safe bit-packing of atomic data. For broader use, a &lt;a href=&quot;https://github.com/facebook/folly/pull/2549&quot;&gt;variant of the code&lt;/a&gt; has been proposed for adding to folly.&lt;/p&gt;

&lt;h2 id=&quot;the-problem-managing-packed-bit-fields&quot;&gt;The Problem: Managing Packed Bit Fields&lt;/h2&gt;

&lt;p&gt;Consider HyperClockCache, an essentially lock-free cache implementation in RocksDB, which was &lt;a href=&quot;https://github.com/facebook/rocksdb/pull/14154&quot;&gt;refactored to use this BitFields API&lt;/a&gt;. It is a hash table built on &lt;em&gt;slots&lt;/em&gt; that can each hold a cache entry and relevant metadata. For atomic simplicity and efficiency, all the essential metadata for each slot is packed into a single 64-bit value:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;The reference count and eviction metadata are together encoded into &lt;em&gt;acquire&lt;/em&gt; and &lt;em&gt;release&lt;/em&gt; counters, 30 bits each.&lt;/li&gt;
  &lt;li&gt;The possible states of {&lt;em&gt;empty&lt;/em&gt;, &lt;em&gt;under construction/destruction&lt;/em&gt;, &lt;em&gt;occupied+visible&lt;/em&gt;, and &lt;em&gt;occupied+invisible&lt;/em&gt;} are encoded into three state bits (instead of two, for easier decoding and manipulation).&lt;/li&gt;
  &lt;li&gt;A &lt;em&gt;hit&lt;/em&gt; bit is used for secondary cache integration.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Traditionally, you might write code like this:&lt;/p&gt;

&lt;div class=&quot;language-cpp highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;rougeHighlight&quot;&gt;&lt;code&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;rouge-gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;rouge-code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;// Old approach: manual bit manipulation&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;constexpr&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;uint64_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kAcquireCounterShift&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;constexpr&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;uint64_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kReleaseCounterShift&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;30&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;constexpr&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;uint64_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kCounterMask&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x3FFFFFFF&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;constexpr&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;uint64_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kHitBitShift&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;60&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;constexpr&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;uint64_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kOccupiedShift&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;61&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;constexpr&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;uint64_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kShareableShift&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;62&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;constexpr&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;uint64_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kVisibleShift&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;63&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;constexpr&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;uint64_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kStateShift&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kOccupiedShift&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;atomic&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;uint64_t&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;meta_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

&lt;span class=&quot;kt&quot;&gt;bool&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;IsUnderConstruction&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;uint64_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;meta&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;meta&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;uint64_t&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kOccupiedShift&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;meta&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;uint64_t&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kShareableShift&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;));&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// Getting fields&lt;/span&gt;
&lt;span class=&quot;kt&quot;&gt;uint64_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;meta&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;meta_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;load&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;memory_order_acquire&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;IsUnderConstruction&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;meta&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// ...&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;meta&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kVisibleShift&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;kt&quot;&gt;uint32_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;refcount&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;
      &lt;span class=&quot;k&quot;&gt;static_cast&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;uint32_t&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;meta&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kAcquireCounterShift&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;
                             &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;meta&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kReleaseCounterShift&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kCounterMask&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// ...&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;


&lt;span class=&quot;c1&quot;&gt;// Setting fields&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// Set the hit bit (relaxed)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;meta_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fetch_or&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;uint64_t&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kHitBitShift&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;memory_order_relaxed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// Set both counters to `new_count` (as in eviction processing)&lt;/span&gt;
&lt;span class=&quot;kt&quot;&gt;uint64_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;meta&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;meta_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;load&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;memory_order_relaxed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;kt&quot;&gt;uint64_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;new_meta&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;meta&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;uint64_t&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kHitBitShift&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;uint64_t&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;7&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kStateShift&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)))&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;new_count&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kReleaseCounterShift&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;|&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;new_count&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kAcquireCounterShift&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;kt&quot;&gt;bool&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;success&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;meta_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;compare_exchange_strong&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;meta&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;new_meta&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                                             &lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;memory_order_acq_rel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// Increment acquire counter by initial_countdown&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;old_meta&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;meta_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fetch_add&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;uint64_t&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kAcquireCounterShift&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;initial_countdown&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                           &lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;memory_order_acq_rel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This approach has several problems:&lt;/p&gt;
&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;Error-prone&lt;/strong&gt;: Easy to get masks and shifts wrong&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Maintenance burden&lt;/strong&gt;: Changes to field sizes require updating multiple constants&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Abstraction challenges&lt;/strong&gt;: Even if writing a full set of well-tested getters and setters to hide all the details, details can leak in to do things like update multiple fields in one non-CAS (compare-and-swap) atomic operation.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;new-solution-bitfields-api&quot;&gt;New Solution: BitFields API&lt;/h2&gt;

&lt;p&gt;The BitFields API provides a declarative, type-safe way to define bit-packed structures. Here’s how the same example looks with BitFields:&lt;/p&gt;

&lt;div class=&quot;language-cpp highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;rougeHighlight&quot;&gt;&lt;code&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;rouge-gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;rouge-code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;// New approach: declarative bit fields. (Each field must reference the&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// previous, so that the declaration machinery is simply stateless.)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;SlotMeta&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;BitFields&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;uint64_t&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;SlotMeta&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;using&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AcquireCounter&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;UnsignedBitField&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SlotMeta&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;30&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;NoPrevBitField&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;using&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ReleaseCounter&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;UnsignedBitField&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SlotMeta&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;30&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AcquireCounter&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;using&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;HitFlag&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;BoolBitField&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SlotMeta&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ReleaseCounter&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;using&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;OccupiedFlag&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;BoolBitField&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SlotMeta&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;HitFlag&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;using&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ShareableFlag&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;BoolBitField&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SlotMeta&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;OccupiedFlag&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;using&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;VisibleFlag&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;BoolBitField&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SlotMeta&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ShareableFlag&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

  &lt;span class=&quot;c1&quot;&gt;// Convenience helpers&lt;/span&gt;
  &lt;span class=&quot;kt&quot;&gt;bool&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;IsUnderConstruction&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Get&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;OccupiedFlag&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Get&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ShareableFlag&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;};&lt;/span&gt;

&lt;span class=&quot;n&quot;&gt;BitFieldsAtomic&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SlotMeta&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;meta_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// Getting fields&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;SlotMeta&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;state&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;meta_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Load&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;state&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;IsUnderConstruction&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;())&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// ...&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;state&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Get&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SlotMeta&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;VisibleFlag&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;())&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;kt&quot;&gt;uint32_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;refcount&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;state&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Get&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SlotMeta&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AcquireCounter&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;
                      &lt;span class=&quot;n&quot;&gt;state&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Get&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SlotMeta&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ReleaseCounter&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// ...&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// Setting fields&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// Set the hit bit (relaxed)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;meta_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ApplyRelaxed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SlotMeta&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;HitFlag&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SetTransform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;());&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// Set both counters to `new_count` (as in eviction processing)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;SlotMeta&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;meta&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;meta_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;LoadRelaxed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;SlotMeta&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;new_meta&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;meta&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;new_meta&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Set&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SlotMeta&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ReleaseCounter&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;new_count&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;new_meta&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Set&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SlotMeta&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AcquireCounter&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;new_count&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;meta_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;CasStrongRelaxed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;meta&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;new_meta&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// Increment acquire counter by initial_countdown&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;add_acquire&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;AcquireCounter&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PlusTransformPromiseNoOverflow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;initial_countdown&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;meta_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Apply&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;add_acquire&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;old_meta&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// Bonus: Atomic multi-field updates without compare-exchange&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;transform&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;AcquireCounter&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PlusTransformPromiseNoOverflow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;
                 &lt;span class=&quot;n&quot;&gt;ReleaseCounter&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PlusTransformPromiseNoOverflow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;meta_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Apply&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;transform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;key-features&quot;&gt;Key Features&lt;/h2&gt;

&lt;h3 id=&quot;type-safety-and-self-documentation&quot;&gt;Type Safety and Self-Documentation&lt;/h3&gt;

&lt;p&gt;Each field has a specific type (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;bool&lt;/code&gt; for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BoolBitField&lt;/code&gt;, appropriately-sized unsigned int for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;UnsignedBitField&lt;/code&gt;) and clear semantic meaning. The field definitions are self-documenting: you can immediately see how many bits each field occupies and in what order.&lt;/p&gt;

&lt;h3 id=&quot;zero-overhead&quot;&gt;&lt;a href=&quot;https://en.cppreference.com/w/cpp/language/Zero-overhead_principle&quot;&gt;Zero Overhead&lt;/a&gt;&lt;/h3&gt;

&lt;p&gt;Because of heavy use of templates and constexpr operations and the ability to satisfy multiple field reads or writes from a single atomic operation, we have seen no runtime overhead vs. hand-written bit manipulation, in RocksDB. In one case, we verified the assembly code was identical.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/facebook/folly/pull/2550&quot;&gt;For folly’s LifoSem&lt;/a&gt;, there was one case where an optimization hack with detected overflow from one field to another couldn’t be replicated as efficiently with the BitFields API because it would violate overflow checking. For that case I dove into the underlying representation to bypass the BitFields overflow check.&lt;/p&gt;

&lt;h3 id=&quot;atomic-operations-with-transforms&quot;&gt;Atomic Operations with Transforms&lt;/h3&gt;

&lt;p&gt;One of the most powerful features is the ability to combine multiple field updates into a single atomic operation using “transforms”, if they are all either (a) some combination of addition and subtraction, (b) bitwise-and, or (c) bitwise-or. For example:&lt;/p&gt;

&lt;div class=&quot;language-cpp highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;rougeHighlight&quot;&gt;&lt;code&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;rouge-gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;rouge-code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;// Clear several but not all fields atomically&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;and_transform&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Field1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AndTransform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;
                 &lt;span class=&quot;n&quot;&gt;Field2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ClearTransform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;
                 &lt;span class=&quot;n&quot;&gt;Field4&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ClearTransform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;atomic_bitfields&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Apply&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;and_transform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;old_state&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;new_state&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;...&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// Set more than one boolean field atomically&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;or_transform&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Field2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SetTransform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;
                 &lt;span class=&quot;n&quot;&gt;Field4&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SetTransform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;atomic_bitfields&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Apply&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;or_transform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;old_state&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;new_state&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;...&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;add_transform&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Field1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PlusTransformPromiseNoOverflow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;
                     &lt;span class=&quot;n&quot;&gt;Field3&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;MinusTransformPromiseNoUnderflow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;atomic_bitfields&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Apply&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;add_transform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;old_state&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;new_state&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Each &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Apply()&lt;/code&gt; generates a single atomic operation (e.g., &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fetch_add&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fetch_or&lt;/code&gt;) that updates all the specified fields, and optionally returns both the old and new values. This enables a number of hacks for atomic updates without CAS.&lt;/p&gt;

&lt;h3 id=&quot;overflow-protection&quot;&gt;Overflow Protection&lt;/h3&gt;

&lt;p&gt;The API includes built-in overflow detection in debug builds:&lt;/p&gt;

&lt;div class=&quot;language-cpp highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;rougeHighlight&quot;&gt;&lt;code&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;rouge-gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;rouge-code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;// An assertion will fail in debug builds if the counter overflows&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;transform&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Counter&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PlusTransformPromiseNoOverflow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;atomic&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Apply&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;transform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;For fields at the top of the underlying representation (where overflow doesn’t affect other fields), overflow is explicitly ignored. (A compile time error is generated if you try to use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PlusTransformPromiseNoOverflow&lt;/code&gt; on a field at the top of the representation or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PlusTransformIgnoreOverflow&lt;/code&gt; on a field not at the top of the representation.)&lt;/p&gt;

&lt;div class=&quot;language-cpp highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;rougeHighlight&quot;&gt;&lt;code&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;rouge-gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;rouge-code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;// For wraparound counters&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;transform&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Counter&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PlusTransformIgnoreOverflow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;This capability is used in a folly data structure called LifoSem, which &lt;a href=&quot;https://github.com/facebook/folly/pull/2550&quot;&gt;I have proposed to refactor&lt;/a&gt; to a proposed BitFields API variant for folly.&lt;/p&gt;

&lt;h3 id=&quot;compare-and-swap-cas-support&quot;&gt;Compare-and-Swap (CAS) Support&lt;/h3&gt;

&lt;p&gt;The atomic wrappers provide full CAS support for lock-free algorithms:&lt;/p&gt;

&lt;div class=&quot;language-cpp highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;rougeHighlight&quot;&gt;&lt;code&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;rouge-gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;rouge-code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;SlotMeta&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;expected&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;current_state&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;SlotMeta&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;desired&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;expected&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;With&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Field1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;new_value&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;With&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Field2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;true&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;meta_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;CasStrong&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;expected&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;desired&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;c1&quot;&gt;// Successfully updated&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;...&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;atomic-wrappers&quot;&gt;Atomic wrappers&lt;/h3&gt;

&lt;p&gt;The BitFields API includes two atomic wrappers: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;RelaxedBitFieldsAtomic&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BitFieldsAtomic&lt;/code&gt;. However, RocksDB also has versions of these wrappers for regular &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;std::atomic&lt;/code&gt; variables that help with memory ordering discipline: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;RelaxedAtomic&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Atomic&lt;/code&gt; in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;util/atomic.h&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;These wrappers help in a couple of ways:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Self-document intended memory order&lt;/strong&gt;: An atomic field generally has a single memory order that all or most operations should use, typically either &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;std::memory_order_relaxed&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;std::memory_order_acq_rel&lt;/code&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;More intentional memory orders and atomic operations&lt;/strong&gt;: The standard library’s implicit conversions and default memory ordering (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;memory_order_seq_cst&lt;/code&gt;) make it easy to accidentally use sequential consistency with acquire/release ordering or even relaxed, which could hurt performance, and tend to hide where atomic operations are actually happening (e.g. implicit vs. explicit load).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For example, instead of writing:&lt;/p&gt;
&lt;div class=&quot;language-cpp highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;rougeHighlight&quot;&gt;&lt;code&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;rouge-gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;rouge-code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;atomic&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;uint64_t&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;stat_counter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;stat_counter&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;++&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;// Uses memory_order_seq_cst implicitly - maybe inefficient&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;You write:&lt;/p&gt;
&lt;div class=&quot;language-cpp highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;rougeHighlight&quot;&gt;&lt;code&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;rouge-gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;rouge-code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;RelaxedAtomic&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;uint64_t&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;stat_counter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;stat_counter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;FetchAddRelaxed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;// Explicitly relaxed - appropriate for a diagnostic counter&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Or for data providing synchronization:&lt;/p&gt;
&lt;div class=&quot;language-cpp highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;rougeHighlight&quot;&gt;&lt;code&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;rouge-gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;rouge-code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;Atomic&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;size_t&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;refcount&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;refcount&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;FetchAdd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;  &lt;span class=&quot;c1&quot;&gt;// Standard acquire-release semantics for coordinating with other threads&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;These wrappers complement the BitFields atomic wrappers by providing the same ordering discipline for non-packed atomic variables throughout much of RocksDB, creating a more readable and less clunky approach to concurrent programming. Migrating remaining uses of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;std::atomic&lt;/code&gt; is an ongoing effort.&lt;/p&gt;

&lt;h2 id=&quot;real-world-usage-in-rocksdb&quot;&gt;Real-World Usage in RocksDB&lt;/h2&gt;

&lt;p&gt;The BitFields API was developed along with the revamped parallel compression in RocksDB, but with the intention to also clean up the HyperClockCache (HCC) implementation. With that migration complete, we can see the benefits. Specifically, &lt;strong&gt;by packing more of the state machine into a single atomic value, the parallel algorithms became both simpler and more efficient.&lt;/strong&gt; Concurrent algorithms that could have blown up in their state space with elaborate interleavings between threads trying not to block each other, e.g. because of multi-step consensus on work assignments, were instead able to quickly and more easily make progress, e.g. with atomically clear work assignments.&lt;/p&gt;

&lt;h3 id=&quot;before-manual-bit-manipulation&quot;&gt;Before: Manual Bit Manipulation&lt;/h3&gt;

&lt;p&gt;The old HCC code was difficult to read and maintain. Many of the common read and update operations had manually written helper functions, but it was not practical to develop the full set of functions needed for rare cases. Consider this code that clears the “visible” flag on a slot when an entry is erased from subsequent lookups but might still be referenced:&lt;/p&gt;

&lt;div class=&quot;language-cpp highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;rougeHighlight&quot;&gt;&lt;code&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;rouge-gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;rouge-code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;// Old HCC code, without atomic wrappers&lt;/span&gt;
&lt;span class=&quot;kt&quot;&gt;uint64_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;old_meta&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;h&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;meta&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fetch_and&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;~&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;uint64_t&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ClockHandle&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kStateVisibleBit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
                                   &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ClockHandle&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kStateShift&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;memory_order_acq_rel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// Apply update to local copy&lt;/span&gt;
&lt;span class=&quot;kt&quot;&gt;uint64_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;new_meta&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;old_meta&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;~&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;uint64_t&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ClockHandle&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kStateVisibleBit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
                            &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ClockHandle&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kStateShift&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// New HCC code&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;SlotMeta&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;old_meta&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;new_meta&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;h&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;meta&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Apply&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SlotMeta&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;VisibleFlag&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ClearTransform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;old_meta&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;new_meta&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Or this assertion that the acquire and release counters are different:&lt;/p&gt;

&lt;div class=&quot;language-cpp highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;rougeHighlight&quot;&gt;&lt;code&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;rouge-gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;rouge-code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;// Old HCC code&lt;/span&gt;
&lt;span class=&quot;kt&quot;&gt;uint64_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;old_meta&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;...;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;assert&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;old_meta&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ClockHandle&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kAcquireCounterShift&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;ClockHandle&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kCounterMask&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt;
        &lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;old_meta&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ClockHandle&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kReleaseCounterShift&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt;
        &lt;span class=&quot;n&quot;&gt;ClockHandle&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kCounterMask&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;));&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// New HCC code without single-purpose helper functions&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;SlotMeta&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;old_meta&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;...;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;assert&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;old_meta&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Get&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SlotMeta&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AcquireCounter&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt;
       &lt;span class=&quot;n&quot;&gt;old_meta&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Get&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;SlotMeta&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ReleaseCounter&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;());&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// New HCC code, with single-purpose helper functions&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;SlotMeta&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;old_meta&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;...;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;assert&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;old_meta&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;GetAcquireCounter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;old_meta&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;GetReleaseCounter&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;());&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Some hand-written helper functions or using directives are still useful for brevity, but even without them all the bit manipulation details are hidden in the BitFields implementation.&lt;/p&gt;

&lt;h2 id=&quot;future-directions&quot;&gt;Future Directions&lt;/h2&gt;

&lt;p&gt;We hope the proposed folly version is accepted to make the BitFields API available for broader usage. Additionally, some quality-of-life improvements are likely possible, perhaps including easier declaration and usage syntax, hopefully without delving into boost-like macro hell. Better runtime and compile time checks might also be possible.&lt;/p&gt;

&lt;h2 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h2&gt;

&lt;p&gt;The BitFields API demonstrates that zero-overhead abstractions can significantly improve code quality without sacrificing performance. By providing type safety, self-documentation, and convenience features around bit manipulation and atomic operations, it makes lock-free programming more accessible and maintainable. Bit-packed atomics are arguably essential for &lt;em&gt;slaying the complexity dragon&lt;/em&gt; of efficient lock-free and low-lock algorithms, because they reduce explosion in algorithm states.&lt;/p&gt;

&lt;p&gt;For RocksDB specifically, the migration to BitFields has made the HyperClockCache implementation substantially easier to understand and modify, while maintaining the same high-performance characteristics. Combined with the recent &lt;a href=&quot;/blog/2025/10/08/parallel-compression-revamp.html&quot;&gt;parallel compression revamp&lt;/a&gt;, these improvements showcase our ongoing commitment to writing clean, efficient, and maintainable code.&lt;/p&gt;

&lt;p&gt;The BitFields API is available in RocksDB’s util/bit_fields.h and can be adapted for use in other projects requiring efficient, type-safe bit packing. For those building high-performance concurrent systems, it offers a compelling alternative to manual bit manipulation—proving that safe abstractions and peak performance are not mutually exclusive.&lt;/p&gt;
</description>
        <pubDate>Wed, 31 Dec 2025 00:00:00 +0000</pubDate>
        <link>http://rocksdb.org/blog/2025/12/31/bit-fields-api.html</link>
        <guid isPermaLink="true">http://rocksdb.org/blog/2025/12/31/bit-fields-api.html</guid>
        
        
        <category>blog</category>
        
      </item>
    
      <item>
        <title>Parallel Compression Revamp: Dramatically Reduced CPU Overhead</title>
        <description>&lt;p&gt;The upcoming RocksDB 10.7 release includes a major revamp of parallel compression that &lt;strong&gt;dramatically reduces the feature’s CPU overhead by up to 65%&lt;/strong&gt; while maintaining or improving throughput for compression-heavy workloads. We expect this to broaden the set of workloads that could benefit from parallel compression, especially for &lt;strong&gt;bulk SST generation and remote compaction use cases&lt;/strong&gt; that are less sensitive to CPU responsiveness.&lt;/p&gt;

&lt;h2 id=&quot;background&quot;&gt;Background&lt;/h2&gt;

&lt;p&gt;Parallel compression in RocksDB (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CompressionOptions::parallel_threads &amp;gt; 1&lt;/code&gt;) allows multiple threads to compress different blocks simultaneously during SST file generation, which can significantly improve compaction throughput for workloads where compression is a bottleneck. However, the original implementation had substantial CPU overhead that often outweighed the benefits, limiting its practical adoption.&lt;/p&gt;

&lt;h2 id=&quot;whats-new-a-complete-reimplementation&quot;&gt;What’s New: A Complete Reimplementation&lt;/h2&gt;

&lt;p&gt;The parallel compression framework has been completely rewritten from the ground up in &lt;a href=&quot;https://github.com/facebook/rocksdb/pull/13910&quot;&gt;pull request #13910&lt;/a&gt; to address the core inefficiencies:&lt;/p&gt;

&lt;h3 id=&quot;ring-buffer-architecture&quot;&gt;Ring Buffer Architecture&lt;/h3&gt;
&lt;p&gt;Instead of separate compression and write queues with complex thread coordination, the new implementation uses a ring buffer of blocks-in-progress that enables efficient work distribution across threads. This bounds working memory while enabling high throughput with minimal cross-thread synchronization.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/static/images/parallel-compression/ring-buffer-architecture.svg&quot; alt=&quot;Ring Buffer Architecture&quot; /&gt;&lt;/p&gt;

&lt;h3 id=&quot;work-stealing-design&quot;&gt;Work-Stealing Design&lt;/h3&gt;
&lt;p&gt;Previously, the calling thread could only generate uncompressed blocks, dedicated compression threads could only compress, and a writer thread could only write the SST file to storage. Now, all threads can participate in compression work in a quasi-work-stealing manner, dramatically reducing the need for threads to block waiting for work. While only one thread (the calling thread or “emit thread”) can generate uncompressed SST blocks in the new implementation, feeding compression work to other threads and itself, all other threads are compatible with writing compressed blocks to storage.&lt;/p&gt;

&lt;h3 id=&quot;auto-scaling-thread-management&quot;&gt;Auto-Scaling Thread Management&lt;/h3&gt;
&lt;p&gt;The ring buffer enables another key feature: auto-scaling of active threads based on ring buffer utilization. The framework intelligently wakes up idle worker threads only when there’s sufficient work to justify the overhead, achieving near-maximum throughput while minimizing CPU waste from unnecessary thread wake-ups.&lt;/p&gt;

&lt;h3 id=&quot;lock-free-synchronization&quot;&gt;Lock-Free Synchronization&lt;/h3&gt;
&lt;p&gt;The entire framework is now lock-free (and wait-free as long as compatible work units are available for each thread), based primarily on atomic operations. To cleanly pack and leverage many data fields into a single atomic value, I’ve developed a new &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BitFields&lt;/code&gt; utility API. This is proving useful for cleaning up the HyperClockCache implementation as well, and will be the topic of a later blog post.&lt;/p&gt;

&lt;p&gt;Semaphores are used for lock-free management of idle threads (assuming a lock-free semaphore implementation, which is likely the case with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ROCKSDB_USE_STD_SEMAPHORES&lt;/code&gt; but that is untrustworthy; see below).&lt;/p&gt;

&lt;h2 id=&quot;performance-improvements&quot;&gt;Performance Improvements&lt;/h2&gt;

&lt;p&gt;The results speak for themselves. Here’s a comparison using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;db_bench&lt;/code&gt; fillseq benchmarks with various compression configurations:&lt;/p&gt;

&lt;h3 id=&quot;zstd-compression-default-level&quot;&gt;ZSTD Compression (Default Level)&lt;/h3&gt;
&lt;p&gt;Note:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;“throughput” = how quickly a given CPU-bound flush or compaction can complete&lt;/li&gt;
  &lt;li&gt;“CPU increase” = total CPU usage in amount of time that each core was used&lt;/li&gt;
  &lt;li&gt;“PT” = parallel_threads setting.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Before:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;PT=3: ~38% throughput increase for ~73% CPU increase&lt;/li&gt;
  &lt;li&gt;PT=6: No throughput increase for ~70% CPU increase&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;After:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;PT=3: ~58% throughput increase for ~25% CPU increase&lt;/li&gt;
  &lt;li&gt;PT=6: ~58% throughput increase for ~28% CPU increase&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;high-compression-scenarios&quot;&gt;High Compression Scenarios&lt;/h3&gt;
&lt;p&gt;For ZSTD compression level 8, the improvements are even more dramatic:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;PT=4: 2.6x throughput increase for 139% CPU increase&lt;/li&gt;
  &lt;li&gt;PT=8: 3.6x throughput increase for 135% CPU increase&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;After:&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;PT=4: 2.8x throughput increase for 114% CPU increase&lt;/li&gt;
  &lt;li&gt;PT=8: 3.7x throughput increase for 116% CPU increase&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;compression-algorithm-optimizations&quot;&gt;Compression Algorithm Optimizations&lt;/h2&gt;

&lt;p&gt;Alongside the parallel compression revamp, some optimizations have gone into the underlying compression implementations/integrations. Most notably, &lt;strong&gt;LZ4HC received dramatic performance improvements&lt;/strong&gt; through better reuse of internal data structures between compression calls (detailed in &lt;a href=&quot;https://github.com/facebook/rocksdb/pull/13805&quot;&gt;pull request #13805&lt;/a&gt;). A small regression in LZ4 performance from that change was fixed in &lt;a href=&quot;https://github.com/facebook/rocksdb/pull/14017&quot;&gt;pull request #14017&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;While &lt;strong&gt;ZSTD remains the gold standard&lt;/strong&gt; for medium-to-high compression ratios in RocksDB, these LZ4HC optimizations make it an increasingly attractive option for read-heavy workloads where LZ4’s faster decompression can provide overall performance benefits.&lt;/p&gt;

&lt;h2 id=&quot;production-ready&quot;&gt;Production Ready&lt;/h2&gt;

&lt;p&gt;With these efficiency improvements, parallel compression is now considered &lt;strong&gt;production-ready&lt;/strong&gt;. The feature has been thoroughly tested in both unit tests and stress testing, including validation on high-load scenarios with hundreds of concurrent compression jobs and thousands of threads.&lt;/p&gt;

&lt;p&gt;Some notes on current limitations:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Parallel compression is currently incompatible with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;UserDefinedIndex&lt;/code&gt; and with the deprecated &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;decouple_partitioned_filters=false&lt;/code&gt; setting&lt;/li&gt;
  &lt;li&gt;Maximum performance is available with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-DROCKSDB_USE_STD_SEMAPHORES&lt;/code&gt; at compile time, though this is not currently recommended due to reported bugs in some implementations of C++20 semaphores&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;configuration-recommendations&quot;&gt;Configuration Recommendations&lt;/h2&gt;

&lt;p&gt;The dramatically reduced CPU overhead means parallel compression is now viable for a broader range of workloads, particularly those using higher compression levels or compression-heavy scenarios like time-series data. However, simply enabling parallel compression could result in more &lt;em&gt;spiky&lt;/em&gt; CPU loads for hosts serving live DB data. &lt;strong&gt;Parallel compression might be most useful for bulk SST file generation and/or remote compaction workloads&lt;/strong&gt; because they are less sensitive to CPU responsiveness. In these scenarios there is little danger in setting &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;parallel_threads=8&lt;/code&gt; even with the possibility of over-subscribing CPU cores, though the potentially safer “sweet spot” is typically around &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;parallel_threads=3&lt;/code&gt;, depending on compression level, etc.&lt;/p&gt;

&lt;h2 id=&quot;limitations-and-future&quot;&gt;Limitations and Future&lt;/h2&gt;

&lt;p&gt;Although this offers a great improvement in the implementation of an existing option, we recognize that this setup is suboptimal in a number of ways:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;There is no work sharing / thread pooling for these SST compression/writer threads among compactions in the same process, so not well able to fit the workload to available CPU cores and not able to use other SST file compression work to avoid a worker thread going to sleep.&lt;/li&gt;
  &lt;li&gt;We are not (yet) using a framework that would allow micro-work sharing with things other than SST generation on a set of threads. That would be a good direction for effective sharing of CPU resources without spikes in usage, but might incur intolerable CPU overhead in managing work. With this “hand optimized” and specialized framework, we can at least evaluate such future endeavors against a perhaps ideal framework in terms of parallelizing with minimal overhead.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;try-it-out&quot;&gt;Try It Out&lt;/h2&gt;

&lt;p&gt;Parallel compression revamp will be available in RocksDB 10.7. As always, we recommend testing in your specific environment to determine the optimal configuration for your workload.&lt;/p&gt;
</description>
        <pubDate>Wed, 08 Oct 2025 00:00:00 +0000</pubDate>
        <link>http://rocksdb.org/blog/2025/10/08/parallel-compression-revamp.html</link>
        <guid isPermaLink="true">http://rocksdb.org/blog/2025/10/08/parallel-compression-revamp.html</guid>
        
        
        <category>blog</category>
        
      </item>
    
      <item>
        <title>IO Activity Tagging</title>
        <description>&lt;h2 id=&quot;context&quot;&gt;Context&lt;/h2&gt;

&lt;p&gt;RocksDB performs a variety of IO operations—user reads, background compactions, flushes, database opens, and verification tasks. Treating all these operations the same makes it difficult for file system implementers to optimize performance, prioritize latency-sensitive IOs, and diagnose bottlenecks. To solve that, RocksDB internally tags every IO operation with its activity type using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;IOActivity&lt;/code&gt; enum. This automatic tagging provides precise context for each IO, enabling file systems to make smarter, context-aware decisions for scheduling, caching, and resource management.&lt;/p&gt;

&lt;h2 id=&quot;how-internal-io-tagging-works&quot;&gt;How Internal IO Tagging Works&lt;/h2&gt;
&lt;p&gt;RocksDB automatically assigns an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;IOActivity&lt;/code&gt; tag to each IO operation. This tag is propagated through the storage stack and included in the IO options passed to the file system.&lt;/p&gt;

&lt;div class=&quot;language-cpp highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;rougeHighlight&quot;&gt;&lt;code&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;rouge-gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;rouge-code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;k&quot;&gt;enum&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;IOActivity&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;uint8_t&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;kFlush&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;                        &lt;span class=&quot;c1&quot;&gt;// IO for flush operations (background write)&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;kCompaction&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;                   &lt;span class=&quot;c1&quot;&gt;// IO for compaction (background read/write)&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;kDBOpen&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;                       &lt;span class=&quot;c1&quot;&gt;// IO during database open (read/write)&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;kGet&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;                          &lt;span class=&quot;c1&quot;&gt;// User Get() read&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;kMultiGet&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;                     &lt;span class=&quot;c1&quot;&gt;// User MultiGet() read&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;kDBIterator&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;5&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;                   &lt;span class=&quot;c1&quot;&gt;// User iterator read&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;kVerifyDBChecksum&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;6&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;             &lt;span class=&quot;c1&quot;&gt;// Verification: DB checksum&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;kVerifyFileChecksums&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;7&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;          &lt;span class=&quot;c1&quot;&gt;// Verification: file checksums&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;kGetEntity&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;8&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;                    &lt;span class=&quot;c1&quot;&gt;// Entity Get (e.g., wide-column)&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;kMultiGetEntity&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;9&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;               &lt;span class=&quot;c1&quot;&gt;// Entity MultiGet&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;kGetFileChecksumsFromCurrentManifest&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// Manifest checksum reads&lt;/span&gt;
    &lt;span class=&quot;c1&quot;&gt;// 0x80–0xFE: Reserved for custom/internal use&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;kUnknown&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0xFF&lt;/span&gt;                    &lt;span class=&quot;c1&quot;&gt;// Unknown/unspecified activity&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;};&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;access-io-tag-in-file-system&quot;&gt;Access IO Tag in File System&lt;/h2&gt;
&lt;p&gt;Custom file systems can access the IOActivity tag via the IO options structure provided by RocksDB. This allows them to optimize behavior based on the specific IO activity.&lt;/p&gt;

&lt;div class=&quot;language-cpp highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;rougeHighlight&quot;&gt;&lt;code&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;rouge-gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;rouge-code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;Status&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;CustomFileSystem&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Append&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;uint64_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;offset&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Slice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;IOOptions&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;io_opts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;...)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;k&quot;&gt;switch&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;io_opts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;io_activity&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;case&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Env&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;IOActivity&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kGet&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
            &lt;span class=&quot;c1&quot;&gt;// Prioritize or cache user reads&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;break&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;case&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Env&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;IOActivity&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kCompaction&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
            &lt;span class=&quot;c1&quot;&gt;// Throttle or deprioritize background compaction IO&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;break&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
        &lt;span class=&quot;k&quot;&gt;case&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Env&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;IOActivity&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kDBOpen&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
            &lt;span class=&quot;c1&quot;&gt;// Track or optimize DB open IO&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;break&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
        &lt;span class=&quot;c1&quot;&gt;// ... handle other activities ...&lt;/span&gt;
        &lt;span class=&quot;nl&quot;&gt;default:&lt;/span&gt;
            &lt;span class=&quot;c1&quot;&gt;// Default handling&lt;/span&gt;
            &lt;span class=&quot;k&quot;&gt;break&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
    &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;h2 id=&quot;io-activity-statistics-in-rocksdb&quot;&gt;IO Activity Statistics in RocksDB&lt;/h2&gt;
&lt;p&gt;RocksDB provides detailed histograms for IO activities, allowing you to analyze both the aggregate time spent (in microseconds) and the count of IOs for each activity type.&lt;/p&gt;
&lt;div class=&quot;language-cpp highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;rougeHighlight&quot;&gt;&lt;code&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;rouge-gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;rouge-code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;c1&quot;&gt;// Read Histograms&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;FILE_READ_FLUSH_MICROS&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;FILE_READ_COMPACTION_MICROS&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;FILE_READ_DB_OPEN_MICROS&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;FILE_READ_GET_MICROS&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;FILE_READ_MULTIGET_MICROS&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;FILE_READ_DB_ITERATOR_MICROS&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;FILE_READ_VERIFY_DB_CHECKSUM_MICROS&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;FILE_READ_VERIFY_FILE_CHECKSUMS_MICROS&lt;/span&gt;

&lt;span class=&quot;c1&quot;&gt;// Write Histograms&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;FILE_WRITE_FLUSH_MICROS&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;FILE_WRITE_COMPACTION_MICROS&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;FILE_WRITE_DB_OPEN_MICROS&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Thanks to Maciej Szeszko and Andrew Chang from the RocksDB team for their contributions in expanding and maintaining the IOActivity enum.&lt;/p&gt;
</description>
        <pubDate>Thu, 25 Sep 2025 00:00:00 +0000</pubDate>
        <link>http://rocksdb.org/blog/2025/09/25/io-tagging.html</link>
        <guid isPermaLink="true">http://rocksdb.org/blog/2025/09/25/io-tagging.html</guid>
        
        
        <category>blog</category>
        
      </item>
    
      <item>
        <title>Unified Memory Tracking</title>
        <description>&lt;h2 id=&quot;context--problem&quot;&gt;Context / Problem&lt;/h2&gt;
&lt;p&gt;Modern RocksDB deployments often run in environments with strict memory constraints—cloud VMs, containers, or hosts with hundreds of DB instances. Unpredictable memory usage can lead to out-of-memory (OOM) errors, degraded performance, or even service outages.
Historically, while the block cache was the main source of memory usage, other components—such as memtables, table readers, file metadata, and temporary buffers—could consume significant memory outside the block cache’s control. This made it difficult for users to set a single memory limit and guarantee resource usage stays within expectations.&lt;/p&gt;

&lt;h2 id=&quot;goal&quot;&gt;Goal&lt;/h2&gt;
&lt;p&gt;The goal of recent memory tracking work in RocksDB is to enable users to cap the total memory usage of RocksDB instances under a single, configurable limit—the block cache capacity. This is achieved by:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Tracking and charging&lt;/strong&gt; all major memory consumers (memtables, table readers, file metadata, compression buffers, filter construction) to the block cache.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Evicting&lt;/strong&gt; data blocks or other memory when the total tracked usage exceeds the configured limit.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Providing a fixed memory footprint&lt;/strong&gt; for RocksDB, making it easier to run in resource-constrained environments and avoid OOMs.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;memtable-memory-charging&quot;&gt;Memtable Memory Charging&lt;/h2&gt;
&lt;p&gt;A major source of memory usage in RocksDB is the memtable. To ensure memtable memory is tracked and capped under a single limit, RocksDB provides the WriteBufferManager (WBM). When WBM is configured with a block cache, memtable memory usage is charged to the block cache. This helps prevent OOM errors and simplifies resource management.&lt;/p&gt;

&lt;div class=&quot;language-cpp highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;rougeHighlight&quot;&gt;&lt;code&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;rouge-gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;rouge-code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;shared_ptr&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Cache&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cache&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;HyperClockCacheOptions&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;capacity&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;MakeSharedCache&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;DBOptions&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;db_options&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;db_options&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;write_buffer_manager&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;make_shared&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;WriteBufferManager&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(..,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cache&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;other-memory-charging&quot;&gt;Other Memory Charging&lt;/h2&gt;
&lt;p&gt;Beyond memtables, RocksDB allows users to control memory charging for other internal roles using the cache_usage_options API. This provides fine-grained control over how memory is tracked for components like table readers, file metadata, compression dictionary buffers (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CompressionOptions::max_dict_buffer_bytes:&lt;/code&gt;) and filter construction.&lt;/p&gt;

&lt;div class=&quot;language-cpp highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;rougeHighlight&quot;&gt;&lt;code&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;rouge-gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;rouge-code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;CacheEntryRoleOptions&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;k&quot;&gt;enum&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Decision&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;kEnabled&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;kDisabled&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
    &lt;span class=&quot;n&quot;&gt;kFallback&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;};&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;Decision&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;charged&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Decision&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kFallback&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;};&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;CacheUsageOptions&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;CacheEntryRoleOptions&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;options&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
  &lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;map&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;CacheEntryRole&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;CacheEntryRoleOptions&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;options_overrides&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;};&lt;/span&gt;

&lt;span class=&quot;p&quot;&gt;...&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;BlockBasedTableOptions&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;table_options&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;table_options&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cache_usage_options&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;options&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;charged&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;CacheEntryRoleOptions&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Decision&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kFallback&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;table_options&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cache_usage_options&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;options_overrides&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;CacheEntryRole&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kTableBuilder&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;charged&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;CacheEntryRoleOptions&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Decision&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kEnabled&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;};&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;Default (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Decision::kFallback&lt;/code&gt;) behavior for each memory type:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CacheEntryRole::kCompressionDictionaryBuildingBuffer&lt;/code&gt;: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kEnabled&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CacheEntryRole::kFilterConstruction&lt;/code&gt;: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kDisabled&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CacheEntryRole::kBlockBasedTableReader&lt;/code&gt;: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kDisabled&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CacheEntryRole::kFileMetadata&lt;/code&gt;: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;kDisabled&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;monitoring-and-observability&quot;&gt;Monitoring and Observability&lt;/h2&gt;
&lt;p&gt;RocksDB provides built-in statistics to help users monitor memory usage and cache behavior. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DB::Properties::kBlockCacheEntryStats&lt;/code&gt; exposes detailed statistics about block cache entries, including breakdowns by each &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CacheEntryRole&lt;/code&gt;. These statistics are essential for understanding memory consumption and tuning cache configuration.&lt;/p&gt;
</description>
        <pubDate>Wed, 24 Sep 2025 00:00:00 +0000</pubDate>
        <link>http://rocksdb.org/blog/2025/09/24/unified-memory-tracking.html</link>
        <guid isPermaLink="true">http://rocksdb.org/blog/2025/09/24/unified-memory-tracking.html</guid>
        
        
        <category>blog</category>
        
      </item>
    
      <item>
        <title>Addressing a Mitigated Misconfig Bug in the RocksDB OSS Repository</title>
        <description>&lt;p&gt;Dear RocksDB Community,&lt;/p&gt;

&lt;p&gt;We want to share an update about the bug that allowed our bug bounty researcher to update the release note title in August 2024 involving the RocksDB open-source repository on GitHub. This issue was found and responsibly disclosed to us by an external bug bounty researcher through our &lt;a href=&quot;https://www.facebook.com/whitehat&quot;&gt;Meta Bug Bounty program&lt;/a&gt; and quickly mitigated by our teams. We have not seen any evidence of malicious exploitation. Please note that no action is required from our community, as we have taken all necessary steps to remediate the issue.&lt;/p&gt;

&lt;h2 id=&quot;background&quot;&gt;Background&lt;/h2&gt;

&lt;p&gt;RocksDB is a high-performance storage engine library widely used in various large-scale applications. On August 21, 2024, a bug was reported to us by one of our  bug bounty researchers. They were able to demonstrate the ability to obtain the GITHUB_TOKEN used in GitHub Actions workflows. This token provides write access to the metadata of the repository, and the researcher used it to change the title of the release note 9.5.2 as proof of concept. The researcher also unsuccessfully attempted to merge a change to the main branch of the repository; however, we had access controls set up to prevent it from going through.&lt;/p&gt;

&lt;h2 id=&quot;key-details&quot;&gt;Key Details&lt;/h2&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Incident Discovery&lt;/strong&gt;: After the bug bounty researcher changed the open source release note title to demonstrate the vulnerability, external users noticed this change and &lt;a href=&quot;https://github.com/facebook/rocksdb/issues/12962&quot;&gt;notified&lt;/a&gt; RocksDB. RocksDB then reached out to the Bug Bounty program to confirm this was the result of security research.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;No Malicious Abuse&lt;/strong&gt;: The investigation confirmed that no code or data was compromised. The change was public and visible on GitHub.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Tag Reversion Clarification&lt;/strong&gt;: On August 21, a tag named “v9.5.2” was initially published pointing to an incorrect commit. This was unrelated to the bug described here and was promptly corrected by pointing the tag to the correct commit. The release binary remains safe to use, and this correction does not impact the security or integrity of the release.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We’ve taken the following steps to mitigate and remediate the issue:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;The release note title was corrected.&lt;/li&gt;
  &lt;li&gt;The workflow running on the self-hosted runner was disabled immediately.&lt;/li&gt;
  &lt;li&gt;It was confirmed that the GITHUB_TOKEN expired and is no longer in use.&lt;/li&gt;
  &lt;li&gt;The binary tagged for public release was examined to confirm that it was not compromised.&lt;/li&gt;
  &lt;li&gt;Action logs were cross-checked to ensure no other actions were taken with the compromised token, other than the release note title change and the failed attempts to merge self-approved pull requests to the main branch.&lt;/li&gt;
  &lt;li&gt;We have &lt;a href=&quot;https://github.com/facebook/rocksdb/pull/12973&quot;&gt;scoped down&lt;/a&gt; the access level of tokens generated for workflows to prevent similar issues. Additionally, we are developing better guidelines for bug bounty researchers to minimize disruptions during their research.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Thank you for your continued support and trust in RocksDB.&lt;/p&gt;

&lt;p&gt;Sincerely,&lt;/p&gt;

&lt;p&gt;The RocksDB Team&lt;/p&gt;
</description>
        <pubDate>Fri, 07 Feb 2025 00:00:00 +0000</pubDate>
        <link>http://rocksdb.org/blog/2025/02/07/mitigated-bug-update.html</link>
        <guid isPermaLink="true">http://rocksdb.org/blog/2025/02/07/mitigated-bug-update.html</guid>
        
        
        <category>blog</category>
        
      </item>
    
      <item>
        <title>Java Foreign Function Interface</title>
        <description>&lt;h1 id=&quot;java-foreign-function-interface-ffi&quot;&gt;Java Foreign Function Interface (FFI)&lt;/h1&gt;

&lt;p&gt;Evolved Binary has been working on several aspects of how the Java API to RocksDB can be improved. The recently introduced FFI features in Java provide significant opportunities for improving the API. We have investigated this through a prototype implementation.&lt;/p&gt;

&lt;p&gt;Java 19 introduced a new &lt;a href=&quot;https://openjdk.org/jeps/424&quot;&gt;FFI Preview&lt;/a&gt; which is described as &lt;em&gt;an API by which Java programs can interoperate with code and data outside of the Java runtime. By efficiently invoking foreign functions (i.e., code outside the JVM), and by safely accessing foreign memory (i.e., memory not managed by the JVM), the API enables Java programs to call native libraries and process native data without the brittleness and danger of JNI&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;If the twin promises of efficiency and safety are realised, then using FFI as a mechanism to support a future RocksDB API may be of significant benefit.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Remove the complexity of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;JNI&lt;/code&gt; access to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;C++ RocksDB&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;Improve RocksDB Java API performance&lt;/li&gt;
  &lt;li&gt;Reduce the opportunity for coding errors in the RocksDB Java API&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here’s what we did. We have&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;created a prototype FFI branch&lt;/li&gt;
  &lt;li&gt;updated the RocksDB Java build to use Java 19&lt;/li&gt;
  &lt;li&gt;implemented an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;FFI Preview API&lt;/code&gt; version of core RocksDB feature (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;get()&lt;/code&gt;)&lt;/li&gt;
  &lt;li&gt;Extended the current JMH benchmarks to also benchmark the new FFI methods. Usefully, JNI and FFI can co-exist peacefully, so we use the existing RocksDB Java to do support work around the FFI-based &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;get()&lt;/code&gt; implementation.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;implementation&quot;&gt;Implementation&lt;/h2&gt;

&lt;h3 id=&quot;how-jni-works&quot;&gt;How JNI Works&lt;/h3&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;JNI&lt;/code&gt; requires a preprocessing step during build/compilation to generate header files which are linked into by Pure Java code. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;C++&lt;/code&gt; implementations of the methods in the headers are implemented. Corresponding &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;native&lt;/code&gt; methods are declared in Java and the whole is linked together.&lt;/p&gt;

&lt;p&gt;Code in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;C++&lt;/code&gt; methods uses what amounts to a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;JNI&lt;/code&gt; library to access Java values and objects and to create Java objects in response.&lt;/p&gt;

&lt;h3 id=&quot;how-ffi-works&quot;&gt;How FFI Works&lt;/h3&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;FFI&lt;/code&gt; provides the facility for Java to call existing native (in our case C++) code from Pure Java without having to generate support files during compilation steps. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;FFI&lt;/code&gt; does support an external tool (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jextract&lt;/code&gt;) which makes generating common boilerplate easier and less error prone, but we choose to start prototyping without it, in part better to understand how things really work.&lt;/p&gt;

&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;FFI&lt;/code&gt; does its job by providing 2 things&lt;/p&gt;
&lt;ol&gt;
  &lt;li&gt;A model for allocating, reading and writing native memory and native structures within that memory&lt;/li&gt;
  &lt;li&gt;A model for discovering and calling native methods with parameters consisting of native memory references and/or values&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The called &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;C++&lt;/code&gt; is invoked entirely natively. It does not have to access any Java objects to retrieve data it needs. Therefore existing packages in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;C++&lt;/code&gt; and other sufficiently low level languages can be called from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Java&lt;/code&gt; without having to implement stubs in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;C++&lt;/code&gt;.&lt;/p&gt;

&lt;h3 id=&quot;our-approach&quot;&gt;Our Approach&lt;/h3&gt;

&lt;p&gt;While we could in principle avoid writing any C++, C++ objects and classes are not easily defined in the FFI model, so to begin with it is easier to write some very simple &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;C&lt;/code&gt;-like methods/stubs in C++ which can immediately call into the object-oriented core of RocksDB. We define structures with which to pass parameters to and receive results from the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;C&lt;/code&gt;-like method(s) we implement.&lt;/p&gt;

&lt;h4 id=&quot;c-side&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;C++&lt;/code&gt; Side&lt;/h4&gt;

&lt;p&gt;The first method we implement is&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-C&quot;&gt;extern &quot;C&quot; int rocksdb_ffi_get_pinnable(
    ROCKSDB_NAMESPACE::DB* db, ROCKSDB_NAMESPACE::ReadOptions* read_options,
    ROCKSDB_NAMESPACE::ColumnFamilyHandle* cf, rocksdb_input_slice_t* key,
    rocksdb_pinnable_slice_t* value);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;our input structure is&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-C&quot;&gt;typedef struct rocksdb_input_slice {
  const char* data;
  size_t size;
} rocksdb_input_slice_t;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;and our output structure is a pinnable slice (of which more later)&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-C&quot;&gt;typedef struct rocksdb_pinnable_slice {
  const char* data;
  size_t size;
  ROCKSDB_NAMESPACE::PinnableSlice* pinnable_slice;
  bool is_pinned;
} rocksdb_pinnable_slice_t;
&lt;/code&gt;&lt;/pre&gt;

&lt;h4 id=&quot;java-side&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Java&lt;/code&gt; Side&lt;/h4&gt;

&lt;p&gt;We implement an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;FFIMethod&lt;/code&gt; class to advertise a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;java.lang.invoke.MethodHandle&lt;/code&gt; for each of our helper stubs&lt;/p&gt;

&lt;div class=&quot;language-java highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;rougeHighlight&quot;&gt;&lt;code&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;rouge-gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;rouge-code&quot;&gt;&lt;pre&gt;  &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;MethodHandle&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;GetPinnable&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// handle which refers to the rocksdb_ffi_get_pinnable method in C++&lt;/span&gt;
  &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;MethodHandle&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;ResetPinnable&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;// handle which refers to the rocksdb_ffi_reset_pinnable method in C++&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;We also implement an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;FFILayout&lt;/code&gt; class to describe each of the passed structures (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rocksdb_input_slice&lt;/code&gt; , &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rocksdb_pinnable_slice&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rocksdb_output_slice&lt;/code&gt;) in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Java&lt;/code&gt; terms&lt;/p&gt;

&lt;div class=&quot;language-java highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;rougeHighlight&quot;&gt;&lt;code&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;rouge-gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;rouge-code&quot;&gt;&lt;pre&gt; &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;InputSlice&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;kd&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;GroupLayout&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Layout&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;...&lt;/span&gt;
  &lt;span class=&quot;kd&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;VarHandle&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;...&lt;/span&gt;
  &lt;span class=&quot;kd&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;VarHandle&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Size&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;  &lt;span class=&quot;o&quot;&gt;...&lt;/span&gt;
 &lt;span class=&quot;o&quot;&gt;};&lt;/span&gt;

 &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;PinnableSlice&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;kd&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;GroupLayout&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Layout&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;...&lt;/span&gt;
  &lt;span class=&quot;kd&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;VarHandle&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;...&lt;/span&gt;
  &lt;span class=&quot;kd&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;VarHandle&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Size&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;  &lt;span class=&quot;o&quot;&gt;...&lt;/span&gt;
  &lt;span class=&quot;kd&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;VarHandle&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;IsPinned&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;  &lt;span class=&quot;o&quot;&gt;...&lt;/span&gt;
 &lt;span class=&quot;o&quot;&gt;};&lt;/span&gt;

 &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;OutputSlice&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{&lt;/span&gt;
  &lt;span class=&quot;kd&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;GroupLayout&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Layout&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;...&lt;/span&gt;
  &lt;span class=&quot;kd&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;VarHandle&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;...&lt;/span&gt;
  &lt;span class=&quot;kd&quot;&gt;static&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;VarHandle&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Size&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;  &lt;span class=&quot;o&quot;&gt;...&lt;/span&gt;
 &lt;span class=&quot;o&quot;&gt;};&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;FFIDB&lt;/code&gt; class, which implements the public Java FFI API methods, makes use of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;FFIMethod&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;FFILayout&lt;/code&gt; to make the code for each individual method as idiomatic and efficient as possible. This class also contains &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;java.lang.foreign.MemorySession&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;java.lang.foreign.SegmentAllocator&lt;/code&gt; objects which control the lifetime of native memory sessions and allow us to allocate lifetime-limited native memory which can be written and read by Java, and passed to native methods.&lt;/p&gt;

&lt;p&gt;At the user level, we then present a method which wraps the details of use of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;FFIMethod&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;FFILayout&lt;/code&gt; to implement our single, core Java API &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;get()&lt;/code&gt; method&lt;/p&gt;

&lt;div class=&quot;language-java highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;rougeHighlight&quot;&gt;&lt;code&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;rouge-gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;rouge-code&quot;&gt;&lt;pre&gt; &lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;GetPinnableSlice&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;getPinnableSlice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;ReadOptions&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;readOptions&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt;
      &lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;ColumnFamilyHandle&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;columnFamilyHandle&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;MemorySegment&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;keySegment&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt;
      &lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;GetParams&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;getParams&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The flow of implementation of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;getPinnableSlice()&lt;/code&gt;, in common with any other core RocksDB FFI API method becomes:&lt;/p&gt;
&lt;ol&gt;
  &lt;li&gt;Allocate &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;MemorySegment&lt;/code&gt;s for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;C++&lt;/code&gt; structures using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Layout&lt;/code&gt;s from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;FFILayout&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;Write to the allocated structures using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;VarHandle&lt;/code&gt;s from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;FFILayout&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;Invoke the native method using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;MethodHandle&lt;/code&gt; from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;FFIMethod&lt;/code&gt; and addresses of instantiated &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;MemorySegment&lt;/code&gt;s, or value types, as parameters&lt;/li&gt;
  &lt;li&gt;Read the call result and the output parameter(s), again using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;VarHandle&lt;/code&gt;s from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;FFILayout&lt;/code&gt; to perform the mapping.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;getPinnableSlice()&lt;/code&gt; method, on successful return from an invocation of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;rocksdb_ffi_get()&lt;/code&gt;, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PinnableSlice&lt;/code&gt; object will contain the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;data&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;size&lt;/code&gt; fields of a pinnable slice (see below) containing the requested value. A &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;MemorySegment&lt;/code&gt; referring to the native memory of the pinnable slice is then constructed, and used by the client to retrieve the value in whatever fashion they choose.&lt;/p&gt;

&lt;h3 id=&quot;pinnable-slices&quot;&gt;Pinnable Slices&lt;/h3&gt;

&lt;p&gt;RocksDB offers core (C++) API methods using the concept of a &lt;a href=&quot;http://rocksdb.org/blog/2017/08/24/pinnableslice.html&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PinnableSlice&lt;/code&gt;&lt;/a&gt; to return fetched data values while reducing copies to a minimum. We take advantage of this to base our central &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;get()&lt;/code&gt; method(s) on &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PinnableSlice&lt;/code&gt;s. Methods mirroring the existing &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;JNI&lt;/code&gt;-based API can then be implemented in pure Java by wrapping the core &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;getPinnableSlice()&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;So we implement&lt;/p&gt;
&lt;div class=&quot;language-java highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;rougeHighlight&quot;&gt;&lt;code&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;rouge-gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;rouge-code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;record&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;GetPinnableSlice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;Status&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;Code&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;code&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;Optional&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;FFIPinnableSlice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pinnableSlice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{}&lt;/span&gt;

&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;GetPinnableSlice&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;getPinnableSlice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;
      &lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;ColumnFamilyHandle&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;columnFamilyHandle&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;byte&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[]&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;key&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;and we wrap that to provide&lt;/p&gt;
&lt;div class=&quot;language-java highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;rougeHighlight&quot;&gt;&lt;code&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;rouge-gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;rouge-code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;record&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;GetBytes&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nc&quot;&gt;Status&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;na&quot;&gt;Code&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;code&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;byte&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[]&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;long&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;{}&lt;/span&gt;

&lt;span class=&quot;kd&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;GetBytes&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;get&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;ColumnFamilyHandle&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;columnFamilyHandle&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kd&quot;&gt;final&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;byte&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;[]&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;key&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h2 id=&quot;benchmark-results&quot;&gt;Benchmark Results&lt;/h2&gt;

&lt;p&gt;We extended existing RocksDB Java JNI benchmarks with new benchmarks based on FFI. Full benchmark run on Ubuntu, including new benchmarks.&lt;/p&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;rougeHighlight&quot;&gt;&lt;code&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;rouge-gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;rouge-code&quot;&gt;&lt;pre&gt;java &lt;span class=&quot;nt&quot;&gt;--enable-preview&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--enable-native-access&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;ALL-UNNAMED &lt;span class=&quot;nt&quot;&gt;-jar&lt;/span&gt; target/rocksdbjni-jmh-1.0-SNAPSHOT-benchmarks.jar &lt;span class=&quot;nt&quot;&gt;-p&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;keyCount&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;100000 &lt;span class=&quot;nt&quot;&gt;-p&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;keySize&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;128 &lt;span class=&quot;nt&quot;&gt;-p&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;valueSize&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;4096,65536 &lt;span class=&quot;nt&quot;&gt;-p&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;columnFamilyTestType&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;no_column_family&quot;&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-rf&lt;/span&gt; csv org.rocksdb.jmh.GetBenchmarks
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;&lt;img src=&quot;/static/images/jni-ffi/jmh-result-fixed.png&quot; alt=&quot;JNI vs FFI&quot; /&gt;&lt;/p&gt;

&lt;h3 id=&quot;discussion&quot;&gt;Discussion&lt;/h3&gt;

&lt;p&gt;We have plotted the performance (more operations is better) of a selection of benchmarks,&lt;/p&gt;
&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;rougeHighlight&quot;&gt;&lt;code&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;rouge-gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;rouge-code&quot;&gt;&lt;pre&gt;q &lt;span class=&quot;s2&quot;&gt;&quot;select Benchmark,Score from ./plot/jmh-result-fixed.csv where &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&quot;&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;Param: keyCount&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&quot;&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;=100000 and &lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&quot;&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;Param: valueSize&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\&quot;&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;=65536 -d, -H
&lt;/span&gt;&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;ul&gt;
  &lt;li&gt;JNI versions of benchmarks are previously implemented &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jmh&lt;/code&gt; benchmarks for measuring the performance of the current RocksDB Java interface.&lt;/li&gt;
  &lt;li&gt;FFI versions of benchmarks are equivalent benchmarks (as far as possible) implemented using the FFI mechanisms.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We can see that for all benchmarks which have equivalent FFI and JNI pairs, the JNI version is only very marginally faster. FFI has successfully optimized away most of the extra safety-checking of the new invocation mechanism.&lt;/p&gt;

&lt;p&gt;Our initial implementation of FFI benchmarks lagged the JNI benchmarks quite significantly, but we have received extremely helpful support from Maurizio Cimadamore of the Panama Dev team, to help us optimize the performance of our FFI implementation. We consider that the small remaining performance gap is a feature of the remaining extra bounds checking of FFI.&lt;/p&gt;

&lt;p&gt;For basic &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;get()&lt;/code&gt; the result buffer is allocated by the method, so that there is a cost of allocation associated with each request.&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ffiGet&lt;/code&gt; vs &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;get&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;The JNI version is very marginally faster than FFI&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For preallocated &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;get()&lt;/code&gt; where the result buffer is supplied to the method, we avoid an allocation of a fresh result buffer on each call, and the test recycles its result buffers. Then the same small difference persists&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;JNI is very marginally faster than FFI&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;preallocatedGet()&lt;/code&gt; is a lot faster than basic &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;get()&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We implemented some methods where the key for the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;get()&lt;/code&gt; is randomized, so that any ordering effects can be accounted for. The same differences persisted.&lt;/p&gt;

&lt;p&gt;The FFI interface gives us a natural way to expose RocksDB’s &lt;a href=&quot;http://rocksdb.org/blog/2017/08/24/pinnableslice.html&quot;&gt;pinnable slice&lt;/a&gt; mechanism. When we provide a benchmark which accesses the raw &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PinnableSlice&lt;/code&gt; API, as expected this is the fastest method of any; however we are not comparing like with like:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ffiGetPinnableSlice()&lt;/code&gt; returns a handle to the RocksDB memory containing the slice, and presents that as an FFI &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;MemorySegment&lt;/code&gt;. No copying of the memory in the segment occurs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As noted above, we implement the new FFI-based &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;get()&lt;/code&gt; methods using the new FFI-based &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;getPinnableSlice()&lt;/code&gt; method, and copying out the result. So the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ffiGet&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ffiPreallocatedGet&lt;/code&gt; benchmarks use this mechanism underneath.&lt;/p&gt;

&lt;p&gt;In an effort to discover whether using the Java APIs to copy from the pinnable slice backed &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;MemorySegment&lt;/code&gt; was a problem, we implemented a separate &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ffiGetOutputSlice()&lt;/code&gt; benchmark which copies the result into a (Java allocated native memory) segment at the C++ side.&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ffiGetOutputSlice()&lt;/code&gt; is faster than &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ffiPreallocatedGet()&lt;/code&gt; and is in fact at least as fast as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;preallocatedGet()&lt;/code&gt;, which is an almost exact analogue in the JNI world.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So it appears that we can build an FFI-based API with equal performance to the JNI-based one.&lt;/p&gt;

&lt;p&gt;Thinking about the (very small, but probably statistically significant) difference between our &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ffiGetPinnableSlice()&lt;/code&gt;-based FFI calls and the JNI-based calls, it is reasonable to expect that some of the cost is the extra FFI call to C++ to release the pinned slice as a separate operation. A null FFI method call is extremely fast, but it does take some time.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;We would recommend looking again the performance of the FFI-based implementation when Panama is release post-Preview in Java 21. It seems that at least with Java 20 the performance is of our FFI benchmarks is not significantly different from that of the Java 19 version.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;copies-versus-calls&quot;&gt;Copies versus Calls&lt;/h3&gt;

&lt;p&gt;The second method call over the FFI boundary to release a pinnable slice has a cost. We compared the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ffiGetOutputSlice()&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ffiGetPinnableSlice()&lt;/code&gt; benchmarks in order to examine this cost. We ran it with a fixed ky size (128 bytes); the key size is likely to be pretty much irrelevant anyway; we varied the value size read from 16 bytes to 16k, and we found a crossover point between 1k and 4k for performance:&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/static/images/jni-ffi/jmh-result-pinnable-vs-output-plot.png&quot; alt=&quot;Plot&quot; /&gt;&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ffiGetOutputSlice()&lt;/code&gt; is faster when values read are  1k in size or smaller. The cost of an extra copy in the C++ side from the pinnable slice buffer into the supplied buffer allocated by Java Foreign Memory API is less than the cost of the extra call to release a pinnable slice.&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ffiGetPinnableSlice()&lt;/code&gt; is faster when values read are 4k in size, or larger. Consistent with intuition, the advantage grows with larger read values.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The way that the RocksDB API is constructed means that of the 2 methods compared, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ffiGetOutputSlice()&lt;/code&gt; will always make exactly 1 more copy than &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ffiGetPinnableSlice()&lt;/code&gt;. The underlying RocksDB C++ API will always copy into its own temporary buffer if it decides that it cannot pin an internal buffer, and that will be returned as the pinnable slice. There is a potential optimization where the temporary buffer could be replaced by an output buffer, such as that supplied by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ffiGetOutputSlice()&lt;/code&gt;; in practice that is a hard fix to hack in. Its effectiveness depends on how often RocksDB fails to pin an internal buffer.&lt;/p&gt;

&lt;p&gt;A solution which either filled a buffer &lt;em&gt;or&lt;/em&gt; returned a pinnable slice would give us the best of both worlds.&lt;/p&gt;

&lt;h2 id=&quot;other-conclusions&quot;&gt;Other Conclusions&lt;/h2&gt;

&lt;h3 id=&quot;build-processing&quot;&gt;Build Processing&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;
    &lt;p&gt;It is easier to implement an interface using FFI than JNI. No intermediate build processing or code generation steps were needed to implement this protoype.&lt;/p&gt;
  &lt;/li&gt;
  &lt;li&gt;
    &lt;p&gt;For a production version, we would urge using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jextract&lt;/code&gt; to automate the process of generating Java API methods from the set of supporting stubs we generate.&lt;/p&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;safety&quot;&gt;Safety&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;The use of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;jextract&lt;/code&gt; will give a similar level of type security to the use of JNI, when crossing the language boundary. But we do not believe FFI is significantly more type-safe than JNI for method invocation. Neither is it less safe, though.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;native-memory&quot;&gt;Native Memory&lt;/h3&gt;

&lt;p&gt;Panama’s &lt;em&gt;Foreign-Memory Access API&lt;/em&gt; appears to us to be the most significant part of the whole project. At the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Java&lt;/code&gt; side of RocksDB it gives us a clean mechanism (a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;MemorySegment&lt;/code&gt;) for holding RocksDB data (e.g. as from the result of a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;get()&lt;/code&gt;) call pending its forwarding to client code or network buffers.&lt;/p&gt;

&lt;p&gt;We have taken advantage of this mechanism to provide the core &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;FFIDB.getPinnableSlice()&lt;/code&gt; method in our Panama-based API. The rest of our prototype &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;get()&lt;/code&gt; API, duplicating the existing &lt;em&gt;JNI&lt;/em&gt;-based API, is then a &lt;em&gt;Pure Java&lt;/em&gt; library on top of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;FFIDB.getPinnableSlice()&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;FFIPinnableSlice.reset()&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The common standard for foreign memory opens up the possibility of efficient interoperation between RocksDB and Java clients (e.g. Kafka). We think that this is really the key to higher performing, more integrated Java-based systems:&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;This could result in data never being copied into Java memory, or a significant reduction in copies, as native &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;MemorySegment&lt;/code&gt;s are handed off between co-operating Java clients of fundamentally native APIs. This extra potential performance can be extremely useful when 2 or more clients are interoperating; we still need to provide a simplest possible API wrapping these calls (like our prototype &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;get()&lt;/code&gt;), which operates at a similar level to the current Java API.&lt;/li&gt;
  &lt;li&gt;Some thought should be applied to how this architecture would interact with the cache layer(s) in RocksDB, and whether it can be accommodated within the present RocksDB architecture. How long can 3rd-party applications &lt;em&gt;pin&lt;/em&gt; pages in the RocksDB cache without disrupting RocksDB normal behaviour (e.g. compaction) ?&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;summary&quot;&gt;Summary&lt;/h2&gt;

&lt;ol&gt;
  &lt;li&gt;Panama/FFI (in &lt;a href=&quot;https://openjdk.org/jeps/424&quot;&gt;Preview&lt;/a&gt;) is a highly capable technology for (re)building the RocksDB Java API, although the supported language level of RocksDB and the planned release schedule for Panama mean that it could not replace JNI in production for some time to come.&lt;/li&gt;
  &lt;li&gt;Panama/FFI would seem to offer comparable performance to JNI;  there is no strong performance argument &lt;em&gt;for&lt;/em&gt; a re-implementation of a standalone RocksDB Java API. But the opportunity to provide a natural pinnable slice-based API gives a lot of flexibility; not least because an efficient API could be built mostly in Java with only a small underlying layer implementing the pinnable slice interface.&lt;/li&gt;
  &lt;li&gt;Panama/FFI can remove some boilerplate (native method declarations) and allow Java programs to access &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;C&lt;/code&gt; libraries without stub code, but calling a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;C++&lt;/code&gt;-based library still requires &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;C&lt;/code&gt; stubs; a possible approach would be to use the RocksDB &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;C&lt;/code&gt; API as the basis for a rebuilt Java API. This would allow us to remove all the existing JNI boilerplate, and concentrate support effort on the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;C&lt;/code&gt; API. An alternative approach would be to build a robust API based on &lt;a href=&quot;https://github.com/facebook/rocksdb/pull/10736&quot;&gt;Reference Counting&lt;/a&gt;, but using FFI.&lt;/li&gt;
  &lt;li&gt;Panama/FFI really shines as a foreign memory standard for a Java API that can allow efficient interoperation between RocksDB Java clients and other (Java and native) components of a system. Foreign Memory gives us a model for how to efficiently return data from RocksDB; as pinnable slices with their contents presented in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;MemorySegment&lt;/code&gt;s. If we focus on designing an API &lt;em&gt;for native interoperability&lt;/em&gt; we think this can be highly productive in opening RocksDB to new uses and opportunities in future.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id=&quot;appendix&quot;&gt;Appendix&lt;/h2&gt;

&lt;h3 id=&quot;code-and-data&quot;&gt;Code and Data&lt;/h3&gt;

&lt;p&gt;The &lt;a href=&quot;https://github.com/facebook/rocksdb/pull/11095/files&quot;&gt;Experimental Pull Request&lt;/a&gt; contains the source code implemented,
together with further data plots and the source CSV files for all data plots.&lt;/p&gt;

&lt;h3 id=&quot;running&quot;&gt;Running&lt;/h3&gt;

&lt;p&gt;This is an example run; the jmh parameters (after &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;-p&lt;/code&gt;) can be changed to measure performance with varying key counts, and key and value sizes.&lt;/p&gt;
&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;rougeHighlight&quot;&gt;&lt;code&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;rouge-gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;rouge-code&quot;&gt;&lt;pre&gt;java &lt;span class=&quot;nt&quot;&gt;--enable-preview&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;--enable-native-access&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;ALL-UNNAMED &lt;span class=&quot;nt&quot;&gt;-jar&lt;/span&gt; target/rocksdbjni-jmh-1.0-SNAPSHOT-benchmarks.jar &lt;span class=&quot;nt&quot;&gt;-p&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;keyCount&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;100000 &lt;span class=&quot;nt&quot;&gt;-p&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;keySize&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;128 &lt;span class=&quot;nt&quot;&gt;-p&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;valueSize&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;4096,65536 &lt;span class=&quot;nt&quot;&gt;-p&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;columnFamilyTestType&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;no_column_family&quot;&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-rf&lt;/span&gt; csv org.rocksdb.jmh.GetBenchmarks &lt;span class=&quot;nt&quot;&gt;-wi&lt;/span&gt; 1 &lt;span class=&quot;nt&quot;&gt;-to&lt;/span&gt; 1m &lt;span class=&quot;nt&quot;&gt;-i&lt;/span&gt; 1
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;processing&quot;&gt;Processing&lt;/h3&gt;

&lt;p&gt;Use &lt;a href=&quot;http://harelba.github.io/q/&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;q&lt;/code&gt;&lt;/a&gt; to select the csv output for analysis and graphing.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Note that we edited the column headings for easier processing&lt;/li&gt;
&lt;/ul&gt;

&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;rougeHighlight&quot;&gt;&lt;code&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;rouge-gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;rouge-code&quot;&gt;&lt;pre&gt;q &lt;span class=&quot;s2&quot;&gt;&quot;select Benchmark,Score,Error from ./plot/jmh-result.csv where keyCount=100000 and valueSize=65536&quot;&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-d&lt;/span&gt;, &lt;span class=&quot;nt&quot;&gt;-H&lt;/span&gt; &lt;span class=&quot;nt&quot;&gt;-C&lt;/span&gt; readwrite
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;h3 id=&quot;java-19-installation&quot;&gt;Java 19 installation&lt;/h3&gt;

&lt;p&gt;We followed the instructions to install &lt;a href=&quot;https://docs.azul.com/core/zulu-openjdk/install/debian&quot;&gt;Azul&lt;/a&gt;. Then select the correct instance of java locally:&lt;/p&gt;
&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;rougeHighlight&quot;&gt;&lt;code&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;rouge-gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;rouge-code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;update-alternatives &lt;span class=&quot;nt&quot;&gt;--config&lt;/span&gt; java
&lt;span class=&quot;nb&quot;&gt;sudo &lt;/span&gt;update-alternatives &lt;span class=&quot;nt&quot;&gt;--config&lt;/span&gt; javac
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;And set &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;JAVA_HOME&lt;/code&gt; appropriately. In my case, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;sudo update-alternatives --config java&lt;/code&gt; listed a few JVMs thus:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;rougeHighlight&quot;&gt;&lt;code&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;rouge-gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;rouge-code&quot;&gt;&lt;pre&gt;  0            /usr/lib/jvm/bellsoft-java8-full-amd64/bin/java   20803123  auto mode
  1            /usr/lib/jvm/bellsoft-java8-full-amd64/bin/java   20803123  manual mode
  2            /usr/lib/jvm/java-11-openjdk-amd64/bin/java       1111      manual mode
* 3            /usr/lib/jvm/zulu19/bin/java                      2193001   manual mode
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;For our environment, we set this:&lt;/p&gt;
&lt;div class=&quot;language-bash highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;rougeHighlight&quot;&gt;&lt;code&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;rouge-gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;rouge-code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;nb&quot;&gt;export &lt;/span&gt;&lt;span class=&quot;nv&quot;&gt;JAVA_HOME&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;/usr/lib/jvm/zulu19
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;The default version of Maven avaiable on the Ubuntu package repositories (3.6.3) is incompatible with Java 19. You will need to install a later &lt;a href=&quot;https://maven.apache.org/install.html&quot;&gt;Maven&lt;/a&gt;, and use it. I used &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;3.8.7&lt;/code&gt; successfully.&lt;/p&gt;

&lt;h3 id=&quot;java-20-21-22-and-subsequent-versions&quot;&gt;Java 20, 21, 22 and subsequent versions&lt;/h3&gt;

&lt;p&gt;The FFI version we used was a preview in Java 19, and the interface has changed through to Java 22, where it has been finalized. Future work with this prototype will need to update the code to use the changed interface.&lt;/p&gt;
</description>
        <pubDate>Tue, 20 Feb 2024 00:00:00 +0000</pubDate>
        <link>http://rocksdb.org/blog/2024/02/20/foreign-function-interface.html</link>
        <guid isPermaLink="true">http://rocksdb.org/blog/2024/02/20/foreign-function-interface.html</guid>
        
        
        <category>blog</category>
        
      </item>
    
      <item>
        <title>Java API Performance Improvements</title>
        <description>&lt;h1 id=&quot;rocksdb-java-api-performance-improvements&quot;&gt;RocksDB Java API Performance Improvements&lt;/h1&gt;

&lt;p&gt;Evolved Binary has been working on several aspects of how the Java API to RocksDB can be improved. Two aspects of this which are of particular importance are performance and the developer experience.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;We have built some synthetic benchmark code to determine which are the most efficient methods of transferring data between Java and C++.&lt;/li&gt;
  &lt;li&gt;We have used the results of the synthetic benchmarking to guide plans for rationalising the API interfaces.&lt;/li&gt;
  &lt;li&gt;We have made some opportunistic performance optimizations/fixes within the Java API which have already yielded noticable improvements.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id=&quot;synthetic-jni-api-performance-benchmarks&quot;&gt;Synthetic JNI API Performance Benchmarks&lt;/h2&gt;
&lt;p&gt;The synthetic benchmark repository contains tests designed to isolate the Java to/from C++ interaction of a canonical data intensive Key/Value Store implemented in C++ with a Java (JNI) API layered on top.&lt;/p&gt;

&lt;p&gt;JNI provides several mechanisms for allowing transfer of data between Java buffers and C++ buffers. These mechanisms are not trivial, because they require the JNI system to ensure that Java memory under the control of the JVM is not moved or garbage collected whilst it is being accessed outside the direct control of the JVM.&lt;/p&gt;

&lt;p&gt;We set out to determine which of multiple options for transfer of data from &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;C++&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Java&lt;/code&gt; and vice-versa were the most efficient. We used the &lt;a href=&quot;https://github.com/openjdk/jmh&quot;&gt;Java Microbenchmark Harness&lt;/a&gt; to set up repeatable benchmarks to measure all the options.&lt;/p&gt;

&lt;p&gt;We explore these and some other potential mechanisms in the detailed results (in our &lt;a href=&quot;https://github.com/evolvedbinary/jni-benchmarks/blob/main/DataBenchmarks.md&quot;&gt;Synthetic JNI performance repository&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;We summarise this work here:&lt;/p&gt;

&lt;h3 id=&quot;the-model&quot;&gt;The Model&lt;/h3&gt;

&lt;ul&gt;
  &lt;li&gt;In &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;C++&lt;/code&gt; we represent the on-disk data as an in-memory map of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;(key, value)&lt;/code&gt;
pairs.&lt;/li&gt;
  &lt;li&gt;For a fetch query, we expect the result to be a Java object with access to the
contents of the &lt;em&gt;value&lt;/em&gt;. This may be a standard Java object which does the job
of data access (a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;byte[]&lt;/code&gt; or a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ByteBuffer&lt;/code&gt;) or an object of our own devising
which holds references to the value in some form (a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;FastBuffer&lt;/code&gt; pointing to
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;com.sun.unsafe.Unsafe&lt;/code&gt; unsafe memory, for instance).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3 id=&quot;data-types&quot;&gt;Data Types&lt;/h3&gt;

&lt;p&gt;There are several potential data types for holding data for transfer, and they
are unsurprisingly quite connected underneath.&lt;/p&gt;

&lt;h4 id=&quot;byte-array&quot;&gt;Byte Array&lt;/h4&gt;

&lt;p&gt;The simplest data container is a &lt;em&gt;raw&lt;/em&gt; array of bytes (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;byte[]&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;There are 3 different mechanisms for transferring data between a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;byte[]&lt;/code&gt; and
C++&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;At the C++ side, the method
&lt;a href=&quot;https://docs.oracle.com/en/java/javase/13/docs/specs/jni/functions.html#getprimitivearraycritical&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;JNIEnv.GetArrayCritical()&lt;/code&gt;&lt;/a&gt;
allows access to a C++ pointer to the underlying array.&lt;/li&gt;
  &lt;li&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;JNIEnv&lt;/code&gt; methods &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GetByteArrayElements()&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ReleaseByteArrayElements()&lt;/code&gt;
fetch references/copies to and from the contents of a byte array, with less
concern for critical sections than the &lt;em&gt;critical&lt;/em&gt; methods, though they are
consequently more likely/certain to result in (extra) copies.&lt;/li&gt;
  &lt;li&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;JNIEnv&lt;/code&gt; methods &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GetByteArrayRegion()&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SetByteArrayRegion()&lt;/code&gt;
transfer raw C++ buffer data to and from the contents of a byte array. These
must ultimately do some data pinning for the duration of copies; the
mechanisms may be similar or different to the &lt;em&gt;critical&lt;/em&gt; operations, and
therefore performance may differ.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;byte-buffer&quot;&gt;Byte Buffer&lt;/h4&gt;

&lt;p&gt;A &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ByteBuffer&lt;/code&gt; abstracts the contents of a collection of bytes, and was in fact
introduced to support a range of higher-performance I/O operations in some
circumstances.&lt;/p&gt;

&lt;p&gt;There are 2 types of byte buffers in Java, &lt;em&gt;indirect&lt;/em&gt; and &lt;em&gt;direct&lt;/em&gt;. Indirect
byte buffers are the standard, and the memory they use is on-heap as with all
usual Java objects. In contrast, direct byte buffers are used to wrap off-heap
memory which is accessible to direct network I/O. Either type of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ByteBuffer&lt;/code&gt;
can be allocated at the Java side, using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;allocate()&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;allocateDirect()&lt;/code&gt;
methods respectively.&lt;/p&gt;

&lt;p&gt;Direct byte buffers can be created in C++ using the JNI method
&lt;a href=&quot;https://docs.oracle.com/en/java/javase/13/docs/specs/jni/functions.html#newdirectbytebuffer&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;JNIEnv.NewDirectByteBuffer()&lt;/code&gt;&lt;/a&gt;
to wrap some native (C++) memory.&lt;/p&gt;

&lt;p&gt;Direct byte buffers can be accessed in C++ using the
&lt;a href=&quot;https://docs.oracle.com/en/java/javase/13/docs/specs/jni/functions.html#GetDirectBufferAddress&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;JNIEnv.GetDirectBufferAddress()&lt;/code&gt;&lt;/a&gt;
and measured using
&lt;a href=&quot;https://docs.oracle.com/en/java/javase/13/docs/specs/jni/functions.html#GetDirectBufferCapacity&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;JNIEnv.GetDirectBufferCapacity()&lt;/code&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4 id=&quot;unsafe-memory&quot;&gt;Unsafe Memory&lt;/h4&gt;

&lt;p&gt;The call &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;com.sun.unsafe.Unsafe.allocateMemory()&lt;/code&gt; returns a handle which is (of course) just a pointer to raw memory, and
can be used as such on the C++ side. We could turn it into a byte buffer on the
C++ side by calling &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;JNIEnv.NewDirectByteBuffer()&lt;/code&gt;, or simply use it as a native
C++ buffer at the expected address, assuming we record or remember how much
space was allocated.&lt;/p&gt;

&lt;p&gt;A custom &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;FastBuffer&lt;/code&gt; class provides access to unsafe memory from the Java side.&lt;/p&gt;

&lt;h4 id=&quot;allocation&quot;&gt;Allocation&lt;/h4&gt;

&lt;p&gt;For these benchmarks, allocation has been excluded from the benchmark costs by
pre-allocating a quantity of buffers of the appropriate kind as part of the test
setup. Each run of the benchmark acquires an existing buffer from a pre-allocated
FIFO list, and returns it afterwards. A small test has
confirmed that the request and return cycle is of insignificant cost compared to
the benchmark API call.&lt;/p&gt;

&lt;h3 id=&quot;getjnibenchmark-performance&quot;&gt;GetJNIBenchmark Performance&lt;/h3&gt;

&lt;p&gt;Benchmarks ran for a duration of order 6 hours on an otherwise unloaded VM,
  the error bars are small and we can have strong confidence in the values
  derived and plotted.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/static/images/jni-get-benchmarks/fig_1024_1_none_nopoolbig.png&quot; alt=&quot;Raw JNI Get small&quot; /&gt;&lt;/p&gt;

&lt;p&gt;Comparing all the benchmarks as the data size tends large, the conclusions we
can draw are:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Indirect byte buffers add cost; they are effectively an overhead on plain
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;byte[]&lt;/code&gt; and the JNI-side only allows them to be accessed via their
encapsulated &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;byte[]&lt;/code&gt;.&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SetRegion&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GetCritical&lt;/code&gt; mechanisms for copying data into a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;byte[]&lt;/code&gt; are
of very comparable performance; presumably the behaviour behind the scenes of
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SetRegion&lt;/code&gt; is very similar to that of declaring a critical region, doing a
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;memcpy()&lt;/code&gt; and releasing the critical region.&lt;/li&gt;
  &lt;li&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GetElements&lt;/code&gt; methods for transferring data from C++ to Java are consistently
less efficient than &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SetRegion&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GetCritical&lt;/code&gt;.&lt;/li&gt;
  &lt;li&gt;Getting into a raw memory buffer, passed as an address (the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;handle&lt;/code&gt; of an
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Unsafe&lt;/code&gt; or of a netty &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ByteBuf&lt;/code&gt;) is of similar cost to the more efficient
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;byte[]&lt;/code&gt; operations.&lt;/li&gt;
  &lt;li&gt;Getting into a direct &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nio.ByteBuffer&lt;/code&gt; is of similar cost again; while the
ByteBuffer is passed over JNI as an ordinary Java object, JNI has a specific
method for getting hold of the address of the direct buffer, and using this, the
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;get()&lt;/code&gt; cost with a ByteBuffer is just that of the underlying C++ &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;memcpy()&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At small(er) data sizes, we can see whether other factors are important.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/static/images/jni-get-benchmarks/fig_1024_1_none_nopoolsmall.png&quot; alt=&quot;Raw JNI Get large&quot; /&gt;&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Indirect byte buffers are the most significant overhead here. Again, we can
conclude that this is due to pure overhead compared to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;byte[]&lt;/code&gt; operations.&lt;/li&gt;
  &lt;li&gt;At the lowest data sizes, netty &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ByteBuf&lt;/code&gt;s and unsafe memory are marginally
more efficient than &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;byte[]&lt;/code&gt;s or (slightly less efficient) direct
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nio.Bytebuffer&lt;/code&gt;s. This may be explained by even the small cost of
calling the JNI model on the C++ side simply to acquire a
direct buffer address. The margins (nanoseconds) here are extremely small.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4 id=&quot;post-processing-the-results&quot;&gt;Post processing the results&lt;/h4&gt;

&lt;p&gt;Our benchmark model for post-processing is to transfer the results into a
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;byte[]&lt;/code&gt;. Where the result is already a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;byte[]&lt;/code&gt; this may seem like an unfair
extra cost, but the aim is to model the least cost processing step for any kind
of result.&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Copying into a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;byte[]&lt;/code&gt; using the bulk methods supported by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;byte[]&lt;/code&gt;,
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nio.ByteBuffer&lt;/code&gt; have comparable performance.&lt;/li&gt;
  &lt;li&gt;Accessing the contents of an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Unsafe&lt;/code&gt; buffer using the supplied unsafe methods
is inefficient. The access is word by
word, in Java.&lt;/li&gt;
  &lt;li&gt;Accessing the contents of a netty &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ByteBuf&lt;/code&gt; is similarly inefficient; again
the access is presumably word by word, using normal
Java mechanisms.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;img src=&quot;/static/images/jni-get-benchmarks/fig_1024_1_copyout_nopoolbig.png&quot; alt=&quot;Copy out JNI Get&quot; /&gt;&lt;/p&gt;

&lt;h3 id=&quot;putjnibenchmark&quot;&gt;PutJNIBenchmark&lt;/h3&gt;

&lt;p&gt;We benchmarked &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Put&lt;/code&gt; methods in a similar synthetic fashion in less depth, but enough to confirm that the performance profile is similar/symmetrical. As with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;get()&lt;/code&gt; using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GetElements&lt;/code&gt; is the least performant way of implementing transfers to/from Java objects in C++/JNI, and other JNI mechanisms do not differ greatly one from another.&lt;/p&gt;

&lt;h2 id=&quot;lessons-from-synthetic-api&quot;&gt;Lessons from Synthetic API&lt;/h2&gt;

&lt;p&gt;Performance analysis shows that for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;get()&lt;/code&gt;, fetching into allocated &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;byte[]&lt;/code&gt; is
equally as efficient as any other mechanism, as long as JNI region methods are used
for the internal data transfer. Copying out or otherwise using the
result on the Java side is straightforward and efficient. Using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;byte[]&lt;/code&gt; avoids the manual memory
management required with direct &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nio.ByteBuffer&lt;/code&gt;s, which extra work does not
appear to provide any gain. A C++ implementation using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GetRegion&lt;/code&gt; JNI
method is probably to be preferred to using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GetCritical&lt;/code&gt; because while their
performance is equal, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GetRegion&lt;/code&gt; is a higher-level/simpler abstraction.&lt;/p&gt;

&lt;p&gt;Vitally, whatever JNI transfer mechanism is chosen, the buffer allocation
mechanism and pattern is crucial to achieving good performance. We experimented
with making use of netty’s pooled allocator part of the benchmark, and the
difference of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;getIntoPooledNettyByteBuf&lt;/code&gt;, using the allocator, compared to
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;getIntoNettyByteBuf&lt;/code&gt; using the same pre-allocate on setup as every other
benchmark, is significant.&lt;/p&gt;

&lt;p&gt;Equally importantly, transfer of data to or from buffers should where possible
be done in bulk, using array copy or buffer copy mechanisms. Thought should
perhaps be given to supporting common transformations in the underlying C++
layer.&lt;/p&gt;

&lt;h2 id=&quot;api-recommendations&quot;&gt;API Recommendations&lt;/h2&gt;

&lt;p&gt;Of course there is some noise within the results. but we can agree:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Don’t make copies you don’t need to make&lt;/li&gt;
  &lt;li&gt;Don’t allocate/deallocate when you can avoid it&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Translating this into designing an efficient API, we want to:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Support API methods that return results in buffers supplied by the client.&lt;/li&gt;
  &lt;li&gt;Support &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;byte[]&lt;/code&gt;-based APIs as the simplest way of getting data into a usable configuration for a broad range of Java use.&lt;/li&gt;
  &lt;li&gt;Support direct &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ByteBuffer&lt;/code&gt;s as these can reduce copies when used as part of a chain of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ByteBuffer&lt;/code&gt;-based operations. This sort of sophisticated streaming model is most likely to be used by clients where performance is important, and so we decide to support it.&lt;/li&gt;
  &lt;li&gt;Support indirect &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ByteBuffer&lt;/code&gt;s for a combination of reasons:
    &lt;ul&gt;
      &lt;li&gt;API consistency between direct and indirect buffers&lt;/li&gt;
      &lt;li&gt;Simplicity of implementation, as we can wrap &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;byte[]&lt;/code&gt;-oriented methods&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
  &lt;li&gt;Continue to support methods which allocate return buffers per-call, as these are the easiest to use on initial encounter with the RocksDB API.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;High performance Java interaction with RocksDB ultimately requires architectural decisions by the client&lt;/p&gt;
&lt;ul&gt;
  &lt;li&gt;Use more complex (client supplied buffer) API methods where performance matters&lt;/li&gt;
  &lt;li&gt;Don’t allocate/deallocate where you don’t need to
    &lt;ul&gt;
      &lt;li&gt;recycle your own buffers where this makes sense&lt;/li&gt;
      &lt;li&gt;or make sure that you are supplying the ultimate destination buffer (your cache, or a target network buffer) as input to RocksDB &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;get()&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;put()&lt;/code&gt; calls&lt;/li&gt;
    &lt;/ul&gt;
  &lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We are currently implementing a number of extra methods consistently across the Java fetch and store APIs to RocksDB in the PR &lt;a href=&quot;https://github.com/facebook/rocksdb/pull/11019&quot;&gt;Java API consistency between RocksDB.put() , .merge() and Transaction.put() , .merge()&lt;/a&gt; according to these principles.&lt;/p&gt;

&lt;h2 id=&quot;optimizations&quot;&gt;Optimizations&lt;/h2&gt;

&lt;h3 id=&quot;reduce-copies-within-api-implementation&quot;&gt;Reduce Copies within API Implementation&lt;/h3&gt;

&lt;p&gt;Having analysed JNI performance as described, we reviewed the core of RocksJNI for opportunities to improve the performance. We noticed one thing in particular; some of the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;get()&lt;/code&gt; methods of the Java API had not been updated to take advantage of the new &lt;a href=&quot;http://rocksdb.org/blog/2017/08/24/pinnableslice.html&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PinnableSlice&lt;/code&gt;&lt;/a&gt; methods.&lt;/p&gt;

&lt;p&gt;Fixing this turned out to be a straightforward change, which has now been incorporated in the codebase &lt;a href=&quot;https://github.com/facebook/rocksdb/pull/10970&quot;&gt;Improve Java API &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;get()&lt;/code&gt; performance by reducing copies&lt;/a&gt;&lt;/p&gt;

&lt;h4 id=&quot;performance-results&quot;&gt;Performance Results&lt;/h4&gt;

&lt;p&gt;Using the JMH performances tests we updated as part of the above PR, we can see a small but consistent improvement in performance for all of the different get method variants which we have enhanced in the PR.&lt;/p&gt;

&lt;div class=&quot;language-sh highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;rougeHighlight&quot;&gt;&lt;code&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;rouge-gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;rouge-code&quot;&gt;&lt;pre&gt;java &lt;span class=&quot;nt&quot;&gt;-jar&lt;/span&gt; target/rocksdbjni-jmh-1.0-SNAPSHOT-benchmarks.jar &lt;span class=&quot;nt&quot;&gt;-p&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;keyCount&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;1000,50000 &lt;span class=&quot;nt&quot;&gt;-p&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;keySize&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;128 &lt;span class=&quot;nt&quot;&gt;-p&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;valueSize&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;1024,16384 &lt;span class=&quot;nt&quot;&gt;-p&lt;/span&gt; &lt;span class=&quot;nv&quot;&gt;columnFamilyTestType&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;1_column_family&quot;&lt;/span&gt; GetBenchmarks.get GetBenchmarks.preallocatedByteBufferGet GetBenchmarks.preallocatedGet
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The y-axis shows &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ops/sec&lt;/code&gt; in throughput, so higher is better.&lt;/p&gt;

&lt;p&gt;&lt;img src=&quot;/static/images/jni-get-benchmarks/optimization-graph.png&quot; alt=&quot;&quot; /&gt;&lt;/p&gt;

&lt;h3 id=&quot;analysis&quot;&gt;Analysis&lt;/h3&gt;

&lt;p&gt;Before the invention of the Pinnable Slice the simplest RocksDB (native) API &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Get()&lt;/code&gt; looked like this:&lt;/p&gt;

&lt;div class=&quot;language-cpp highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;rougeHighlight&quot;&gt;&lt;code&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;rouge-gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;rouge-code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;Status&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Get&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ReadOptions&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;options&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                           &lt;span class=&quot;n&quot;&gt;ColumnFamilyHandle&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;column_family&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Slice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;key&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                           &lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;string&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;After PinnableSlice the correct way for new code to implement a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;get()&lt;/code&gt; is like this&lt;/p&gt;

&lt;div class=&quot;language-cpp highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;rougeHighlight&quot;&gt;&lt;code&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;rouge-gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;rouge-code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;Status&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Get&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ReadOptions&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;options&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;ColumnFamilyHandle&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;column_family&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Slice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;key&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
                    &lt;span class=&quot;n&quot;&gt;PinnableSlice&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;p&gt;But of course RocksDB has to support legacy code, so there is an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;inline&lt;/code&gt; method in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;db.h&lt;/code&gt; which re-implements the former using the latter.
And RocksJava API implementation seamlessly continues to use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;std::string&lt;/code&gt;-based &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;get()&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Let’s examine what happens when get() is called from Java&lt;/p&gt;

&lt;div class=&quot;language-cpp highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;rougeHighlight&quot;&gt;&lt;code&gt;&lt;table class=&quot;rouge-table&quot;&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class=&quot;rouge-gutter gl&quot;&gt;&lt;pre class=&quot;lineno&quot;&gt;1
2
3
4
&lt;/pre&gt;&lt;/td&gt;&lt;td class=&quot;rouge-code&quot;&gt;&lt;pre&gt;&lt;span class=&quot;n&quot;&gt;jint&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Java_org_rocksdb_RocksDB_get__JJ_3BII_3BIIJ&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
   &lt;span class=&quot;n&quot;&gt;JNIEnv&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;env&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;jobject&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;jlong&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;jdb_handle&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;jlong&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;jropt_handle&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;jbyteArray&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;jkey&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
   &lt;span class=&quot;n&quot;&gt;jint&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;jkey_off&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;jint&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;jkey_len&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;jbyteArray&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;jval&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;jint&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;jval_off&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;jint&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;jval_len&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
   &lt;span class=&quot;n&quot;&gt;jlong&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;jcf_handle&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/pre&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;

&lt;ol&gt;
  &lt;li&gt;Create an empty &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;std::string value&lt;/code&gt;&lt;/li&gt;
  &lt;li&gt;Call &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DB::Get()&lt;/code&gt; using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;std::string&lt;/code&gt; variant&lt;/li&gt;
  &lt;li&gt;Copy the resultant &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;std::string&lt;/code&gt; into Java, using the JNI &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SetByteArrayRegion()&lt;/code&gt; method&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;So stage (3) costs us a copy into Java. It’s mostly unavoidable that there will be at least the one copy from a C++ buffer into a Java buffer.&lt;/p&gt;

&lt;p&gt;But what does stage 2 do ?&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;Create a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PinnableSlice(std::string&amp;amp;)&lt;/code&gt; which uses the value as the slice’s backing buffer.&lt;/li&gt;
  &lt;li&gt;Call &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DB::Get()&lt;/code&gt; using the PinnableSlice variant&lt;/li&gt;
  &lt;li&gt;Work out if the slice has pinned data, in which case copy the pinned data into value and release it.&lt;/li&gt;
  &lt;li&gt;..or, if the slice has not pinned data, it is already in value (because we tried, but couldn’t pin anything).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So stage (2) costs us a copy into a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;std::string&lt;/code&gt;. But! It’s just a naive &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;std::string&lt;/code&gt; that we have copied a large buffer into. And in RocksDB, the buffer is or can be large, so an extra copy something we need to worry about.&lt;/p&gt;

&lt;p&gt;Luckily this is easy to fix. In the Java API (JNI) implementation:&lt;/p&gt;

&lt;ol&gt;
  &lt;li&gt;Create a PinnableSlice() which uses its own default backing buffer.&lt;/li&gt;
  &lt;li&gt;Call &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DB::Get()&lt;/code&gt; using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PinnableSlice&lt;/code&gt; variant of the RocksDB API&lt;/li&gt;
  &lt;li&gt;Copy the data indicated by the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PinnableSlice&lt;/code&gt; straight into the Java output buffer using the JNI &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SetByteArrayRegion()&lt;/code&gt; method, then release the slice.&lt;/li&gt;
  &lt;li&gt;Work out if the slice has successfully pinned data, in which case copy the pinned data straight into the Java output buffer using the JNI &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SetByteArrayRegion()&lt;/code&gt; method, then release the pin.&lt;/li&gt;
  &lt;li&gt;..or, if the slice has not pinned data, it is in the pinnable slice’s default backing buffer. All that is left, is to copy it straight into the Java output buffer using the JNI SetByteArrayRegion() method.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In the case where the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PinnableSlice&lt;/code&gt; has succesfully pinned the data, this saves us the intermediate copy to the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;std::string&lt;/code&gt;. In the case where it hasn’t, we still have the extra copy so the observed performance improvement depends on when the data can be pinned. Luckily, our benchmarking suggests that the pin is happening in a significant number of cases.&lt;/p&gt;

&lt;p&gt;On discussion with the RocksDB core team we understand that the core &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PinnableSlice&lt;/code&gt; optimization is most likely to succeed when pages are loaded from the block cache, rather than when they are in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;memtable&lt;/code&gt;. And it might be possible to successfully pin in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;memtable&lt;/code&gt; as well, with some extra coding effort. This would likely improve the results for these benchmarks.&lt;/p&gt;
</description>
        <pubDate>Mon, 06 Nov 2023 00:00:00 +0000</pubDate>
        <link>http://rocksdb.org/blog/2023/11/06/java-jni-benchmarks.html</link>
        <guid isPermaLink="true">http://rocksdb.org/blog/2023/11/06/java-jni-benchmarks.html</guid>
        
        
        <category>blog</category>
        
      </item>
    
  </channel>
</rss>
