RocksDB supports both optimistic and pessimistic concurrency controls. The pessimistic transactions make use of locks to provide isolation between the transactions. The default write policy in pessimistic transactions is WriteCommitted, which means that the data is written to the DB, i.e., the memtable, only after the transaction is committed. This policy simplified the implementation but came with some limitations in throughput, transaction size, and variety in supported isolation levels. In the below, we explain these in detail and present the other write policies, WritePrepared and WriteUnprepared. We then dive into the design of WritePrepared transactions.

WriteCommitted, Pros and Cons

With WriteCommitted write policy, the data is written to the memtable only after the transaction commits. This greatly simplifies the read path as any data that is read by other transactions can be assumed to be committed. This write policy, however, implies that the writes are buffered in memory in the meanwhile. This makes memory a bottleneck for large transactions. The delay of the commit phase in 2PC (two-phase commit) also becomes noticeable since most of the work, i.e., writing to memtable, is done at the commit phase. When the commit of multiple transactions are done in a serial fashion, such as in 2PC implementation of MySQL, the lengthy commit latency becomes a major contributor to lower throughput. Moreover this write policy cannot provide weaker isolation levels, such as READ UNCOMMITTED, that could potentially provide higher throughput for some applications.

Alternatives: WritePrepared and WriteUnprepared

To tackle the lengthy commit issue, we should do memtable writes at earlier phases of 2PC so that the commit phase become lightweight and fast. 2PC is composed of Write stage, where the transaction ::Put is invoked, the prepare phase, where ::Prepare is invoked (upon which the DB promises to commit the transaction if later is requested), and commit phase, where ::Commit is invoked and the transaction writes become visible to all readers. To make the commit phase lightweight, the memtable write could be done at either ::Prepare or ::Put stages, resulting into WritePrepared and WriteUnprepared write policies respectively. The downside is that when another transaction is reading data, it would need a way to tell apart which data is committed, and if they are, whether they are committed before the transaction’s start, i.e., in the read snapshot of the transaction. WritePrepared would still have the issue of buffering the data, which makes the memory the bottleneck for large transactions. It however provides a good milestone for transitioning from WriteCommitted to WriteUnprepared write policy. Here we explain the design of WritePrepared policy. We will cover the changes that make the design to also supported WriteUnprepared in an upcoming post.

WritePrepared in a nutshell

These are the primary design questions that needs to be addressed: 1) How do we identify the key/values in the DB with transactions that wrote them? 2) How do we figure if a key/value written by transaction Txn_w is in the read snapshot of the reading transaction Txn_r? 3) How do we rollback the data written by aborted transactions?

With WritePrepared, a transaction still buffers the writes in a write batch object in memory. When 2PC ::Prepare is called, it writes the in-memory write batch to the WAL (write-ahead log) as well as to the memtable(s) (one memtable per column family); We reuse the existing notion of sequence numbers in RocksDB to tag all the key/values in the same write batch with the same sequence number, prepare_seq, which is also used as the identifier for the transaction. At commit time, it writes a commit marker to the WAL, whose sequence number, commit_seq, will be used as the commit timestamp of the transaction. Before releasing the commit sequence number to the readers, it stores a mapping from prepare_seq to commit_seq in an in-memory data structure that we call CommitCache. When a transaction reading values from the DB (tagged with prepare_seq) it makes use of the CommitCache to figure if commit_seq of the value is in its read snapshot. To rollback an aborted transaction, we apply the status before the transaction by making another write that cancels out the writes of the aborted transaction.

The CommitCache is a lock-free data structure that caches the recent commit entries. Looking up the entries in the cache must be enough for almost all th transactions that commit in a timely manner. When evicting the older entries from the cache, it still maintains some other data structures to cover the corner cases for transactions that takes abnormally too long to finish. We will cover them in the design details below.

Benchmark Results

Here we presents the improvements observed in MyRocks with sysbench and linkbench:

  • benchmark………..tps………p95 latency….cpu/query
  • insert……………….68%
  • update-noindex…30%……38%
  • update-index…….61%…….28%
  • read-write…………6%……..3.5%
  • read-only………..-1.2%…..-1.8%
  • linkbench………….1.9%……+overall……..0.6%

Here are also the detailed results for In-Memory Sysbench and SSD Sysbench curtesy of @mdcallag.

Learn more here.