Rex: Replication at the Speed of Multi-core by Zhenyu Guo, Chuntao Hong, Mao Yang, Dong Zhou, Lidong Zhou, and Li Zhuang at EuroSys 2014.

Like the C5 paper I just reviewed, Rex is about improving asynchronous replication performance on multi-core, but I thought the C5 paper was humdrum whereas Rex is thought-provoking.

Black-and-white illustration of cross-section of a neuron. Its central body contains bubbles or balls. White lines twist among the balls and spread outward into bundles of filaments leading to the edge of the image. This and many other illustrations are from a 19th Century book titled “The principles of light and color”.

Warm up your neurons because they're going to get a workout.

Background #

Usually in asynchronous replication, one server is the leader for a while and makes modifications to its data. It logs these modifications in a sequence, which it streams to followers. They replay these modifications to their copies of the data, in the same order. Thus there is a total order of states; any state that the leader passes through is eventually reflected on each follower in the same order. Clients can read from the leader or followers, and they’ll see the same sequence of states, although the followers may lag.

Diagrame of two databases labelled “primary” and “follower”. Writes go to the primary, logs go from primary to follower, and reads come from the primary and follower.

So far so good. Distributed Systems 101.

But what if it didn’t have to be that way?

Black-and-white illustration of a person, seen from the neck up, androgynous and bald, with curved lines indicating some kind of magnetic field or lines of energy surrounding and crossing the head.

Galaxy brain.

Partial-Order Replication #

The Rex paper’s insight is, the leader only needs to guarantee a partial order of events. On the leader if two transactions are running on concurrent threads t₁ and t₂ and they attempt to modify the same data x, some concurrency protocol decides the outcome. For example, if your database wants to guarantee serializability, it might use two-phase locking. Let’s say t₁ gets the write-lock L on x first and updates x, then unlocks L. Then, t₂ locks L, updates x, and unlocks L. This sequence of events guarantees that t₁’s write happens-before t₂’s, and the value of x after both transactions commit is the value t₂ wrote.

(Forgive me, this will get interesting in a second.)

A sequence diagram of two threads. Thread 1 experiences three events labeled “New Request”, “Lock L”, and “Unlock L”. Thread 2 experiences the same three events. A line goes from Thread 1’s event “Lock L” to Thread 2’s event “Unlock L”, indicating that Thread 2 can’t lock L until Thread 1 unlocks it.

I adapted this from the author's Figure 2.

On the other hand if a pair of threads t₃ and t₄ write to disjoint data, they’ll share no locks and their writes are concurrent: neither one happened before or after the other.

(OK, now it’s going to get interesting.)

The Rex leader logs all the lock/unlock events executed by t₁, t₂, t₃, and t₄. But it does not decide what order they occur! It only logs that t₁’s unlock happens-before t₂’s lock, then streams that information to the followers. So followers learn about a partial order of lock/unlock events, and replay them in some order that matches the leader’s partial order. Each follower is free to replay them in whatever order it wants so long as it obeys the rule, “t₁ unlocks L before t₂ locks L”. Thus followers can execute with as much parallelism as the leader did, allowing higher throughput than if they had to execute the writes in some total order.

My example above is for a database that guarantees serializability, and its mechanism is two-phase locking. But any guarantee and any mechanism would work with Rex. The point is that the primary decides what happens-before relationships must be obeyed, and the followers obey them.

The Rex authors observe that followers in most systems suffer low parallelism, because they’re obeying a total order for the sake of consistency, but:

This tension between concurrency and consistency is not inherent because the total ordering of requests is merely a simplifying convenience that is unnecessary for consistency.

We’ll revisit this claim below; I have concerns about “consistency”.

Black-and-white illustration of cross-section of brain.

Capturing Partial Order #

How does the leader learn the partial order of writes and transmit it to the followers? In Rex, locks are wrapped with a shim which captures happens-before relationships on the leader, and enforces them on the followers. The Rex authors call these relationships “causal edges”:

// Pseudocode of the Lock and Unlock wrappers.
class RexLock {
public:
  void Lock() {
    if (env::local_thread_mode == RECORD) {
      AcquireLock(real_lock);
      RecordCausalEdge();
    } else if (env::local_thread_mode == REPLAY) {
      WaitCausalEdgesIfNecessary();
      AcquireLock(real_lock);
    }
  }

  void Unlock() {
    if (env::local_thread_mode == RECORD) {
      RecordCausalEdge();
    }
    ReleaseLock(real_lock);
  }
};

So if t₁ unlocks L on the leader and then t₂ locks it, L records that the second event depends on the first, in RecordCausalEdge. Rex sends this information to the followers, who wait for t₁ to unlock L before t₂ locks it, with WaitCausalEdgesIfNecessary. This happens whether t₂ actually has to wait to lock L on the leader, or just locks L some time after t₁ unlocks it.

The paper doesn’t describe the exact format of the log, but the log somehow expresses which threads executed which operations, and the order in which they acquired and released every lock. Visually it’s a DAG:

The lines labeled “c₁” and “c₂” are two possible cuts through the DAG. The authors say:

A cut is consistent if, for any causal edge from event e₁ to e₂ in the trace, e₂ being in the cut implies that e₁ is also included in the cut. An execution reaches only consistent cuts. Figure 2 shows two cuts c₁ and c₂, where c₁ is consistent, but c₂ is inconsistent.

“Consistent” here seems a bad choice of jargon, we’ve so overloaded that poor word already. In any case, a cut is ready to be packed up and shipped to the secondaries so long as it doesn’t slice across a causal edge the wrong way. If the primary sent all the events above c₂ to the secondary, the secondary thread t₁ would be stuck at its event 4, because it’s waiting for t₂’s event 3, which isn’t included in the cut. No such hangup occurs if the primary uses the “consistent” cut c₁.

For this definition to work, imagine that each cut includes all events to the beginning of time. In practice, of course, the primary only has to send events that are in the newest cut and not in previous cuts.

A bald person in profile, with portions of the head labeled “moral and spiritual powers”, “higher energies”, “reason”, “perception”, “animal energies”, “thoracic”, “brachial”, “dorsal”, “crural”.

Annoying Details #

I skimmed Section 3, which buries the reader in Paxos junk. It describes how leaders are demoted or elected, how servers are added or removed, how intermediate states are checkpointed for the purposes of crash-recovery, rollback, and garbage collection. I sympathize with the authors: they had to describe their ideas in terms of Paxos, the main consensus protocol in 2014, and Paxos has so much ambiguity and variation that the authors must detail their choices about every aspect of the protocol. If they’d come just a bit later they could have built on Raft instead and taken a lot of these details for granted, focusing on their main contribution instead.

Black-and-white illustration of cross-section of brain with starburst pattern drawn from the center.

My Evaluation #

The Rex authors want to increase replication throughput by improving parallelism on secondaries. They claim Rex is “a new multi-core friendly replicated state-machine framework that achieves strong consistency while preserving parallelism in multi-thread applications.” I agree that the protocol is multi-core friendly, but C5 is a simpler way to achieve that goal, and as in my review of the C5 paper, I’ll warn that parallelism alone won’t save you from replication lag.

What do the authors mean when they say Rex “achieves strong consistency”? That word is so pathologically abused, it has at least two meanings in this paper alone! In the sentence above, “consistency” seems to mean what we usually call “convergence”: if you stop sending writes to the primary, eventually all replicas will have the same state. From a client’s point of view this is “eventual consistency”, and it’s parsecs from the usual meaning of “strong consistency”. See Pat Helland’s Don’t Get Stuck in the “Con” Game. I’ll forgive the authors, there’s been a lot of work since 2014 to clarify our jargon.

Leaving jargon aside, what’s the user experience when querying a Rex secondary? There’s no guarantee besides convergence. Independent events happen in different orders on different secondaries, temporarily producing states that never existed on the primary. Other systems such as MongoDB have weak guarantees on secondaries but we’re able to build stronger consistency on top of them. For example, MongoDB guarantees causal consistency on secondaries by reading at the majority-committed timestamp. We make this guarantee and also achieve high parallelism during replication (see the C5 review). MongoDB’s guarantee relies on its ability to read a past snapshot on all replicas at a specific timestamp. This is either impossible with Rex, or would require a degree of coordination that I suspect would obviate Rex’s advantage.

The Rex paper expanded my mind with the idea of partial-order replication, but I don’t immediately see a practical use for it.

Black-and-white illustration of cross-section of brain.

Images from The Principles of Light and Color, Edwin D. Babbit 1878.

A. Jesse Jiryu Davis

Paper Review: Rex: Replication at the Speed of Multi-core

Background #

Partial-Order Replication #

Capturing Partial Order #

Annoying Details #

My Evaluation #