A. Jesse Jiryu Davis

Lights Out Climbing

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Fri, 28 Jun 2024 07:55:01 +0000

June 26, 2023. “Pride Climb & Glow Night” at the Gravity Vault climbing gym in Poughkeepsie.

Images © A. Jesse Jiryu Davis

New Abbot Ceremony

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Wed, 19 Jun 2024 10:20:55 +0000

June 16, 2024. The “Descending and Ascending the Mountain Ceremony”, or Shisanshiki, at the Village Zendo in NYC. Our abbot Enkyo O’Hara Roshi retired and was succeeded by Shinryu Sinclair Roshi.

Images © A. Jesse Jiryu Davis

Climbing in Lost City

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Sat, 15 Jun 2024 21:38:01 +0000

June 15, 2024. Climbing in Lost City, an undocumented area of the Gunks. The local tradition prohibits publishing information about climbing routes here. The first image is of Dustin Portzline on “Keelhauled”, an ungraded roof-crack boulder problem. The others are on a nameless 5.9 route.

Images © A. Jesse Jiryu Davis

Mohonk Trail Builders

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Tue, 11 Jun 2024 15:15:14 +0000

June 11, 2024. The Mohonk Trail Builders lifting big rocks to build a trail in the Mohonk Preserve.

Images © A. Jesse Jiryu Davis

Absurdland in the Gunks

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Sun, 09 Jun 2024 17:04:01 +0000

June 7, 2024. My friends climbing Absurdland in the Gunks as a storm approaches.

Images © A. Jesse Jiryu Davis

Bonticou Crag

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Wed, 05 Jun 2024 14:37:06 +0000

June 3, 2024. A Gunks cliff at sunset. I did a quick hike to the cliff in the evening and scrambled up, then I alternated taking photos and eating a sandwich as the light changed. A couple of these are HDRs: I steadied my digital camera on some rocks and took three exposure-bracketed frames, then merged them in Lightroom. The blossoms are mountain laurel.

Images © A. Jesse Jiryu Davis

Review: Detock: High Performance Multi-region Transactions at Scale

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Wed, 29 May 2024 09:18:59 +0000

Detock: High Performance Multi-region Transactions at Scale, SIGMOD 2023. This paper is about strict serializable transactions in a geo-distributed database. It’s named “Detock” for deterministic deadlock avoidance. My presentation to the Distributed Systems Reading Group is above, and my written summary is below.

Calvin

Detock’s goals:

Strict-serializable, multi-region transactions.
Low latency and high throughput under high contention.
Handle contention between multi- and single-region transactions.

Detock descends from a lineage of “deterministic” databases invented by Daniel Abadi and others, starting with Calvin: Fast Distributed Transactions for Partitioned Database Systems in 2012. Calvin decides in advance how a sequence of transactions will execute, before the transactions fan out to the partitions. (It took me years to realize it was named for John Calvin, who taught that souls were predestined for heaven or hell.) Then there was SLOG: Serializable, Low-latency, Geo-replicated Transactions in 2019, which applies deterministic transactions to geo-replication. Detock has the same architecture and mostly the same code as SLOG, but it resolves deadlocks differently, as you’ll see.

Detock’s Architecture

Detock is partitioned and distributed across geographic regions. A region can have multiple partitions. Each item in the datastore has one home region, which holds the authoritative fresh copy of that item. An item can be asynchronously replicated from its home region to other regions; the other regions can have a read-only cache of the item.

A transaction can arrive at any partition; the partition becomes the transaction’s coordinator. Transactions are one-shot, which is a requirement of deterministic databases. We can’t do SQL-style conversational transactions. The read set and write set are either declared by the client, or can be determined with static analysis of the transaction’s code if it’s a procedure, or else the coordinator does a reconnaissance transaction to determine the read and write set. If the coordinator uses a reconnaissance transaction, and then the data changes such that the read and write set become invalid, that’s detected somehow and the transaction is retried.

The coordinator uses a “home directory” to map data items to their home regions, annotates the read/write sets with their respective regions, and forwards the transaction to all participant regions.

Single-Home Transactions

For a single-home transaction, the transaction is added to the log in its home region. These logs are stored in batches and written to disk every five milliseconds (batched to minimize disk I/O, I think).

Single-home transactions are arranged in a graph according to their dependencies, and there won’t be any cycles. The paper explains why single-home transactions can’t have dependency cycles with each other, I think this is because they’re one-shot transactions which have already been totally ordered by the sequencer. Since they can’t have cycles, they don’t have deadlocks, so the scheduler dispatches them to the workers in a straightforward way. Whenever all a transaction’s dependencies have finished, the transaction can run.

There can be single-home transactions that are in one region but multiple partitions, the Detock paper isn’t interested in these. It just says, “Transactions accessing multiple partitions in the same replica follow a deterministic execution protocol similar to Calvin and thus do not require two-phase commit.” I guess I need to re-read the Calvin paper.

Each region’s local log is asynchronously replicated to other regions. All regions replicate each other’s transactions at different times, there’s no coordination. When a region replicates a remote log it can play the transactions there to update its local copy of the remote data. I think this makes all the copies eventually consistent, but there’s no stronger guarantee.

So let’s talk about multi-home transactions, this is the interesting part.

Multi-Home Transactions

Let’s say a transaction comes to some region, and its list of items spans several regions.

Here’s my understanding of the algorithm. A client sends a transaction to an arbitrary region. The forwarder checks the read and write sets and annotates the keys with their home regions. Since this transaction involves keys in Region A and Region B the forwarder forwards it to Region B.

In both regions, the forwarder notices that this is a multi-home transaction, so it creates something called a “Graph Placement Transaction”, I’ve drawn these as T1a and T1b.

A “Graph Placement Transaction” is the part of the transaction that only uses keys in one region. So T1a is the part of the transaction on keys in A, same for T1b. Graph Placement Transactions are like single-home transactions, and they’re added to a batch by the sequencer, like single-home transactions. But scheduling them is much more complex than scheduling single-home transactions.

Let’s say another transaction T2 arrives around the same time in Region B. And let’s say that T2’s Graph Placement Transactions are sequenced so that T2 is first in Region B and second in Region A.

The two regions' log managers communicate the sequences they chose. As these Graph Placement Transactions arrive they’re processed by the scheduler. They arrive in different orders at different regions' schedulers. So maybe in Region A, T1 arrives before T2, and vice versa in Region B. So the scheduler can’t just execute each transaction as soon as possible once all conflicting transactions are finished, that would lead to different outcomes in different regions. We need some way for these Graph Placement Transactions to be scheduled in the same order everywhere, despite the asynchronous replication. The authors write,

GraphPlacementTxns establish an order between multi-home and single-home transactions at the region that generated the GraphPlacementTxn. However, they do not globally order multi-home transactions, since two different regions may generate GraphPlacementTxns for a set of multi-home transactions in different orders. There is thus a concern that the generated graph may contain cycles, which would lead to deadlock during processing.

This deterministic scheduling is where Detock diverges substantially from SLOG. It’s the major contribution, and the hardest for me to understand.

Deterministic Deadlock Avoidance

To explain Detock’s deadlock avoidance, first imagine you know all the transactions that will ever occur, and you know the dependencies among them. You can condense the dependency graph into strongly connected components (SCC).

A strongly connected component (SCC) is a subgraph where all the transactions are reachable from all the others. Therefore it contains at least one cycle. Within an SCC, Detock schedules transactions according to their unique transaction ids, which are assigned by their coordinators. Among SCCs there are no cycles: the blue arrows can’t form cycles, since you’ve isolated the cycles within the orange SCCs. Therefore you can just schedule the SCCs by topologically sorting them.

This works if you know all transactions, but in reality transactions are arriving continuously. When is a transaction ready to be scheduled?

For every vertex corresponding a multi-home transaction T in the dependency graph, let GPTotal (T ) be the total number of GraphPlacementTxns generated for T, a counter GP (T ) is associated with T to keep track of the number of GraphPlacementTxns of T that have been added to the graph so far. We define two types of vertices:

A complete vertex T is either a single-home transaction or a multi-home transaction with GP (T ) equal to GPTotal (T ).

A stable vertex T is a complete vertex and there does not exist a path going from an incomplete vertex to T .

So a transaction is stable in a region if its Graph Placement Transactions have all arrived, as have those for all transactions it depends on. I guess GPTotal (T ) is calculated at the beginning, when the forwarder creates the Graph Placement Transactions. Once the transaction is stable it can be scheduled; transactions are scheduled in the same order in all regions.

I think that the ordering algorithm is the same as Egalitarian Paxos (“EPaxos”), and Detock introduces the mechanism for waiting until a transaction is “stable”, but that’s just my guess.

In pathological cases, conflicting transactions continuously arrive at different regions in different orders, and the set of unstable transactions grows forever. Detock gets livelocked: each transaction’s dependencies are never resolved and Detock can never start executing it. The more often transactions arrive in regions in the same order, the lower the risk of livelock. Detock improves its chances thus: the coordinators assign each transaction a timestamp in the future; a transaction is scheduled once its timestamp has passed, by which time most lower-timestamped transactions have probably already arrived. (This is very similar to Deadline-Ordered Multicast in Nezha and several earlier papers.) The authors call this “opportunistic reordering”.

Their Evaluation

Like a lot of papers lately, Detock has a massive evaluation section. The authors compare Detock to:

Calvin, which globally orders all transactions and has no optimizations for geo-distributed transactions,
SLOG, which globally orders multi-home transactions using one ordering service in one region,
SLOG (slow), which is Calvin plus global consensus for the ordering of multi-home transactions,
Janus, an EPaxos variant optimized for geo-distributed transactions,
CockroachDB, which uses Spanner-style nondeterministic concurrency control based on synchronized clocks.

For a fair fight, they reimplemented the first four protocols within the same codebase as Detock. They vary workload skewness, network latency, the ratio of multi-home to single-home transactions, the ratio of multi-partition to single-partition transactions within a region, etc. etc. for multiple benchmarks. I was most interested in this sub-chart of Figure 6 (I’ve edited its layout):

This is a YCSB experiment with high contention (a few very hot keys) and no data partitioning within regions. It’s a rare example of a system not beating all rivals in all circumstances in its evaluation section. Detock’s throughput falls as the percent of multi-home transactions rises, since multi-home transactions require more work at more regions than single-home. When most transactions are multi-home, in fact, Calvin’s naïve algorithm actually beats Detock. The paper claims (and I agree) that this is an unlikely scenario, though.

The graph shows how critical opportunistic reordering is for Detock: the blue line at the bottom is pathetic.

In other experiments, the authors show that Detock’s distributed processing of multi-home transactions beats SLOG, where the centralized ordering service is a bottleneck. Plus, SLOG suffers more from contention, since its multi-home transactions hold locks for longer. (See the SLOG and Detock papers for details.)

My Evaluation

This paper is well-written but its content is ineluctably complex. It’s not a single clever algorithm like timestamp as a service or leader leases. Instead, it’s a new combination of existing parts, many of them already intricate. If you don’t know Calvin and EPaxos and maybe SLOG, this paper is hard to read. But its complexity is realistic: real distributed databases are horribly complex. We make them by combining all the state-of-the-art components, trying to stake out an unclaimed position on the Pareto frontier. If you think Detock is complex, try a real-world protocol like Cassandra’s.

The authors can’t claim “our protocol is absolutely the best”; as always, they have to claim “our protocol makes better tradeoffs in realistic scenarios”. Detock looks like a solid improvement over previous similar systems, and I appreciate the giant effort the authors made to benchmark Detock and report their results honestly.

Review: Timestamp as a Service, not an Oracle

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Thu, 23 May 2024 15:18:28 +0000

Timestamp as a Service, not an Oracle, by authors from Alibaba Cloud, in Proceedings of VLDB this year. Watch my presentation to the Distributed Systems Reading Group above, or read my summary below.

Timestamp Oracles

Priestess of Delphi (1891) by John Collier

An oracle is someone who speaks for a god and reveals divine knowledge. In computer science we’ve used “oracle” to refer to theoretical machines that could do something impossible, like solve the halting problem or produce truly random numbers. Also, for some reason, real actual machines that produce monotonic timestamps are called “timestamp oracles”.

A timestamp oracle is used by a distributed database to get monotonically increasing IDs, for ordering events. It’s a single server in your data center which provides a larger number every time you ask it. The timestamp might or might not be related to the wall clock.

Why not Lamport clocks or vector clocks? They require you to pass these clock values between clients and servers, through all the layers of your multi-tier architecture. I know personally that database users find that burdensome.
Why not synchronized clocks? Syncing clocks is hard, and no matter how precise the clock is, you need to add some latency to wait out the uncertainty. Well-synced clocks are becoming widely available, though; see Huygens and AWS Time Sync.

The paper mentions that timestamp oracles are used by various distributed systems: PolarDB-X, OceanBase, CORFU, TiDB placement driver, Percolator, Postgres-XL. I’ve only heard of half of these. The TiDB placement driver (“TiDB-PD”) includes a timestamp oracle in its implementation, and it’s the main example that this paper’s authors use as a comparison for evaluating their alternative.

Timestamp oracles only work for one data center. The timestamp consumers should be on a very low-latency link to the timestamp oracle server; cross-DC links are too slow for practical use. This paper doesn’t try to solve that problem: this paper’s timestamp-as-a-service is also intended for one data center.

A fault-tolerant timestamp oracle is a consensus group: each new timestamp is majority-committed. If the leader fails, the next leader must know the highest timestamp that the previous leader produced. The paper mentions that this can be optimized: the leader could reserve a range of timestamps. It majority-commits the range, and it gives out timestamps until the range is exhausted. A new leader reserves a higher range of timestamps than any previous one.

A consensus group is fault-tolerant, but nevertheless, losing the leader causes some brief unavailability. Especially since this consensus group must use timed leader leases, for speed and consistency. Therefore the new leader has to wait for the previous lease to expire. The paper shows that TiDB-PD is unavailable for 10 seconds after the leader dies. The black throughput line drops to zero each time the leader is killed:

Figure 9 from the paper, lower half

Besides being a single point of failure, the leader is a bottleneck—you can’t get timestamps from followers, so a system could saturate the timestamp oracle leader.

A Service, Not An Oracle

The authors propose a timestamp service, rather than an oracle. They say an oracle is one server, therefore a single point of failure, even if it’s the leader of a consensus group. A service, however, is not a single point of failure. I don’t know if everyone agrees with these definitions of oracle and service, but it’s a useful distinction for this paper. The paper says a Timestamp-as-a-Service is “a distributed algorithm that computes logical timestamps from a consensusless cluster of clocks.” That means no unavailability from loss of the leader, and no bottleneck on the leader. Consensuslessness is not only fun to say, it’s the funnest part of this paper!

The Algorithm, 1.0

The paper presents the Timestamp-as-a-Service (TaaS) algorithm in two stages, starting with a simplified version that assumes no server failures or message loss. There can be any number of clients, and N servers.

A client starts a session S by sending the bottom timestamp ⊥ to all servers. The paper generally uses σ (sigma) for a session but I’m going to use S because I don’t read Greek. That symbol for “bottom timestamp”, you could think of it as negative infinity. Whenever you see that symbol you know you’re probably in the world of lattices and order theory and abstract algebra.

A “session” in this paper is not a sequence of database commands. It’s just the commands required to get one timestamp, then the session is over. The client gets its timestamp and then forgets everything, no state persists into the next session. So the client always starts its session by sending ⊥, even if it got a timestamp in a previous session.

Each server has a persistent timestamp, which is somehow initialized to some value when the server is born. When the server receives the client request, it increments its timestamp and sends its new current timestamp back to the client. Thus if a server replies with “5”, the client knows the server’s current timestamp is 5, and it knows the server had a timestamp less than 5 before the client talked to it. This will be important.

The session is complete once all servers have replied. The client uses the M^th‑smallest timestamp from the replies, for some M ≤ N. All clients must use the same M.

M can be anything! M should be the smallest majority, so if N is 5 then M should be 3. The paper starts abstractly, by showing TaaS is correct for any M from 1 to N. Eventually it admits that for maximum fault tolerance, M should be the smallest majority, same as a quorum in Paxos or any consensus algorithm. But for now, M is any number 1 through N.

Let’s look at an example of TaaS 1.0 in action.

Figure 2 from the paper.

There are Clients V and Client W, and Servers X, Y, and Z. Session Alpha starts concurrently with Session Beta. Session Gamma starts after Session Alpha. Let’s say M = 2, so at the end of each session the client chooses the second‑smallest timestamp from all the server replies.

Session Alpha gets timestamp 1, Beta gets timestamp 2, and Gamma gets timestamp 3. Beta’s allowed to have any timestamp since it’s concurrent with the others. So you can see these timestamps uphold the linearizability guarantee. The important constraint is, Session Alpha’s timestamp must be less than Session Gamma’s, and it is: 1 is less than 3.

The client could send its messages to all servers in parallel, or any order within a session, and the latencies could be of any length, TaaS still works.

Theorem 1

“The timestamp for session T is guaranteed larger than the timestamp for any session S that ended before T began.”

This sounds like a linearizability guarantee, and I believe you could call the timestamp a linearizable data structure. Proof:

The M^th‑smallest response in S ≤ the M^th‑smallest server state at the end of S. (Servers' timestamps increase monotonically, so their timestamps at the end of S must be ≥ their responses in S.)
The M^th‑smallest server state at the end of S ≤ the M^th‑smallest server state at the start of T. (Monotonicity, and S ends before T starts.)
The M^th‑smallest server state at the start of T < the M^th‑smallest response in T. (The client makes all the servers increment their timestamps during T.)

So by transitivity, the left side of #1 < the right side of #3. Q.E.D.

The Algorithm, 2.0

In the second and final version of TaaS, servers can crash-fail indefinitely, and they can come back online.

A restarted server still guarantees monotonicity, i.e. it remembers the last timestamp it produced before it crashed. The only way to guarantee this is some sort of replication.The paper suggests using RAID, or cloud storage like S3 which has its own replication, or make each timestamp server a consensus group with fault-tolerance. You might ask, if we make each timestamp server a consensus group, haven’t we come full-circle to the “timestamp oracle” that the paper says is bad? Not quite: now we have a separate group for each timestamp server, so if one group loses its leader, TaaS can proceed with the other groups without waiting for a new leader.

The TaaS client’s goal is to find a timestamp t such that:

t > M^th‑smallest of all servers' timestamps when the session started
t ≤ M^th‑smallest of all servers' timestamps when the session ends

These two properties are the two facts that Theorem 1 depends on.

Here’s an example where Server X is partitioned from the client. Let’s say M=2; we want the 2^nd‑smallest timestamp.

From Figure 3.

The client remembers that it got timestamp 5 in some past session from Server X. This memory is a new feature of the fault-tolerant version of TaaS.

In session δ, the client gets a 4 and a 5 from the servers it can reach. It picks 5 as the timestamp for this session, because the client knows, without talking to Server X:

5 > 2^nd-smallest timestamp at the start of the session: The client heard Servers Y and Z respond with timestamps 4 and 5, so their previous timestamps were less than 5.
5 ≤ 2^nd-smallest timestamp at the end of the session: The client knows Servers X and Z now have timestamps at least 5.

The two facts that Theorem 1 relies on are both true, so the client can pick 5 without talking to Server X.

From Figure 3.

In session ε the client gets 5 and 6 from the servers it can reach. Now it doesn’t know the second‑smallest timestamp. If Server X is talking to some other client, it might have advanced to 7; then the second‑smallest would be 6. Or Server X might be at timestamp 5.5—timestamps don’t have to be integers! (I wish the paper had mentioned this earlier.) If Server X has 5.5, then 5.5 would be the second‑smallest. We don’t know. What’s the solution?

The client does something new now. Instead of sending the bottom timestamp to all the servers, like it did before, it continues this session. It thinks 6 might be the second‑smallest timestamp. So it sends 6 to any servers that might have less than 6: Server Y. Then Server Y updates its own timestamp to at least 6, increments it by one, and returns it.

Now the client has 6 from Z, 7 from Y. X is still down, but the client doesn’t care, it can choose 6:

6 > 2^nd-smallest timestamp at the start of the session: The client heard Servers Y and Z respond with 5 and 6, so they started with less than 6.
6 ≤ 2^nd-smallest timestamp at the end of the session: The client eventually heard Servers Z and Y respond with 6 and 7, they ended with at least 6.

So the client achieved certainty by advancing one of the server’s timestamps. That’s a specific example, here’s the general algorithm.

Algorithm 2.0 pseudocode

The paper uses a pseudocode that’s hard for me to read, so of course I’ll make my own pseudocode that seems easier to me but may be worse for you. Sorry.

// global map from server to timestamp, initially bottom timestamp for
// all servers. the client persistently tracks this. 
cache = {server: ⟘ for server in servers}

// client code to acquire one timestamp
def do_session():
  // "session" is like "cache" but reset with top timestamp each session
  session = {server: ⟙ for server in servers}

  // like TaaS 1.0, initially send bottom timestamp to all servers
  send timestamp ⟘ to all servers
  while true:
    reply = await next reply

    // update global cache with max per server, local with min
    cache[reply.server] = max(cache[reply.server], reply.timestamp)
    session[reply.server] = min(session[reply.server], reply.timestamp)

    if we received at least M replies:
      candidate = Mth-smallest value in session
      if candidate ≤ Mth-smallest value in cache:
        return candidate
      else if no more quickly-available replies:
        // promote the candidate
        for server in servers:
          if cache[server] < candidate:
            send candidate timestamp to server

As the client gets replies, it updates the global and local maps. The purpose of these two maps is to determine when we know the two facts we need for Theorem 1.

Each server’s timestamp is guaranteed to increase monotonically, but a client could get replies out of order. So it uses max when updating the global cache, to ignore delayed messages. The client uses min when updating the per-session map called session. I don’t understand why; it seems that if the client got a 6 from Server X, then a 5, it should keep the 6, because it knows that 5 is a delayed message. The algorithm is still correct, it just has unnecessary retries.

How does the client know when it’s acquired a correct timestamp, at the return timestamp line? Remember, the goal is to find a timestamp t such that:

t > M^th‑smallest of all servers' timestamps when the session started: at least M servers started < candidate, according to session-local map
t ≤ M^th‑smallest of all servers' timestamps when the session ends: at least M servers end > candidate, according to global map

The client knows candidate satisfies the first criterion, because according to the session-local map, there are M servers that had smaller timestamps when the session started, before the client talked to them. All the values in the session-local map are values that servers incremented during the session, so the starting values were smaller.

The client knows candidate meets the second criterion because it explicitly checks it in the global cache. The client hasn’t talked to all the servers, but it’s talked to enough of them to know that this fact is true. It wants to ensure that future sessions are guaranteed to get a larger timestamp than this one, and it’s learned enough to guarantee that.

If the client has no candidate, it tries to make the second criterion true by advancing the timestamps on servers that might have less than candidate. The client tries all lagging servers, including those that were unavailable recently, in case they came back online.

At the line, else if no more quickly-available replies, there’s some timeout while the client waits to see if the remaining servers are going to reply soon or not. I guess that’s a tunable parameter.

What Should M Be?

It’s finally time to descend from lofty abstraction and decide a value for M. The authors write, “The system allows at most min(N − M, N − 1) downs while continuing its service.” So for optimal fault-tolerance M should be the median of 1 .. N:

$$\left\lceil \frac{n+1}{2} \right\rceil$$

I suppose it’s theoretically interesting that M could be any number, but practically, it should be a majority, which makes this protocol resemble consensus a lot. However, the authors insist this isn’t consensus, because servers never talk with each other. TaaS is a “bipartite architecture”, where clients and servers talk to each other, but servers don’t talk to servers and clients don’t talk to clients. There’s a nice little discussion in Section 6.2 of the definition of consensus and consensusless consistency.

Unique Timestamps

Concurrent sessions may get the same timestamp. Solution: each server appends its server ID to the returned timestamp.

Their Evaluation

Comparing TaaS to a consensus-based timestamp oracle (TiDB-PD), the authors claim:

In the happy case, TaaS is higher-latency than TiDB-PD, especially with more servers.
TiDB-PD stalls for 10 seconds after leader failure.
TaaS latency doesn’t increase much when servers are slow or dead, if a majority is healthy.

Thus TaaS is closer to constant work, which makes it more stable as part of a whole system and avoids metastable failures. Here’s Figure 9 again, now with TaaS included:

At the 1- and 2-minute marks, the authors killed a server. TiDB-PD suffered a complete stall in throughput, TaaS merely increased latency a bit.

My Evaluation

The algorithm is fairly simple, but the authors explain it with needlessly weird notation. These are three pieces of notation actually used in this paper:

This looks less like math and more like an alien language. The one on the bottom means the “second‑smallest timestamp received by session delta.” I guess that’s ok, but they never explain why delta gets an upside-down hat? Is that a rule in this alien culture?

More seriously, I’m curious about several questions:

TaaS doesn’t assume timestamps are integers. If it did, i.e. if there were a minimum increment amount, could the fault-tolerant algorithm be more efficient?
What happens when a client restarts and loses its long-lived global cache? Does that weaken fault-tolerance?
What about reconfiguration?: how are timestamps servers initialized, added, or removed?
Has Alibaba deployed this? It seems like they haven’t. Why?—did they use synced clocks instead?

I enjoyed this paper, and spent a long time understanding it (as you can see). It describes a new protocol in the classic distributed systems style. It provides rigorous explanations and proofs, and informative experiments. If you need monotonic ids within one data center and you can’t use synchronized clocks, TaaS is a simple solution with stable performance during failures.

Update: One of the authors has answered some of my questions about the protocol.

How To Use MongoDB Causal Consistency

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Tue, 14 May 2024 15:53:22 +0000

MongoDB implemented a consistency level called causal consistency in 2017 (version 3.6). It’s quite a handy consistency level, strong enough for most applications and still performant. I helped design the API for causal consistency, but when we released it I dropped the ball and didn’t publicize or document it well. MongoDB’s causal consistency didn’t get the fame it deserved in 2017; I’ll try to rectify that now.

MongoDB implements a Raft-like consensus protocol. Most people deploy MongoDB as a three-server replica set, with one primary and two secondaries. Only the primary executes writes, which are replicated by secondaries with a small delay. You can read from the primary or the secondaries. Your application talks to the servers via an instance of the MongoClient class.

The Problem With Secondary Reads

Liam

Let’s start with the classic example: you’ve implemented a social media site, with MongoDB as your database. Some influencer named Liam posts something witty to your site, then clicks on his profile to admire his wittiness. But Liam doesn’t see his latest post! He desperately refreshes, and phew: soon his new post appears. What happened?

The problem is that you have tried to decrease load on the primary by reading posts from secondaries. For example, in Python:

client = MongoClient("mongodb://srv1,srv2,srv3/?replicaSet=rs")
# Configure this collection object to read from secondaries.
posts_collection = client.db.get_collection(
    "posts",
    read_preference=ReadPreference.SECONDARY)

# The insert message goes to the primary
# and awaits majority replication.
posts_collection.insert_one(
    {"user_id": "Liam", "contents": "I'm very witty"})

# Read from secondary, due to "read_preference" above.
user_posts = list(posts_collection.find({"user_id": "Liam}))

MongoClient sends the insert command to the primary, which by default awaits acknowledgment from a majority of servers: in a three-server replica set, a majority is the primary plus one secondary. But then, MongoClient sends the find command to a random secondary—perhaps the other secondary, which may not have replicated the new post yet.

Reading from secondaries is unpredictable. You can’t reliably read your writes. You can’t do monotonic reads either: as you execute a series of reads on secondaries, you’ll use secondaries with different amounts of replication lag, so your data will seem to randomly jump back and forth in time. Secondary reads give you only the weakest guarantee, eventual consistency.

Causal Consistency

We want you to be able to read from secondaries with reasonable consistency, so in MongoDB 3.6 we introduced causal consistency, which guarantees read-your-writes and monotonic reads. As we wrote in Implementation of Cluster-wide Logical Clock and Causal Consistency in MongoDB,

Causal Consistency is defined as a model that preserves a partial order of events in a distributed system. If an event A causes another event B, causal consistency provides an assurance that every other process in the system observes the event A before observing event B. Causal order is transitive: if A causes B and B causes C then A causes C. Non causally ordered events are concurrent.

You enable causal consistency with a session:

# Note read_concern, which isn't in the code above.
posts_collection = client.db.get_collection(
    "posts",
    read_preference=ReadPreference.SECONDARY,
    read_concern=ReadConcern("majority"))

# start_session() has causal_consistency=True by default.
with client.start_session() as s:
    posts_collection.insert_one(
        {"user_id": "Liam", "contents": "I'm very witty"}, session=s)

    # Read your previous write, even from a secondary!
    user_posts = list(posts_collection.find(
        {"user_id": "Liam}, session=s))

    # This query returns data at least as new as the previous
    # query, even if it chooses a different secondary.
    n = posts_collection.count_documents({}, session=s)

You have to pass the session parameter with every command. This is too easy to forget—I forgot it when I first wrote this example!

In a causally consistent session, you’ll read your writes and get monotonic reads from secondaries. Both read concern and write concern must be set to “majority”, as explained in the MongoDB manual. “Majority” is the default write concern, so I configured only the read concern explicitly.

How does MongoDB ensure causal consistency? It uses a logical clock (aka a Lamport clock) called clusterTime to partially order events across all servers in a replica set or sharded cluster. Whenever the client sends a write operation to a server, the server advances its logical clock and returns the new clock value to the MongoClient. Then, if the MongoClient’s next message is a query, it passes the afterClusterTime parameter, which asks the server to return data including all writes up to that clusterTime. If the server is a lagged secondary, it waits until has sufficiently caught up:

If you query a secondary that hasn’t yet caught up to that point in time, according to the logical clock, then your query blocks until the secondary replicates to that point. (Yes, the parameter is called afterClusterTime, but the secondary only needs to replicate up to that clusterTime, not after it.)

The Fine Print

As I said, causal consistency requires write concern “majority” (the default) and read concern “majority” (not the default). Other configurations give weaker guarantees, see the thorough documentation.

Your client-side session object tracks, in memory, an ever-increasing clusterTime value, which it exchanges with every MongoDB server it talks to. If you lose the clusterTime, you lose causal consistency. Thus you must use the same MongoClient and the same session object throughout. Except, it’s possible to…

Transfer the clusterTime

Here’s the trick we should’ve publicized: you can transfer the clusterTime from one session to another. The sessions can’t be concurrent—you must wait for one session to end before you use the next one—but even if the sessions belong to different MongoClients, or different processes, or they run on different machines, they’ll form a causally consistent chain.

Continuing the example above, inside the “with start_session” block, capture the session’s clusterTime and operationTime:

with client.start_session() as s:
    posts_collection.insert_one(
        {"user_id": "Liam", "contents": "I'm very witty"}, session=s)
    cluster_time = s.cluster_time
    operation_time = s.operation_time

Now you have logical clock values from the primary after it inserted the post. You can create a new session, optionally on a different client or even a different machine, and fast-forward its logical clock:

client2 = MongoClient("mongodb://srv1,srv2,srv3/?replicaSet=rs")

# Same get_collection args as before.
posts_collection2 = client2.db.get_collection(
    "posts",
    read_preference=ReadPreference.SECONDARY,
    read_concern=ReadConcern("majority"))

with client2.start_session() as s2:
    s2.advance_cluster_time(cluster_time)
    s2.advance_operation_time(operation_time)
    user_posts2 = list(
        posts_collection2.find({"user_id": "Liam"}, session=s2))

After you call advance_cluster_time and advance_operation_time, subsequent operations on that session (don’t forget the session parameter!) are guaranteed to reflect all changes up to that time.

See the MongoDB Manual’s example code for transferring logical clocks with each driver.

Not So Convenient

Why do you need to transfer two clock values between sessions, instead of one? It’s bad API design; it reveals implementation details that could have and should have been hidden. I was in a rush during the MongoDB 3.6 cycle and I didn’t take the time to understand our logical clocks and propose a convenient API. It’s hard to change now, millions of people depend on the current APIs, but some MongoDB engineers are pushing to fix it.

Anyway, transferring two values isn’t so bad, the real inconvenience is piping these values through the layers of your application stack. If you use MongoDB in your web application, you probably have something like this:

Guaranteeing causal consistency with this architecture is toilsome. When Liam posts his witticism, he clicks a button in the frontend (a Javascript web app in this example), which sends the post to the middle tier, which calls insert with the MongoDB driver. Then the middle tier must capture the session’s clusterTime and operationTime, and return them to the frontend, which saves them in web storage. When Liam refreshes the page, the frontend must load the clusterTime and operationTime from web storage and send them with its request to the middle tier, which uses them to call advance_operation_time and advance_cluster_time on its session before executing find on a secondary. This guarantees Liam sees his post, but what a pain in the tuchus!

In the years since MongoDB 3.6, we could’ve documented this process better, and encouraged framework authors to build it into their application frameworks to ease the burden on developers. We didn’t do that. As far as I know, causal consistency is rarely used. It’s a shame, since it’s performant and conceptually simple.

It’s not too late to make causal consistency popular. If you want to help (especially if you maintain a multi-tier app framework), please write to me! But I now guess that consistent secondary reads without application logic are the real solution. I hope to research it later this year.

Monks at Holy Cross, part 2

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Mon, 22 Apr 2024 16:26:06 +0000

April 15, 2024. Monks at Holy Cross, an Anglican Benedictine monastery in the Hudson Valley. See part 1 here.

Images © A. Jesse Jiryu Davis

Jyakuen's Shuso Hossen

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Sun, 24 Mar 2024 22:08:13 +0000

March 24, 2024. A shuso hossen ceremony at the Village Zendo. Sherri Jyakuen Posey gave her first dharma talk and became a senior Zen student.

Images © A. Jesse Jiryu Davis

Monks at Holy Cross

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Sat, 16 Mar 2024 10:19:46 +0000

March 5, 2024. Monks at Holy Cross, an Anglican Benedictine monastery in the Hudson Valley. See part 2 here.

Images © A. Jesse Jiryu Davis

Bodhisattvas Always Smile

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Wed, 13 Mar 2024 22:42:22 +0000

I gave a talk at the New Paltz Zen Center against doomerism. We vow to bear witness to all the suffering in the world, but there’s no reason to be gloomy about it. Let’s be cheerful bodhisattvas and enjoy our lives!

Here’s three images. When you read the news, or you think about the state of the world, which one are you? What’s your state of mind?

This is Edvard Munch’s “The Scream”, painted in 1893. There’s an emaciated figure, holding its head in its hands and its mouth is open like it’s howling, it’s in a disorienting, apocalyptic landscape with a red-streaked sky.

Munch wrote about the inspiration for this painting: “One evening I was walking along a path, the city was on one side and the fjord below. I felt tired and ill. I stopped and looked out over the fjord – the sun was setting, and the clouds turning blood red. I sensed a scream passing through nature; it seemed to me that I heard the scream. I painted this picture, painted the clouds as actual blood. The color shrieked. This became The Scream.”

The second image is a comic strip. There’s a dog sitting at a table with a cup of coffee, and all around it the house is on fire, and a cloud of smoke hangs under the ceiling. The dog is wearing a little bowler hat, it has a stupid smile on its face, and it’s saying, “This is fine,” as huge flames close in around it.

This was drawn in 2013 by an artist named K.C. Green for his webcomic, it’s part of a 6-panel comic where the dog catches fire as he says, “I’m okay with the events that are unfolding currently”, “that’s okay, things are going to be okay”. Then his face melts off.

The first two panels became an Internet meme that people use to express, “You’re in denial. There’s a crisis of some sort and you should be panicking.”

This is an image of Avalokiteshvara, painted on a cave wall in Nepal in the 12th Century. Avalokiteshvara’s name means literally “down-looking lord”. But, I learned that the name was originally Avalokitasvara—slightly different—and it meant “perceiving sound”, which is how we think of the Bodhisattva, as a being that hears the cries of the suffering world. This being was first mentioned in the Avatamsaka Sutra, written about 2000 years ago, where he appeared as male. In the Lotus Sutra, slightly later, Avalokiteshvara has many forms, both male and female.

In China he or she is called Guanyin and can be male or female. This statue of Guanyin is at the Met, it was made in the 10th or 11th Century in China.

Here’s a contemporary statue. Here she’s pouring the water of compassion from a container.

In Japan, Guanyin is called Kanzeon or Kwanon or Kannon. The Canon camera company is named after her, by the way. Here’s an early Canon logo.

What I want you to notice is, even though Kanzeon hears the cries of all the suffering beings of the world, she’s smiling. Or at least, she looks peaceful. At the Village Zendo we are all on the Bodhisattva Path, we chant the Bodhisattva Vows, bodhisattvas like Kanzeon are our role models. So let’s ask ourselves: when we hear the cries of the suffering beings, are we smiling, or not?

Zenshin Caplow gave a terrific talk at the zendo a couple weeks ago, she focused on the courage of a bodhisattva. Please listen to that talk, it’s super insightful. Today, I’m not talking about courage, but about the cheerfulness of a bodhisattva!

Consider those three images: The Scream, or the oblivious dog in the burning house, or smiling Kanzeon. Which way are you?

I hear depression or exhaustion from a lot of people in the Village Zendo, and a lot of friends, and a lot of people who care about justice and democracy and the climate. What I don’t hear is people making it their goal to cheer up and enjoy their lives, despite the daily crises.

So I’m going to try to convince you to make the last image, the smiling Kanzeon, your role model. I want you to make it your goal to be cheerful and enjoy your life no matter what world events you hear about. I have three reasons why you should do this:

The world isn’t going to hell.
You’re a more effective bodhisattva if you’re cheerful.
Emptiness.

Number one, the world isn’t going to hell. Do you think it is? Do you think we’re all going to die of global warming, or people are starving, or that the world is getting more violent? You can slice statistics in infinite ways, but overall, life has improved for nearly everyone, and it’ll probably continue to improve for a long time.

Global poverty is declining rapidly. There’s a statistic called the International Poverty Line, defined as living on less than $2.15 a day. The number of people below that line has decreased from nearly 40% of people in 1990 to less than 10% today, in other words the share of the population in extreme poverty is a quarter of what it was in 1990. In the same period infant mortality fell by half. Deaths from famine have become extremely rare compared to past decades.

As I say this, I’m aware that people are dying of hunger in Gaza right now. I’m not minimizing that. The opposite: I’m saying we must act to end hunger everywhere, and we should be encouraged, knowing that it’s possible, because we’ve nearly accomplished it.

Of course, the big negative trend is global warming. But it doesn’t mean the planet is going to burn up or we’re all going to die. The Intergovernmental Panel on Climate Change says the most likely scenario is 2 or 3 degrees Celsius of warming by the year 2100. That means we’re going to be hotter than the goal that they set, but not by too much, and it’s far better than the worst-case scenario. A lot of climate activists will cite scenarios where Manhattan is underwater and all of Africa is a desert, but that’s based on the IPCC’s worst-case scenario, and it’s now clear that won’t happen, partly because people are taking action! The US’s total carbon emissions, for example, have been falling since 2007. Plus, as countries get richer, they tend to use more efficient energy sources. And as countries get richer their population growth slows. The global population will probably peak in the year 2100 or sooner, and start shrinking after that, which should make it easier to reduce carbon emissions. Taking everything together, the IPCC projects that over the next century, despite global warming, global poverty will keep falling, hunger will be rarer, average lifespan will increase, standards of living will improve. Global warming is bad, but humanity’s progress continues anyway.

Of course there are crises. The last few years have been fairly violent, with wars between Russia and Ukraine, and between Israel and Hamas. And there have been a few other wars. Just as an example of how distorted our view can be, the Tigray War in Ethopia a few years ago was much bigger than the Israel-Hamas war but it was hardly covered by US media.

Kanzeon doesn’t lose her cool when a war breaks out. She knows there have been far more violent years in the 70s and 80s, and of course those years are dwarfed by WWI and WWII. Overall, wars between countries have become very rare compared to the 20th Century. And anyway, Kanzeon understands human nature, we’re prone to violence because we have greed, anger, and ignorance. So she expects there to be some wars from time to time. It doesn’t change her mission. Kanzeon just keeps working to wake us up.

If you think the world is going to hell, that’s your view, it’s arising in your mind. The world seems to be going to hell because of what you choose to focus on, and that is probably influenced by how you consume the news. And the news is influenced by media companies' desparate need for your attention, it’s what they need to survive. Of course they focus on the negative! Global poverty keeps falling year after year, but media companies know that stories about continuous improvement don’t generate clicks. Be aware of this distortion and try to correct it, seek out good news, and seek out information about big, long-term trends, not the brief crises of the day.

Kanzeon sees the whole world over the long term, that’s why she keeps smiling through each crisis.

Number two, you’re a more effective bodhisattva if you’re cheerful.

If we imagine that today’s wars are unusually bad, it can seem like we’re losing the struggle for peace, and that’s discouraging. Or if we think we’re all doomed on a burning planet, that’s super discouraging. I hear people say there’s no point bringing children into a doomed world. Have children or don’t, I have no opinion, but make an informed decision: their lives will probably be better than ours. If we’re aware of the progress we’re making, towards peace, towards a carbon-neutral economy, towards global prosperity, that encourages us to keep trying!

There’s a story about Kanzeon, which of course has many versions, so here’s mine: Once upon a time, early in Kanzeon’s career, he had just one head and two arms. But he worked furiously hard and liberated countless beings from suffering. He paused, he looked around, and the number of suffering beings still seemed overwhelming. He was so frustrated his head exploded. Buddha put his head back together in a different form: it became 11 faces. A peaceful face to calmly gaze in each of the 10 directions, so Kanzeon wouldn’t lose track again. And on top of these 10 faces, Buddha placed an 11th face, a wrathful face for the suffering beings who need a kick in the ass to liberate them.

So Kanzeon went back into action. But trying to save all beings at once, he moved his arms so fast they shattered. So Buddha did some magical upgrades, and gave Kanzeon 1000 arms, and in each of his 1000 palms Buddha placed an eye, so Kanzeon could give individual attention to every being simultaneously.

But you and I are not 11-faced and 1000-armed. Generally we have one face each, and at most two arms. If I tried to respond to every crisis simultaneously, my head would explode and my arms would shatter. I need to remember that this body, this brain, is just one of Kanzeon’s hands and eyes. Or maybe I’m one of her fingers, maybe I’m one of her cells. I can’t do it all. So I focus where I think I’m effective: I lead the sangha at Sing Sing, I donate 10% of my income to GiveWell, which directs it to the most effective charities, like vitamins and vaccines to people in Africa. I support the Village Zendo and New Paltz Zen Center. I focus on a few things, and I’m satisfied. I am doing enough!

Kanzeon responds to all the cries of the world, no one and nothing is left out, and my actions are a part of that.

Since I’m doing enough, I don’t read too much news. I don’t need to know the ins and outs of US politics. I don’t need to follow the daily events of every war. It’s okay. I need to be informed, but I don’t need daily information to do my bodhisattva work. If I got daily information it would be bad for me.

If you take one piece of advice from this whole talk, it’s this: turn off news alerts on your phone. If Shakyamuni Buddha were here today, he’d make all his monks and nuns turn off their news alerts before he shaved their heads. It’s essential.

If you have an iPhone, here’s how to do it. Open Settings, search for whatever news app is sending you alerts, like maybe the NYTimes app, then tap Notifications and turn them off. If you have an Android or a smartwatch, I’m sure it’s equally easy. It is so important that you do this.

You might be wondering, how can a bodhisattva ignore the daily news? What about bearing witness? Bearing witness is one of the tenets of the Zen Peacemaker Order, which we belong to. We occasionally recite these tenets and vow to follow them. So am I allowed to focus on a few things, and turn off the news? Yes I am! Bearing witness is intimate. It’s being fully present with someone. You have to enter their world and experience it with them. You can’t do that if you’re being interrupted by New York Times headlines. The headlines are a media company’s interpretation of the world. Bearing witness is a direct experience. The two things could not be more different. If you turn off the news it gives you space to bear witness, and I find that when I do bear witness, it fulfills me and gives me courage to act, cheerfully!

So that was number two, a cheerful bodhisattva is an effective bodhisattva. My third and final argument why you should try to be cheerful is: emptiness. Nothing is separate or solid, it only briefly appears that way. Beings appear and disappear constantly, like waves rise up from the ocean and fall back into it. Fairly soon, everyone in this room and this Zoom call will die. Later, the sun’s fusion reaction will use up all its hydrogen, and it will start fusing helium instead; it will heat up and destroy the earth. If humanity escapes the earth, we could survive a while longer, but some day the whole universe will run out of energy and become cold and silent and everything will die.

What I like about Buddhism is that we face the facts. We practice accepting death. Unlike the oblivious dog saying “this is fine”, we do see the flames and we know what’s going to happen. But we accept it, because the beings that appear and disappear are just waves in the ocean. They are empty of meaning as individuals, they are inseparable parts of the magnificent whole. The knowledge of this fact is called prajna paramita, “wisdom perfected”. As the Heart Sutra tells us, Avalokiteshvara practices prajna paramita! The paradox of being a bodhisattva is to value each individual and care about their well-being, without ever forgetting that we are each just a momentary expression of the whole shebang, never really separate. When we see this, our own peace of mind is stable, it doesn’t depend on anything, certainly not on the daily news cycle.

As Samantabadhra Bodhisattva says in the Avatamsaka Sutra:

Even though the void of space, together with the states of beings, with the karmas of beings, with the sorrows of beings, though all these will end, yet my approval and joy in the merits of all beings will not end.

Samantabadhra isn’t depressed by the thought of the end of the world, actually he sounds ecstatic, because he knows his joy doesn’t depend on anything but his vow.

So I’ve argued that the bodhisattva path is cheerful action and enjoying our lives. The bodhisattva vow isn’t vowing to be depressed about the state of the world, it’s vowing to be like Kanzeon—both in our attitude and in our action. Think again about those three images, the scream, the oblivious dog, the smiling bodhisattva. If you feel more like screaming than smiling, ask yourself: what about my practice is having this result? What might I do differently to have a different result?

The hardest news for me to handle, lately, is the likelihood that Trump will be reëlected. But I do not intend to be anxious about it. My method is, instead of obsessively reading news coverage, I’m just doing what I can. Keishin and I will lead a fundraiser for Biden this fall, we did this in 2020 and helped raise tens of thousands of dollars, and we’ll do that again. And I’ll take September and October off work and look for ways to volunteer for Democrats. My local congressman is a Democrat. We’re in a swing district here, he won by one percent last year, so working for his reëlection is a high-leverage way to make a difference. Electing Democrats seems like the most effective way to combat global warming and protect human rights. We all know Democrats are far from perfect, but it looks like the bodhisattva path to me.

It’s my experience that obsessing about politics is depressing, but participating in politics is really fun and satisfying. So I’ll do what I can during this campaign, and then I’ll move on and enjoy my life. A bodhisattva is never stuck, a bodhisattva’s action is always flowing.

If Trump wins, I don’t intend to be depressed about it. I’ll keep up with the news as much as is useful, and that’s it. I’ll cheerfully work to contain the damage however I can, and I’ll move on and enjoy my life. I won’t get stuck, and I won’t fall apart.

I’ll leave you with a verse by Enkyo Roshi, from her book “A Little Bit of Zen”:

We can’t lose our heads.
Oh, Bodhisattva!
Losing your head doesn’t help—
Instead, extend your hand.
Offer your words, your wisdom, your effort.
That’s what a Bodhisattva does, day in and day out.
Serving all of life.

Ordination at Christ Episcopal

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Mon, 11 Mar 2024 22:43:28 +0000

March 5, 2024. An ordination ceremony for Emily Carter at Christ Episcopal Church in Poughkeepsie.

Images © A. Jesse Jiryu Davis

Climbing in Puerto Rico

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Thu, 15 Feb 2024 16:16:36 +0000

February 2024. Photos of my friends (and one friendly stranger) sport-climbing on the limestone crags near San Juan (Monagas and Flying Coconut).

Images © A. Jesse Jiryu Davis

Review: Antipode: Enforcing Cross-Service Causal Consistency in Distributed Applications

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Mon, 29 Jan 2024 21:02:14 +0000

Ptolemy Map, 1482

In 2015 some Facebook researchers threw down a gauntlet, challenging anyone who dared to provide stronger consistency in big, heterogeneous systems. In 2023, some researchers (mostly Portuguese) responded with Antipode: Enforcing Cross-Service Causal Consistency in Distributed Applications. Antipode defines an interesting new consistency model, cross-service causal consistency, and an enforcement technique they claim is practical in such systems.

Motivating Example

Here’s an example from the paper. A social network is composed of services:

My simplification of the paper’s Figure 2

This system has the following problematic workflow:

Author uploads a post; it’s received by the Post-Upload service in Region A.
Post-Upload sends the post to Post-Storage,
… which saves the post to a local replica of its datastore.
Post-Upload tells the post-id to Notifier,
… which saves the post it a local replica of its (separate) datastore.
Both datastores eventually replicate to Region B in arbitrary order.
In Region B, as soon as Notifier replicates the notification,
… it triggers Follower-Notify,
… which retrieves the post from Post-Storage,
… which retrieves it from its local replica of its datastore.
Once Follower-Notify has the post, it relays it to Follower.

The problem, as astute readers can predict, is if Notifier’s datastore replicates sooner than Post-Storage’s. In that case, Follower-Notify will learn about the post too soon; it’ll try to retrieve it from Post-Storage but the post won’t be there yet.

This is a consistency violation of some sort—the paper will define it exactly, in a moment. We could prevent the anomaly by making all datastores replicate synchronously. In that case, once Post-Storage has acknowledged storing the post in Region A, it has also replicated it to Region B, so it’s certainly there by the time Follower-Notify tries to retrieve it. But this kills all parallelism. Maybe there’s a better way?

Causal Consistency Isn’t Enough

What about Lamport-style causal consistency with logical clocks? This wouldn’t prevent the anomaly. The paper doesn’t explain in detail why, so I’ll try.

In Lamport’s algorithm, each process has a clock value (perhaps just an integer), which is incremented and propagated whenever processes exchange messages. Lamport clocks could prevent the anomaly if we had one replicated datastore:

Post-Storage in Region A saves the post, gets a Lamport clock value of 42 from the datastore.
Post-Storage in Region A directly notifies Post-Storage in Region B and tells it the clock value.
Post-Storage in Region B reads from its replica of the datastore. It tells the datastore to wait until it’s replicated up to clock value 42 before executing the query:

Many datastores (including MongoDB) support causal consistency this way, and it would prevent the anomaly described above. This doesn’t work in the example from the paper’s Figure 2, however. The problem is, there are two datastores replicating concurrently in Figure 2. Causal consistency is a only a partial order, not total; it allows the Post-Storage’s or the Notifier’s datastore to replicate first. With multiple replicated datastores, the anomaly is allowed by causal consistency, so we have to define a stricter consistency level.

Cross-service Causal Consistency

The Antipode authors define a new consistency level that prohibits the anomaly: “cross-service causal consistency”. They abbreviate it “XCY”, which perhaps makes sense in Portuguese. Cross-service causal consistency includes several ideas:

Lineage: A DAG of operations, partially ordered with Lamport’s “happens-before” relation. A lineage begins with some “start” operation, such as a client request or a cron job, and proceeds until each branch completes with a “stop” operation.

In Figure 2 there are two lineages: one is spawned when Author uploads a post. I’ll call this Lineage A. It has two branches (leading to Post-Storage and Notifier), and it includes concurrent replication of two datastores to Region B. The other lineage, which I’ll call Lineage B, is spawned when Follower-Notify in Region B receives the notification. Lineage B then reads from Post-Storage in Region B, and notifies Follower.

The authors use a data set from Alibaba, where lineages are hairy: “User requests typically form a tree, where more than 10% of stateless microservices fan out to at least five other services, and where the average call depth is greater than four. Additionally, this tree contains, on average, more than five stateful services,” i.e. services with datastores.

Reads-from-lineage: An operation b reads-from-lineage L if b reads a value written by an operation in L.

Cross-service causal order: This is denoted with the squiggly arrow ⤳. For two operations a and b, if a happens-before b or b reads-from-lineage L, where L includes a, then a ⤳ b. Cross-service causal order is a transitive partial order, like happens-before.

XCY: This is the paper’s new consistency level. An execution obeys XCY if you can find a serial order of operations obeying cross-service causal order.

XCY is the consistency level that Figure 2 violates! When Follower-Notify tries to read Author’s post in Region B, that should happen after all the events in Lineage A, including the replication of the post to Region B.

My Feelings About Lineages

I feel uncomfortable, as if there’s a purer mathematical concept obscured by the specifics of microservice architectures. Why are the borders between Lineages A and B drawn where they are? Could we split these operations into more than two lineages, or combine them into one?

I think that a lineage is a general concept (“a DAG of operations”), but Antipode finds it convenient for microservice architectures to split lineages thus: Operations in a lineage are connected by happens-before. When a service reads a value from storage, this operation does not join the lineage that wrote the value. Instead, it’s connected by reads-from-lineage. The goal of “cross-service causal consistency” is to make a partial order of lineages, such that replicated data stores appear not-replicated. (I was confused about this until I read the paper’s appendix. You should read the appendix, too.)

I think there’s a more general idea of “recursive” or “nested” causal consistency trying to be born. This general idea would include lineages, defined however you want, and lineages could contain nested lineages. Cross-service causal consistency is a specialization of this general idea.

Tracking And Enforcing Cross-Service Causal Consistency

This paper describes a system for enforcing XCY, called “Antipode”, which means “opposite side"; maybe this refers to end-to-end consistency guarantees across geographic regions. Or maybe it refers to mythical beings with reversed feet for some reason.

Anyway, whenever services exchange messages as part of the regular functioning of the application, Antipode piggybacks lineage information. Since microservice architectures already piggyback info for distributed tracing, Antipode doesn’t add much coding-time or runtime burden. Additionally, Antipode places shims in front of all datastores; the shims add lineage information to reads and writes. (Antipode borrows the technique from Bolt-on Causal Consistency.) Lineage info accumulates along causal branches within a lineage, and gets dropped whenever a branch ends.

Developers can customize lineage tracking; they can explicitly add or remove dependencies. If one lineage depends on another in a way that Antipode doesn’t detect, a developer can transfer lineage info between them.

(MongoDB drivers let you transfer causality info between sessions, too, although it’s basically undocumented; I explain it here.)

Antipode could enforce XCY automatically, on each read operation, but instead it provides an explicit barrier operation that developers must call to wait for dependencies to be satisfied. This seems error-prone, but it sometimes permits developers to reduce latency by carefully choosing where to place their barrier calls. The authors write,

One argument that can be made against barrier is that it is as explicit as today’s application-level solutions, since both of them require the developer to manually select its locations. What makes Antipode’s approach better suited is not only barrier, but its combination with the implicit/explicit dependency tracking, which keeps services loosely coupled and does not require end-to-end knowledge of what to enforce.

This bit about “loose coupling” is insightful: you can place your barrier call somewhere, and if you later add dependencies, barrier will enforce them without code changes. On the other hand, having one barrier call for all dependencies requires you to wait for all of them at once, including those you don’t need yet.

How does barrier know how long to wait?

Antipode’s barrier API call enforces the visibility of a lineage. It takes a lineage as an argument and will block until all writes contained in the lineage are visible in the underlying datastores. Internally, a barrier will inspect the write identifiers in the lineage and contact the corresponding datastores. For each datastore, barrier will call the datastore-specific wait API, which will block until the write identifier is visible in that datastore. Note that wait is datastore-specific because visibility depends on the design choices and consistency model of the underlying datastore. Once wait has returned for all identifiers in the lineage, barrier will return.

In our example, this means that before Follower-Notify retrieves Author’s post from Post-Storage, it calls barrier, which queries the Post-Storage datastore and waits until it’s sufficiently up-to-date.

This is an extra round trip (red arrow) even if the datastore is already up-to-date. I think this could be optimized away with something like MongoDB’s afterClusterTime, but Antipode’s API would have to change. Luckily, you can limit the consistency check to nearby replicas:

We implemented a practical optimization strategy specifically tailored for geo-replicated datastores. This involves implementing the wait procedure to enforce dependencies only from replicas that are co-located with its caller, thereby avoiding (whenever the underlying datastore allows it) global enforcement.

I don’t fully understand barrier from the paper’s description. If it’s waiting for all writes from Lineage A to be visible in Region B, how does it know about writes that Lineage A hasn’t even started yet? Must it wait for all branches of Lineage A to finish? If so, how? And what if an operation in Lineage A crashes or hangs?

Their Evaluation

The authors evaluate Antipode with three benchmarks and a dozen brands of datastore, and ask 1) would there be XCY violations without Antipode, and 2) what is the cost of preventing them? The answer to question 1 is yes. For question 2:

Lineage info adds 200 bytes to 14 kb per message in the authors' benchmarks (developers might need to explicitly prune lineages in their own systems).
Waiting for consistency increases latency, by definition.
Enforcing XCY decreases throughput by 2-15%.

My Evaluation

Cross-service causal consistency is a neat concept. The chief argument for it, buried in the middle of the paper, is decoupling: it permits microservices to read consistently from multiple replicated data stores, without knowing the details of the microservices that wrote to them. This limits the impact of changes to any part of your system. With Death Star architectures like Alibaba or AWS, decoupling is crucial.

The AWS microservices “death star” in 2008

Antipode’s API is higher-level than manually enforcing cross-service dependencies. I think it could be useful as part of an even higher-level “cloud programming language” that automatically decomposes, distributes, and parallelizes high-level logic, while detecting consistency requirements and enforcing them. I’m aware of cloud programming language projects like Unison, Wing, Goblins, Hydro, Dedalus, Gallifrey, and so on. They’re at various stages of development and levels of abstraction. If this paper’s definition of lineage were generalized to encompass more kinds of causal relations among operations, it could express the constraints of a variety of constructs in high-level cloud programming languages, and something like Antipode could enforce them.

Review: Leases: An Efficient Fault-Tolerant Mechanism for Distributed File Cache Consistency

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Thu, 25 Jan 2024 11:35:03 +0000

Leases: An Efficient Fault-Tolerant Mechanism for Distributed File Cache Consistency, Cary G. Gray and David R. Cheriton, 1989. Old and good. I read this simple paper because it seems to be ground zero for timed leases in distributed systems, in which I’m now intensely interested.

The Protocol

The authors discuss a distributed file system, e.g. for a network of diskless workstations connected to a shared file server. (The file server itself is a single machine, not replicated. Yes, it’s a single point of failure. This is 1989.) The workstations can read or write files by exchanging messages with the server.

Each workstation is a write-through cache: when it reads a file, it caches it in its local RAM for future reads.

When a workstation updates files in its RAM, it synchronously updates the server’s copy.

Of course, all sorts of race conditions can cause inconsistency. Workstation A could read and cache a file, then workstation B updates it, then workstation A re-reads the file from its now-stale cache and sees outdated data.

The authors want to provide consistency: “By consistent, we mean that the behaviour is equivalent to there being only a single (uncached) copy of the data except for the performance benefit of the cache.” Their solution is a timed lease.

A cache using leases requires a valid lease on the datum (in addition to holding the datum) before it returns the datum in response to a read, or modifies the datum in response to a write. When a datum is fetched from the server (the primary storage site of the datum), the server also returns a lease guaranteeing that the data will not be written by any client during the lease term unless the server first obtains the approval of this leaseholder. If the datum is read again within the term of the lease (and the datum is still in the cache), the cache provides immediate access to the datum without communicating with the server. After the lease expires, a read of the datum requires that the cache first extend the lease on the datum, updating the cache if the datum has been modified since the lease expired. When a client writes a datum, the server must defer the request until each leaseholder has granted approval or the term of its lease has expired.

So before Workstation B can write to a file, it must acquire from the server a lease on that file, with a timeout that will expire some time in the future. Workstation B completes its writes while holding the lease, and either extends the lease or allows it to expire. When Workstation A then re-reads the file from its cache, it must get a new lease on it, thus discovering that the file has changed, and refresh its copy from the server. After this, Workstation A can keep reading the file until its lease expires. Meanwhile, no other workstation can modify it.

The server itself can read and write files, and it needs a lease to do so, the same as a workstation.

It seems that the system grants multiple shared leases for reading, or one exclusive lease for writing. The authors don’t say so, but they imply it. E.g., the passage above mentions “each leaseholder”, indicating there could be more than one per file. These days a paper like this would crush all ambiguity with pseudocode, a TLA+ specification, a formal proof, and a link to an open-source implementation on GitHub. I guess in 1989 you could mumble a few pages about leases and the ACM would publish it.

If a workstation or the server wants access to a file that’s already leased to another workstation, the lease-wanter can wait, or ask the leaseholder to relinquish the lease early. A leaseholder can proactively relinquish an unexpired lease that it doesn’t need anymore.

If many nodes want a lease on the same file, the server enqueues them. The paper doesn’t specify the scheduling policy, except to mention in a footnote that writers take priority.

Fault Tolerance

If a workstation gets disconnected from the other nodes, it might have crashed, or it might be partitioned and still reading from its cache any files for which it has valid leases. Thus to guarantee consistency, the server must wait for the disconnected workstation’s leases to expire before it can grant more on the same files. Since the workstation is practically stateless, crash recovery is trivial.

If the server crashes and restarts, it must remember all the leases it granted before. It could durably record each lease it grants, but that might make disk I/O a bottleneck. The authors propose that the server durably records only the maximum expiration time. When it restarts, it waits for that maximum expiration to pass before granting new leases.

The protocol is resilient to delayed or lost messages, but it does require all nodes' clocks to advance at the same speed, modulo a small, known epsilon.

Optimal Lease Terms

Short leases minimize recovery time after a workstation or server failure. They also minimize “false sharing”: when a node has to wait for a lease, although the leaseholder is no longer using it. Long leases reduce lease-requesting traffic and latency at the workstations.

The large portion of the paper constructs an analytical model of these tradeoffs. I’ll ignore this section since it’s fairly specific to the authors' system. Plus I’m generally skeptical of analytic performance modeling, compared to simulation.

There’s an interesting discussion, however, of the optimal lease terms for different sorts of files and workstations. The server should grant longer leases to more distant workstations to compensate for network latency. Operating system files are very frequently read and almost never written, so the server should grant very long read leases on them. In fact, it should just grant bulk read leases on whole directories of OS files, and proactively issue read lease extensions to all workstations so long as no write to the OS files is pending. On the other hand, the server should grant short leases on a frequently-written file. The authors say “a heavily write-shared file might be given a lease term of zero”, which I don’t understand—how can a workstation use a lease that’s already expired by the time the workstation knows it has acquired it? Anyway, the authors conclude, “a server can dynamically pick lease terms on a per file and per client cache basis using the analytic model, assuming the necessary performance parameters am monitored by the server.”

Their Evaluation

The authors evaluate performance not with a real-life test, but by applying their analytic model to some real-world data. They use a trace of file accesses from one workstation to one file server, while the workstation recompiled a program. There’s no contention in the trace (since there’s one workstation), so they simulate various levels of contention in their evaluation. Network latencies were measured in separate tests, then fed to the model. This is so many steps removed from reality, it wouldn’t fly in a modern research journal. However, I do appreciate this abstracted approach; it can be more revealing than testing a fully-implemented system with its adventitious complexity and noise.

The authors measure server load (i.e., the number of messages it must process) for different numbers of workstations (the “sharing level”, S) and different lease terms. They set the server load to 1 with 1 workstation and zero-term leases, and measure other configurations relative to that one. The workload is 96% reads, so longer terms reduce server load because workstations can mostly read from cache. I’d be curious to see latency at the workstations, too. I’d expect longer terms to decrease latency in this read-heavy workload, and increase it in a write-heavy workload.

My Evaluation

A short and sweet paper, worth reading for historical interest. If I publish a paper about leases I’ll need to know what’s in this one so I can cite it. I wish they had described their protocol more precisely, perhaps in pseudocode, before they rushed to model it analytically and draw charts. But all is forgiven: this is the paper that coined the term “lease” and introduced the world to an elegant consistency technique.

Review: Exploiting a Natural Network Effect for Scalable, Finegrained Clock Synchronization

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Mon, 22 Jan 2024 16:30:27 +0000

Christiaan Huygens by Caspar Netscher, 1671.

This is a review of Exploiting a Natural Network Effect for Scalable, Finegrained Clock Synchronization, from Stanford and Google researchers. It describes Huygens, a protocol for very accurate clock synchronization, plus (an undersold benefit) very accurate measurements of one-way network latency. I read it because the Huygens protocol is used by Nezha, which I reviewed last week. The Huygens and Nezha papers share two authors.

Motivation

As Barbara Liskov wrote in 1991, there are practical uses of clocks in distributed systems, such as establishing an order of events on different servers without communication. In an especially insightful paragraph, the authors write:

In order to achieve external consistency, a write-transaction in Spanner has to wait out the clock uncertainty period, T, before releasing locks on the relevant records and committing. Spanner can afford this wait time because T is comparable to the delay of the two-phase-commit protocol across globally distributed data centers. However, for databases used by real-time, single data center applications, the millisecond-level clock uncertainty would fundamentally limit the database’s write latency, throughput and performance. Thus, if a low latency database, for example, RAMCloud, were to provide external consistency by relying on clock synchronization, it would be critical for T to be in the order of 10s of nanoseconds so as not degrade the performance.

NTP, the usual clock-sync protocol, is only accurate to a few milliseconds. More accurate protocols require specialized hardware. The Huygens protocol (“HOY-gons”, named for the inventor of the pendulum clock) gives nanosecond accuracy in ordinary data centers. The authors summarize it thus: “First, coded probes identify and reject impure probe data (data captured by probes which suffer queuing delays, random jitter, and NIC timestamp noise). Next, Huygens processes the purified data with Support Vector Machines, a widely-used and powerful classifier, to accurately estimate one-way propagation times and achieve clock synchronization to within 100 nanoseconds. Finally, Huygens exploits a natural network effect (the idea that a group of pair-wise synchronized clocks must be transitively synchronized) to detect and correct synchronization errors even further.”

The Buddy System

Clock-sync protocols like NTP estimate network round-trip time (RTT) thus: server A sends a message called a “probe” to server B, which responds with an “ack”. Both messages are timestamped by the sending and receiving network interface cards (NICs). Server A averages the durations between probes and acks to estimate RTT. But some messages experience random queueing delays. Huygens wants to find probe-ack pairs that passed between the servers in minimum time without queueing delays; the authors call these “pure” probes. They use only pure probes for RTT and clock-skew estimation.

Huygens distinguishes pure and impure probes with a crafty little algorithm: Server A sends a probe, waits a small time s according to A’s clock, then sends a second probe. (For some reason these probe pairs are called “coded” probes.) If the time between the probes on the receiving server B is very close to s, Huygens calls both probes pure. (The duration s is small enough that the two servers' differential clock drift is irrelevant.)

The Forbidden Zone

Huygens uses pure probes to calculate the largest and smallest possible clock discrepancies between servers. For example, if server A sends a probe at time 0 according to A’s clock, and B receives the probe at time 5 according to B’s clock, B’s clock can’t be more than 5 units ahead of A’s: otherwise the message would’ve arrived before it was sent, in absolute time. Thus a message from A to B reveals the upper bound of B’s clock skew relative to A’s.

A message going the opposite direction reveals the lower bound.

Any of these numbers could be negative: if the message leaves from A at time 0 and arrives at B at time -5, B’s clock can’t be more than -5 units ahead of A’s (i.e. it must be at least 5 units behind).

Even “pure” probes exhibit some random variation in latency. Quicker probes give tighter bounds: if the message leaves at time 0 and arrives at time 4, we’d know B’s clock can’t be more than 4 units ahead of A’s. But we don’t know how much of that difference is clock skew and how much is network latency—not yet.

Over the course of seconds, as the servers' clocks drift towards or away from each other, the bounds change, producing this difficult and delightful chart:

Figure 4 from the paper.

The quickest messages give the tightest bounds, visible as dense blue and green lines. Slowdowns seem to be randomly distributed, making sparse fields of blue dots above the least upper bound, and green dots below the greatest lower bound.

The handful of dots between the bounds are in the “forbidden zone”. They seem to be unusually quick messages that provide tighter bounds than the other dots, but it’s a lie. In fact, they’re an artifact of NIC timestamp noise: when a server sends a message, there’s an occasional delay of 10s or 100s of nanoseconds before it records the transmit timestamp, making the transmission time seem shorter than it is. This is small enough that these probes are considered “pure”, but large enough to hurt Huygens’s accuracy. Huygens uses a very well-known statistical method called a support vector machine to find the dense lines that border the forbidden zone and filter out the samples inside it.

Detecting Asymmetric Delays

Other protocols like NTP assume symmetric network delays: they assume the one-way delay (OWD) from server A to server B is practically the same as vice versa. The Huygens authors find that this is mostly true in their data centers, but slight asymmetries violate this assumption enough to hurt clock synchronization. They exploit a natural network effect (hence the paper’s title): additional servers help detect asymmetries. For example (from the paper), say that servers A and B think that A’s clock is 20 units ahead of B’s, but due to asymmetry they’re wrong, A’s clock is only 10 units ahead of B’s:

If these were the only servers, the error would be undetectable. But say there’s a server C that thinks it’s 15 units behind B and 5 units ahead of A:

This is impossible, because summing the offsets A→B→C→A we find a “loop offset surplus” of 10 units (20 + -15 + 5). Huygens evenly distributes the surplus among the pairwise offsets, improving accuracy. In this example, the offset from A to B is now estimated to be 16.7; not the accurate value of 10, but better than the wrong value of 20.

In a real data center, “each server probes 10-20 others, regardless of the total number of servers in the network”, and Huygens analyzes the entire graph of pairwise synchronized clocks.

Their Evaluation

The authors built two testbeds, each with several racks and switches, and dozens of servers. One testbed represents, according to the authors, a “state-of-art data center”, and the other a “low-end commodity data center”. The two testbeds have different timestamping features in their NICs, and different network latencies.

I was wondering how they’d evaluate clock synchronization—after all, the problem they’re trying to solve is the lack of a perfect clock to compare to. They evaluate pairwise synchronization with a NetFPGA-CML board, which has four ethernet ports and one clock. They attach two separate VMs to two of the ports, and connect them to the rest of the servers in the low-end testbed. The two VMs are banned from talking directly to each other, they must sync their clocks via intermediate servers. Since the actual clock discrepancy between these VMs is zero, the authors can measure how closely Huygens has synchronized their clocks. They get a mean error of 13.4 ns and a 99th-percentile error of 30.2 ns; this is tiny, and the low variance is especially impressive.

There are additional evaluations that show that:

If each server probes K other servers, clock error falls as K grows.
Network load hurts synchronization, but not too much: a few probes can still get through without delay, and Huygens correctly identifies these as “pure” and uses them for clock sync.
Huygens is orders of magnitude better than NTP.

My Evaluation

The results in this paper seem exceptionally useful for distributed systems. Huygens appears to achieve incredibly tight clock synchronization, plus two features that the paper undersells: Huygens also measures the clock error bounds, and the one-way delay between servers. These three measurements could enable countless futuristic distributed protocols. Furthermore, the explanations in the paper are thorough and clear. The only part I found intimidating was their description of the loop-analysis algorithm, but I think that’s inherently complex.

Unfortunately, I can’t tell from the paper whether Huygens works in public clouds. Do customers like me have the required access to NIC timestamps? Do we need it? If Huygens works in public clouds, I wish this paper had evaluated it there. The Huygens code is closed-source, so third parties can’t test it themselves.

Several of the authors founded Clockwork, which “runs on all major clouds”, synchronizes clocks, and measures clock error bounds and one-way delays. Clockwork claims “sync accuracy as low as 5 – 10 ns with hardware timestamps, 100s of ns – a few μs with software timestamps.” It looks to me like they ported Huygens to public clouds using software timestamps, and its accuracy there is reduced but still great.

Review: Nezha: Deployable and High-Performance Consensus Using Synchronized Clocks

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Sun, 14 Jan 2024 08:25:53 +0000

This is a review of Nezha: Deployable and High-Performance Consensus Using Synchronized Clocks, from NYU and Stanford researchers last year. To understand this paper I had to relearn how quorums work in Paxos and Fast Paxos, so we’ll start there.

Classic Paxos Quorums

In Paxos there are proposers, acceptors, and learners. Each server usually plays all three roles, but the protocol’s often described as if the roles are separate (confusingly, in my opinion). The protocol is like:

A client sends value v to a proposer.
The proposer sends “prepare” to the acceptors, with a unique current round number.
The proposer hears “promise” replies from a majority of acceptors.
The proposer knows that only it can propose a value for this round.
The proposer sends “accept” to the acceptors with the value v for this round.
A learner hears “learn” messages from a majority of acceptors, it knows v is the value for this round.
The learner sends an acknowledgment to the client.

Real systems use MultiPaxos, in which the “prepare”/“promise” exchange is done once to establish a “distinguished proposer”, which then drives a series of “accept”/“learn” exchanges indefinitely. Optimized MultiPaxos is nearly the same as Raft, it’s just explained much worse.

Fast Paxos Quorums

Before any proposals, a proposer sends “accept any” to the acceptors with no value for this round.
The proposer hears “promise” replies from a majority of acceptors.
Fast track: the client sends value v directly to the acceptors!
A learner hears “learn” messages from 3/4 of the acceptors, with the same value.
The learner sends an acknowledgment to the client.

Fast Paxos saves a single one-way message delay compared to regular Paxos. But it risks collisions: several clients can send several values to the acceptors in the same round. Fast Paxos uses a larger “fast quorum” size to check for collisions. If no value is chosen by a fast quorum, the system goes into a slow error-recovery mode. Fast Paxos is safe—it will never accept multiple values in a round—but its performance is brittle; it’s slow under contention.

Quorum Sizes

A classic Paxos quorum must be a majority, so that any two quorums Q₀ and Q₁ share at least one node. For example with 9 acceptors, a quorum is at least 5:

Thus if a minority of acceptors fails, at least one survivor remembers accepting v. (If a majority of acceptors fails, the system won’t accept more values.)

The rule for fast quorums is, any classic quorum Q must share at least one node with any two fast quorums R₀ and R₁. With 9 acceptors, a classic quorum is at least 5 as before, and a fast quorum is at least 7:

If a fast quorum accepts v and then a minority of acceptors fails, at least one survivor remembers v, and no other value could’ve been accepted by a fast or classic quorum. In classic Paxos, “no other value” is guaranteed by the “prepare” phase, where one proposer hears a majority promise to stop accepting other proposers' values with earlier round numbers. But in Fast Paxos all the failed acceptors could’ve accepted some other value. We need a bigger fast quorum to know that we’ve chosen a unique fault-tolerant value.

There are various quorum sizes that satisfy this property, and there are tradeoffs when you choose a quorum size. See Lamport’s Fast Paxos paper for details.

Nezha

Now I can describe the Nezha paper. As author Jinkun Geng mentions in a podcast interview, Nezha is a Chinese diety with three heads and six arms, “so he has wonderful fault-tolerance.”

Nezha bridges the gap between protocols such as MultiPaxos and Raft, which can be readily deployed, and protocols such as NOPaxos and Speculative Paxos, that provide better performance, but require access to technologies such as programmable switches and in-network prioritization, which cloud tenants do not have.

So the authors' motivation is to make a high-performance consensus protocol that can be deployed by cloud customers in public clouds, without requiring special access to the hardware. Nezha improves performance using tightly-synchronized clocks, which increasingly are available to cloud tenants, especially on AWS.

Nezha is like Fast Paxos plus Deadline-Ordered Multicast, speculative execution, and a stateless proxy.

Deadline-Ordered Multicast (DOM)

Just like Fast Paxos, Nezha has a fast path and a slow path, and it’s crucial to take the fast path as often as possible. The authors say that message reordering is the most common reason for taking the slow path: a sequence of messages from the proxy take different network paths to the same server and arrive out of order. DOM reduces reordering thus:

The sender attaches a deadline to each message: the sender’s clock time + one-way-latency estimate.
The receiver rejects any message received after its deadline.
The receiver executes each message after its deadline passes (according to the receiver’s clock).
The receiver executes messages in deadline order.

The authors evaluate DOM in Google Cloud:

The reordering score is the percent of messages in a sequence that are not in the longest ordered subsequence. E.g., if there are 100 messages and the longest ordered subsequence is 80 messages long, the reordering score is 20. Higher is worse. Without DOM, they found in Google’s cloud, under high contention, that the reordering score was over 40%.

DOM estimates the distribution of one-way latencies, using a technique called Huygens from another paper. Once it knows the distribution of one-way latencies, it knows the 50th-percentile one-way latency, the 75th, and so on. So the “50p” bar in the chart above means the authors configured DOM to set a deadline that was the sender’s clock plus the 50th-percentile one-way-latency. A longer delay further reduces reordering, but the receiver waits longer on average before executing each message. So the delay is a tunable parameter; there are tradeoffs and some optimum that you have to find.

I was surprised to read that the Nezha authors choose the 50th percentile. This means half of messages arrive after their deadlines! However, not all late messages force Nezha to take the slow path, only messages that are late and out of order. If a sequence of messages are all late, they can still be ordered. As the chart indicates, configuring DOM so that half of messages are late reduces the reordering score to barely 20%.

Speculative Execution

In regular Paxos, servers don’t execute a client’s command (they don’t update their state machines) until they know the command has been logged by a quorum. But the Nezha leader executes and acknowledges a command as soon as its deadline passes. The client accepts the execution result once it hears confirmation from a fast quorum. This reduces latency. If the leader is deposed before committing the command, the client rejects the result and retries the command.

Stateless Proxy

Nezha includes a proxy that encapsulates some Nezha logic. The proxy runs the Huygens protocol to estimate one-way latency and to tightly synchronize its clock with the other proxies and the servers. The proxy is basically stateless and horizontally scalable; it isn’t responsible for any ordering guarantees. Smart proxies permit dumb clients.

The Nezha Protocol

Fast Path

First a request comes from a client.

The proxy assigns a deadline from the proxy’s reasonably-synchronized clock, plus a delay that’s a percentile of the one-way-latency estimate. The proxy sends the message to all the servers.
The leader and followers get the message, and if its deadline hasn’t passed, the message goes in their “early” buffers to wait. This is the fast path, if the message is late we take the slow path, which I’ll describe later.
Soon after the deadline, each server removes the message from the early buffer and logs it. Servers process messages in deadline order.
The leader executes the command…
…and returns the result to the proxy. The followers send acknowledgments to the proxy without any result, because they don’t have state machines and they don’t execute commands, they only have logs. (So how can a follower become a leader? Read the paper.)

Each acknowledgment includes a hash of the whole log, so the proxy knows whether all the servers in the quorum have the same log. We know the messages are ordered correctly by timestamp, but some servers could be missing messages.

If the proxy hears a fast quorum of replies with the same hash, including from the leader, it accepts the result. It knows it’s durable.

So that’s the fast path. It saves some latency, because the proxy sends the message to all servers at once, and all servers respond directly to the proxy, instead of routing messages through the leader like in MultiPaxos or Raft. It might incur some latency though, if you find that you have to configure the DOM delay to a high percentile.

What about the slow path? Let’s look especially at messages that arrive too late.

Slow Path

A request comes from a client again.

The proxy assigns a deadline and sends it to the servers, the same as before.
This time the message arrives late and goes to the late buffer.
The leader eventually modifies the message’s deadline, from a past time to a future time, and puts it in the early buffer!
Once the new deadline passes, the leader logs…
…and executes the message…
… and sends the result to the proxy.
Concurrently, the leader also sends the message’s ID and its new deadline…
…to the followers. Note how this is slower than the fast path: it requires leader-follower communication, unlike the fast path.
Luckily the followers don’t need the whole message, they have it in their late buffers, so they just retrieve it from there and log it in the proper position with its new deadline. If a follower didn’t receive the message at all, then it has to fetch it from another server, which is even slower.
Finally the followers acknowledge the message…
…and the proxy hears from a majority (a slow quorum) of servers, all with the same hash, and accepts the result.

The authors claim that the slow path is still faster than some competing protocols like MultiPaxos, because of speculative execution at the leader. They say the slow path is only one message delay slower than the fast path.

Of course, a message could arrive before its deadline on some servers, and late on other servers. There are worse cases, if messages are dropped between the proxy and the servers, or between the leader and the followers, or if a replica fails and rejoins, or there’s a new leader. The paper handles these scenarios and I will not.

I’ll summarize everything so far: Typical consensus protocols route through the leader to guarantee ordering, but this costs some network hops and makes the leader a bottleneck. Nezha uses Deadline-Ordered Multicast and large quorums to guarantee ordering, so it can parallelize more.

Commutativity Optimization

Messages are commutative if they contain commands operating on different keys. Nezha relaxes the rule for a message to enter the early buffer.

Naïve rule: the message’s deadline must be after the last message released from the early buffer.
Optimized rule: its deadline must be after the last non-commutative released message.

I see how this preserves per-key linearizability, but I think it violates whole-database linearizability, also known as strict serializability. That’s ok, I believe Nezha only promises per-key linearizability, and it’s the right choice for many users.

Their Evaluation

The authors ran experiments in Google Cloud with various configurations. I’ll concentrate on the open-loop workload with 3 replicas, 5 proxies, and 10 clients. The Huygens clock sync protocol is installed on the proxies and servers, it gets a p99 clock error of only 50 ns! They use a minimal application that processes messages with no command logic. They have Poisson arrivals, 50% reads/writes, and a somewhat skewed access pattern with some hot keys and some cold ones.

As expected, Nezha outperforms everything, because this is a paper about Nezha. But this is an even more dramatic chart than usual in evaluation sections.

NOPaxos (“Network-Order Paxos”) might be at a disadvantage here, because NOPaxos wants low-level access to the network and they don’t have it in Google’s public cloud. The Nezha authors say “we use the implementation from the NOPaxos repository with necessary modification: we change switch multicast into multiple unicasts because switch multicast is unavailable in cloud. We use a software sequencer with multi-threading for NOPaxos because tenant-programmable switches are not yet available in cloud.” NOPaxos-Optim is their enhancement of the published NOPaxos code; they relieved a bottleneck with multithreading. The authors also wrote a technical report with more benchmarks; NOPaxos-Optim outperforms Nezha in one test there.

To measure latency, we use median latency because it is more robust to heavy tails. We attempted to measure tail latency at the 99th and 99.9th percentile. But we find it hard to reliably measure these tails because tail latencies within a cloud zone can exceed a millisecond.

I want to see Nezha’s tail latency, and I don’t understand this explanation for omitting it. Cloud network latencies are indeed unpredictable, which is a big risk for a protocol like Nezha, which is optimistic and explicitly designed for public clouds. The authors' justification sounds to me like, “This problem is so bad we decided to ignore it.”

The paper includes an “ablation study”, a neat bit of jargon for studying the effects of removing optimizations individually.

Deadline-Ordered Multicast is obviously important. Without it (red line), Nezha is practically the same protocol as MultiPaxos and performs about the same: it’s usually on the slow path due to message reordering. “Quorum-Check Offloading” to the proxy or client is important, because it removes some work from the leader, which is otherwise a bottleneck (gray). It seems that the leader’s CPU is saturated and quorum-checking contributes to that. Commutativity is apparently not important for throughput and it only slightly improves latency (purple vs. green). Perhaps that’s because DOM is so good at message ordering that commutativity doesn’t help much, or perhaps it’s because their workload skewness means many messages are non-commutative?

My Evaluation

The paper is written for users like me and MongoDB Atlas: public cloud users without special hardware or network access. I appreciate this focus.
I worry about performance variability in an optimistic protocol with fast and slow paths. How much does Nezha suffer when there’s contention and/or clock skew?
The authors write, “Nezha does not assume the existence of a worst-case clock error bound”, but that’s just for safety. Performance does rely on tightly synchronized clocks. Ever since Metastable Failures in Distributed Systems, I dislike optimizations that work sometimes. Therefore I wish they’d benchmarked tail latency, not just median latency, and tested more adverse conditions like workload skewness, clock skew, and network latency variability.
Deadline-Ordered Multicast is slick, and useful. If it’s tuned, it should be nearly free. But how does DOM fare when latencies to different nodes differ?
The commutativity optimization is clever, although its usefulness is overshadowed by DOM here.
Some of my colleagues were annoyed that the Nezha authors don’t credit the Tempo paper and other precedents.
Synced clocks in public clouds are real now. We can use them in distributed protocols. This is a superb example.

I learned a lot from this paper, especially since I presented it to the DistSys Reading Group and wrote this review. I want to see more of this kind of research: the use of public cloud features for distributed protocols.

Sojourner

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Tue, 09 Jan 2024 15:18:24 +0000

January 1, 2024. Our young Chinese dwarf hamster Sojourner Truth Armstrong Davis.

Images © A. Jesse Jiryu Davis

Bonticou Crag and Table Rocks

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Sun, 07 Jan 2024 12:42:40 +0000

November 25, 2023. Hiking the Bonticou Crag and Table Rocks loop in the Gunks, upstate New York. Portra 400 with my Mamiya 7.

Images © A. Jesse Jiryu Davis

Dickie Barre

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Sun, 07 Jan 2024 10:35:53 +0000

December 1, 2023. Climbing with Jennifer in the Dickie Barre area of Peter’s Kill, the Gunks. Portra 400, Mamiya 7.

Images © A. Jesse Jiryu Davis

Village Zendo's Winter 2023 Meditation Retreat

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Sat, 06 Jan 2024 19:15:04 +0000

December 2023. The Village Zendo’s annual year-end Zen retreat, at the Garrison Institute (a former Catholic monastery) on the Hudson River. This year I was a monitor of the meditation hall, responsible for everyone staying awake, sitting up straight and silent and still, finding a comfortable posture, and keeping to the schedule. We had a lot of inexperienced meditators who were clearly in pain after the first day. I met with them and discussed options for sitting on a chair, or a seiza bench, or a cushion, and how to position their backs and shoulders. The meditation hall settled down noticeably for the rest of the week.

I photographed some meditation sessions on Portra 400 with my Mamiya 7. The room was dark and the photos generally underexposed, so the colors are muddy in several of these. Next year I’ll bring ISO 800 film and a tripod.

Images © A. Jesse Jiryu Davis

Java Modelling Toolkit

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Thu, 04 Jan 2024 18:55:38 +0000

In 2022, Marc Brooker argued that formal methods like TLA+ can check distributed systems' correctness but not their performance. Since then, I’ve been searching for good performance modeling tools. Queue theory seems like a foundation for performance modeling, so I learned some queue theory, although I read the wrong book. That book tried to teach me to analyze queue networks by solving intricate equations, but for most queue networks the equations can’t be solved, and for the rest I can’t remember how to solve them. I concluded that equations aren’t practical for me, and simulation is the right method. I looked for an off-the-shelf queue network simulator, and found the Java Modelling Tools.

JMT is from Politecnico di Milano and Imperial College London. It was begun in 2002 and it’s still actively developed; the last release was November 2023. The two main developers, Giuliano Casale and Giuseppe Serazzi, have written and maintained a thorough user manual, and when I asked a question on the project forum last year they both responded quickly and in detail.

JMT is a suite of Java applications:

JSIMwiz: Wizard interface for JSIM, a discrete event simulator, for modeling a queue network.
JSIMgraph: Same, but you draw the queue network.
JMVA: Exact and approximate analysis of queue networks with restricted features.
JMCH: Simulates a single node, animates the Markov chain.
JABA: Find bottlenecks in closed queue networks.
JWAT: Reads log files, clusters customers into workload classes.

I played with JMCH briefly and JSIMgraph for a couple days.

JMCH

JMCH displays the states and transition probabilities for a single-node queue network, represented as a Markov chain. It seems to be for teaching how Markov chains relate to queue theory.

This is a simple M/M/1 queue. The diagram in the middle is a Markov chain, where each node represents a state with a certain queue length: e.g., if the system is in state 3, that means there are 3 customers enqueued. Each state is labeled with its long-term probability. So in the long run, the system spends half its time in state 0 (empty queue). You can watch the blue queue grow and shrink at the bottom, and the system transition from state to state, as random events occur. If I’d had this available when I started reading my queue theory book it might’ve helped me understand Markov chain analysis.

JSIMgraph

This is the tool I want. I can draw a queue network, with nodes and sources and sinks connected by directed edges, and simulate the system, watching the animation and observing average values. Here’s an example included with JMT:

A network needs one or more classes of customers (or jobs or tasks or whatever). Classes were a niche topic in the queue theory book because we usually assumed one class, but classes are a big deal in JMT.

Every class needs a reference station. I have read the manual’s explanation of reference stations five times and I don’t understand, perhaps I need a PhD.

JMT offers a huge number of probability distributions for arrivals, service times, etc. You can choose the “Replayer” distribution which reads values from a file; that could be a trace from your real system or numbers generated by another program.

The goal of running the simulation is to measure some performance indices. Here the indices are the response time for the whole system, and the utilization of each disk.

Let’s run the simulation and see those performance indices' average values:

By default, JSIM runs until each performance index has converged to a stable value within some confidence interval (the red lines converge around the blue lines).

When I showed this to some friends they wondered, “What if the Join0 node waited for a majority of disks instead of all of them?” This was satisfyingly easy to answer. I changed the “join strategy” from “standard” to “quorum”:

As expected, re-running the simulation shows the same disk utilization but much faster response time: 19 seconds average instead of 34.

Answering a queue theory question

Let’s take a queue theory problem from the book “Performance Modeling and Design of Computer Systems” and contrast the analytic approach to simulation. Here’s a figure from the book:

The book’s Exercise 2.1 asks,

Maximum Outside Arrival Rate
For the network-of-queues with probabilistic routing given in Figure 2.3, suppose that each server serves at an average rate of 10 jobs/sec; that is, μ_i = 10, ∀i. Suppose that r₂ = r₃ = 1. Suppose that p₁₂ = p_2,out = 0.8, p₂₃ = p₁₃ = 0.2, p₁,out = 0, and p₃₁ = 1. What is the maximum allowable value of r₁ to keep this system stable?

Let’s add those parameters to the figure:

To answer the question, “what’s the max r₁”, or “how fast can jobs arrive at Server 1 without overloading the system?”, the book taught me to solve a system of simultaneous equations. I arrived at r₁ ≤ 5.2.

I also wrote a 72-line Python simulation (using just the standard library) to confirm this number.

Now that I’ve forgotten how to answer this question with paper and pencil, can I use JMT instead? I drew this uglitude in JSIMgraph:

I tried to emulate the book’s figure, with a “source” node for r₁, r₂, and r₃, but this produced strange behavior. Maybe I need to understand “reference stations”. Instead I made three classes of customer (each with a different arrival rate), generated them all at the same source, forked them to the three servers, and collected them in a “sink” node.

I set Class 1 to 10 arrivals per second, which we know is too high; it should be at most 5.2. The others are 1 per second as in Exercise 2.1. Since we have 3 classes, we have to configure how each server handles each class. First the I set the service times to 0.1 for each class (since the exercise says they serve 10 jobs per second), then the routing rules.

Configuring a queue network is a quickly-growing chore; I have to point and click and enter a number for all classes multiplied by servers multiplied by routes. Worse, there’s no single place to see all the parameters and verify them. I have to double-click each node to open its config dialog, and click among the tabs checking each number.

To find the maximum r₁ value I configure a “What-If” analysis, trying a range of values:

JMT’s simulator assumes by default that the system is stable and its performance indices will converge. In this case, if r₁ is too big the system is unstable and JMT runs until it’s out of memory. So I configure the simulation to run for a limited time:

What performance indices will tell me whether the system is stable? If queues grow or a server is fully utilized, those are bad signs, so I’ll measure queueing time and utilization.

Finally, I actually run the analysis:

As I had hoped, between the tested r₁ values of 5.1 and 6.1, things go haywire. Queueing time takes off and Server 1’s utilization hits 100%. Configuring this was a lot of work, but it’s gratifying to see a clear result. JMT lets me easily measure more performance indices, or tweak the model and see the effects, more easily than in Python and much more easily than paper and pencil.

My evaluation

Gripes:

JSIMgraph requires a bunch of pointing, clicking, and typing to set up a queue network with a few classes and nodes. The Python interface looks like a slick way to construct really big networks, but medium networks are a chore.
Since JSIMgraph expects stable systems by default, doing “what-if” analysis to determine whether a set of params is stable is tricky.
There’s no single place to see all configuration values, so I have difficulty trusting that I’ve set up a network correctly.

But in JMT’s favor, if your needs fit its features, it’s more convenient than writing a Python simulation, and it overcomes the total impracticality of solving equations. I may play with it some more to learn its features better, and try answering some actual day-job questions about distributed systems.

West Trapps Bouldering

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Thu, 02 Nov 2023 11:50:27 +0000

October 11, 2023. Boulderers in the West Trapps area of the Gunks. Shot with a Mamiya 7 on Portra 400.

Images © A. Jesse Jiryu Davis

Fall in the Gunks

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Mon, 23 Oct 2023 08:40:32 +0000

October 22, 2023. Warrior’s Way Falling and Commitment, a one-day climbing class taught in the Gunks by Dustin Portzline. Students learned to place trad gear and practiced falling on it.

Images © A. Jesse Jiryu Davis

Review: Distributed Transactions at Scale in Amazon DynamoDB

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Fri, 13 Oct 2023 08:58:48 +0000

Distributed Transactions at Scale in Amazon DynamoDB, USENIX ATC 2023. This paper builds on last year’s DynamoDB scalability and reliability paper. It’s unrelated to the obsolete Dynamo system described in 2007. The current paper describes how Amazon added transactions to their existing key-value store. Their requirements were:

ACID, where “C” is serializable, but not strict serializable: transactions can appear to happen in any order. This will be important.
Transactions update items in-place, rather than creating new version of items. Their existing storage layer didn’t have MVCC and they didn’t want to add it.
Large scale. DynamoDB does non-transactional operations at record-breaking volume, and they expect to support a monstrous throughput of transactions too.
Don’t hurt the performance for the single-key (non-transaction) operations they already support.

“The challenge was how to integrate transactional operations without sacrificing the defining characteristics of this critical infrastructure service: high scalability, high availability, and predictable performance at scale.”

DynamoDB is a key-value store, built on sharded, replicated storage nodes. The shards use MultiPaxos for consensus and fault-tolerance.

Clients can send four kinds of non-transactional requests to storage nodes: put, update, delete, and get. These are single-key operations and not transactional, and each runs on a single node. Routers (not shown) find the right node for each request.

Transactions, however, can involve multiple keys and therefore multiple storage nodes. There are lots of transaction coordinator nodes, each transaction can use any one coordinator.

Coordinators are replicated for fault tolerance. If a coordinator goes down during a transaction, one of its backups takes over and continues.

Transactions API

The DynamoDB engineers decided not to support conversational transactions like in SQL. Instead, they have a one-shot transaction API. Here’s an example for writes. (Listing 1 from the paper; I’ve edited for clarity.)

// Check if customer exists
Check checkItem = new Check()
  .withTableName("Customers")
  .withKey(...)
  .withConditionExpression("attribute_exists(CustomerId)");

// Insert the order item in the orders table
Put putItem = new Put()
  .withTableName("Orders")
  .withItem(...)
  .withConditionExpression("attribute_not_exists(OrderId)");

// Update status of the item in Products
Update updateItem = new Update()
  .withTableName("Products")
  .withKey(...)
  .withConditionExpression("expected_status" = "IN_STOCK")
  .withUpdateExpression("SET ProductStatus = SOLD");

TransactWriteItemsRequest twiReq = new TransactWriteItemsRequest()
  .withTransactItems([checkItem, putItem, updateItem]);

DynamoDBclient.transactWriteItems(twiReq);

There are three operations here (check, put, update), which are packaged and sent to the coordinator in one shot. If any condition (highlighted lines) is false, the whole transaction is aborted. So this series of operations is a tiny program that can enforce invariants, like “a product appears in at most one order”.

Write Transactions

Here’s the algorithm that each storage node runs to prepare a transaction, once for each item involved. (Listing 2 from the paper, edited).

def processPrepare(PrepareInput input):
  item = readItem(input)

  if item != NONE:
    if evaluateConditionsOnItem(item, input.conditions)
      AND evaluateSystemRestrictions(item, input)
      // No committed transaction with later timestamp:
      AND item.timestamp < input.timestamp 
      // Not already read/written in a prepared transaction:
      AND item.ongoingTransactions == NONE: 
        item.ongoingTransaction = input.transactionId
        return SUCCESS
      else: 
        return FAILED 
    else : #item does not exist 
      item = new Item(input.item) 
      if evaluateConditionsOnItem(input.conditions) 
        AND evaluateSystemRestrictions(input) 
        // No txn has deleted *anything* with later time
        AND partition.maxDeleteTimestamp < input.timestamp: 
          item.ongoingTransaction = input.transactionId 
          return SUCCESS 
    return FAILED

This is optimistic concurrency control: the write transaction doesn’t lock any items, to avoid blocking non-transactional operations. Instead, the coordinator assigns a timestamp to the transaction (input.timestamp) from its local clock, and the storage node checks if any later-timestamped transaction involving the same items has committed; if so it returns FAILED, which causes the coordinator to abort the transaction. The storage node also returns FAILED if an item’s ongoingTransaction field is set, meaning another transaction involving the same item is in the prepared state.

Notice that if a concurrent, later-timestamped transaction deletes any key in the same partition as this item, it aborts this transaction. The authors say deletes are rare enough that this is okay. Otherwise, they could replace each deleted key with a specific tombstone, but then they’d have to garbage-collect the tombstones, which would be less efficient for their workloads.

The “prepare” phase is part of a larger protocol, the classic two-phase commit. The client sends its request to a request router “RR”, which sends it to a transaction coordinator “TC”. The coordinator persists the transaction info to a replicated ledger, then it tries to prepare the transaction on all the storage nodes. They all write the transaction’s id to all the involved items' ongoingTransaction fields, both the items read and the items written. If successful, that means no other transaction involving those items can prepare. Then the coordinator tells the nodes to commit, so they all write the transaction’s timestamp to the involved items and clear ongoingTransaction. Finally, the coordinator acknowledges the transaction to the client.

If the coordinator aborts a transaction, it clears ongoingTransaction but leaves the items' timestamps unchanged.

So long as each transaction appears to occur at some point in time, without interleaving with any other transaction, serializability is guaranteed. The prepare algorithm we saw ensures that.

Clock skew will cause extra transaction aborts. (And maybe “external consistency” violations?) But clock skew won’t cause DynamoDB to violate serializability or any other stated guarantees.

By the way, what happens if two coordinators start two transactions at the same timestamp (within their clocks' precision)? The paper doesn’t discuss this; I’d assign a unique id to each coordinator, and append this id to each timestamp to deterministically resolve ties.

I think that non-transactional writes (including deletes) must be blocked or aborted by prepared transactions on the same keys, but the paper doesn’t specify. Additional interactions between transactions and non-transactions are described in the optimizations section.

Read Transactions

DynamoDB’s read-only transactions have a distinct implementation from write transactions. Read transactions could use the same protocol as write transactions, but the authors say they didn’t want to update item timestamps on read, because updating the timestamp is an expensive replicated write.

Most modern databases have MVCC, so read-only transactions just read a recent past version of the data, providing snapshot isolation. But DynamoDB doesn’t have MVCC. The authors didn’t want to use read locks, either. They could update every key’s timestamp when it is read, to track read-write conflicts, but that would make every read into a costly replicated write, which the authors wanted to avoid. So how do read transactions work?

DynamoDB uses an optimistic concurrency control algorithm which aborts read transactions if they’re concurrent with any write transactions on the same items. The paper doesn’t include pseudocode for read transactions, so I wrote this:

def processRead(input):
    items1 = readItems(input)
    for x in items1:
        if x.ongoingTransactions != NONE:
            # A prepared transaction is writing.
            return FAILED

    items2 = readItems(input)
    for x, y in zip(items1, items2):
        if x.logicalSequenceNumber != y.logicalSequenceNumber:
            # A transaction wrote to data in the read set.
            return FAILED

    return items2

The transaction coordinator reads all items, then reads them all again. If any are involved in a prepared transaction (just during the first read?) the transaction aborts. Each item has a logical sequence number (LSN) which increases whenever the item is changed, so the coordinator checks that all items' LSNs are unchanged between the first and second reads. If so, then the data the coordinator has read represents a consistent cut of the data at some point in the past, thus the coordinator returns its results to the client and upholds serializability. Otherwise it retries.

There’s an optimization where the coordinator doesn’t retry both phases of the transaction. It just assigns items2 to items1 (and re-checks if there’s an overlapping prepared transaction?), then re-reads items2, and compares LSNs again.

You can see I don’t totally understand the rules for prepared transactions. Plus I’m annoyed to see LSNs pop up here, when they haven’t been mentioned before. Could read transactions compare timestamps instead? Perhaps non-transaction writes update LSNs but not timestamps, so read transactions must compare LSNs to be certain that items are unchanged?

Optimizations

The paper’s Section 4 is two pages of little optimizations.

The classic timestamp ordering concurrency control scheme can be extended with novel optimizations when applied to a key-value store where reads and writes of individual items are mixed with multi-item transactions.

I’ll summarize and categorize their optimizations:

Write transactions don’t block/abort single-key reads/writes, since the latter are always serializable and can be placed somewhere in the total order.
Ignore a late-arriving write transaction if a later transaction just overwrites it. (Murat Demirbas points out this is the old Thomas write rule.)
Same, for a batch of preparing transactions: if they all overwrite the same keys, pick one to commit and ignore the rest without aborting them.
Single-partition transactions skip two-phase commit.
Read transactions can have one phase if each item has a last-read timestamp. But isn’t that what the authors said they want to avoid? In a discussion, Alex Miller speculates they mean storage nodes can keep last-read timestamps in memory. They need not be durable; if there’s a failover, the coordinator can just retry the whole read transaction.

I’m peeved that the authors don’t say which of these optimizations they’ve implemented now, which they implemented before they ran this paper’s benchmarks, and which they haven’t implemented yet.

Their Evaluation

Amazon keeps their cards close to the chest. The authors don’t say what hardware they ran their benchmarks on, or how many routers or storage nodes they used. They show charts where the y axis is latency, but without any labels, so we don’t know what the actual numbers are. Judging by eye, it seems latency is virtually unaffected by the volume of transactions per second:

From Figure 4, p99 latency of half read/half write transaction workload

The slightly higher latency at very high load is due to Java garbage collection on the transaction coordinator. I’m really curious how many machines they used, and whether it was the same for the 100k workload and the 1M workload.

Holding the volume of transactions constant, latency increases sublinearly as the number of operations per transaction increases:

From Figure 6, p99 latencies for varying number of operations per TransactWriteItems

Again, we don’t know the Y scale for this chart, but presumably it’s linear. It shows that you can go from 3 items per write transaction to 100 items, and only double the latency. This is sublinear scaling, and DynamoDB accomplishes it by running each transaction’s operations in parallel. (This surprised me, since they didn’t mention parallelization when they described the transaction algorithms.) Latency does rise, because a large transaction has a higher chance of a slow operation. Plus, a big transaction takes longer to persist to the transaction ledger on the coordinator, and the request and/or response takes longer to transmit over the wire. Regardless, parallelization obviously works very well.

My Evaluation

The paper’s benchmarks show impressive scalability. But we don’t know what hardware they used. We don’t know if all the workloads use the same instance sizes and same number of partitions and coordinators. And of course we don’t know the actual latency numbers, because there’s no Y scale. It’s hard to be sure what the charts mean.

The one-shot transaction API could be tricky to use sometimes. For example, what if you want to read values from two items and store their sum?: a.x := b.x + c.x. You can’t express this in a one-shot transaction using DynamoDB’s API. It’s definitely possible to code this as a serializable operation, but it requires several round trips to the server, careful thinking, extra fields, and extra application logic. It reminds me of things we had to do with MongoDB to preserve constraints, before we had any transactions at all. The restrictions on DynamoDB transactions could make some application logic even trickier:

TransactWriteItems is a synchronous write operation that groups up to 100 action requests. These actions can target items in different tables, but not in different AWS accounts or Regions, and no two actions can target the same item. For example, you cannot both ConditionCheck and Update the same item. The aggregate size of the items in the transaction cannot exceed 4 MB.

I admire this paper, though. The authors faced unusual constraints, and responded with a thoughtful design that meets their goals. Timestamp ordering plus two phase commit is a classic style.

Rosalind Russell and Cary Grant in His Girl Friday

Thanks to svgoptim for converting Excalidraw SVGs into something I can display on my blog with the font preserved.

Ivan

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Sun, 08 Oct 2023 19:13:20 +0000

September 27 and October 1, 2023. Ivan, an itinerant climber, and his dog Rudy. I can’t decide which is my favorite photo of him making breakfast from his truck, or which is my favorite of him lead-rope soloing Teeny Face (5.10a). This was my first time shooting climbing with a film camera, much less with medium format. Given the difficulty I’m pretty happy with the results.

Images © A. Jesse Jiryu Davis

Peter's Kill at Dusk

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Sun, 08 Oct 2023 07:33:24 +0000

September 17, 2023. My first time out with my Mamiya 7 shooting Kodak Portra 400. The Peter’s Kill climbing area, in the Gunks, at dusk.

Images © A. Jesse Jiryu Davis

RIP Shirley

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Sat, 30 Sep 2023 22:59:49 +0000

Our Chinese hamster Shirley Chisholm Armstrong Davis died last week. She was indomitable as her famous namesake, guarding her territory against all comers. She explored the house boldly, ramming her clear plastic ball against obstacles, and when she was asked to edit Jennifer’s manuscripts her critique was pitiless.

Shirley and her sister, Fannie Lou Hamer Armstrong Davis, arrived in our New York City apartment in 2021. They shared a mutual affection and managed their domestic arrangements in perfect accord. Their filial harmony reassured us; several times Jennifer and I left the sisters alone for days when we traveled.

Once we left the pair for more than a week, with extra rations and two bottles of water. We returned to a scene of gore: Fannie was dead, and Shirley paced the cage frantically. Why Shirley was driven to such an act is a mystery to us. Perhaps she had deduced that we were the replenishers of the food bowl, and feared in our absence that rations would become scarce. Or perhaps we were the dominant hamsters, and our disappearance left a power vacuum she was compelled to fill. We were wracked and ashamed; by making a home for these creatures we had assumed total responsibility for their welfare, and we entirely failed them both.

Shirley, however, lived in a primordial state free of shame, free of regret. As soon as we had disposed of her sister and set her cage to rights, she settled in to a complacent solitude. She relocated to New Paltz last summer and we followed, establishing our household in a small town beneath the Shawangunk Ridge. Shirley, unlike her namesake, had no attachment to her neighborhood in the city, and she continued her customary pastimes in the country without interruption. She watched television with Jennifer, rambled the hallways in her ball, and landscaped the fluff of her cage into hillocks and gullies. In the afternoons she sat in her little house and chirped loudly, reminding anyone in earshot of her sovereignty.

We have forgiven Shirley her crime, and partly forgiven ourselves, but we are determined not to repeat our error. Shirley’s successor, Sojourner Truth Armstrong Davis, will be forever safe, and alone.

Review: PolarDB-SCC: A Cloud-Native Database Ensuring Low Latency for Strongly Consistent Reads

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Wed, 13 Sep 2023 18:39:53 +0000

PolarDB is Alibaba’s cloud-native SQL database. It has the now-typical architecture of one read-write (RW) node and several read-only (RO) nodes as hot backups, sharing a disaggregated storage layer.

Figure 1 from the paper.

Each RO has an in-memory cache; to keep this updated, the RO streams the write-ahead log (WAL) from the RW and replays it on the RO’s locally cached data. Nevertheless the RO’s cache is usually a bit out of date, so RO queries may return stale (inconsistent) data. Alibaba either has to accept this inconsistency, or run all queries on the RW. Very fast replication will increase the chance of a consistent RO read, but doesn’t guarantee it. Therefore, Alibaba applications that require consistency only query the RW, bottlenecking on the RW’s CPU and leaving the ROs nearly idle.

In “PolarDB-SCC: A Cloud-Native Database Ensuring Low Latency for Strongly Consistent Reads”, some Alibaba engineers introduce algorithms for accelerating replication and guaranteeing consistent reads from ROs. (SCC stands for “strongly consistent cluster”.) Now that all applications can safely read from ROs, Alibaba can usefully autoscale the number of ROs and load-balance queries among all the nodes behind one serverless endpoint.

Fast replication

ROs use one-sided RDMA to read the WAL from the RW. The WAL is a ring buffer in the RW’s memory. The nodes coordinate to ensure that RWs can safely read it, and the RW won’t overwrite entries before they’re replicated. The paper describes the protocol in detail; it looks like lock-free ring buffers I’ve seen before, but I’m no expert.

Figure 7 from the paper.

Shipping logs with RDMA spares the RW’s CPU, and the authors claim it also minimizes network overhead. (I’d like to know more, see my evaluation below.)

Read-wait

When a client queries an RO, the RO checks the RW’s global last-write timestamp. If the RO has replayed the WAL up to that timestamp, then the RO’s local data is fresh enough to run the client’s query. Otherwise, it waits until it has replayed up to that timestamp before running the query. The RO fetches the timestamp from the RW with one-sided RDMA, to spare the RW’s CPU and perhaps to minimize latency. Again, I’d like to see an experiment that compares this with non-RDMA fetching.

The authors describe an optimization they call a Linear Lamport timestamp, to avoid hammering the RW with a timestamp fetch for each RO query. If one RO query r₂ begins fetching the RW timestamp at time T, and it happens that another RO query r₁ began before T but hasn’t started fetching the timestamp yet, then r₁ can wait for the fetch to complete and reuse that timestamp.

My adaptation from the paper’s Figure 5.

This sounds like an important optimization, but I wonder how often it happens. The authors don’t tell us the ratio of queries to fetches. Reusing a fetch seems to require task reordering on the RO: r₁ starts before r₂, but r₂ begins fetching the timestamp first. Could this result from jittery thread scheduling, or because some queries need more preprocessing than others? Perhaps Nagle’s algorithm would let more queries reuse each fetch?

Hierarchical modification tracker

Even if the RO hasn’t caught up to the RW’s global timestamp, almost all its cached data is fresh, so it can run queries on that fresh subset. The modification tracking table (MTT) lets the RO cheaply determine what data is in the fresh subset.

The RW maintains last-written timestamps at three levels: global, table, page. (A page is a chunk of memory, there are many pages in a table.) The RW’s MTT has two hashtables: one maps tables to timestamps, the other maps pages to timestamps. When the RW commits a transaction it updates the global timestamp and the timestamp for each affected page and table in the MTT.

Figure 6 from the paper.

When the RO starts a query, it first fetches the RW’s global timestamp and checks if it has applied the WAL up to that timestamp; if so the RO is fresh enough to run the query. If not, the RO fetches the relevant table timestamp(s) and checks those. If that check fails, it fetches the relevant page timestamp(s). If that check fails, it’s run out of options and it has to wait. By checking timestamps at these three levels (with the Linear Lamport timestamp optimization, when possible) the RO has the opportunity to run queries on fresh subsets of its data, without waiting to be globally caught up.

The table/page hashtables in the RW’s MTT are fixed-size memory regions of a few hundred megabytes; an RO learns these regions' addresses when it first connects to the RW, so it can read them over RDMA. Hash collisions are common. If several tables (or pages) have the same hash key, the RW uses the latest timestamp as the value for that hash key. This is pessimistic: it records the latest time that any of the colliding tables (pages) could’ve been modified. Thus when the RO checks that it has caught up to all the relevant timestamps, it may wait unnecessarily, but it won’t violate consistency.

When the RO fetches table/page timestamps, it updates its local copy of the MTT. This means the RW and RO MTTs don’t converge; the RO’s copy has recent timestamps for the data it’s queried, but it doesn’t update timestamps for other tables and pages. Furthermore, the timestamps the RO fetches can be wrong: too recent, due to hash collisions on the RW. It’d be interesting to study how the interaction of access patterns and hash table design affects the MTT’s accuracy on ROs, and causes unnecessary waits.

Their evaluation

The evaluation is extensive. The authors test PolarDB deployed with one RW and one or more ROs. They run the standard benchmarks Sysbench, TPC-C, and TATP, plus an Alibaba workload. They compare PolarDB-SCC to three configurations:

Default: The RW handles all queries.
Read-wait: The ROs use “the vanilla read-wait scheme”, which is vaguely explained. I think the ROs check the RW’s global timestamp (over RDMA?), but there is no MTT or Linear Lamport optimization.
Stale-read: The ROs run queries immediately, with no consistency guarantee.

PolarDB-SCC is nearly as fast as stale-read, except of course it doesn’t serve stale reads.

The Sysbench test is fun. They try two workloads: one accesses all data with uniform likelihood, the other is a skewed workload that accesses some hot data much more than others. When all data is accessed uniformly, the RO usually has a fresh enough copy of the data being queried, so the timestamp-fetch (to update the RO’s MTT) is the only overhead compared to stale-read. The performance indicates that timestamp-fetch is very fast. With a skewed access pattern, hot data is more likely to be written on the RW and immediately queried on the RO, so the RO often has to wait to catch up, and throughput is a tiny bit lower:

In all cases, PolarDB-SCC beats the default, because it doesn’t bottleneck on the RW’s CPU, and beats read-wait, which often waits unnecessarily.

The authors compare PolarDB-SCC’s performance to MySQL Group Replication and to two anonymous databases “from two top cloud providers”, which they say they can’t name due to the DeWitt clause. This surprised me, since so many database vendors permit benchmarking nowadays; it might have been more valuable to readers if the authors had compared PolarDB-SCC to competitors they can name. Maybe Alibaba wanted these comparisons for their own use, so the authors decided they might as well publish them.

My favorite part of the evaluation is when the authors measure the contribution of each feature: the Linear Lamport timestamp and MTT each approximately double throughput compared to “vanilla read-wait”. With those two features enabled, RDMA log shipment increases throughput only a few more percent. I think that’s because the modification tracking table has made it less crucial to replay the log quickly: chances are, the queried data was last modified a while ago and the RO is fresh enough, which it can confirm by checking the MTT.

My evaluation

There’s a lot of underutilized dark matter in the cloud, and backup nodes are a big portion of it. Any technique that makes them useful for queries is enticing. The Linear Lamport timestamp and MTT seem clever, and I guess they really work—PolarDB-SCC is in production.

Those two features could be implemented by any database, but the RDMA protocols require special hardware, so I wish the authors compared non-RDMA implementations of each feature. How much worse would PolarDB-SCC be if it didn’t use RDMA for fetching the global timestamp or the MTT? And how much slower would log shipment be without RDMA? Despite all their experiments, the authors didn’t isolate RDMA’s advantage for each feature, and those of us without access to RDMA would like to know. My MongoDB colleague Amirsaman Memaripour co-authored two papers that experiment with RDMA for MongoDB replication[1, 2], it looks zippy.

That’s my only complaint. This paper describes an exciting approach that could save a lot of power and carbon for all cloud databases, I hope to try it myself.

RDMA vs RPC? Jacques-Henri Lartigue, Une Voiture de Course Singer, Avenue des Acacias, Paris, 1912.

Review: Performance Modeling and Design of Computer Systems: Queueing Theory in Action

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Sun, 03 Sep 2023 16:53:43 +0000

Performance Modeling and Design of Computer Systems: Queueing Theory in Action, by Dr. Mor Harchol-Balter.

We are A. Jesse Jiryu Davis, Andrew Helwer, and Murat Demirbas, three enthusiasts of distributed systems and formal methods. We’re looking for rigorous ways to model the performance of distributed systems, and we hoped this book would point the way.

We followed the author’s undergrad syllabus, although we were disappointed at its focus on single-threaded task queues. Most of our work is with multi-threaded servers, aka Processor-Sharing, so we added Chapter 10 section 2 (Aloha protocol) and Chapter 22 (only Jesse read it), skipped Chapter 29, and added Chapter 30. We met roughly every two weeks on Zoom, from December 2022 to August 2023. Between meetings we spent a few hours per week reading; we did many of the homework exercises at the beginning and none of them towards the end.

Our Review

Andrew asked us five questions about our experience with this book, here are our answers.

Q: Why did you join the reading group for this book?

Jesse: I’m a distributed systems researcher at MongoDB Research, I want to learn about autoscaling for systems with many machines and many services. The book claims on its first page that it’s “written with computer scientists and engineers in mind”, so I hoped it would help me analyze the resource requirements of big complex systems.

Andrew: Jesse sold me on a summary of Marc Brooker’s talk, Formal Methods Only Solve Half My Problems. TLA+ can’t effectively model properties like latency, throughput, or bottlenecks. These are all problems software engineers care about, and I’d been asked how to analyze them by people going through TLA+ tutorials. I was hoping to add a complementary tool to my toolbox, and invest more learning time into distributed systems generally.

Murat: My work at AWS involves modeling distributed systems/protocols for correctness and fault-tolerance validation. It quickly becomes apparent that it is also important to model/estimate the performance improvements we would get from pursuing variants of these protocols. Marc Brooker makes this point nicely in his post titled, Formal Methods Only Solve Half My Problems. I knew that queueing theory provides a basic toolkit for modeling performance and estimating bottlenecks of systems, and we employed some basic queueing theory for our paper on “Dissecting performance bottlenecks of strongly-consistent replication protocols”. I had heard Brooker recommend Mor’s book as an introduction to analyzing performance using queueing theory, and I was happy to find people interested in studying the book together. I knew that when I would feel unmotivated, I would be able to continue to journey out of civic duty to my study group friends. I wouldn’t have tried this alone.

Q: Did you meet your objectives by the end of the book?

Jesse: No, I did not learn practical methods to answer my day-to-day questions. Harchol-Balter writes in the preface, “First, I want to provide enough applications from computer systems to make the book relevant and interesting to computer scientists. Second, I want to make the book mathematically rich enough to give readers the ability to actually develop new queueing analysis, not just apply existing analysis.” I am not interested in the second goal, and I don’t think the book fully accomplished the first goal. I must not be the kind of computer scientist she had in mind. She writes, “systems designers could mathematically model the system, … analytically derive the performance of the system as a function of workload and input parameters.” I have read most of the book, but I can’t do this. For the systems I work with, the math is too hard to be worthwhile compared to simulation, or else a closed-form solution is actually intractable or unknown. Now I’m seeking a book that teaches best practices for simulating the performance of complex computer systems, which I believe is practical.

Andrew: I did not. If you plunked down a work queue network in front of me at this moment and asked me to analyze it using the methods from the book, I couldn’t. Maybe this is on me for not doing many of the exercises, but I don’t think coming up with a bunch of differential equations to model a system is something that will really fly in industry. None of your coworkers would understand what the hell you’re doing. And maybe they should, like maybe we should expect advanced mathematical knowledge from our software engineers, but that norm is at least a decade away even if there were a big push for that education now. Really I’d say the drift of software engineering has been toward simple tools powered by good abstractions, and the constant search for new, better abstractions. I think TLA+ is a really great abstraction of a distributed or concurrent system if you want to analyze consistency properties, for example. In that vein I would want to analyze the queue network by writing a simulation of it. Simulations are easy, and easy to understand. Modeling something with systems of differential equations which you then solve feels like something from another era. Like how a physicist on the Manhattan project might work. Maybe it’s extraordinarily effective if you have the ability, but it isn’t scalable in human terms. Every assignment becomes its own unique thing with its own unique difficulties and risk of getting stumped.

Murat: Not really. I got some familiarity with the queueing theorist style of approaching the problems, but I didn’t get any mastery of the space. I knew I didn’t want to be a citizen of queueing-theoryville, but I don’t think I got enough learning/insight by touring the city under the guidance of this book. Seeing expert theoretical physicists derive lengthy formulas on blackboard did not help me. I needed a high-school physics teacher to explain popularized/simplified versions of the formulas. As I read the book, I kept thinking this could have been explained much better, much shorter, and much nicer.

Q: What was the most valuable thing you took from the book?

Jesse: Since I did not learn how to answer all my questions with equations, I now hope to answer them with off-the-shelf simulation software. (JMT seems promising.) So it was valuable to learn the concepts and terminology to configure such software. These simulation tools have some quirks, such as requiring most/all probability distributions to be Exponential, which make sense to me now that I’ve seen the math. I learned rules of thumb like the “Square-Root Staffing Rule” or “All-Can-Win Theorem”, which I can look up in the book whenever I need. (I briefly understood the justification for some of these rules, too.) I got some intuition about the interactions among job size variability, utilization, and response time variability. Job size variability turns out to be the most important parameter; I’m glad I know that now.

Andrew: I’d always been confused by the difference between poisson, exponential, and Pareto distributions. Queueing theory was an ideal context in which to learn these. The difference between exponential (memoryless) and Pareto (memory…ful?) distributions were particularly interesting, with varying implications of how you should act if you’re operating in one versus the other. Closely related, I enjoyed learning about various Markov chain properties. Ergodic, irreducible, null-recurrent, and so forth. I’ve read a few books by Nassim Nicholas Taleb (The Black Swan, Skin in the Game) and he often uses these concepts so it was nice to retroactively understand the theses at a deeper level. I am also interested in probabilistic model checking so getting acquainted with discrete- & continuous-time Markov chains at a deeper level was welcome. Finally, it was useful to learn the ontology of these systems: what properties do we care about? How are those properties related & derived? It felt like I was in an introductory physics class again.

Murat: I enjoyed derivation of nontrivial results/formulas starting with simple first principles, whenever I caught glimpses. This reminded me of my high-school calculus and math classes where I saw that simple math and reasoning can carry one a long way. I really enjoyed the occasional Socratic method employed in the book. By just adopting a very simple question and answer format, some sections of the book were made much more approachable. It is a simple yet very effective trick that we should employ more.

To mention specific chapters of the book, I liked the Chapter 3 probability review (I needed that), and Chapter 6 where we saw the surprising effectiveness of Little’s Law (I think I needed to be shown more examples on this to internalize it). I also enjoyed Chapter 24 task assignment in server farms, because the book returns to a more practical topic after covering theoretical topics.

Q: What was your least favorite aspect of the book?

Jesse: The book mostly assumes that a “server” does one task at a time. But when I see the word “server” I assume we are referring to a multi-core, multi-threaded machine doing many tasks in parallel, so it took me a long time to realize how inapplicable most of the book’s examples are to my work. If the book had referred to something actually single-threaded like a “CPU pipeline” or “network channel” I would’ve understood sooner how much of the book isn’t for me. Networks of multi-threaded servers are finally discussed in Chapter 22, which is not on the undergrad curriculum we followed. I read the chapter anyway: it’s mostly long proofs, and the practical techniques at the end are still too intricate for me to ever apply. It did teach me why off-the-shelf queue simulation software has certain constraints, e.g. job arrivals must be a Poisson process.

Andrew: Many chapters were largely composed of equations. At some level I appreciate the inclusion of these proofs but I would have to expect to live for 1000 years or more to justify investing the time to follow them line by line. If I were to really do this book justice I would come out the other side a fully-fledged queueing theorist. It wasn’t my objective to become a queueing theorist so large parts of the book were completely wasted on me. I just skipped over them. Maybe they could have been moved to an appendix, I don’t know. Later chapters would only be a couple of pages long then.

Murat: Long winded derivation of formulas with little motivation and connection to practical applications. I also would have liked more examples and context to practice and strengthening what was explained in order to serve my style of learning. To mention a specific chapter of the book, I think Chapter 12 on continuous time Markov chains was hard, theoretical, and disconnected.

Q: Was your time with the book well-invested?

Jesse: Yes, actually. I’m in an exploratory phase of my career, so it’s reasonable to run experiments like, “If I read this math textbook will it teach me to design and analyze systems?” I discovered that the answer is mostly “no”, and I also learned some useful queue theory principles. I enjoyed Dr. Harchol-Balter’s clear, enthusiastic writing. It was fun to discuss the book with Murat and Andrew. I intend to keep reading hard books with friends.

Andrew: Yes, despite my comments above I’m happy I went through the book. I got a taste of queueing theory and a rough estimate of the current outer limits of its power. I acquired some real knowledge of things I care about in other contexts, like exponential/Pareto distributions and Markov chains. Above all it was a nice way to spend time with Jesse and Murat. I look forward to our next book!

Murat: Yes. I don’t regret the journey. This was something completely different than the type of distributed systems work I do, and I welcomed learning about a different world alongside my friends Jesse and Andrew. I really enjoyed their companionship, they carried me to the finish line. I also got confidence that if I invest more time in it, I have a chance to master this strange world if I wanted to. But I don’t think going analytical is the right approach. By following a simulation-based approach to performance modeling/estimation, I think I will be able to learn and retain more with less effort, and will be able to apply that toolkit to real-world practical systems to make some leeway in solving the other half of my problems.

To make things more concrete, consider the SEDA paper from SOSP 2001: “SEDA: An architecture for well-conditioned scalable internet services”. I love this paper, it defines and solves an important problem: “A service is well-conditioned if it behaves like a simple pipeline: as the offered load increases, the delivered throughput increases proportionally until the pipeline is full and the throughput saturates; additional load should not degrade throughput.” With a simulation-based approach, I can explore what would be a suitable way of laying out my service to meet performance goals, and when things needs to change apply the same technique to restructure the service. I don’t think I would come close to doing these with analytical queueing theory techniques.

Review: SelfTune: Tuning Cluster Managers

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Wed, 09 Aug 2023 12:53:42 +0000

SelfTune: Tuning Cluster Managers, by authors from Microsoft Research, regular Microsoft, and Stony Brook University, at NSDI (Networked Systems Design and Implementation) 2023. Here’s a presentation I gave to the distributed systems reading group, a written version is below.

The only reason I review papers is because I like finding old images, here’s a guy tuning a piano.

The Problem

This paper is about cluster managers, software systems like Kubernetes or Docker Swarm, that manage a bunch of servers (possibly in the cloud) to provide auto-scaling, high availability, geo locality, and so on. Cluster managers have a ton of config parameters. Tuning them by hand is a big effort, you might not find the optimal settings, and you have to keep tweaking them as conditions change. There are some existing auto-tuning systems, but the SelfTune authors say the existing systems either don’t find the true optimal config, or more importantly, they find it too slowly. It’s important to find a good local optimum fast, otherwise you’re wasting money or violating SLAs while running at suboptimal settings.

The Solution

SelfTune is designed to find an optimal config quickly. It tunes live in production, without a separate experimental phase, thus SelfTune needs to balance “exploration” and “exploitation”: that is, it has to learn, but it mustn’t make lots of bad decisions in the process of learning.

It has a simple API for integrating self-tuning into your app. Here’s a code sample from the paper showing how it works (I’ve reformatted it and added comments). It uses SelfTune to configure some sort of job queue, which can process jobs in parallel, throttled by the number of “tokens” that are available.

public const double optLoad = 0.80; // target utilization
Config UpdateCycle = new Config(
  "UpdateCycle", "TimeDelta", "00:00:01-00:00:30", "00:00:05");
  // name           type        range                default

SelfTune st = new SelfTune.Create(UpdateCycle);
// connect to a SelfTune server - could be one per host,
// or more centralized
st.Connect();
var currentLoad = 0.0;
while(1) {
  // if underloaded, start more jobs, otherwise wait for jobs to finish
  if (currentLoad < optLoad) {
    int numTokens = GenerateTokens(currentLoad);
    GrantTokensToJobs(numTokens);
  }
  Guid callId;
  UpdateCycle = st.Predict(callId, "UpdateCycle");
  sleep(UpdateCycle);
  currentLoad = CalculateLoad();
  // update SelfTune
  // callId permits asynchrony
  st.SetReward(callId, currentLoad - optLoad);
}

In this example, you initialize the SelfTune instance with the parameters it can tune (“UpdateCycle”), connect it to the tuning server, then enter a loop where you measure some reward function (how close load is to the desired level), ask SelfTune for its suggested parameter values, apply them, see how it impacts the reward, and report that back to SelfTune. The API looks nice and clean.

I thought the example of tuning “update cycle” was kind of odd though—wouldn’t you want to measure load as often as possible? When would a system perform better by sampling less frequently? Maybe there’s some other factor at play that should be addressed more directly in this example.

BlueFin algorithm

SelfTune is an API and an architecture for tuning in production. There’s a pluggable learning engine in the middle of it; in this paper they use a specific learning engine called BlueFin. It’s a minimalist gradient ascent algorithm:

Loop forever:

Randomly perturb each parameter up to some amount δ (delta, the “radius”).
Wait for feedback.
Estimate the local gradient.
Update params according to the gradient, scaled by η (eta, “learning rate”).

I think that BlueFin does not decrease delta on each cycle, the way gradient ascent normally would. This means it doesn’t get more precise over time, but also that if the environment is changing, BlueFin is always equally open to learning new info. Or I could’ve misunderstood.

Setting delta and eta seems like a major piece of human guesswork that’s underemphasized in this paper.

The authors say that just measuring the reward in two cycles is enough to get a good estimate of the slope and choose an optimum. This seems to assume that the reward curve is nearly planar? Couldn’t there be curves at least, if not more complex shapes? This is the part of the paper where I have the least expertise, but it feels too simple to me.

The Experiments

The paper includes three experiments:

Tune the refresh rate of a workload manager for Exchange Online and SharePoint. This goes into detail about the system we saw in the code example above.
Tune prewarm and keepalive for FaaS environments. This shows how SelfTune responds to changing circumstances.
Tune 89 parameters for a Kubernetes-managed network of microservices.

This last experiment is what I really want to see: lots of parameters, lots of interacting microservices. It’s a situation that’s realistic, and unmanageable for humans. The app is a simple social network with 28 microservices, adapted from a research benchmark suite called DeathStarBench. The term “Death Star” refers to these sorts of microservice interaction diagrams:

The AWS microservices "death star" in 2008

The example social network app is much more tractable:

Gan et. al. 2019

Even a small system like this can show complex dynamics and failures. They used the Kubernetes vertical pod autoscaler (VPA) and gave SelfTune control over 4 CPU scaling params plus “about 85” other service-level configs. During the experiment, load came in 15 minute bursts. (The app was completely idle before and after?) The authors compared SelfTune to Bayesian Optimization and two Reinforcement Learning methods. The optimizers themselves don’t scale the system up or down, they tell Kubernetes how quickly to scale up or down. Each method gets 50 rounds, i.e. 50 bursts of 15 minutes. I saw no mention of choosing SelfTune’s hyperparameters δ and η.

SelfTune achieved the best throughput of all methods tested, about 2% better than the next best. The big win was it converged optimally after only 8 cycles, 3-5x faster than alternatives. That matters when you’re learning on live systems—you’re wasting resources or violating SLAs during convergence.

My Evaluation

I like the emphasis on SelfTune’s API and ease of integration. The experiments seem exceptionally thorough and realistic, compared to similar papers I’ve read.

Humans still have to think and experiment to choose the “radius” and “learning rate” hyperparameters. I wish they had discussed the method for this, it seems important but not automatic. BlueFin is so simple, I wonder whether it risks getting stuck in a suptoptimal local maximum, or conversely whether it risks jumping around forever and missing the maximum. SelfTune’s learning method is pluggable, and I wonder if it would perform better with a more sophisticated method than BlueFin.

I also wondered if real world scenarios are more complex than they tested. Would 8 cycles always find the global optimum, or could you get stuck in a local maximum? And what’s the right scope for each SelfTune instance—per machine, per service, or more centralized? What are the tradeoffs among these deployment styles? The paper seems to advocate decentralization, but I worry about oscillations or shifting bottlenecks when instances interact. A global tuner could help but might also get overwhelmed by complexity. Further research is needed.

Images:

Unknown photographer, Les Dorizac tuning an upright piano, 1956.
Norman Rockwell, Piano Tuner, 1947.
Tuning a Broadwood Cabinet piano, 1842.
T/4 William Kuehl tunes a Victory Vertical on Guadalcanal.
Steinway Victory Vertical.

Climbing and Falling, 2

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Sun, 02 Jul 2023 19:27:50 +0000

June 30 and July 1, 2023. Traditional Climbing and Mental Training Camp, a two-day climbing class taught in the Gunks by Dustin Portzline. Students learned to place trad gear and practiced falling on it.

Images © A. Jesse Jiryu Davis

Climbing and Falling

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Sat, 20 May 2023 17:31:44 +0000

May 19, 2023. Warrior’s Way Falling and Commitment, a one-day climbing class taught in the Gunks by Dustin Portzline. Students learned to place trad gear and practiced falling on it.

Images © A. Jesse Jiryu Davis

Gunks Climbing Photos

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Mon, 08 May 2023 22:16:52 +0000

May 7, 2023. Irene Yee taught a climbing photography clinic in the Gunks, in collaboration with Rise Outside. I had some experience hanging from a rope and photographing climbers before I took the class. But Irene presented principles, methods, and tricks that went far beyond what I had figured out on my own. Here’s a selection of photos I took during her clinic.

Images © A. Jesse Jiryu Davis

Consistency and Isolation for Python Programmers

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Tue, 18 Apr 2023 13:53:54 +0000

Computers are infuriating.

At PyCon 2023 I talked about consistency and isolation in databases, and showed Python implementations of four isolation levels. Here’s the PyCon video and here’s a Talk Python podcast interview with Michael Kennedy (episode page).

The subject goes a bit deeper than I could cover in 30 minutes; here are links for further reading.

First, orient yourself with Kyle “Aphyr” Kingsbury’s map.

Isolation

Basics

Granularity of Locks in a Shared Data Base, Gray et. al. 1975, or the summary in the Morning Paper part 1 and part 2. This is the earliest paper I’ve read about isolation. It’s fundamental.

A Critique of ANSI SQL Isolation Levels, Berenson et. al. 1995, or the Morning Paper summary.

Advanced

Generalized Isolation Level Definitions, Adya et. al. 2000, Morning Paper.

Optional: Seeing is Believing: A Client-Centric Specification of Database Isolation, Crooks et. al. 2017, Morning Paper.

Consistency

Linearizability

Linearizability: A Correctness Condition for Concurrent Objects, Herlihy & Wing 1990.

Linearizability versus Serializability on Peter Bailis’s blog.

Visualizing Linearizability, Michael Whittaker’s blog.

We often implement linearizability in a distributed system with a consensus algorithm such as Paxos or Raft.

Other consistency levels

Morning Paper: Distributed Consistency and Session Anomalies.

MongoDB’s “Read Concern” docs.

Pseudocode Is Not Durable

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Sat, 01 Apr 2023 17:06:47 +0000

My friend Andrew Helwer argues that the best way to communicate an algorithm is with PlusCal or Python. He has “come to believe that both are superior to an ad-hoc unspecified pseudo-math language,” of the sort that researchers often invent for their papers. He recommends a subset of Python that avoids “fancy features” like list comprehensions, and avoids the standard library. He shows an apt example: a 43-year-old algorithm that was specified in pseudocode and has therefore been plagued with bugs and ambiguities ever since.

I agree with Andrew, and I think Python specifically is the best option for most situations.

It is the most widely used programming language, and it probably has the largest population of people who can mostly read it, especially a non-fancy subset of the language.
It is concise enough: nearly as concise as most pseudocode examples and (as Andrew shows) much shorter than PlusCal.
A Python algorithm can be tested as thoroughly as you want with property-based testing, fuzzing, or exhaustive testing, just like a PlusCal algorithm. If you want to formally prove its correctness then you’ll need PlusCal or another spec language; you should still write the algorithm in Python for the common reader’s sake, and test it in Python too!

But whichever language you choose, use an actual executable language. That’s the only way to avoid ambiguity and bugs.

(I include PlusCal among the “executable” languages, although “machine-evaluable” is more accurate.)

Andrew proposes one argument for pseudocode:

I do feel that tying algorithms—these immortal mathematical objects—to languages relevant within a half-century sliding window cheapens their presentation somehow. I’m old enough to have lived through the waning of several once-popular languages. Pseudocode is timeless!

Here I disagree. Pseudocode is not timeless: authors write pseudocode to resemble the languages with which they’re familiar. Pseudocode in old papers looks like COBOL, in middle-aged papers like Pascal, and in recent papers like Python. Authors write pseudocode assuming that all readers have approximately the same cultural context as their own—they know the same programming languages, the same standard libraries, the same mathematical notations. This is false. To readers familiar with modern languages, old pseudocode is illegible or ambiguous, and there’s no way to disambiguate the author’s intent. Math notation changes more slowly, but it also does not stand still.

If you publish an algorithm in an executable language today, and specify its version, readers decades from now will have some way to run it or infer how it would run. Again, I recommend Python since it’s concise and widely known, but any real language is more durable than pseudocode.

Review: Distributed Reset

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Tue, 14 Feb 2023 19:07:52 +0000

Distributed Reset, by Anish Arora and Mohamed Gouda, in IEEE Transactions on Computers, 1994.

Say you have a distributed system, and each node wants the ability to reset all the nodes to some predefined state. “Distributed Reset” is a bolt-on protocol you can add to your system. The protocol involves constructing a spanning tree and diffusing the reset message in three waves through the tree.

Spanning tree

Arora and Gouda’s goal is to augment any distributed system with a distributed-reset module, which requires no additional processes or channels, it’s just some extra code. Distributed Reset brings all nodes to the same “distinguished state”. The authors don’t want to stop the world while the reset occurs: instead, it’s good enough if all nodes pass through the distinguished state, such that the system’s subsequent states are as if all nodes had been reset simultaneously.

In order to reset all nodes, a message must be able to diffuse from any single node to all the others. To achieve this, nodes continually maintain a spanning tree, a DAG that covers all nodes and has a single root node.

The initial state could be any directed graph, even one with cycles or isolates:

Each node keeps track of which adjacent nodes are alive and figures out who its parent should be, giving precedence to nodes with higher id numbers. How nodes know which peers are alive is an exercise for the reader. The authors assume there’s no network partition (big assumption!) and prove that their protocol is self-stabilizing: from any initial state, or after any topology change, the nodes will eventually re-establish a spanning tree with a single root node.

Diffusing computation

All nodes have a session number, and they somehow all start with the same session number. I’ve shown the initial session number as “1” here:

At any time, some node could decide to start a reset. This decision sets off three waves of messages. First, the reset-initiating node sends a message to its parent. The message keeps bubbling upward to the root:

Second, the root resets itself to the distinguished state, and increments its session number. It propagates the message down to its children, who also reset themselves and increment their session numbers, and so on to the leaves:

Third, the leaves send acknowledgements which bubble upward until they reach the root:

Multiple resets could be in progress at the same time; the purpose of the session numbers is to distinguish them. During normal application logic (that is, aside from the distributed-reset protocol), nodes can only talk with other nodes that have the same session number. Thus if Node A has been reset and Node B hasn’t yet, then Node A has session number 2 and Node B has session number 1, and they can’t talk until Node B also resets itself. This guarantees that application logic proceeds as if all nodes were reset simultaneously.

Interestingly, it’s ok for the session number to wrap around when it exceeds some maximum. You could use an 8-byte int, and let the number increment from 255 to 0. The paper says only that the sequence of session numbers must have at least two elements. (From which I infer that the protocol permits only two resets to be in progress.)

That’s the gist of the Distributed Reset protocol. Much of the paper is consumed by meticulous proofs; they’re admirable and I ignored them.

My evaluation

The paper’s spanning-tree algorithm is simple and robust. I don’t know what advances have been made in the subsequent 30 years for solving the spanning-tree problem, but this seems like a worthy contribution. I wonder if the rest of the paper is obsolete, though.

When is Distributed Reset necessary? Arora and Gouda write,

There are many occasions in which it is desirable for some processes in a distributed system to initiate resets; for example,

Reconfiguration: When the system is reconfigured, for instance, by adding processes or channels to it, some process in the system can be signaled to initiate a reset of the system to an appropriate “initial state”.

Mode Change: The system can be designed to execute in different modes or phases. If this is the case, then changing the current mode of execution can be achieved by resetting the system to an appropriate global state of the next mode.

Coordination Loss: When a process observes unexpected behavior from other processes, it recognizes that the coordination between the processes in the system has been lost. In such a situation, coordination can be regained by a reset.

Periodic Maintenance: The system can be designed such that a designated process periodically initiates a reset as a precaution, in case the current global state of the system has deviated from the global system invariant.

This seems vague and unconvincing to me. Consider points 3 and 4: I can’t picture what the authors mean by “coordination loss” or “periodic maintenance”. It sounds like they propose Distributed Reset as a Band-Aid over buggy software. Perhaps that’s a useful application.

Now consider points 1 and 2: “reconfiguration” and “mode change” require all nodes to agree to the new configuration or mode … in other words, consensus. And if you’ve solved consensus, you already solved Distributed Reset. You can just use your consensus protocol to make all nodes agree to the “distinguished state”. Furthermore, consensus protocols handle network partitions in predictable ways, whereas the Distributed Reset protocol was designed with the assumption there isn’t a partition; I’m not sure what would happen if there was.

The Distributed Reset paper was published in 1994, before Paxos was published in 1998. But Viewstamped Replication had solved consensus in 1988. What does Distributed Reset offer that isn’t a subset of Viewstamped Replication? It’s possible that Arora and Gouda weren’t aware of VR, or didn’t understand its power. They wouldn’t be the only researchers to do so. It seems that consensus wasn’t a well-distinguished concept among researchers of that era, and it had to be rediscovered a few times before achieving fame. Or perhaps Distributed Reset can succeed in some situations where no consensus protocol can make progress? This is past the frontiers of my knowledge, if you know then please tell me.

Update: Why not use consensus?

Murat Demirbas educated me about the context of this paper. Anish Arora was Murat’s academic advisor a few years after this paper was published. In that era, Gouda, Arora, and Demirbas all studied self-stabilization, a branch of distributed systems that I’m not familiar with. The field was inaugurated by a 1974 Edsger Dijkstra paper. A self-stabilizing system can start from any state and reach a good state (which upholds some desirable invariants) in a predictable number of steps, unless it’s perturbed again by some change or failure. We can see that style of reasoning in this paper’s spanning tree algorithm: it doesn’t begin with some initial state, rather it starts from any state and is guaranteed to construct a tree in a bounded number of steps. Murat says this paper solved a problem that had been open for some years before it was published.

My own specialty is distributed databases, so given a problem like “Distributed Reset”, I reach for a consensus protocol like Raft. But Raft and this paper are intended for very different scenarios. Raft assumes a small number of nodes, which are initiated with a known state, and have stable local storage. Distributed Reset works for a large and ever-changing set of nodes with no storage. Raft assumes all nodes can talk with each other directly, but Distributed Reset can handle a partly-connected network graph. The Distributed Reset protocol is more “local” than Raft, requiring each node to execute only simple rules based on its local knowledge, eventually leading to correct system-wide behavior.

Raft requires a majority of nodes to stay up: you can’t reconfigure yourself to a smaller set of nodes if the old majority can’t acknowledge the reconfig! Distributed Reset, however, maintains availability despite the loss of practically any number of nodes. In CAP terms, Distributed Reset prefers availability, but offers no consistency: you don’t know when or if the reset completes, but you can always start another reset.

Thanks to Murat for explaining this to me. Search for “stabilizing” on his blog for more, particularly his articles on cloud fault tolerance and his proposal for extremely resilient systems.

Photos © A. Jesse Jiryu Davis.

Review: Cornus: Atomic Commit for a Cloud DBMS with Storage Disaggregation

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Thu, 19 Jan 2023 15:22:29 +0000

Cornus: Atomic Commit for a Cloud DBMS with Storage Disaggregation, by Zhihan Guo, Xinyu Zeng, Kan Wu, Wuh-Chwen Hwang, Ziwei Ren, Xiangyao Yu, Mahesh Balakrishnan, and Philip A. Bernstein, at VLDB 2022.

This paper describes some optimizations to two-phase commit on top of cloud storage. If you’re using cloud storage and 2PC, this protocol will reduce commit latency and avoid 2PC’s dreaded “blocking problem”.

Background: Two-Phase Commit

Let’s say we have a partitioned database. A user has started a transaction and written to several partitions, aka “participants” in the transaction. A coordinator has been chosen (maybe one of the participants, or a distinct service) and the user told it to commit. Now it must commit the transaction atomically: either all the transaction’s writes are made permanent, or none. The classic algorithm is two-phase commit (2PC):

First the coordinator sends a “prepare” message to all participants. They all log their votes to disk (they vote “yes” in this diagram) and reply. Once the coordinator hears all the yes-votes, or any no-votes, or times out, it logs its decision. It tells the client and participants about the commit decision. The participants log the decision.

The coordinator and the participants all have both compute and storage—those stacks of donuts are local disks.

Local hard disks.

The coordinator must durably log its commit decision before it replies to the user. This causes some latency which we’ll call the coordinator log delay.

But why is this delay necessary? Let’s say instead that the coordinator just replies as soon as it decides. Well, what if it dies immediately after?

Now only the client knows the commit decision; none of the participants does.

Can’t we recover the decision by checking whether all the participants logged “yes”? No: if the coordinator timed out while waiting to hear all the votes, even if they were eventually all “yes”, then the coordinator would’ve aborted and told the user. But there’s no record of that decision in the system now. That’s why the coordinator must durably log its decision before it sends it to the user and the participants.

So if we lose the coordinator after it logs its decision, but before it has told the user and participants, then we can recover its decision when it reboots. But what if it never reboots? Participants must wait forever. This is called the blocking problem.

That’s 2PC. It’s been this way since the 80s. What’s new now? Now we have cloud storage.

Cloud Storage

Cloud storage services (Amazon S3, Azure Table, Azure Blob, …) permit cloud databases to disaggregate compute from storage. Cornus is a 2PC variant optimized for cloud storage. It relies on a few features:

Many cloud storage services have very high availability; lots of 9s.
Since storage is on a separate layer from the participants, participants can read and write other participants' logs!
Writes to cloud storage are durable as soon as they’re acknowledged.
Many cloud storage services permit something like compare-and-swap, which can be used to implement write-once.

Cornus can use any storage service, as long as these two APIs can be implemented on top of it:

Log(txn, message): A participant durably appends to its own log for txn. The message is a write operation.

LogOnce(txn, message) → message: A participant durably appends to any participant’s log with compare-and-swap. The message is VOTE-YES, COMMIT, or ABORT. If no such message exists for txn, update txn and return message. Otherwise, don’t update anything, return the already-existing message. This API is possible on top of S3, Azure Blob, etc.

The LogOnce API allows participants to write to each other’s logs concurrently. The first writer wins.

Cornus

What if, instead of local disks, the coordinator and the participants all used cloud storage? In this figure from the paper, the stacks of donuts look the same as before, but now they mean cloud storage.

(Aside: the coordinator is stateless, it doesn’t log anything. So why does it have donuts?)

Cornus uses cloud storage to solve the blocking problem and to eliminate the coordinator log delay.

The 2PC blocking problem arises after a coordinator dies at an unlucky moment, because we don’t know if the coordinator made a decision, or what its decision was. In Cornus, though, the coordinator can’t decide anything—specifically it can’t decide to abort the transaction if it times out waiting for the participants. It must wait for the participants to all vote “yes” or any to vote “no”, then it relays this decision back to the participants, and to the user. The coordinator is stateless. If it dies, no information is lost. Once the participants give up waiting for the coordinator, they can recover the transaction state by asking each other how they voted. Since the coordinator doesn’t need to log anything, Cornus eliminates the coordinator log delay.

It’s ok if the coordinator replied to the user before it died: The participants will eventually figure out what it said.

This change makes sense if you bet the coordinator is more likely to die than any of the participants. But there’s one coordinator and many participants, so that seems like a foolish bet. But wait: if some participants die, that’s no problem! The surviving participants all use LogOnce to concurrently try writing ABORT to each others' logs. (Remember, a participant’s storage is available even if the compute node died.) If all the logs already have VOTE-YES or COMMIT, then LogOnce returns those values and the surviving participants can commit. Otherwise they abort. Thus Cornus solves the blocking problem.

So cloud storage seems magical.

If you weren’t using cloud storage, then each database partition would need at least three-way replication for durability. Writes to cloud storage are also replicated for durability, but at a different layer. Thus they’re higher-latency than local writes: the authors say it’s ~10ms for Azure Blob in one data center, which is the minimum redundancy you’d want. So cloud storage isn’t magic, you’re paying the same latency cost in exchange for durability as if you implemented the replication yourself.

Their Evaluation

They have latency charts for Cornus implemented on top of several Azure services. I show Azure Blob because it’s the most like S3, which is what I’m most familiar with. I’d like to see Cornus actually implemented on S3, but the authors collaborated with Microsoft so they just used Azure.

Cornus latency with Azure Blob storage. I added the blue arrows.

Cornus clearly halves the commit delay from the user’s perspective. Cornus does one replicated write, and regular 2PC takes two, and replicated writes account for nearly all the latency.

(Why does p99 latency fall for Cornus with eight servers compared to six?)

My Evaluation

This seems like a worthwhile improvement to 2PC on top of cloud storage. If you’re already using cloud storage for your distributed database, there are useful ideas here. I have four thoughts.

Thought One: The Storage Hierarchy

Cornus works correctly so long as participants use the LogOnce API to write log messages directly to cloud storage whenever they vote, commit, or abort, but this incurs cloud storage’s latency for those writes. You might prefer participants write to a local cache instead, and asynchronously flush to cloud storage—this would be lower-latency but it won’t work with Cornus. Imagine that some participants think participant P is dead. They write ABORT to P’s log. If P is actually alive and has a local copy of its log, its copy will be inconsistent with cloud storage, and the participants will disagree about the transaction’s outcome.

So it’s essential in Cornus for LogOnce to write directly to cloud storage; this might make Cornus higher latency in total than a protocol that writes asynchronously to the cloud. The tradeoffs will be specific to your system’s architecture and usage.

Thought Two: Consistency

What about reading your writes?

Look at this situation again. As soon as the Cornus coordinator hears all the “yes” votes it replies to the user. What does the user do next? If they read from a participant before it commits, they won’t see their own writes.

If the user never reads, then it’s nice that they’re unblocked as soon as the transaction is durable. If they mix reads and writes, then maybe the coordinator should wait until consistent reads are possible before it replies. It’s still useful to remove the coordinator log delay, but now that only saves one third of the commit latency, not one half.

Thought Three: The Optimization Doesn’t Need Cloud Storage

If you still want the coordinator log delay optimization, I don’t think it depends on cloud storage. You could rely on the participants to be always available (and replicate them appropriately), and use their logs as the source of truth, like Cornus does.

Thought Four: Cloud Storage Isn’t Magic

My final caveat is, don’t assume cloud storage is magically invulnerable. You might want to configure a higher replication factor than the authors did. There can be disasters, even in the cloud.

I'm Tithing Ten Percent of My Income to Effective Charities

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Tue, 03 Jan 2023 22:32:50 +0000

I spent most of last year earning no income, and I also spent a lot of time learning about Effective Altruism. The EA philosophy, as I understand it, is that people who can afford to be charitable ought to give some money to the most effective charities. We should send money where it will do the most good, even if it helps distant strangers rather than the people we know in our families and communities.

As I said in a dharma talk last year, Effective Altruism bears some resemblance to the principle of “universal goodwill” in my religion, Buddhism. In the Karaniya Metta Sutta, Buddha said:

Just as a mother would protect her only child at the risk of her own life,
even so, let one cultivate a boundless heart towards all beings.

Buddha emphasized the practice of compassion toward all beings, more than effective action to materially benefit them. Here I part ways with the old man. I do not actually care about distant strangers in poor countries. But I can nevertheless muster the willpower to donate ten percent of my income to their material welfare, so now that I’m earning a salary again, that’s what I’m doing. For the moment, I’m donating to the GiveWell Top Charities Fund, it seems like they’re trying hard to optimize effectiveness.

You might have some objections to Effective Altruism: their big donor Sam Bankman-Fried is a criminal, and the Longtermist offshoot has weird ideas. But that doesn’t dissuade me from giving ten percent of my income to effective charities. Scott Alexander has a short and funny essay about criticisms of EA that closely matches my feelings.

You might not agree that we have a moral obligation to help distant strangers. In that case we have different moral foundations, or we reason from different axioms, and arguing about moral foundations leads to the void of nothingness underlying ethics: Why choose some axioms rather than others? Why use logic to reason about right and wrong? I don’t know. I’m probably a nihilist if you corner me, but it feels good to act like a Utilitarian anyway. And it feels good to publish this blog post encouraging you, if you’re able, to act like a Utilitarian too. So please consider donating a substantial portion of your income to effective charities. If you don’t already, I hope my example will inspire you.

Image: Buddha statue in Bodh Gaya.

Review: Paxos Quorum Leases: Fast Reads Without Sacrificing Writes

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Tue, 03 Jan 2023 12:09:41 +0000

Paxos Quorum Leases: Fast Reads Without Sacrificing Writes, by Iulian Moraru, David Andersen, and Michael Kaminsky, in SoCC 2014.

It’s about Paxos, but don’t be intimidated: this is an accessible paper about speeding up linearizable reads in geo-distributed groups of servers, supposedly without harming write performance too badly.

The Problem

In a replicated system like Multi-Paxos, a client sends a write request w to the leader, which forwards it to the followers. When a majority of replicas (including the leader) acknowledges w, the leader “commits” the command (actually executes it on the leader’s local data) and tells the client and followers that the command is committed. The followers then execute the command on their local copy of the data to stay in sync.

I adapted this from the authors' Figure 1.

A client reading the data might want “consistency”, which in this case means strict serializability. Strict serializability implies linearizability, which implies “no stale reads”: if a write operation w has completed before a read operation r begins, then r must see w’s effects; it must not see some earlier version of the data.

When the client queries the leader, the leader can provide consistent reads without extra latency, by holding a read lease: the followers promise not to become leaders for a period of time, which is constantly renewed while the leader is up, so the leader knows it’s the only leader and its local version of the data is the most up-to-date. (The leader also needs locks or whatever to ensure consistency despite its internal concurrency; we’re ignoring that and focusing on the distributed algorithm.)

But in a geo-replicated system the client might be distant from the leader, and want to query a nearby follower instead. There are some simple options:

The follower replies immediately. This is quick but inconsistent: w could have completed on the leader, but the follower doesn’t know it yet, so the client sees stale data.
The leader waits for all followers to acknowledge w, before the leader acknowledges w to the client. Thus follower reads are consistent: any read that starts after the leader acknowledges w will see w’s effects. This adds write latency, and risks unavailability if any follower dies.
The leader waits for a majority of followers to acknowledge w, before the leader acknowledges w to the client, and every read checks a majority for the latest version of the data. This increases latency for reads and writes. (See Flexible Paxos for a generalization discovered after this paper.)

Update, later the same day: Murat Demirbas and Aleksey Charapko point out that option 3 isn’t in the paper. That’s true, I was just adding my own idea for a third option, and in fact I got it wrong. I’ll have to read their Paxos Quorum Reads paper next to educate myself.

Anyway, the “Paxos Quorum Leases” authors mention options 1 and 2, and they’re dissatisfied with both.

In this paper, we argue that there is an overlooked alternative that is a more natural fit to the structure of Paxos: quorum leases. In this model, a lease for each object in the system could be granted to different subsets of nodes. The size and composition of these subsets is selected either based upon how frequently each replica reads the objects in question (for best read performance) or based upon their proximity to the leader (to improve read performance without slowing write performance).

The Algorithm

In this paper, a “lease” is the right to serve consistent reads on some subset of the data. There could be as many leases as distinct sets of data, though too many leases would create a lot of overhead. A lease is granted to a set of nodes for a period of time, which is constantly renewed. The lease-granters must be a majority of nodes, to prevent conflicting grants. The lease-holders could be any number of nodes.

Whenever the leader modifies some data, it promises to wait for acknowledgment from that data’s lease-holders before it acknowledges the write to the client. Since, in Paxos, the leader must wait for majority-acknowledgment for all writes anyway, it’s natural for each lease’s holders to be some majority.

You can see how this guarantees consistent reads. Say a client sends the leader a write operation w which modifies some chunk of data x. The leader waits for acknowledgment from x’s lease-holders before acknowledging the write to the client. If any client then reads x from one of the lease-holding followers, it will see w’s effects.

Here’s an example use-case (mine, not the authors'). Imagine a social network whose data is stored in one big geo-distributed Paxos group. The leader node is in California, perhaps, and a user named Mariko in Japan wants to log in. She connects to a webserver in Japan and sends her password hash.

The Japanese webserver must check her password hash against the latest hash stored in the Paxos group; stale reads aren’t okay, because a password change must take effect instantly. Therefore the webserver needs a linearizable read. In regular Paxos, either the webserver would have to read from the leader in California, or else every password change would require acknowledgment from all nodes. But with quorum leases, some Paxos nodes in Japan could be granted read leases on Japanese users' password hashes. So password changes for Japanese users would require acknowledgment from a quorum of Japanese Paxos nodes. The Japanese webserver checks Mariko’s hash against the one stored in a Japanese Paxos node with much lower latency, without compromising security.

Their Evaluation

The authors made a prototype geo-replicated Paxos implementation and aimed a YCSB workload at it, using various ratios of reads to writes, and either a uniform or heavily skewed access pattern. When skewed, nodes in different regions had different hot keys. They implemented various styles of read leases and compared them to quorum leases.

In the authors' opinion, quorum leases occupy a sweet spot that balances write latency and read latency. They claim that their quorum lease prototype serves 80% of reads from the nearest node, indicating that leases are granted efficiently to the nodes in the correct regions. Somehow, the authors say that “over 70% of updates have the minimum latency achievable by a geo-distributed Multi-Paxos system, matching that provided by the single leader lease”. In other words, write-acknowledgment from a specific quorum is usually as fast as write-acknowledgment from the fastest quorum. This surprises me, and see my concerns below.

My Evaluation

If you thought regular Paxos was complicated, just wait 'til you add quorum leases. As the authors admit:

Despite the intuitiveness of this approach, implementing quorum leases is nontrivial. Compared to approaches in which the set of nodes with the lease is fixed—either a single master or all replicas—an implementation of quorum leases must be able to consistently determine which objects belong to which lease quorum, automatically determine appropriate lease durations, and efficiently refresh the leases in a way that balances overhead, a high hit rate on leased objects, and rapid lease expiration in the event of a node or network failure.

Let’s say you’ve implemented everything correctly. Now there are exciting new failure modes! In regular Paxos, if a majority of nodes is up, writes succeed. But with quorum leases, if any node goes down, writes fail to any data to which that node holds a lease. The system can recover, but not until the previous leases expire (on the order of 10 seconds).

Even when all nodes are up, I’m concerned about tail latency for writes. The authors' well-tuned prototype usually executes writes as fast as regular Paxos would, but the tail latency is much worse. And real life is less predictable than a prototype: if just one lease-holder is slow, or far from the leader, write latencies would spike for all the data in its leases. Regular Paxos isn’t as vulnerable to a single-node incident.

Despite my concerns, I enjoyed reading this paper and I recognize that it’s a valuable contribution. Quorum leases are a clever idea; they extend a pseudopod of the Paxos amoeba into a new area of the problem space. The authors have tackled the complexity, and their prototype performs surprisingly well. In cases like my login example above, it could be worthwhile to assign quorum leases for some data that has the appropriate access patterns, like passwords, at the risk of write latency or unavailability—then use conventional Paxos for the rest of the data.

Should MongoDB Implement This?

Whenever I read a paper I ask, “Should we implement this in MongoDB?” So let’s restate the problem the authors want to solve: some objects are frequently read from certain regions, we want linearizable reads, and there’s just one leader. So, quorum leases let followers in some regions serve local linearizable reads on some objects.

But MongoDB doesn’t have just one leader: we have sharding. Each shard is a replica set with its own leader. In my login example, we could use tags to place user data in the proper region. Each shard’s leader could serve linearizable reads to nearby clients.

Linearizable reads in MongoDB have the same latency as a majority-acknowledged write, so they’re slower than the single-node local reads of Paxos quorum leases, but we still avoid cross-region read latency. Quorum leases permit read-scaling by spreading reads across some followers, unlike linearizable reads from the leader with MongoDB. But MongoDB’s sharding also scales writes and avoids cross-region write latency, so it might be a better sweet spot for many applications.

MongoDB tag-aware sharding	Paxos quorum leases
Linearizable reads from leader, as slow as majority write	Local reads from some followers
Within-region writes	Cross-region writes
Write scaling	Read scaling

Tag-aware sharding seems to cover most use cases well enough or better than quorum leases. For quorum leases to still be necessary, you’d have to require linearizable reads from followers (not just the leader), or linearizable reads that are faster than a majority-acknowledged write, or fast reads in multiple regions for the same data. Historically, people who need fast geo-distributed reads are willing to give up some consistency for that, and they don’t need to settle for eventual consistency: I think MongoDB’s causal consistency is a reasonably strong and intuitive consistency level that would solve this problem without as many drawbacks as quorum leases.

Images from The Principles of Light and Color, Edwin D. Babbit 1878.

Village Zendo's Winter 2022 Meditation Retreat

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Sun, 01 Jan 2023 22:28:45 +0000

December 2022. The Village Zendo’s annual year-end Zen retreat. We held it at the Garrison Institute in upstate New York, as usual. Well, it was usual before the pandemic, but this is our first in-person winter retreat since year-end 2019, so I didn’t take it for granted.

The Garrison Institute was a Capuchin monastery built in 1923, then converted to a retreat center in 2003; now it hosts retreats for all kinds of spiritual communities. I notice in these photos how the Christian bones show: the Zen priest offering incense to Manjushri looks almost like a Catholic priest flanked by altar boys, preparing the eucharist.

Images © A. Jesse Jiryu Davis

Paper Review: Rex: Replication at the Speed of Multi-core

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Tue, 20 Dec 2022 12:00:21 +0000

Rex: Replication at the Speed of Multi-core by Zhenyu Guo, Chuntao Hong, Mao Yang, Dong Zhou, Lidong Zhou, and Li Zhuang at EuroSys 2014.

Like the C5 paper I just reviewed, Rex is about improving asynchronous replication performance on multi-core, but I thought the C5 paper was humdrum whereas Rex is thought-provoking.

Warm up your neurons because they're going to get a workout.

Background

Usually in asynchronous replication, one server is the leader for a while and makes modifications to its data. It logs these modifications in a sequence, which it streams to followers. They replay these modifications to their copies of the data, in the same order. Thus there is a total order of states; any state that the leader passes through is eventually reflected on each follower in the same order. Clients can read from the leader or followers, and they’ll see the same sequence of states, although the followers may lag.

So far so good. Distributed Systems 101.

But what if it didn’t have to be that way?

Galaxy brain.

Partial-Order Replication

The Rex paper’s insight is, the leader only needs to guarantee a partial order of events. On the leader if two transactions are running on concurrent threads t₁ and t₂ and they attempt to modify the same data x, some concurrency protocol decides the outcome. For example, if your database wants to guarantee serializability, it might use two-phase locking. Let’s say t₁ gets the write-lock L on x first and updates x, then unlocks L. Then, t₂ locks L, updates x, and unlocks L. This sequence of events guarantees that t₁’s write happens-before t₂’s, and the value of x after both transactions commit is the value t₂ wrote.

(Forgive me, this will get interesting in a second.)

I adapted this from the author's Figure 2.

On the other hand if a pair of threads t₃ and t₄ write to disjoint data, they’ll share no locks and their writes are concurrent: neither one happened before or after the other.

(OK, now it’s going to get interesting.)

The Rex leader logs all the lock/unlock events executed by t₁, t₂, t₃, and t₄. But it does not decide what order they occur! It only logs that t₁’s unlock happens-before t₂’s lock, then streams that information to the followers. So followers learn about a partial order of lock/unlock events, and replay them in some order that matches the leader’s partial order. Each follower is free to replay them in whatever order it wants so long as it obeys the rule, “t₁ unlocks L before t₂ locks L”. Thus followers can execute with as much parallelism as the leader did, allowing higher throughput than if they had to execute the writes in some total order.

My example above is for a database that guarantees serializability, and its mechanism is two-phase locking. But any guarantee and any mechanism would work with Rex. The point is that the primary decides what happens-before relationships must be obeyed, and the followers obey them.

The Rex authors observe that followers in most systems suffer low parallelism, because they’re obeying a total order for the sake of consistency, but:

This tension between concurrency and consistency is not inherent because the total ordering of requests is merely a simplifying convenience that is unnecessary for consistency.

We’ll revisit this claim below; I have concerns about “consistency”.

Capturing Partial Order

How does the leader learn the partial order of writes and transmit it to the followers? In Rex, locks are wrapped with a shim which captures happens-before relationships on the leader, and enforces them on the followers. The Rex authors call these relationships “causal edges”:

// Pseudocode of the Lock and Unlock wrappers.
class RexLock {
public:
  void Lock() {
    if (env::local_thread_mode == RECORD) {
      AcquireLock(real_lock);
      RecordCausalEdge();
    } else if (env::local_thread_mode == REPLAY) {
      WaitCausalEdgesIfNecessary();
      AcquireLock(real_lock);
    }
  }

  void Unlock() {
    if (env::local_thread_mode == RECORD) {
      RecordCausalEdge();
    }
    ReleaseLock(real_lock);
  }
};

So if t₁ unlocks L on the leader and then t₂ locks it, L records that the second event depends on the first, in RecordCausalEdge. Rex sends this information to the followers, who wait for t₁ to unlock L before t₂ locks it, with WaitCausalEdgesIfNecessary. This happens whether t₂ actually has to wait to lock L on the leader, or just locks L some time after t₁ unlocks it.

The paper doesn’t describe the exact format of the log, but the log somehow expresses which threads executed which operations, and the order in which they acquired and released every lock. Visually it’s a DAG:

The lines labeled “c₁” and “c₂” are two possible cuts through the DAG. The authors say:

A cut is consistent if, for any causal edge from event e₁ to e₂ in the trace, e₂ being in the cut implies that e₁ is also included in the cut. An execution reaches only consistent cuts. Figure 2 shows two cuts c₁ and c₂, where c₁ is consistent, but c₂ is inconsistent.

“Consistent” here seems a bad choice of jargon, we’ve so overloaded that poor word already. In any case, a cut is ready to be packed up and shipped to the secondaries so long as it doesn’t slice across a causal edge the wrong way. If the primary sent all the events above c₂ to the secondary, the secondary thread t₁ would be stuck at its event 4, because it’s waiting for t₂’s event 3, which isn’t included in the cut. No such hangup occurs if the primary uses the “consistent” cut c₁.

For this definition to work, imagine that each cut includes all events to the beginning of time. In practice, of course, the primary only has to send events that are in the newest cut and not in previous cuts.

Annoying Details

I skimmed Section 3, which buries the reader in Paxos junk. It describes how leaders are demoted or elected, how servers are added or removed, how intermediate states are checkpointed for the purposes of crash-recovery, rollback, and garbage collection. I sympathize with the authors: they had to describe their ideas in terms of Paxos, the main consensus protocol in 2014, and Paxos has so much ambiguity and variation that the authors must detail their choices about every aspect of the protocol. If they’d come just a bit later they could have built on Raft instead and taken a lot of these details for granted, focusing on their main contribution instead.

My Evaluation

The Rex authors want to increase replication throughput by improving parallelism on secondaries. They claim Rex is “a new multi-core friendly replicated state-machine framework that achieves strong consistency while preserving parallelism in multi-thread applications.” I agree that the protocol is multi-core friendly, but C5 is a simpler way to achieve that goal, and as in my review of the C5 paper, I’ll warn that parallelism alone won’t save you from replication lag.

What do the authors mean when they say Rex “achieves strong consistency”? That word is so pathologically abused, it has at least two meanings in this paper alone! In the sentence above, “consistency” seems to mean what we usually call “convergence”: if you stop sending writes to the primary, eventually all replicas will have the same state. From a client’s point of view this is “eventual consistency”, and it’s parsecs from the usual meaning of “strong consistency”. See Pat Helland’s Don’t Get Stuck in the “Con” Game. I’ll forgive the authors, there’s been a lot of work since 2014 to clarify our jargon.

Leaving jargon aside, what’s the user experience when querying a Rex secondary? There’s no guarantee besides convergence. Independent events happen in different orders on different secondaries, temporarily producing states that never existed on the primary. Other systems such as MongoDB have weak guarantees on secondaries but we’re able to build stronger consistency on top of them. For example, MongoDB guarantees causal consistency on secondaries by reading at the majority-committed timestamp. We make this guarantee and also achieve high parallelism during replication (see the C5 review). MongoDB’s guarantee relies on its ability to read a past snapshot on all replicas at a specific timestamp. This is either impossible with Rex, or would require a degree of coordination that I suspect would obviate Rex’s advantage.

The Rex paper expanded my mind with the idea of partial-order replication, but I don’t immediately see a practical use for it.

Images from The Principles of Light and Color, Edwin D. Babbit 1878.

New Paltz Zen Center

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Mon, 19 Dec 2022 10:21:47 +0000

December 18, 2022. Jo An, Keishin, and Ruby the dog practicing at the New Paltz Zen Center in upstate New York.

Images © A. Jesse Jiryu Davis

After 244 Days Off

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Fri, 16 Dec 2022 16:16:42 +0000

As planned, I returned from an 8-month leave and resumed work for MongoDB last week. My intention for the long break was to climb a lot, practice being afraid, get more comfortable on the wall, and have more fun climbing. My secondary goal was to read more computer science research and focus on learning, freed of daily engineering responsibilities. I accomplished the first goal and not the second, and I’m completely satisfied with that.

Climbing

Photo: Rita Strauss

The climbing style here in the Gunks, in upstate New York, is called traditional climbing, or “trad”: When I climb up a cliff I’m attached by a rope to a belayer standing at the bottom. For protection, I gently slot pieces of metal (nuts or cams) into cracks in the rock, and attach the rope there. If I’ve climbed 5 feet since my last piece of protection, for example, and I fall, then I’ll fall at least 10 feet before the belayer’s weight catches me. Probably much farther, due to the rope’s stretch and the slack the belayer allows me. If the fall is clean I’ll be fine, but there’s often the danger of hitting a ledge, or the ground, especially if any pieces of protection pull out of the rock. So far I’ve climbed conservatively. I hardly ever fall, and I’ve never been hurt.

Trad climbing needs some strength, a lot of skill, and a ton of courage, so courage is the main attribute I’ve cultivated. It’s disarming to be part of a community that talks constantly about fear. I often meet strangers to climb, and we trust our lives to each other. We’re afraid together, so we’re also brave together.

Photo: Rita Strauss

Climbing’s worth the risks for two reasons. First, because climbing is the most fun I’ve had since I was a kid. Second, because being scared on the rock is a powerful way to learn about my mind. When I’m at risk, like if I’m about to do a tricky move far above my last piece, my mind can be unhelpful. It might try to ignore the danger, telling me, “You’ll be fine, you’ve got this, don’t worry about the consequences of a fall.” Or it might obsess about the danger, distracting me from performing the move with grace.

I found it useful to read books by a climber named Arno Ilgner, and take classes from climbing coaches he’s taught, Dustin Portzline and Lor Sabourin. Ilgner advises climbers to distinguish between two modes: deciding and executing. When you’re deciding, you consider the risks and find the best way upward. Then you choose what to do and do it, with full commitment and no further thought, until you reach the next stopping point. There’s no room for criticizing yourself or wishing you were elsewhere. Ilgner says that before you have even left the ground, you must accept the consequences of your decision to climb. With this mindset, on good days, I can climb beautifully even when I’m scared.

I started the year leading very easy climbs rated 5.5, and finished by leading Roseland, a moderately-difficult 5.9, which was quite a bit more progress than I expected!

Photo: Olivia Bernard

Photo: Jordan Shapiro

It wasn’t all mind games in the Gunks, though. I spent a couple very enjoyable weeks climbing in Utah, including an intense 3-day crack-climbing clinic in Indian Creek with Mary Eden, aka TradPrincess.

Photo: Alex Lemieux

Photo: Spencer McKay

This fall I spent a few days in Rumney, the Northeast’s premier sport-climbing area. It was cold and wet, but I was camping with friends so I mostly had fun.

Wrecking My Life

Climbing makes a lot of people quit their jobs, run away, and become van-dwelling dirtbags. I didn’t go that far, but I did buy a house in New Paltz, then quit my job for 8 months to climb. While I was on break, I saw no reason to return to our apartment in NYC, which annoyed Jennifer. She asked, “Where do we live now?” She decided to move her necessities and the hamster to New Paltz for the summer. A month later, we discovered that we had decided to leave our apartment permanently, although we didn’t remember the moment when we decided. Moving the hamster upstate had somehow resulted in moving ourselves upstate for good. In August we packed up, and I filmed one last sunset over the East River from our window before we left.

I lived in NYC for 18 years, and I swore I’d never leave. It’s hard to make and keep friends in middle age and I didn’t want to start over. But enough of my friends had drifted or moved away, especially during the pandemic, that I had no choice about starting over. In fact, so many friends have moved to the Hudson Valley, coming to New Paltz has kept me closer to my old friends than if I’d stayed in the city.

The other reason to stay in NYC had been my Zen Buddhist community, the Village Zendo, where I’ve practiced since 2004. The zendo was closed for the pandemic and stayed closed for frustratingly long. By the time they reopened this spring, I’d decathected. I’m accustomed now to meditating in the mornings with Jennifer at home, going to the New Paltz Zen Center once or twice a week, and seeing all my NYC Zen friends on occasion. That said, it was a homecoming I’d ached for when we finally did a 10-day retreat together this summer, and an in-person shuso hossen in the city this fall.

Being Useful

I promised to be not completely useless to other people while I was off work. I restarted the Village Zendo’s meditation program at Sing Sing prison after a two-year Covid closure, enduring more bureaucratic exasperation than I could possibly have anticipated. Most of our old volunteers are once again permitted to enter. A handful of the prisoners we used to sit with have rejoined us; it’ll take time to rebuild.

Sanghas Supporting Refugees, a project of the Village Zendo and some other Buddhist groups, along with the Hebrew Immigrant Aid Society, raised a bunch of money and resettled Javier, a trans man fleeing persecution in Colombia. We found him an apartment, an internship, health insurance, English classes, and much more. I helped primarily with the fundraising. I met Javier in October.

Computers

I thought I’d have time, on rainy days or when I was too tired to climb, to read more computer science research and learn some of the areas I’ve ignored, like machine learning or DevOps. But in fact I climbed most days, and moving from NYC cost many of the rest. I read a few papers, and started learning queueing theory, and I had a good time speaking at PyCon and attending HPTS.

Ready or not, I returned to work two weeks ago on December 1 as planned. I’m still at MongoDB, where I’ve worked since 2011, but I have a new job: I’m a researcher at MongoDB Research. I don’t yet know what my initial focus will be. In general, I’ll try to stay current with research in industry and academia, especially in distributed databases, and look for techniques that could make MongoDB better. When I find something promising, I’ll make a prototype or do some experiments, write a report, hand off the winning ideas to engineering teams, and repeat the cycle. I haven’t done much research before. I only have a BA and I was never more than competent at math, but I’ve been around the block a few times and might have developed some intuition about what’s useful. Wish me luck!

During my sabbatical I’ve fallen in love with the Gunks and being outside, not just climbing. Now that it’s too cold to climb here I still go on a hike once a week. I’m grateful, at the age of 44, to rediscover nature. We had a relationship when I lived in a monastery two decades ago, but we drifted apart. Now we’ve found each other and won’t be separated again.

Paper Review: C5: Cloned Concurrency Control that Always Keeps Up

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Fri, 16 Dec 2022 14:08:00 +0000

C5: Cloned Concurrency Control that Always Keeps Up, by Jeffrey Helt, Abhinav Sharma, Daniel J. Abadi, Wyatt Lloyd, Jose M. Faleiro, in VLDB 2022. The authors describe a useful optimization for primary-backup replication, but they don’t break new ground. Here’s a video presentation I made to the DistSys Reading Group about this paper, and my written review below.

I don’t have images of clones or anything apropos, but it’s snowing in New Paltz, NY so I’ll try to entertain you with snowy images.

Background

First let’s review some background—quickly, since it’ll be familiar to distributed databases people. This paper is about asynchronous replication, meaning you can write to the primary and the backups will replicate the writes eventually.

MongoDB and many other Raft- or Paxos-like systems work this way. Note that in asynchronous replication, the primary applies the write to its local data immediately, and then the write is replicated by the backup.

Multiple clients can write to the primary at once, but the primary’s copy of the data reflects some serial ordering of the writes. In the examples in this paper, the primary executes multi-row transactions and seems to guarantee serializability: every transaction appears to commit at an instant in time. Thus reads from the primary show the result of some serial order of committed transactions. But the point of this paper isn’t serializability: whatever guarantee the primary makes about reads, the backup eventually must, too.

The primary logs all its writes in the order they were executed, and streams the logs to the backup. The backup applies the writes to its own copy of the data. Clients can read from the backup. The authors want the backup to guarantee “monotonic prefix consistency”, meaning that clients see “a progressing sequence of the primary’s recent states”; each state reflects a complete prefix of the primary’s log. To ensure this, the backup implements a “cloned concurrency control” protocol.

The Problems

C5 addresses two problems:

High parallelism on the primary, low parallelism on the backup.

According to the authors, in many primary-backup systems, the backup executes writes with low parallelism; this is a bottleneck and causes replication lag. (I agree that replication lag is a common problem, but there are many causes, see below.) If the backup lags then reads will show stale data. Recent writes will be lost if the primary dies before they’re replicated. Worse, a backup can fall off the end of the log: If there’s a finite buffer of log entries (on the primary and/or the backup) and the backup lags so much that unreplicated entries are evicted, then the backup can never catch up. This usually requires manual recovery.

Unconstrained parallelism on the backup would allow consistency violations.

But you can’t just permit a free-for-all among backup threads; if they apply log entries in a different order than the primary did, the backup’s copy of the data will permanently diverge.

For instance, suppose two transactions T and U each update rows x and y . If different workers execute the resultant writes (denoted w_T[x], w_T[y], w_U[x], and w_U[y]), w_T[x] may finish before w_U[x] and w_U[y] before w_T[y]. If there are no further writes to these rows, the backup will forever reflect w_U[x] and w_T[y], violating transactional atomicity and thus monotonic prefix consistency.

The C5 Backup Algorithm

The algorithm presented in the paper executes on the backup with the same degree of parallelism as on the primary, while guaranteeing monotonic prefix consistency. Here’s an example, with the transactions T and U from before:

The backup receives a log of writes from the primary, in the order the primary executed them. The backup has a queue per row, so x has a queue and y has a queue. The backup moves log entries in order from the log to the per-row queues.

There’s a scheduler queue with pointers to the per-row queues. Let’s say the backup has a thread pool with two workers. Each worker follows a pointer from the head of the scheduler queue to the head of a per-row queue and executes the write there. This schedule ensures that each row receives writes in the same order as on the primary, but there’s no order guarantee between rows, so you could still see inconsistencies if you read from the backup. How does C5 ensure monotonic prefix consistency for readers?

The C5 Snapshot Algorithm

C5 assumes an MVCC storage layer with multiple snapshots. It maintains a “current” snapshot which is prefix-complete: all writes up to some log entry c are included, so clients can read from it. There is also a “next” snapshot which is being updated with log entries between c and n in parallel. Since the backup server executes writes out of order, “next” is inconsistent and hidden from readers. Finally there’s a “future” snapshot which is updated with log entries after n.

As soon as all the writes to the “next” snapshot are complete, next becomes current, future becomes next, a new future snapshot is started, and c and n are shifted to the right. Now reads come from the new “current” snapshot so they see newer data, but still prefix-consistent.

I think you could accomplish this with just two snapshots: “current” and “next”. The C5 authors seem concerned that snapshot installation takes time. They want to start writing to “future” while they install “next” and read from “current”, so they don’t lose availability during that time.

C5 Goals

The authors have two goals. First, match the primary’s parallelism on the backups. They say “cloned concurrency control has commensurate constraints (C5) with the primary.” Second, preserve monotonic prefix consistency. It seems plausible that they’ve accomplished these with the scheduler and snapshotter described above.

The rest of the paper goes into details about two implementations of their algorithm. One of them was an improvement to the MyRocks database and the authors show that it solved a replication lag problem at Meta. The paper does not describe in detail what backup algorithm Meta ran before, and I didn’t research it. I’ll take them at their word that the previous algorithm’s low parallelism was culpable for Meta’s replication lag.

My Evaluation

The C5 authors claim their scheduling algorithm is novel in 2022, but I’m pretty sure Eliot Horowitz implemented it in MongoDB in 2014:

Author: Eliot Horowitz <eliot@10gen.com>
Date:   Fri Oct 31 11:28:37 2014 -0400

SERVER-15900: secondaries should stripe writes by doc if doc locking is on

  uint32_t hash = 0;
  MurmurHash3_x86_32( ns, len, 0, &hash);

+ const char* opType = it->getField( "op" ).value();
+
+ if (getGlobalEnvironment()->getGlobalStorageEngine()->supportsDocLocking() &&
+     isCrudOpType(opType)) {
+     BSONElement id;
+     switch (opType[0]) {
+     case 'u':
+         id = it->getField("o2").Obj()["_id"];
+         break;
+     case 'd':
+     case 'i':
+         id = it->getField("o").Obj()["_id"];
+         break;
+     }
+
+     size_t idHash = hashBSONElement( id );
+     boost::hash_combine(idHash, hash);
+     hash = idHash;
+ }
+
  (*writerVectors)[hash % writerVectors->size()].push_back(*it);

This code runs on MongoDB secondaries. It takes a log entry iterator it and adds it to one of the worker queues in writerVectors. (MongoDB uses 16 workers for replication.) The code before Eliot’s change hashed the collection namespace ns and used that to choose a writer vector: that’s collection-level parallelism. But in MongoDB 3.0 we acquired WiredTiger and implemented document-level locking, so Eliot added code to get the document id and use that in the hash too. Voilà, document-level parallelism. This algorithm is identical to the C5 scheduler, or so nearly identical that only their parents can tell them apart.

Our snapshotter is also the same as C5’s, except we only need two snapshots, not three, because “installing” a snapshot with WiredTiger is instant: A secondary just updates the lastApplied timestamp after executing each batch of log entries, and secondary queries read at lastApplied by default, achieving monotonic prefix consistency avant la lettre. We introduced snapshot reads on secondaries in MongoDB 4.0 in 2018; before that, secondary reads blocked while the secondary applied a batch of log entries.

But I feel like a jerk for criticizing a paper as “not novel”. If there are systems that can improve their backup parallelism and haven’t yet, this paper is a good explanation of a useful optimization. In our experience at MongoDB, replication lag usually has other causes: the secondary’s network connection to the primary is slow, or the secondary is underpowered, or the primary is so overloaded it can’t send logs. But perhaps if we hadn’t already maximized replication parallelism then it would be a bottleneck. Evidently it was at Meta.

(Update, 2022-12-22: my colleague Kev Pulo recalls that when MongoDB replication was single-threaded, back in the day, it was a bottleneck. Collection-level parallelism was an improvement some time in the 2.x series, and presumably document-level parallelism was a further improvement for some users, starting in 3.0.)

Psilocybin session at Johns Hopkins

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Thu, 01 Dec 2022 09:31:57 +0000

May 18, 2022. My partner Jennifer Keishin Armstrong and I visited the Johns Hopkins Center for Psychedelic and Consciousness Research in Baltimore to interview the director, Roland Griffiths, and observe a psilocybin session. We staged a session with one of their staff acting as the research participant so we could witness and photograph the whole process, including the preparation of the psilocybin pill. Our story appears in the November print issue of Lion’s Roar magazine and online.

Images © A. Jesse Jiryu Davis

Words Added To Merriam-Webster The Year I Was Born

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Tue, 22 Nov 2022 19:39:52 +0000

Merriam-Webster’s time traveler page shows you words that were added to the dictionary in a given year. The year I was born, 1978, is the year that bed-and-breakfasts, bodice rippers, and BFFs joined the canon. You won’t be surprised that in the late 70s juiceries and meaderies were entered in the dictionary, along with lookism and pad thai. Information technology, my field, earned its place in Merriam-Webster’s, and the job title “intrapreneur” was recognized.

I haven’t heard of intrapreneur before, it was a buzzword before its time, due for a revival. I guess when my sabbatical ends, my new job at MongoDB will be aptly described thus.

(Apropos to “Mean Girls”, “snitty” was made official in 1978.)

Presciently, Merriam-Webster added “proning”, “vaxx”, and “wet-market” all at once, words that waited until Covid for their fateful rendezvous.

In 1978, the MW editors added some IT jargon that describes my college years. The dictionary acquired “anti-aliasing”, which I studied in my graphics class, and “logic bomb”, which I once deployed to crash the judges' machines when I was losing an ACM coding contest. (I was just lashing out, and I knew ACM’s security would prevent harm. My college was banned from the contest for years, though.)

The dictionary got “MDMA” and “baked” (in the sense you’re thinking of) in 1978, which also describe my college years.

Bouldering in Peter's Kill

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Sat, 05 Nov 2022 21:41:38 +0000

November 5, 2022. My friends participating in a casual bouldering competition in Peter’s Kill, part of the Gunks climbing region in upstate New York.

Images © A. Jesse Jiryu Davis

Shugo's Shuso Hossen

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Wed, 02 Nov 2022 13:38:17 +0000

October 29, 2022. A ceremony at the Village Zendo in which Jeff Shugo Berman gave his first dharma talk and was promoted to senior Zen student. This was one of our first big hybrid ceremonies at the Village Zendo, with multiple cameras and microphones livestreaming the event to attendees on Zoom.

Images © A. Jesse Jiryu Davis

Notes from HPTS 2022

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Fri, 21 Oct 2022 21:53:59 +0000

I attended my first High Performance Transaction Systems (HPTS) conference last week. Here are my notes on the talks. Please don’t quote or trust me; these notes are based on what I heard, frantically typed, and minimally polished later. I have certainly misunderstood a lot, especially in areas outside my expertise. And even if I heard them correctly, treat researchers' claims skeptically!

Intro

Keynote: HPTS Comes Full Circle - James Hamilton (Amazon)

High performance transaction systems began on purpose-built hardware. Then we used horizontally scaled clusters of commodity servers. Now, systems span datacenters for scale and reliability, and once again providers are purpose-building hardware.

Starting in the late 90s systems reached a scale where “fault avoidance” doesn’t scale, need “fault tolerance”. In 2001 at HPTS, Bell & Dalzell presented “Obidos”, an Amazon page rendering engine, a disastrously buggy and monolithic piece of software. Leaked memory so fast it required restart every 100-200 requests. (Comment from audience: really sometimes every 10 requests.) Someone called it “morally bankrupt software engineering” but Hamilton loved it. Example of reliable system built on highly unreliable parts. (But also one of the reasons Amazon went to Service Oriented Architecture (SOA) early!)

In 2005 Stonebraker presented “One Size Fits All: An Idea Whose Time Has Come And Gone”. There were only 3 commercially relevant DBs at the time, Stonebraker says apps could be 10x faster with more specialized DBs appropriate to each app. But you must remove the admin overhead of many kinds of DBs, which cloud providers do. Now AWS offers 13 DBs.

In 2009 Barroso & Hilde presented “The Datacenter As Computer”. Web search exceeds single cluster scale, then cloud computing exceeds web search. Apps span DCs for reliability, scaling, & latency. If you have two DCs you waste 50% capacity to ensure enough cap when a DC fails. But with 3 DCs you only waste 1/3. These days with advanced erasure coding and many DCs it’s possible to ensure failover with very little spare capacity.

Now AWS makes lots of specialized kinds of servers. The uninterruptible power supply (UPS) is the flakiest component, so Hamilton proposed making an Amazon custom UPS. They use lots of small ones to minimize blast radius: each serves about 20 servers. They made more reliable UPSes by making them smaller and simpler, and iterating faster than when they bought them from suppliers. Also slightly cheaper than suppliers. They make custom switches, NICs, and CPUs. They also run their own power plants.

AI drives hardware specialization now. ML parameter counts were growing 2x per year until recently, now 10x. GPT-3 has 175 billion params. Tiny optimizations make a huge difference in training times. EC2 p4de instance is an “ML training monster” optimized for the job, with 55k cores per machine. A typical training run took 21 days with 240 of these instances, 13m cores and several megawatts. Shaving some resources from such jobs would mean big savings. Hamilton says there’s a great investment opportunity in making better ML hardware.

Hamilton thinks Amazon, Google, MS, and Ali Baba have the scale now to make significant amounts of custom HW, in the future it may be 10s of companies at most.

30 Students In 30 Minutes

Jack Waudby hosts a CS research podcast called Disseminate.

Scale (But Done Right), chair: Camille Fournier

Dynamo At 15: What Worked And What’s Next In Majority Quorum Databases - Scott Andreas (contributor, Apache Cassandra)

This talk was prepared before Amazon published their own Dynamo retrospective.

Dynamo original: a KV store, not a DB. Leaderless, incremental scalability, simple API, prioritize availability & latency. Sacrifice query language, transactions, indexes, MVCC, or even a storage engine (using pluggable storage). Sparked a discussion about “what is a DB?” Popularized non-relational and other “exotic” DB flavors.

Cassandra was co-authored by a Dynamo author, but at Facebook. LSM storage, majority quorum architecture. Shares with Dynamo many ideas like consistent hashing. Cassandra is for high scale, low latency, geo distribution. Now has strong consistency and linearizable transactions.

What worked:

Leaderless: avoids waiting for a distant / overloaded primary, localizes faults, avoids bottlenecks.
Majority quorum: permits more flexible deployment topologies, you don’t need electable nodes in special places. “Fault-domain-aware planning enables 99.9999% availability.”

According to “innovator’s dilemma” a disruptor like Cassandra starts as a cheap alternative in some niche, then adds features to compete with more expensive incumbents. Cassandra has added a query language, strong consistency, single key transactions, new storage engines, etc. There was a gap in feature development 2018-2022 when it “became a DB” by improving quality:

Lots of property based testing.
“Simulation: deterministic execution via managed executors / mutexes.” (We discussed this after the conference, he linked me to the proposal and Jira for further reading.)
“Rethinking deprecation: lose features, lose users.”
“Linearizability is table stakes for modern applications.”

Distributed transactions are coming to Cassandra, transact over any subset of the data, optional strict serializability. They wrote Accord for leaderless transactions. Obviously once they have transactions, it unlocks all sorts of things: consistent materialized views, enforced foreign key constraints, transactional changes to DB state….

Who’s Afraid Of Distributed Transactions? - Chuck Carman (Amazon)

Carman describes consensus view of “scale agnostic architecture” (what Carman doesn’t believe in anymore):

Follow patterns that scale, only do small things and partition small things by key.
Know your business and design for it, design for “right now”, make copies of data to avoid slow pages.
Apps need to scale, not DBs.

This talk is about Amazon’s SABLE DB. It has a basic DB-like API. It contains customer transaction data (carts, orders), source data (merchants, offers, products), and intermediate aggregated data. A given Amazon product detail web page might query SABLE a hundred times.

SABLE prioritizes speed over correctness, “right now is more important than right”. It has small plentiful entities, everything is partitioned to maximize local operations. All writes are “publishes” in a pub sub architecture? Every published message includes a partition key to allow parallel processing. Messages trigger serverless functions, which can be arranged into complex pipelines including fan-in and fan-out. Functions execute small localized consistent transactions, but the system as a whole is eventually consistent?

So SABLE isn’t a DB, it’s an “application engine” or “environment”. Carman doesn’t believe in distributed DBs as an API for users anymore, he believes in higher level data apis for users, which may be backed by a DB as an implementation detail.

What he believes now:

High performance is specialization, business dictates what kind of correctness is required.
For high performance data, build an application engine. Don’t worry about generality, scale the business’s specific logic.

(I’m not clear how this contradicts the “scale agnostic architecture” above, but it’s interesting regardless.)

Evolution Of The Spanner Storage Engine - David F. Bacon (Google)

Nearly all of Google’s products run on Spanner, as well as their internal control plane. He notes that since the System R paper, top data sizes grew by 11 orders of magnitude, QPS by 7 orders. (Audience comment: “the data’s getting colder!")

Spanner in 2014:

Colossus FS holds SSTable files
SSTable: sorted string tables, held in an LSM?
Not yet SQL, became a SQL system ~2017

This talk is about a new storage engine, “Ressi”. They did a live migration of basically all of Google’s data to Ressi over the course of 2 yrs 7 mos.

“Performance is a form of correctness”: at Google’s scale if an upgrade requires more of any kind of resource, it may cause an outage. How can we measure the impact of code changes on real data, quickly, at production scale, w/o risk to production, accurately, and preserving privacy?

They run experiments: transcode some data from old to new format, validate that it’s uncorrupted, test performance. These experiments have no write access (thus no danger they’ll corrupt prod), and they request resources at low priority (less danger of causing an outage, but somehow they get accurate performance numbers despite low priority). The experimental data obeys the same privacy rules as the source data, and experiment report has no private data, only aggregated. They can run experiments on 5-10% of all data over a few days. It sounds like their inability to look at customer data makes it hard to determine why an experiment gave negative results (e.g. new storage engine is bad at compressing some customer’s data, but they can’t look at it).

They basically can’t handle writes in experiments now: they don’t measure changes to write performance w/ this system, nor maintain consistency b/w old and new formats if there’s concurrent writes during an experiment. Bacon proposes “meta synthetic” experiments that anonymize write data and replay it against new engine, wiping enough details to preserve privacy while retaining enough for a realistic experiment.

Languages vs. Data, chair: Peter Alvaro

Specifying Ourselves Out Of A Job - Margo Seltzer (The University Of British Columbia)

Program synthesis: specify a program’s properties and use an SMT solver (Z3) to produce a conforming program (“counter example guided inductive synthesis”, CEGIS, aka “guess and check”). Seltzer is involved in projects to synthesize an OS (which sounds foolhardy, and she described recent failures in this project), to synthesize IoT programs, to synthesize a compiler, to synthesize device drivers.

Postmodern Systems And Datalog - Quinn Wilton (Fission)

“Dialog” is an edge-first DB, built of CRDTs. Each entity is a CRDT, plus lineage/causality data as a blockchain. These entities are called EAVC:

Entity
Attribute
Value
Causality

Each client involved can control its own data and other clients' access to it. A “global” view is eventually consistent? Or perhaps there’s no authoritative single view? Wilton calls this a postmodern DB. A viewer can see the result of merging the EAVCs to which they’ve been granted access cryptographically.

The system is Byzantine fault tolerant. In response to a question, Wilton says this means an attacker can’t introduce a causal cycle that can’t be merged.

DataLog (an old logic programming language) is the query language. More details about Wilton’s work here.

A Data First, Hands-Free Distributed Programming Model - Achilles Benetopoulos (UCSC)

If you want to deploy a distributed algorithm using some existing framework like MapReduce or Spark, you have to map the app semantics to the mechanisms of the framework, usually short-lived computations over intermediate results. This is hard for most programmers.

Benetopoulos proposes two concepts:

Object: some interrelated data that exists on exactly one machine
Nanotransaction: a constrained data access mechanism

Both things are mobile, they can be shifted from machine to machine to bring compute to data or vice versa. Objects are immutable and have global refs to each other so they can be more easily moved. So all nanotransactions are on local data, which makes them easier to program correctly. We must ask users to factor their programs into “composable operations over local data”. (I wonder, is this any different from the burden that MapReduce & friends impose on programmers?) In return he promises the runtime will do the right thing - maximize parallelism, minimize data movement, maintain correctness. “The runtime can peek into the application’s semantics”, I don’t know how that works. Perhaps he explains by saying “each nanotransaction has a contract that communicates how it uses data.” The framework would provide some CRDTs to the programmer as part of a standard library.

Benetopoulos shows examples of code that doesn’t look distributed, but he claims they could be automatically distributed by a system like his proposal. So maybe users don’t have to factor their programs into composable operations, the runtime automatically does it.

I think a common theme in research like this is, “Users somehow communicate their app semantic requirements to the runtime to permit automatic distribution with minimum coordination.” But I haven’t yet seen a great example of a syntax for users to do this—it should be something less error-prone than just writing Spark code yourself. Probably I just haven’t been reading the right papers.

Advanced Queries Over Programs - Anna Herlihy (EPFL)

(Formerly MongoDB; Anna and I worked together on MongoDB drivers.)

The amount of code in the world is growing exponentially, because there are more programmers, and because there’s about to be much more machine-generated code. She proposes a “database of code”, which can express static analyses as queries over volumes of code. E.g., a query could answer, “which functions have side effects?” “What type is X?” There are AI-based approaches but Herlihy doesn’t think these are rigorous enough.

She proposes a far-future “CodeDB”, which should be able to extract guarantees from a program without running it. She proposes to analyze code (in any language? in Scala specifically?) and transform it into Datalog “facts” about the program. Datalog has a query language over facts already. E.g. is_reachable(x). The query language is composable, you can ask questions like “are all references to x in dead code?” So CodeDB would be built on Datalog capabilities. With CodeDB we could query programs for optimization opportunities, security vulnerabilities, etc.

DB And OS: Back At It Again, chair: Justin Levandoski

All RPCS Are Bad (At Least, By Themselves) - Daniel Bittman (UCSC)

RPCs are location-centric, call-by-value, require expensive serialization/deserialization (SerDe). Distributed shared memory (DSM) is an old idea that should be reconsidered.

Storage latency is getting faster (SSDs, NVMe) so we need fewer levels in the storage hierarchy? But that means OS and SerDe overhead for read/write persistent stores is relatively more expensive. In current OSes to read or write persistent data crosses many layers (app, std lib, kernel, device driver, …), each layer has its quirks, and the app’s intent/semantics get lost at various layers.

It would be better to operate directly on persistent data. Same story for RPCs - better to operate directly on distributed data instead of doing RPCs. We need something w/ the code mobility of RPC plus data mobility of DSM. A more flexible and effective method for decoupling in distributed computation.

Twizzler OS is a data centric OS. Provides global (cross machine) address space with invariant references (pointers are still valid after data moves). The OS knows (somehow) something about your app’s data semantics so it can do smart things like pre-fetch.

Q: Data latency isn’t uniformly improving, e.g. lots of systems still use spinning disks. Bittman: Twizzler OS should have enough info (provided by users, or other ways) to optimize data access in various situations.

Q: How can programmers deal with a distributed system, where every data access is potentially a network error, in a manner that hides these details and appears like a single machine? Bittman: hard problem!

DBOS

DBOS: A Database Oriented Operating System - Mike Stonebraker (MIT)

Aims for the same goals as Twizzler OS but Stonebraker claims it’s further along in development.

The amount of state an OS must manage (CPUs, RAM, storage, threads, messages, files, etc.) has all grown by orders of magnitude since the 80s when most current system software was first built. Managing this state is naturally a DB problem, it should be managed transactionally using the new ideas in DBs. He proposes a new OS stack:

Level 4: User programs
Level 3: OS support routines written in SQL!
Level 2: Distributed high availability in-memory SQL DB like VoltDB (his invention)
Level 1: Microkernel: raw device support, low level memory management, minimum needed to run the DB and nothing else

His prototype uses VoltDB for level 2, uses Linux for the moment to simulate level 1. Replaces the “everything is a file” abstraction with “everything is a table”.

Analytics, monitoring, HA are all much better/easier than current OSes. All OS operations are transactions. He claims it’s fast enough, they implemented an FS, inter process communication (IPC), and schedulers on top of VoltDB. A scheduler is a SQL query for which process to run next. IPC is implemented as a table: sending a message is a SQL insert, reading a message is a SQL select followed by delete (in one transaction). Stonebraker claims FS performance is competitive with Linux (more overhead, so small ops are slower than Linux). IPC is competitive with gRPC.

He mentions VoltDB is “unbeatable” on single partition transactions but terrible at multi partition transactions, but this is being fixed. It now batches multi partition transactions and does them all at once, every few milliseconds, using special locking semantics. It’s now apparently very fast at multi partition transactions too.

Security: DBOS would provide complete provenance for all OS state, and could restore to a previous state in seconds. You could monitor for bad events with SQL queries.

The Design Of Apiary: A Programming Environment For DBOS - Peter Kraft (Stanford University)

DBOS’s programming model should be FaaS. “Apiary tightly integrates function execution and data management. It wraps a distributed DBMS and executes functions transactionally as stored procedures.” Avoids the usual re-execution problem of FaaS, devs don’t need to make their functions idempotent. Guarantees exactly once. (Though in answer to a question, real-world side-effects like sending an email must still be somehow made idempotent.)

Automatic data provenance tracking for observability and security. All executions and data accesses are captured and stored to a log?! The overhead seems astronomical to me. I think that even if it’s aggressively filtered, just deciding which events to capture would be expensive. But it would enable amazing time travel debugging. They show that Apiary outperforms some academic and open source FaaS systems like OpenWhisk, but I wonder if it could outperform AWS Lambda.

My thoughts: The DBOS proposals seem radical. “Ditch the OS, the DB is now the lowest-level software!” But perhaps they aren’t radical. Let’s start with the same objective they do: provide a distributed environment for running functions-as-a-service (FaaS). Naturally you’d use a distributed database to store state, e.g. the list of functions, the queue of tasks, the access control lists, billing info, etc. I’m sure all existing FaaS services depend on one or more distributed DBs. Thus the “scheduler” is naturally implemented as a DB query over the tasks table; this isn’t radical. If you provide a distributed FS, then again it’s natural to store FS metadata in a DB. Pretty soon you’ve built a system resembling DBOS, without the hype. The concept of DBOS is mostly a shift of perspective: Stonebraker & co. are pointing at a run-of-the-mill FaaS service and saying, “operating system.” (Then they propose to replace Linux with a microkernel, but I don’t see the purpose of that.)

The Ultima Thule Of A DBMS Backed Operating System - Kostis Kaffes (Stanford University)

Prototype of bringing DBOS principles to Linux. Express fundamental OS operations, including alterations to state, as short SQL statements. Track provenance and log all events.

Gong Show

Firestore: NoSQL for Serverless

The “Be Real” app is an extreme use case for serverless. It notifies all users on a continent when it’s “time to be real”, and everyone uploads a photo within 2 minutes, causing a giant brief load spike. The app is backed by Firestore and proves that Firestore can handle this kind of spike. The local Firestore client caches to hide latency to the central DB. Good at offline workloads, syncs when online. Firestore uses Spanner for strong consistency etc. Provides notifications to subscriber apps for “continuous queries” for each mutation. Serverless, pay for what you use.

Ballerina

An open source language for network services, from wso2. It’s for integrating services together. It speaks many protocols. The IDE can auto-generate sequence diagrams from Ballerina programs.

Building To Buy

From a Two Sigma engineer.

Build vs buy. No single answer to this dilemma. Once you’ve built you have to keep building to maintain your custom code. Once you buy you have to keep buying to scale up.

Two Sigma in 2005 needed distributed storage. S3 etc. weren’t available yet, so they built their own, CelFS. In 2020 they could buy storage from the cloud, but their users have built to their CelFS API, and to their performance characteristics (they have lower latency than S3). Moral: once you build it’s very difficult to migrate to off-the-shelf. (Sounds to me like Hyrum’s Law, and I wonder if any farsighted providers slow down their APIs to preserve wiggle room.)

BigQuery

Google BigQuery was “serverless before serverless was a thing”. Disaggregation of CPU, memory, and storage. Many features added over the years:

Big Lake: query data lakes, supports many formats
Big Query Omni: ship computation to data in other clouds, like to S3!

Stop Fretting About Data Loss

Doug Terry of AWS.

Customers have gotten concerned about data loss lately, and there have been publicized AZ outages. Don’t worry, AWS replicates data. But customers hate “RPO” (recovery point objective), data that hasn’t been replicated yet. How to deal with RPO?

Live with it, especially for write heavy workloads
Reconcile it, with some merge algorithm
Prevent it, with synchronous replication, ensures zero RPO but adds latency

Scale Session 2, chair: Randy Shoup

Scaling Systems For Critical Infrastructure - Colin Breck (Tesla)

Tesla interacts with home battery banks using Kafka and Akka. It shares weather forecasts, energy market info, etc. from the central service to the homes. This enables California to request Tesla home batteries to charge in advance of a forecasted peak, then discharge during the peak to sell power back to the grid when it’s needed most. Called the “Tesla Virtual Power Plant”.

Making the grid smart opens new security vulnerabilities. Breck likes the Consequence Driven Cyber Informed Engineering (CCE) methodology for responding to this.

Systems Architecture At Scale: The Unreasonable Effectiveness Of Simplicity - Randy Shoup

Large sites all start w/ monolithic apps and eventually become polyglot sets of microservices.

Simple principles for large scale systems:

Simple components - service boundaries match the problem domain, and encapsulate logic & data. No back doors to the data. Use stateless domain logic that’s deterministic and testable in isolation. Use “straight line processing”, single threaded, minimal branching. Teams and services split like cellular mitosis. Abstract out a common platform: shared infrastructure, standard frameworks, developer experience. “Large scale organizations invest more than 50% of engineering effort in platform capabilities.”

Simple interactions - see the Reactive Manifesto. Communicate state changes as a stream of events; the event driven model decouples domains & teams, enables simple event-processing components. Store state as an immutable log of events; this enables audit & replay. Compact old logs into snapshots. Embrace asynchrony; this decouples operations in time, e.g. services can interact even if some of them are temporarily unavailable. See the Aurora paper, Netflix, and Walmart.

Simple changes - he ran out of time.

Snowflake Architecture: Scaling The Data Warehouse For The Cloud - Thierry Cruanes (Snowflake)

Snowflake Data Cloud was built to deal with modern data challenges: lots of machine-generated, semi-structured data; lots of people who want to use the data; demand for fast or real time analytics; many diverse data sources. Circa 2012 when they began, Snowflake expected an abundance of resources in the cloud: storage, compute, etc. would become very cheap and reliable. (Implication that this was over optimistic? I asked later and I think Cruanes said they’ve spent more time than they expected negotiating with cloud providers for resources and to ensure capacity.)

They wanted to build a dream system with unlimited & instant scaling, low cost, good for structured & semi-structured data, zero management, support ACID and SQL. Their architecture has centralized, scalable storage, and multiple independent compute clusters. Data is immutable, which simplifies many things, encourages caching, and permits instant cloning to create a dev or test environment. They use and integrate resources from AWS, GCP, and Azure.

They release weekly, worldwide, with a single version to maintain. (Must be nice.) “Virtuous cycle—data driven development to identify and prioritize feature development. Snowflake is extensively instrumented, we generate many terabytes of service data daily.”

Their roadmap includes:

Data warehouse: full SQL, ACID, UDFs, UDTs, data governance, stored procedures
Data engineering: streaming ingest, external functions, data pipelines
Data lake: semi structured data, unstructured data, external tables
Data science: support Python dataframes, Java, Scala

They want to become a “data application platform” optimized for collaboration. Sharing raw data isn’t enough, there must be access control and auditing, a way to execute trusted application code inside the Snowflake platform. Maybe they’re also creating an app marketplace?

Verification, Formal And Otherwise, chair: Marc Brooker

Formal Modeling And Analysis Of Distributed Systems (Finding Critical Bugs Early!) - Ankush Desai (Amazon)

Video from Desai’s StrangeLoop talk, with similar contents.

He’s a senior applied scientist at AWS database systems.

“Formal methods: it’s not just about the proofs, it’s also about the process.”

Lamport: “Coding is to programming what typing is to writing.”

Formal methods aren’t widely adopted in industry because

Gap between design & implementation.
Model & implementation go more out of sync over time.
Model is hard to write because programmers think of distributed systems as communicating state machines, not a single state transition system as in TLA+.
Model checking doesn’t scale to the complexity of real world systems.

This talk is about P. A P program has several parts: formal models “PSrc”, specification “PSPec”, and an environment or test harness “PTst”. Claims to have a scalable model checker for large systems. Sounds more sophisticated than the TLA+ checker (TLC)?: it does symbolic execution (which is experimental for TLA+), it does distributed model checking (which TLA+ supports, but I couldn’t make it work). P specs are runtime monitors: state machines that track system history and assert global invariants. Specs are general programs just like models. Somehow they also check liveness, not just safety. PTst does “inputs and fault modeling”.

Case studies for P.

Lessons learned: P is a good thinking tool early in design. Identifying the invariants is the most valuable part. It can find bugs through model checking. Boosts development velocity after initial investment in modeling.

Keeping model & implementation in sync: they’ve made trace checking work in both tests and production, unlike me.

Jepsen XV: Unsafe At Any Speed - Kyle Kingsbury (Jepsen, LLC)

He tested a DB called Radix, distributed ledger for decentralized finance, a blockchain. It advertises high performance, 12 million TPS. Kingsbury finds they actually do more like 16 TPS! (Not 16 million: 16.) A single node fault makes latency increase 50x, because there’s no fault detection and node eviction algorithm. He found hilarious consistency violations as usual, e.g. read uncommitted anomaly in 1/300 of reads of the public ledger. They fixed only some of the bugs he found.

He tested RedPanda, a distributed Kafka-compatible stream. It’s Kafka but with Raft under the hood instead of Paxos. He found duplicate values, and values getting reordered in the log after commit. It sounds like he found bugs in both Kafka and RedPanda? RedPanda fixed all the bugs he found.

Conclusion: Jepsen keeps finding bugs in systems easily, we all need to be testing better. Transactions aren’t just in SQL anymore, stream processing and blockchains have transaction-like things too. Kingsbury is trying to apply Adya et. al.’s formalisms, we might need new formalisms to describe these things, and of course better documentation.

Q: How does he pick systems to test? A: He tests any system that someone pays him to test, but he rejects some clients e.g. environmentally damaging blockchains.

Formal Methods Only Solve Half My Problems - Marc Brooker (Amazon)

He co-authored the famous paper with Chris Newcombe et. al. Beyond TLA+ there’s more going on at AWS:

Using the Kani Rust verifier on the Firecracker hypervisor (here?).
See lightweight formal methods paper.
Semantic-based Automated Reasoning for AWS Access Policies using SMT

Brooker has come to think that safety & liveness aren’t enough. Distributed systems are dynamical systems with feedback loops and emergent behaviors. We don’t understand their dynamics. E.g. the AWS Lambda task scheduling system “eLSA”. If there’s a partial outage and they lose a lot of workers, there’s lots of cold starts as new workers are added; eLSA tries to smooth this spike. Once upon a time, cold starts overwhelmed one of AWS’s microservices, causing a retry storm that wedged Lambda until human intervention.

(I was reminded of “metastable failures”, see here and here. As usual, the outage was caused by unintended interactions among mechanisms intended to make the system robust.)

Brooker says that diagnosing a particular outage like this is a “just so story” like “how the elephant got his trunk”: it explains one event but doesn’t teach how to avoid them in general. We need some formalism to reason about them. What about control theory? He says it can’t handle complex distributed systems. He asks, “How do we understand system dynamics better? What research should we be doing?”

He proposes simulation: numerical simulators of stochastic control systems. Example: a single shard DB is overloaded, maybe it should split into two. Depending on workload, splitting could increase throughput, or have no effect (queries all become fan-out), or hurt throughput (more distributed transactions). He cites Watts & Strogatz 1998 Collective dynamics of small-world networks as a way to model the system and predict the outcome.

Why build a simulator when we already have a spec in TLA+ or P? He proposes leveraging existing models by analyzing the state space for system dynamics. “Can we get more value from specs?” (I didn’t understand how, but the TLA+ maintainer Markus Kuppe later linked me to this talk, Obtaining Statistical Properties by Simulating Specs with TLC, which shows techniques and examples.)

Conclusion: distributed systems are dynamical systems, and we don’t understand their dynamics. The resiliency mechanisms we add make them more complex, producing more emergent behavior that we don’t understand. This is a problem. “We’re not going to grow up until we solve it.”

Data Governance, Security, And Privacy: Computing With Seat Belts, chair: Jennie Rogers

Toward Building Deletion Compliant Data Systems - Subhadeep Sarkar (Boston University)

Many systems now use “out of place” updates: not in-place updates, instead they just log the changes. Instead of deleting, log a tombstone. Merge lazily. But this doesn’t handle delete-heavy workloads well. When there are lots of tombstones it hurts read performance badly. Delete-heavy workloads are in fact common, e.g.:

Orphan cleanup after internal DB operations.
GDPR and other privacy laws require timely, unrecoverable deletes.
An auditing DB that deletes data after 30 days must delete 1/30 of its data daily.

Sarkar’s research group envisions designing DBs from the start to be good at deletion.

Two kinds of regulatory requirements for hard deletes:

Delete data within x days of a request for deletion
Delete data that is more than x days old

Sarkar proposes new SQL operators to use with DELETE to specify the exact hard-delete requirements the DB must obey.

He described how to hard delete efficiently in an LSM tree. Today to hard delete from LSM is easy if your delete key is the same as your sort key. Otherwise you have to compact the whole tree to one level, which is very costly. KiWi, part of a system called Lethe, is his new data format that accelerates LSM compaction after hard deleting. It arranges data in a hybrid structure that uses both the sort key and the delete key.

Towards Regulating Large Scale Multi Enterprise Environments With Privacy Guarantees - Mohammad Javad Amiri (University of Pennsylvania)

Mutually distrustful entities sometimes want to collaborate. He presented at VLDB 2019 CAPER: A Cross-Application Permissioned Blockchain which does cross-firm transactions. Now he presents Qanaat: A Scalable Multi-Enterprise Permissioned Blockchain System with Confidentiality Guarantees, which handles cross-firm transactions where some collaborations are confidential to a subset of the firms involved.

Qanaat is Byzantine fault tolerant, it handles the case where some nodes are malicious. It uses quorums to prevent corruption, and firewalls to prevent malicious nodes from exfiltrating confidential data. Amiri said something like, “a malicious node can either read private data or talk to the client but not both”, but I didn’t understand the details.

Use case: Let’s say Uber, Lyft, etc. want to collaborate to prevent drivers from working >40 hrs/wk total for all rideshare firms, for regulatory reasons. We want to protect each driver’s privacy, Uber can’t know if a driver also works for Lyft. Yet the regulation enforcement should be verifiable by the government.

Amiri et. al. have a vision for regulated multi enterprise systems. In this system, participants reserve crypto “tokens” distributed by an authority, the tokens represent e.g. hours of work. Participants and firms can request tokens, spend them, verify they’ve been spent, etc., while limiting who can see what data.

Authenticated Concurrent Databases - Suyash Gupta (University of California, Berkeley)

Byzantine fault tolerance (BFT) is expensive, requires 3f+1 nodes to tolerate f faulty/malicious nodes. What about trusted computing (e.g. Intel SGX and trusted enclaves)? Led to “trusted BFT protocols” that assumes each node has a trusted component, which “attests” (signs?) each message. Requires only 2f+1 nodes. The client awaits acknowledgment from f+1 nodes. Gupta claims to have found 3 limitations to trusted BFT.

Loss of responsiveness. If one honest node hangs, the protocol hangs.
Trusted components can experience rollback, which leads to safety violation. Those that don’t rollback are very slow.
No parallelism, can’t run consensus on 2 transactions at once. (I guess you must wait for several rounds of messages for txn 1 to finish before txn 2 can start.)

Flexi trust is a new suite of protocols that guarantee liveness, responsiveness, safety. This requires 3f+1 nodes again, and requires trusted hardware. But it only accesses the trusted component once per machine per transaction, which enables parallelism.

Privacy Preserving Systems - Johes Bater (Tufts University)

Use case: several hospitals want to pool their aggregate statistics about patients so 3rd-party researchers can run analyses on the whole. “We want to ensure privacy while maintaining utility.” Ordinarily you’d add noise to each hospital’s data during export, but the noise compounds as you add more hospitals, degrading the final accuracy.

The solution is a “private data federation”. The building blocks are differential privacy (DP, add noise in order to anonymize data), secure multiparty computation (MPC, aka encrypted computation). There are quantifiable privacy / accuracy / performance tradeoffs between these approaches, his system optimizes the tradeoffs automatically to produce fast, reasonably private and reasonably accurate results with simple knobs. “Greater than the sum of its parts.”

Making Hardware Work For Data, chair: Rebecca Taft

How To Kill Two Birds With One Stone - Matteo Interlandi (Microsoft)

There are >4k ML+AI papers uploaded to arXiv a month now. Neural net sizes are growing exponentially. Huge amounts of AI hardware is being built. PyTorch is becoming a standard, and it has optimizations for all this AI hardware.

Interlandi’s project compiles SQL to the popular “tensor” abstraction and runs it with PyTorch, to use all the PyTorch optimizations (e.g. GPUs), not for ML/AI but for transaction processing (or perhaps it’s for analytics?). Given specialized hardware it can be very fast.

Future direction: “AI-centric DB”. It would allow you to use SQL queries to access AI algorithms, e.g. SELECT * FROM images WHERE image_similiarity("KFC receipt") > 10, if you have a pretrained “KFC receipt” recognizer.

Transparent Data Transformation - Manos Athanassoulis (Boston University)

“How to stop worrying about data layouts”. How to have benefits of both row store and column store? Athanassoulis’s vision: “relational memory”. He proposes keeping data in RAM, row-oriented, plus custom hardware that transforms it to a group of selected columns on the fly. The columns are stored in L2 cache on access. The custom hardware is called a relational memory engine (RME). He claims it’s fast because it’s so low level, it can exploit parallelism in the data bus, and it can pre-fetch.

There exists “programmable logic in the middle” tech today, which sits between CPU and RAM, it can implement an RME now.

Writes to columns hit the column-oriented cache first and eventually the row-oriented source data in RAM. He adds MVCC to the RME, to track which data is valid/invalid.

Doesn’t work if the data is compressed in RAM. Of course it’s ok if it’s compressed on disk. Doesn’t work with variable length columns?

Mach: Breaking The Cpu Speed Barrier With In Flight Data Processing - Alberto Lerner (eXascale Infolab – University of Fribourg, Switzerland)

CPU performance is now doubling only every 20 years. Lerner proposes to push DB processing to peripheral chips, e.g. the SSD controller, NIC, memory controller, etc. He cites a bunch of papers doing this already. E.g. he made “DB Annihilator” which generates a DB test workload directly from an FPGA attached to the network. “Harmonia” directs readonly transactions to sufficiently caught up secondaries, with all the logic in network hardware. “P4DB” serves indexes directly from the NIC.

Programmable devices are becoming more available and easier to use for non experts. SSD controllers now support eBPF (oddly—it was originally a packet filter language but apparently very flexible). This is all exciting because networks are getting faster, so are hard drive data buses, they will increasingly overwhelm CPUs unless we push more processing into the middle-layer specialized chips.

Flash Based GPU Accelerated Queries - Hamish Nicholson (EPFL)

There used to be a clear storage hierarchy, but now it’s an anarchy: NVMe is almost as cheap as SSDs and almost as fast as RAM, so there’s overlap between the cost/perf of persistent and volatile storage. GPUs and CPUs have different speeds and strengths, and the buses between the various media and various chips have different speeds. Data can flow along a number of different paths with different advantages. The simple old heuristics like LRU caches aren’t optimal now, it depends on each query’s workload and the hardware’s capabilities. Query planners must take into account the possible paths among various storage media and GPUs and CPUs.

Village Zendo's Summer Meditation Retreat 2022

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Fri, 05 Aug 2022 22:14:34 +0000

Photos from a 10-day Zen meditation retreat, in a Catholic retreat center in rural Connecticut.

This was my Zen community’s first in-person sesshin since December 2019. In the years between, I’ve been disconnected from the Village Zendo. While many of the sangha members met on Zoom for zazen, chanting, and various events, I focused on other practices and teachers. Climbing has been my main spiritual practice. To sit again knee-to-knee with my sangha felt like coming home to my chosen family.

Images © A. Jesse Jiryu Davis

Climbing in Peterskill

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Tue, 31 May 2022 15:31:40 +0000

May 29, 2022. Photos of my friends climbing in Peterskill in upstate New York. I learned a bit about climbing photography from Spencer McKay; now I’m trying it out.

Images © A. Jesse Jiryu Davis

Why Should Async Get All The Love?: Advanced Control Flow With Threads

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Mon, 16 May 2022 14:58:50 +0000

I spoke at PyCon 2022 about writing safe, elegant concurrent Python with threads. Here’s the video. Sorry about the choppy audio, the A/V at PyCon this year was a shitshow. Below is a written version of the talk.

asyncio.

Asyncio is really hip. And not just asyncio—the older async frameworks like Twisted and Tornado, and more recent ones like Trio and Curio are hip, too. I think they deserve to be! I’m a big fan. I spent a lot of time contributing to Tornado and asyncio some years ago. My very first PyCon talk, in 2014, was called “What Is Async, How Does It Work, And When Should I Use It?” I was an early async booster.

Asyncio introduced a lot of Pythonistas to advanced control flows with Tasks, Futures, chaining, asyncio.gather, and so on. All this was really exciting! But there’s something a lot of Python programmers didn’t notice at the time: All this was already possible with threads, too.

Threads.

Compared to asyncio, threads seem hopelessly outdated. The cool kids will laugh at you if you use threads.

Concurrency and parallelism

Threads and asyncio are two ways to achieve concurrency.

Let’s avoid any confusion at the start: Concurrency is not parallelism. Parallelism is when your program executes code on multiple CPUs at once. Python mostly can’t do parallelism due to the Global Interpreter Lock. You can understand the GIL with a phrase short enough to fit on your hand: One thread runs Python, while N others sleep or await I/O. Learn more about the GIL from my PyCon talk a few years ago.

So threads and asyncio have the same limitation: Neither threads nor asyncio Tasks can use multiple CPUs.

(An aside about multiprocessing, just so you know I know what you’re thinking: If you really need parallelism, use multiprocessing. That’s the only way to run Python code using multiple CPUs at once with standard Python. But coordinating and exchanging data among Python processes is much harder than with threads, only do this if you really have to.)

But in this article I’m not talking about parallelism, I’m talking about concurrency. Concurrency is dealing with events in partial order: your program is waiting for several things to happen, and they could occur in one of several sequences. By far the most important example is waiting for data on many network connections at once. Some network clients and most network servers have to support concurrency, sometimes very high concurrency. We can use threads or an async framework, such as asyncio, as our method of supporting concurrency.

Threads vs. asyncio

Memory

Which one should you use, threads or asyncio? Let’s start with asyncio’s main advantage: Very very high concurrency programs are more memory efficient with asyncio.

Here’s a chart of two simple programs spawning lots of threads (blue) and asyncio Tasks (red). Just importing asyncio means the red program starts with a higher memory footprint, but that doesn’t matter. What matters is, as concurrency increases, the red asyncio program’s memory grows slower.

A Python thread costs about 10k of memory. That’s not much memory! More than a few hundred threads is impractical in Python, and the operating system imposes limits that prevent huge numbers of threads. But if you have low hundreds, you don’t need asyncio. Threads work great. If you remember the problems David Beazley pointed out in Python 2, they were solved in Python 3.

With asyncio, each Task costs about 2k of memory, and there’s effectively no upper bound. So asyncio is more memory-efficient for very high concurrency, e.g. waiting for network events on a huge number of mostly idle connections.

Speed

Is asyncio faster than threads? No. As Cal Peterson wrote:

Sadly async is not go-faster-stripes for the Python interpreter.

Under realistic conditions asynchronous web frameworks are slightly worse throughput and much worse latency variance.

Standard library asyncio is definitely slower than most multi-threaded frameworks, because asyncio executes a lot of Python for each event. Generally frameworks are faster the more that they’re implemented in C or another compiled language. Even with the fastest async frameworks, like those based on uvloop, tail latency seems to be worse than with multi-threading.

I’m not going to say all async frameworks are definitely slower than threads. What I can say confidently is that asyncio isn’t faster, and it’s more efficient only for huge numbers of mostly idle connections. And only for that.

Compatibility

What about compatibility? Here are the most popular Python web frameworks (source).

The sum is more than 100% because respondents could choose multiple. Flask, Django, and most of the others are multi-threaded frameworks. Only three (FastAPI, Falcon, and Tornado) are asyncio-compatible. (We don’t know about the “other” category, but it’s only 4%.)

So your web application is probably multi-threaded now, not async. If you want to use asyncio, that means rewriting a large portion of your app. Whereas multi-threaded code is compatible with most of the apps, libraries, and frameworks already written.

Trickiness

How tricky is it to write correct concurrent code with threads or asyncio?

Let’s make a function called do_something which adds one to a global counter, and run it on two threads at once.

counter = 0

def do_something():
    global counter
    print("doing something....")
    counter += 1  # Not atomic!

t0 = threading.Thread(target=do_something)
t1 = threading.Thread(target=do_something)
t0.start()
t1.start()
t0.join()
t1.join()
print(f"Counter: {counter}")

Will counter always eventually equal 2? No! Plus-equals isn’t atomic. It first loads the value from the global, then adds 1, then stores the value to the global. If the two threads interleave during this process, one of their updates could be lost, and we end up with counter equal to 1, not 2.

We need to protect the plus-equals with a lock:

counter = 0
lock = threading.Lock()

def do_something():
    global counter
    print("doing something....")
    with lock:
        counter += 1

This is tricky! In a 2014 blog post Glyph Lefkowitz, the author of Twisted, talks about this trickiness. It’s still my favorite argument on the topic.

As we know, threads are a bad idea, (for most purposes). Threads make local reasoning difficult, and local reasoning is perhaps the most important thing in software development.

Glyph says the main reason to write async code isn’t that it’s faster. It’s not even memory efficiency. It’s that it’s less prone to concurrency bugs and it requires less tricky programming. (But it doesn’t have to be that bad, as you’ll see below.)

Let’s rewrite our counter-incrementing example with asyncio.

counter = 0

async def do_something():
    global counter
    print("doing something....")
    await call_some_coroutine()
    counter += 1  # Atomic! No "await" in +=.

async def main():
    t0 = asyncio.Task(do_something())
    t1 = asyncio.Task(do_something())

Now do_something is a coroutine. It calls another coroutine for the sake of illustration, and then increments the counter. We run it on two Tasks at once. Just by looking at the code we know where interleaving is possible. If it has an await expression, a coroutine can interleave there. Otherwise it’s atomic. That’s “local reasoning”. Plus-equals has no await expression, so it’s atomic. We don’t need a lock here.

Therefore asyncio is better than multi-threading, because it’s less tricky, right? We shall see….

In summary:

Threads	asyncio
Speed: Threads are at least as fast.	Memory: asyncio efficiently waits on huge numbers of mostly-idle network connections.
Compatibility: Threads work with Flask, Django, etc., without rewriting your app for asyncio.	Trickiness: asyncio is less error-prone than threads.

Must multi-threaded code be so tricky?

It’s Time To Take Another Look At Threads

All along, it’s been possible to write elegant, correct code with threads. To begin, let’s look at how to use threads with Futures. Threads had Futures first, before asyncio. Futures let us express control flows you’d struggle to write with mutexes and condition variables.

(Confusingly, asyncio introduced a new Future class that’s different from the one we use with threads. I’ve never had to use both in the same program, so it’s fine.)

Future.

Let’s rewrite our previous counter-incrementing example with Futures.

from concurrent.futures import Future

future0 = Future()
future1 = Future()

def do_something(future):
    print("doing something....")
    future.set_result(1)  # How much to increment the counter.

t0 = threading.Thread(target=do_something, args=(future0,))
t1 = threading.Thread(target=do_something, args=(future1,))
t0.start()
t1.start()

# Blocks until another thread calls Future.set_result().
counter = future0.result() + future1.result()
print(f"Counter: {counter}")

The concurrent.futures module is where all the cool threads stuff lives. It was introduced back in Python 3.2. Now do_something takes a Future and sets its result to 1. This is called “resolving the Future”. We run the function on two threads and pass in the two Futures as arguments. Then we wait for the threads to call set_result, and sum the two results. Calling Future.result() blocks until the future is resolved. Note that we no longer need to call Thread.join().

This code isn’t much of an improvement. I’m just showing how Futures work. In reality you’d write something more like this:

from concurrent.futures import ThreadPoolExecutor

executor = ThreadPoolExecutor()

# Takes a dummy argument.
def do_something(_):
    print("doing something....")
    return 1

# Like builtin "map" but concurrent.
counter = sum(executor.map(do_something, range(2)))
print(f"Counter: {counter}")

We create a ThreadPoolExecutor, which runs code on threads and reuses threads efficiently. executor.map is like the builtin map function, but it calls the function concurrently over all the arguments at once. In this case do_something doesn’t need an argument, so we use a dummy argument list, range(2).

There’s no more explicit Futures or threads here, they’re hidden inside the implementation of map. I think this looks really nice, and not error-prone at all.

Workflows

What about more complex workflows?

The morning before I gave this talk in Salt Lake City, I made French press coffee in my Airbnb. I brought a hand grinder, so grinding the coffee took some time. Then I heated water, combined them and waited for it to brew, and drank it.

Obviously that’s not efficient. I should start the water heating and grind the coffee concurrently.

How can we code this with threads?

executor = ThreadPoolExecutor()

def heat_water():
    ...

def grind_coffee():
    ...

def brew(future1, future2):
    future1.result()
    future2.result()
    time.sleep(4 * 60)  # Brew for 4 minutes.

heated_future = executor.submit(heat_water)
ground_future = executor.submit(grind_coffee)
brew(heated_future, ground_future)
print("Drinking coffee")

The brew function takes two Futures and waits until both have completed, then waits for the coffee to brew. We use the ThreadPoolExecutor to start heating and grinding concurrently. We call brew and when it’s done, we can drink.

So far so good. Let’s add more steps to this workflow and see how this technique handles the added complexity.

There’s a quick step right after heating the water: I pour it into the French press. And after I grind the coffee I add the grounds to the press. These events can happen in either order, but I always want to do the red step as soon as its blue step is completed.

def heat_water():
    return "heated water"

def grind_coffee():
    return "ground coffee"

def brew(future1, future2):
    for future in as_completed([future1, future2]):
        print(f"Adding {future.result()} to French press")
    time.sleep(4 * 60)  # Brew for 4 minutes.

Now the heat_water and grind_coffee functions have return values; they produce something. The new brew function uses as_completed, which is also in the concurrent.futures module. If the water is heated first, then we add it to the press, or if the coffee is ground first, we add the grounds first. Once both steps are done, then we wait 4 minutes. The rest of the code is like before.

Imagine if you had to use old-fashioned thread code, with locks and condition variables to signal when each step was done. It would be a nightmare. But with concurrent.futures the code is just as clean and easy as with asyncio.

Futures and Typing

These code examples aren’t really modern Python yet, because they don’t have any types.

def heat_water() -> str:
    return "heated water"

def grind_coffee() -> str:
    return "ground coffee"

def brew(future1: Future[str], future2: Future[str]):
    for future in as_completed([future1, future2]):
        print(f"Adding {future.result()} to French press")
        # ^ type system knows result() returns a string.

To use types with Futures, just subscript the Future type with whatever the Future resolves to, in this case a string. Then the type system knows that result() returns a string.

Workflows, Part 2

What if the “coffee” workflow is one component of a much larger workflow, encompassing a whole afternoon?

First I make and drink coffee, then I have the motivation to do chores, which is a separate complex workflow. Of course I’m listening to a podcast the whole time.

with ThreadPoolExecutor() as main_exec:
    main_exec.submit(listen_to_podcast)

    with ThreadPoolExecutor() as coffee_exec:
        heated_future = coffee_exec.submit(heat_water)
        ground_future = coffee_exec.submit(grind_coffee)
        brew(heated_future, ground_future)
        print("Drinking coffee")
        # Join and shut down coffee_exec.

    with ThreadPoolExecutor() as chores_exec:
        ...
        # Join and shut down chores_exec.

    # Join and shut down main_exec.

A nice way to structure nested workflows is using a with statement. I start a block like with ThreadPoolExecutor and run a function on that executor. I can start an inner executor using another with statement. When we leave the block, either normally with an exception, we automatically join and shut down the executor, so all threads started within the block must finish.

This style is called “structured concurrency”. It’s been popularized in several languages and frameworks; Nathaniel Smith’s Trio framework introduced it to a lot of Pythonistas, and it will be included in asyncio as “task groups” in Python 3.11.

Unfortunately we can’t do full structured concurrency with Python threads. Ideally, if one thread dies with an exception, other threads started in the same block would be quickly cancelled, and all exceptions thrown in the block would be grouped together and bubble up. But exceptions in ThreadPoolExecutor blocks don’t work well today, and cancellation with Python threads is Stone-Aged.

Cancellation

Threads are not nearly as good at cancellation as asyncio, Trio, or other async frameworks. Here’s a handwritten solution; you’ll need something like this in your program if you want cancellation.

class ThreadCancelledError(BaseException):
    pass

class CancellationToken:
    is_cancelled = False

    def check_cancelled(self):
        if self.is_cancelled: raise ThreadCancelledError()

def do_something(token):
    while True:
        # Don't forget to call check_cancelled!
        token.check_cancelled()

token = CancellationToken()
executor = ThreadPoolExecutor()
future = executor.submit(do_something, token)
time.sleep(1)
token.is_cancelled = True
try:
    future.result() # Wait for do_something to notice that it's cancelled.
except ThreadCancelledError:
    print("Thread cancelled")

The custom ThreadCancelledError inherits from BaseException rather than Exception, so that it bypasses most except blocks. Now in do_something we must add frequent calls to check_cancelled.

Python doesn’t control the thread scheduler the way it controls the asyncio event loop, so it’s not possible for thread cancellation to be as good. But it could be improved. See Nathaniel Smith’s 2018 article for superior ideas. I’m curious if anyone has a PEP for improving thread cancellation.

A Real Life Example

Let’s get back to the good news about threads.

Here’s a real life example I coded a few months ago. I implemented Paxos in Python. Paxos is a way for a group of servers to cooperate for fault-tolerance. Here’s a group of three servers which all communicate with each other.

How does each server know all its peers' names? Let’s give them all a config file.

{
  "servers": [
    "host0.example.com",
    "host1.example.com",
    "host2.example.com"
  ]
}

But how does any server know which one it is? This is surprisingly hard. In a data center or cloud deployment, each server usually has several IPs and several DNS names, such as its internal and external names. Calling gethostname() usually doesn’t give you the information you need. There’s no easy way to know if a DNS query for host0, for example, resolves to this server or another server.

The solution is sort of amazing. First, each server generates a random unique id for itself when it starts up. Next, each server sends a request to all the servers in the list, which includes itself, but it doesn’t know which one is self. Here I show host0 sending out three requests; the others do the same. host0 gets replies from host1 and host2 with different ids, and it gets a reply from host0 with its own id! So it knows that it is host0.

This is actually how MongoDB and lots of other distributed systems solve this problem.

Servers can’t process any requests until they find themselves, and they can start up in any order, so this creates a complex control flow. Sounds like a job for Futures!

Here’s the server code. We’ll start by generating a unique id for this server. I want to use Flask for the server, of course—Flask is the most popular web framework. The server makes a Future which it will resolve when it finds itself.

server_id = uuid.uuid4().hex
app = flask.Flask('PyPaxos')
self_future = Future()

@app.route('/server_id', methods=['GET'])
def get_server_id():
    return flask.jsonify(server_id)

@app.route('/client-request', methods=['POST'])
def client_request():
    # Can't handle requests until I find myself, block here.
    self_future.result()
    ...

config = json.load(open("paxos-config.json"))
executor = ThreadPoolExecutor()
# Run Flask server on a thread, so main thread can search for self.
app_done = executor.submit(app.run)

start = time.monotonic()
while time.monotonic() - start < 60 and not self_future.done():
    for server_name in config["servers"]:
        try:
            # Use Requests to query /server_id handler, above.
            reply = requests.get(f"http://{server_name}/server_id")
            if reply.json() == server_id:
                # Found self. Unblock threads in client_request()
                # above, and exit loop.
                self_future.set_result(server_name)
                break
        except requests.RequestException as exc:
            # See explanation below.
            pass

    time.sleep(1)

app_done.result()  # Let app run in background.

The server can’t process any client requests until it’s found itself, so client_request waits for self_future to be resolved by calling self_future.result(). Once the future has been resolved, calling result() always returns immediately.

The search loop tries repeatedly for a minute to find self, by querying for each server’s id. It might catch an exception when querying; either because it’s trying to reach another server that hasn’t started yet, or it’s trying to reach itself but Flask hasn’t initialized on its background thread.

After the search loop completes we wait for app_done.result(): that means the main thread sleeps until the server thread exits, maybe because of a Control-C or some other signal.

Clean and clear, right? If I had rewritten this with asyncio I couldn’t use Flask, the most popular web framework, and I couldn’t use Requests, the most popular client library. (Requests doesn’t support asyncio.) I would’ve had to rewrite everything to use asyncio. But with threads, I can implement this advanced control flow in a straightforward and legible manner, and I can still use Flask and Requests.

Cool Threads

Threads are cool. Don’t let the asyncio kids make you feel like a nerd.

Threads are a better choice than asyncio for most concurrent programs. They’re at least as fast as asyncio, they’re compatible with the popular frameworks, and with the techniques we looked at, using Futures and ThreadPoolExecutors, multi-threaded code can be safe and elegant.

Buddhist Groups Supporting Refugees

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Sun, 10 Apr 2022 15:29:17 +0000

My Zen group, the Village Zendo, is leading a Buddhist project to resettle refugees in New York City. We’re partnering with HIAS, the Jewish nonprofit that’s been aiding immigrants of all kinds for more than 140 years. (They almost certainly helped my Jewish great great grandparents come from Lithuania and Ukraine to the US.)

We’re calling the project “Sanghas Supporting Refugees.” We have a few dozen volunteers who will help a refugee individual or family to settle here. We’ll help them find an apartment and pay the first few months rent, enroll their kids in public school, find a job, find healthcare, and connect with their local community. They might be from Ukraine or Afghanistan or some other place in crisis.

We need to raise a lot of money! HIAS estimates we need $40,000 to receive a refugee family. More would be better. We’ve raised $18,000 so far. If you donate now, your money will be matched one-to-one by a generous donor. It’s tax deductible, the Village Zendo is a 501(c)3.

Please Donate

Thanks.

Photo: Immigrants at Ellis Island awaiting a ferry to the city.

Paper Review: E-Store, P-Store, and Elastic Database Systems

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Thu, 07 Apr 2022 11:35:12 +0000

Elastic Database Systems, a 2017 PhD thesis by Rebecca Taft, collects several techniques for auto-scaling a sharded database. Each technique is so direct, so obviously useful, that I can hardly believe I didn’t encounter it earlier: a sure sign of a really good new idea. This research is even more relevant today, when companies selling serverless databases must compete on efficiency.

I’ve read Taft’s thesis, plus two papers that summarize the same material:

E-store: Fine Grained Elastic Partitioning For Distributed Transaction Processing Systems, VLDB 2015.
P-Store: An Elastic Database System with Predictive Provisioning, SIGMOD 2018.

Taft is the lead author on both; famous researchers Michael Stonebraker and Andy Pavlo are among the coauthors. If my summary gets you excited, I recommend you read the thesis rather than the papers. The papers are very dense to meet conferences' demand for unreasonable concision. The thesis has more pages, but it’s less total effort to comprehend.

Taft’s research prototypes are built atop H-Store, a research prototype database. It’s in-memory and sharded. (It could be replicated too, but Taft ignores replication for simplicity.) H-Store is a very cool project in its own right, you can read papers about it or its commercial variant, VoltDB.

Taft’s thesis describes two systems, E-Store and P-Store. The first is reactive: like most autoscaling systems I’ve read about, it measures load and scales in/out in reaction to rising/falling demand. What makes it clever is how it quickly smooths out uneven load. P-Store, on the other hand, is predictive: it forecasts demand based on past cycles and starts scaling in/out at the right time to minimize cost. In both cases, the goal is to build an auto-scaling distributed data store on top of a virtual machine provider like EC2, paying the provider as little as possible while always meeting the demands of a fluctuating workload. I found P-Store revelatory so I’ll spend most of our time there.

E-Store

E-Store is designed for a sharded database with a skewed workload; i.e., some tuples are hotter than others. (A “tuple” is also known as a record, row, document….) Like most sharded databases, E-Store partitions its data into disjoint blocks. Each block contains tuples within some key range. Each machine stores many blocks. If a block contains some hot tuples, its machine risks overload. But now the system faces a Catch-22: moving a whole block of data from an overloaded machine will temporarily make the overload worse, and it will take a long time. So E-Store identifies individual hot tuples and tracks them separately, so it can quickly move one hot tuple from an overloaded machine. This is fast and cheap and relieves the hot spot right away. Later if the hot tuple becomes cold, E-Store merges it back into its block and stops tracking it separately.

Steps of E-Store's migration process. Credit: Taft et al.

Taft describes several optimizations that make E-Store more sophisticated than most autoscaling systems I’ve read about. It uses cheap CPU monitoring most of the time; only when it detects an imbalance does it switch briefly to expensive per-tuple monitoring. The rebalancer chooses a plan that minimizes data movement while achieving a reasonably balanced outcome. It prioritizes moving hot tuples rather than cold blocks. As a result, E-Store can handle spikey and skewed workloads with fewer latency spikes than comparable systems.

P-Store

The insight that motivates P-Store is that many workloads are cyclic, and therefore predictable. The authors used historical data from a Brazilian retailer called B2W, known as “the Amazon of Brazil”. (I thought the Amazon of Brazil was a river, but no, we live in an absurd era, nothing is real.) B2W’s load is spikey—its peak is 10x its trough—and cyclical: it has daily and weekly cycles. (And yearly, although the authors don’t handle that.)

Load on one of B2W’s databases over three days. Credit: Taft et al.

When a reactive system deals with fluctuating demand like this, it scales out while load is spiking. Thus it faces a tradeoff: it must either pay for lots of spare capacity during stable periods, or suffer bad performance while it scales out during the spike. P-Store finds a better way. It analyzes past cycles to forecast future changes in demand, so it starts scaling out before a spike. It also forecasts how long scale-out will take, so it can start at the right time. This prescience lets P-Store minimize overhead without risking bad performance.

Modeling the system

P-Store has to know the system’s capacity per server and how quickly it scales in/out, so P-Store can decide when to start scaling in anticipation of a change in demand. It models the system with three variables:

Q̂: The max capacity of one server. The authors determine this experimentally by loading a server until latency rises, then reducing to 80% of that load for safety.
Q: The target throughput, how much load P-Store aims for each server to experience in a stable, balanced system. Set to 65% of Q̂ to leave headroom for the costs of the scaling process.
D: The data-transfer speed during rebalancing. The authors determine this experimentally, too, by transferring data faster and faster between servers experiencing Q̂ of regular load, until the data transfer affects the latency of normal operations. Then they set D to 90% of this speed for safety.

You can see there’s still lots of headroom built in to deal with unforecasted spikes, skewed workload, etc. I’m curious how the authors decided the amount of buffer needed, and whether a real world system with experienced operators could run closer to the maxima.

Forecasting demand

P-Store uses Sparse Periodic Auto-Regression (SPAR) to find cycles in the past week of history. It also measures an offset between the last 30 minutes of load and the average for that time of day and day of week. This offset might indicate long-term growth or decline, or some unusual event taking place. P-Store combines the offset with the periodic prediction to forecast the next 60 minutes of demand.

60-minute-ahead SPAR predictions during a 24-hour period. Credit: Taft et al.

You can see that SPAR with these parameters predicts B2W’s workload very accurately, with relative error around 10%.

Effective capacity over time

Given a prediction about how workload will fluctuate, P-Store must decide when to add or remove machines, and how many. Its goal is to match the forecasted demand (blue) with capacity (orange), with just a little headroom for safety:

But P-Store can’t continuously vary capacity. It adds and removes whole machines, so capacity is a step function:

There’s one more complication. When P-Store adds a machine, the system’s capacity doesn’t jump upward immediately. What Taft calls “effective capacity” curves upward over time, as chunks of data are migrated onto the new machine, until the data is fully balanced:

(When a machine is removed, I’m showing effective capacity drop instantly. I’m not sure this is how it works in the real world or P-Store’s model.)

Planning to meet demand

P-Store evaluates its current situation and plans for the immediate future. It looks at a week of history leading up to the current moment, and uses it to forecast the next hour. Then it decides what steps it should take for the next hour to match demand. Finally, it executes the next single step: it scales out or in by some amount, or maintains the status quo.

The planner's goal. Credit: Taft et al. I added the red arrows.

Note, even if demand is currently steady or falling, P-Store might need to start scaling out now to meet demand that will come within the next hour. This is why P-Store forecasts and plans far ahead, although each step requires just a few minutes to execute, and P-Store reevaluates every few minutes.

So how does P-Store decide what to do?

In order to determine the optimal series of moves, we have formulated the problem as a dynamic program. This problem is a good candidate for dynamic programming because it carries optimal substructure. The minimum cost of a series of moves ending with A machines at time t is equal to the minimum cost of a series of moves ending with B machines at time t − T(B,A), plus the (minimal) cost of the last optimal move, C(B,A).

Taft describes the algorithm in a couple pages of pseudocode. Here’s my understanding.

The planner’s main data structure is a matrix. Each column is a time slot in the future; the planner decides what step it will take at each of these times. Each row is a number of machines that might be allocated, up to a maximum determined by the max load forecasted. In each cell, the planner will memoize the cost of the cheapest “feasible” path to that cell. A path is feasible if there are enough servers to handle forecasted load at each step from the initial cell to the destination cell. I imagine (and this is my invention) that you could draw the forecasted load superimposed on the matrix (blue line) to visualize feasible paths.

The planner begins on the lower right: this is the cheapest possible end-state. However, there’s no feasible path to it, since there aren’t enough machines at that end-state to handle load. The planner considers an infeasible path to have infinite cost, and since all paths to the lower-right cell are infeasible, the cell’s cost is infinite.

The planner proceeds to consider each end-state cell, from bottom to top, i.e. from the least number of machines to the most. The next few cells have infinite cost, but finally we reach the 4-machine row, which is the first with a feasible path to it. The planner does a depth-first search of paths from the initial state to this end-state, calculating each path’s cost as the sum of the number of machines allocated in each time slot. It recursively selects the cheapest path leading to each cell from left to right. It never has to recalculate the cheapest path to a given cell because it memoizes that info in the matrix.

As the planner proceeds up the end-state (rightmost) cells from bottom to top, as soon as it finds a cell with a feasible path to it, the job’s done. Any higher end-state cell would use more machines than necessary.

I hope that explanation made sense, see Taft’s thesis for more. I’m papering over many details: effective capacity is a curve as I showed above, machines are added as late as possible within a step to minimize cost, and some steps are faster than others depending on how parallelizable the data transfers are. The gist is, the planner finds the optimal path using a straightforward algorithm, with a big-O complexity equal to the max number of machines multiplied by the number of future time slots.

P-Store workflow

Before deploying the system, humans determine the system model parameters Q̂, Q, and D once, experimentally. (I think a real-world system would have to automatically and frequently update these parameters.) Then, every few minutes, P-Store makes a new forecast based on the latest history, and feeds it into the planner. The planner executes the next step. Continuous monitoring generates new history, which feeds into the next forecast, and so on.

Their evaluation of P-Store

P-Store is designed with these assumptions:

The workload mix and data size are not quickly changing. (Because this would make the system model parameters outdated; I think a real-world system could handle these changes by frequently updating Q̂, Q, and D.)
The workload and data are distributed evenly across the partitions. (Otherwise the system model is inaccurate. Perhaps a combination of E-Store and P-Store, or a more sophisticated system model, could handle skewed load.)
The workload has few distributed transactions. (Otherwise, moving data could affect the efficiency of transactions. See Clay, which moves data in order to minimize distributed transactions.)

With these simplifying assumptions in place, the authors evaluate how well P-Store meets its goals: avoid latency spikes with minimal cost.

Credit: Taft et al. I added the red arrows.

B2W, the “Amazon of Brazil”, doesn’t actually use P-Store, E-Store, or their foundation H-Store. The authors simulated how it would have performed by replaying 3 days of B2W transactions on E-Store and P-Store. E-Store must scale out while demand is rising (the pink line intersects with green/black line) so it experiences latency spikes. P-Store scales out ahead of rising demand, so it keeps latency low. (P-Store appears to over-allocate; Taft told me this is probably an artifact of the chart resolution. P-Store anticipates abrupt load spikes that are too brief to be visible here.)

My evaluation of P-Store

“Workloads are often cyclic, hence predictable” is a great insight, and starting to scale before the spike is brilliant.
Real-world implementations need not use SPAR or the P-Store path-finding algorithm. The research prototype proved that these particular techniques work well, but they’re not the only possibilities. For me, the overall system concept was more valuable than the details.
A real world deployment would need some enhancements: It would need to handle skewed workloads, and automatically update Q̂, Q, and D.

Both E-Store and P-Store show straightforward and novel (to me) techniques for autoscaling databases. These ideas were obviously useful in 2017, and even more now: the industry is starting to build serverless databases (including my team at MongoDB), which compete on their efficiency at matching resources to demand. Precise allocation will make the difference between life and death for some companies.

Paper Review: 1Pipe

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Sun, 03 Apr 2022 16:28:06 +0000

1Pipe: Scalable Total Order Communication in Data Center Networks Bojie Li, Gefei Zuo, Wei Bai, and Lintao Zhang, SIGCOMM 2021.

This is a fun little distributed systems paper with a novel protocol for data consistency, which only works in cutting-edge data centers. Here’s the gist:

1Pipe provides causal consistency and total ordering. (Thus sequential consistency?) Its latency is much lower than most protocols with these guarantees, measuring in tens of microseconds.
It metaphorically provides “one pipe” that connects all senders to all receivers, with FIFO semantics. Every message is pushed into the front of the pipe and popped from the back. Of course, if the system really worked this way it couldn’t scale. The magic trick is to provide the illusion of a global FIFO, on a many-to-many substrate.
Nodes can send a one-to-many transmission called a “scattering”: as in a multicast, all messages in a scattering share one position in the total order, but they can have different contents and recipients. A scattering is delivered atomically, i.e. all-or-nothing.
1Pipe offers two variants: You can choose the “best effort” or the “reliable” protocol.

Limitations

When I had read this much of the paper, I was suspicious. Atomic multicast and total order are consensus problems, and consensus algorithms can’t be this cheap. Where’s the rub? Well, 1Pipe is not an Internet protocol. It only works within cutting-edge data center networks:

All nodes' clocks must be very tightly synchronized (within roughly a microsecond) for performance, though not for correctness.
All switches and other nodes must implement the 1Pipe protocol logic.
The network topology must be a DAG between any pair of nodes (no loops, unlike the Internet).

According to the paper, data center networks are typically “multi-root tree topologies”. The following diagram from the paper is kind of confusing. Imagine that you split every node into two: the sender and the receiver. Place the sender half on the left, the receiver on the right. Then all traffic only goes left to right, and the network is a DAG.

Multi-Root Tree Topology

Figure 3: Routing topology of a typical data center network. Each physical switch is split into two logical switches, one for uplink and one for downlink. Dashed virtual link between corresponding uplink and downlink switch indicates “loopback” traffic from a lower-layer switch or host to another one.

Multi-Root Tree Topology

The diagram below, from a 2008 paper, seems clearer to me. It’s obviously tree-like. A message from one node to another must travel up the tree until it finds a common parent, then it can travel down. There are multiple paths between some pairs, since there are multiple roots to choose among, but there are no loops. Somehow this is different from the public Internet, and it makes the 1Pipe protocol safe, though I don’t understand the details.

Credit: “A Scalable, Commodity Data Center Network Architecture”, Al-Fares, Loukissas, Vahdat 2008

1Pipe protocol by example

I’ll describe the 1Pipe protocol with a scenario. This is like a scenario in the paper, but I added one node and I’m inventing some details. I found this helpful for understanding. Let’s say there’s an email system with 4 nodes:

An SMTP server that receives emails.
A storage node; it receives emails from the SMTP server.
A switch; it forwards messages from the SMTP server to the UI server.
A UI server that presents a website like GMail.

(This is all my invention, not the authors'.) An email arrives, the SMTP server stores it in the storage node. It concurrently sends a notification, via the switch, to the UI server. The UI server requests the email contents from the storage node, and when it receives the contents, it updates the UI:

In the animation above, the UI server requests the email contents after they’ve been stored on the storage node, which is fine. The danger is the UI server might request the contents before the storage node receives them, which would violate causality. Let’s see how 1Pipe ensures causal consistency in that scenario.

First, an email arrives at the SMTP server. The SMTP server sends an “email contents” message to the storage node, then a “notification” message to the switch. In the 1Pipe protocol, all messages include a timestamp and a “barrier” value; initially these are set to the originating server’s clock. So the email contents message has timestamp and barrier 1, and the notification has timestamp and barrier 2. Warning: the first message is delayed!

The switch receives the notification with timestamp and barrier 2. Here’s where 1Pipe gets interesting: the switch remembers the last “barrier” value it got on each network link. All nodes promise to send messages with increasing barriers, so the switch knows the SMTP server’s messages will have barriers greater than 2 from now on. The switch wants to tell the UI server what barriers to expect from the switch. Can the switch promise to send barriers over 2? No, because it might still receive lower-barrier messages on its other inbound link. Thus outgoing messages from the switch have barrier value 0: that’s all the switch can promise. In general, a node’s overall barrier is the minimum of its barriers on each inbound link. (We’ll see in a moment how 1Pipe uses barriers to enforce causal consistency.)

When the UI server gets the “new email” notification, it requests the email contents from the storage node. Danger! The storage node still hasn’t received the email contents, so if it processes the UI server’s request now, it will violate causality.

Not a problem, though: the storage node’s overall barrier is still zero, the minimum of the barriers on each inbound link. A node refuses to process any message with a timestamp greater than the node’s overall barrier, so the storage node doesn’t process the request yet.

Now, the old message with the mail contents is finally delivered, and the storage node stores the email contents:

Can the storage node process the timestamp 3 request yet? No, because the storage node’s overall barrier has only advanced to 1. When will it ever advance to 3? 1Pipe has a solution for this problem: “beacon” messages that nodes send periodically over idle links.

When the SMTP server sends a beacon with barrier 4, the storage node advances its overall barrier to 3. That permits it to process the request with timestamp 3, and respond.

You can see how barriers enforce causality: the storage node waited to process the request until it was certain it had received all older messages on all links. It remembered the last barrier value on each link in order to determine this.

The switch can’t hold messages while waiting for its overall barrier to advance (a switch doesn’t have much RAM), so it updates outbound messages' barriers to the switch’s overall barrier before forwarding the message. Thus, each message’s barrier value reflects the minimum barrier along the whole path it took. This value tells the destination node that no lower-barrier messages will ever arrive on the same link.

Reliability

1Pipe offers two reliability levels. The simple protocol I described above is “Best Effort 1Pipe”: it only requires half a round-trip time, but it has few guarantees if any node crashes. There’s also “Reliable 1Pipe”, which resembles two-phase commit: the sender transmits a “prepare” message and retries until all recipients acknowledge it. The “prepare” message ignores barrier timestamps. Then the sender transmits a “commit” message using Best Effort 1Pipe. If this message is lost, I’m not sure what happens: the paper doesn’t mention any way for the sender to detect the error, much less correct it. And yet this two-phase protocol is described in the “Handling Packet Loss” section so there must be a way….

It’s clearer to me how 1Pipe handles crashed nodes, though also unsatisfying. Surviving nodes use timeouts to detect the crash and inform the application, but it’s the application’s responsibility to recover using a traditional consensus protocol like Paxos or Raft:

In fault tolerant applications, 1Pipe provides a fast path in normal cases, and falls back to the application for customized failure handling. More concretely, an application may use state machine replication to replicate its states, and register onepipe_proc_fail_callback which invokes a traditional consensus algorithm. Each message is scattered to all replicas. When failure occurs, message delivery is stalled, and 1Pipe invokes the callbacks in all non-fail processes. Restricted failure atomicity ensures that all correct replicas deliver the same sequence of messages. If the correct replicas reach a quorum, the callbacks return, and message delivery is resumed. Otherwise, there are too many failed replicas, and the application can choose between consistency and availability. If it chooses consistency and waits for some replicas to recover, the recovered replicas can deliver the same sequence of messages.

So, 1Pipe doesn’t excuse you from implementing a distributed protocol: it handles the happy path, but you handle recovery.

Recovery would be orders of magnitude slower than normal operation. Whoever deploys 1Pipe should think hard about how the rest of the system will react when 1Pipe stalls for crash recovery, or risk metastable failure. The 1Pipe authors clearly optimized for very reliable data center networks where packet loss, crashes, and clock skew are very rare.

My evaluation

1Pipe seems far faster than other protocols that achieve similar consistency levels, at the cost of high requirements, and expensive and poorly defined error recovery. It guarantees causal and totally ordered communication, aka “CATOCS”. This be the same as sequential consistency, or perhaps the authors don’t claim sequential consistency because 1Pipe doesn’t guarantee delivery. Anyway, it’s a cool protocol! If you have a high-tech data center network and you need super low latency plus causal consistency, then 1Pipe will fit your niche.

244 Days Off

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Fri, 01 Apr 2022 11:54:51 +0000

Yesterday was my last day at MongoDB until December. At least, that’s the plan. My company offers unpaid sabbaticals of a few months to employees of 5-plus years. I asked for a longer leave and it was granted, since I’ve been there more than 10 years.

Why leave? I started climbing in a gym a few years ago, and when the gym closed for Covid, my partner Jennifer and I started to climb outdoors regularly in the Gunks. I fell passionately in love with real rock climbing—first, tamely, by hiking to the clifftop and attaching a rope there.

Then, more boldly, by climbing from the ground up and placing protection into cracks in the rock as I go ("traditional climbing"). Last year I was following a friend up Madame G’s, a trad climb in the Gunks. The climb is easy and safe, but vertiginous. Towards the top I was gripping the rock with a hundred feet of empty space below my feet, and feeling frustrated: I wanted to enjoy this climb more, and be less afraid. I wanted to be comfortable enough to lead the climb, risking roped falls of 10 or 20 feet, instead of only following my more courageous friend.

So I decided to devote most of this year to getting comfortable on the wall. I’m 43, and the physical training won’t get easier if I wait. I hit the startup jackpot by joining MongoDB early, so I can live off stock. I’ll climb in the Gunks and take trips to Utah and other climbing playgrounds, with Jennifer and friends. I’ll recruit more climbing partners on Facebook and Mountain Project. This will be a very challenging year! I’ll spend a lot of time afraid, working with fear, practicing frightened mindfulness. This is a terrific addition to my 20 years of Zen. I feel apprehensive and attracted, like I felt when I signed up for a year at a Zen monastery back in 2003.

Climbing every day is impractical (and inadvisable at 43). What am I going to do with my mind on rest days? I’m going to read and blog about more research papers in distributed systems, and participate in the DistSys Reading Group. I’ll keep building up my foundational knowledge, and expand into cloud operations. I’ll finally learn about containers.

I intend to be not completely useless to other people while I’m on leave. I’m the director of the Village Zendo’s program at Sing Sing prison and I co-founded our program in the NYC jails. Both are still paused for Covid, but I hope they’ll reopen soon and let me resume volunteering. The zendo is starting a refugee resettlement project with HIAS and I will donate time and money there, too.

Yesterday I went into the MongoDB office to say goodbye for now. I almost never work in the office anymore, but it seemed right. Around 5pm I gathered with the MongoDB Serverless team, drank some scotch, and ceremonially set my Slack status to “Away”. I logged out of work email and calendar on my phone and laptop, took a picture of the vista from the 37th Floor, and took the elevator down to the street.

Multi-Paxos in Python, tested with Jepsen

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Fri, 10 Dec 2021 15:25:18 +0000

I want to understand Paxos better, especially Multi-Paxos, so I implemented it (badly) in Python. I tested it with Jepsen—it was a chance to play with Jepsen, and a way to check if I’d understood Paxos well enough to code it. I spent about two weeks on the project (one was MongoDB’s periodic “Skunkworks” week). Here’s a rambling report on my experience.

Paxos

Paxos is reputed to be hard to understand, and it lives up to its reputation. As I covered in my review of the “Paxos vs. Raft” paper, the original Paxos description is obfuscated, and each subsequent clarification uses different jargon to describe a different variation of the algorithm. As Howard et. al. explain in that paper, once you implement Multi-Paxos with all the modern enhancements, it’s almost Raft. I could have accepted that on faith and implemented Raft, but I wanted to work the problem myself and understand Paxos on its own terms.

I skipped Leslie Lamport’s original “Part-Time Parliament” paper and went straight to his “Paxos Made Simple”. It really does make Paxos simpler—Lamport’s a terrific writer, aside from going overboard in “Part-Time Parliament”—but the paper explains single-decree Paxos and I want to understand Multi-Paxos, the minimum enhancement that makes Paxos practical. Next I read “Paxos Made Moderately Complex” by Renesse and Altinbuken. It’s sandbagged; they should have called it “Monstrously Convoluted”. They take the three roles of “Paxos Made Simple” (Proposer, Acceptor, Learner) and add Replica, Leader (same as Proposer?), Commander, and Scout!

I think the proliferation of roles is yet another reason why Paxos is harder to understand than Raft, and an underappreciated one. In Raft, there is one permanent role, Server, and it performs one of two temporary roles: Leader or Follower. But in “Paxos Made Simple” there are three permanent roles, and the Proposer can assume the temporary role of “Distinguished Proposer” (like a Raft Leader). In “Moderately Complex” there are about six permanent roles. Roles aren’t actually distinct machines or processes, they could be implemented as threads on one server. The point is to decompose Paxos into small, single-threaded subroutines. But this decomposition moves the complexity from the roles to their interactions, and makes it harder for me to envision a Paxos implementation in code.

Anyway, I found what I sought in an unlikely place: “Formal Verification of Multi-Paxos for Distributed Consensus” by Chand, Liu, and Stoller. It’s Multi-Paxos, described unambiguously in TLA+. The authors only specify the Proposer and Acceptor roles, so I took the Learner pseudocode from “Moderately Complex”, along with some optimizations, and coded up a working Paxos in Python.

I’m grateful to the authors of all these papers. “Made Simple” is a good start, “Formal Verification” was my primary reference, and “Moderately Complex” has a ton of detailed explanation, plus pseudocode and Python for every role. Together they gave me enough hints to hack together an implementation in under a week.

I used the Requests library to send messages and Flask to receive them. I combined the Proposer and Learner roles into one Python class and implemented the Acceptor in a separate class. Each class has a queue of incoming messages that it processes one at a time, which excused me from any mutex chores. I also wrote a client in Python, which sends its request to any Proposer the user chooses.

My Multi-Paxos has no reconfiguration protocol. It has no stable leader, thus no election protocol and no Fast Paxos. What it lacks in features it makes up for in bugs. It can’t run very long since it uses more memory and passes larger messages with each operation. Any server can propose a value at any time, which leads to conflicts. If a client submits a value, the chosen Proposer will keep proposing it with a higher ballot and slot until it’s accepted, even after the client times out. In high-concurrency tests a value may be stuck in a conflict-retry loop for minutes. This would be bad for a production system, but good for my purposes, since conflicts are the most interesting event I want to test.

The purpose of Multi-Paxos is for the servers to agree on a sequence of operations on a replicated state machine (“RSM”). I could’ve chosen any data structure as their shared state, so I chose an append-only list of integers. A client can send “1” to a Proposer, and another client can send “2”, and eventually the servers may agree that the RSM’s state is [1, 2]. The server replies to each client with the current list. I knew from the Elle paper that an append-only list is an easy data structure to check for linearizability. Which leads us to the second half of this project….

Jepsen

I’ve been curious about Jepsen ever since Kyle Kingsbury (aka “Aphyr”) appeared with his consistency checker and ruined every distributed system implementer’s life. It’s found a few bugs in MongoDB (my day job) and in dozens of other systems.

I followed the terrific Jepsen tutorial and got it set up on a four-node EC2 cluster, then spent a week building a basic Jepsen test for my Paxos code. Jepsen is a test framework written in Clojure; you have to subclass its components and write some custom functions. I don’t know Clojure, so I used Kingsbury’s Clojure From The Ground Up, which got me started after an encouraging introduction:

I want to help in my little corner of the technical community—functional programming and distributed systems—by making high-quality educational resources available for free…. As technical authors, we often assume that our readers are white, that our readers are straight, that our readers are traditionally male. This is the invisible default in US culture, and it’s especially true in tech. People continue to assume on the basis of my software and writing that I’m straight, because well hey, it’s a statistically reasonable assumption. But I’m not straight.

This softened my irritation about an obscure language barring my way. Kingsbury didn’t mean it to be a barrier. Anyway, between my undergraduate memory of Scheme, Kingsbury’s guide, and Stack Overflow, I didn’t lose too many hours to Clojure syntax. I set up a four-node EC2 cluster, one node for the Jepsen controller and clients, three for the Paxos servers. I wrote some Clojure that Jepsen executes to deploy my servers and run my concurrent clients.

Jepsen stores the history of each run and produces a timeline diagram, where each client process is a vertical column, representing one operation after another in sequence. Concurrent ops overlap on the horizontal.

Visualizing your test run is a good sanity check. For example, I briefly had a bug that caused all Paxos Phase 2a messages to be lost. That meant no values were accepted and all client operations failed, but linearizability wasn’t violated! As Lamport says, “Always be suspicious of success.” In other words, Jepsen checks safety, but you need other tests for liveness. At first, you can just see if the diagram looks reasonable.

Jepsen sees each test’s history as a graph of overlapping operations. It checks that there’s some way to transform it into a linear sequence, without violating real-world order, that makes the system behave like a “model” that you provide. (That’s linearizability.)

My main difficulty was comprehending what a model is and how to model an append-only list so Jepsen can check it. A model specifies how each operation should change a system’s state. This is so similar to a TLA+ “action” that I expected it to be easy; somehow I got stumped for hours and I’m still not confident. Perhaps it’s because the Jepsen tutorial and other examples showed me distinct write and read operations, whereas my system allows a single operation that both writes to the list and reads the list’s current value. Here’s my code, critique welcome:

; A Knossos model, validates that the Paxos system's state (which is an
; appendable vector of ints) behaves as it ought.
(defrecord AppendableList [state]
  Model
  (step [model op]
    (assert (= (:f op) :append))
    (if (nil? (:new-state (:value op)))
      ; op failed.
      (do (info "failed" op) model)
      ; op succeeded. E.g., if state is [1 2] and we append 3, and the
      ; reply is [1 2 3 4] because another process appended 4, then op
      ; is {:value {:appended-value 3, :new-state [1 2 3 4]}}.
      ; Linearizability demands that [1 2] is a prefix of new-state and
      ; 3 is in the suffix.
      (let [appended-value (:appended-value (:value op))
            new-state      (:new-state (:value op))
            actual-prefix  (vec-slice new-state 0 (count state))
            actual-suffix  (vec-slice
                            new-state (count state) (count new-state))]
        (cond
         (< (count new-state) (count state))
         (knossos.model/inconsistent
          (str "new state: " new-state " shorter than state: " state))
         (not= state actual-prefix)
         (knossos.model/inconsistent
          (str "state: " state "not a prefix of new state: " new-state))
         (not (some #(= appended-value %) actual-suffix))
         (knossos.model/inconsistent
          (str "appended value: " appended-value
               " not in new values: " actual-suffix))
         :else
         (AppendableList. (conj state appended-value)))))))

To test Jepsen itself, I tried disabling an important rule in the Acceptor: it should accept a Phase 1a message only with a higher ballot number than any it’s seen, but I made it accept any Phase 1a message.

class Acceptor:
    def _handle_prepare(self, prepare: Prepare) -> None:
        # Handle Phase 1a mesage, see Fig. 3 in Chand et al.
        # --------- COMMENTED-OUT TO PRODUCE AN INCONSISTENCY ------
        # if prepare.ballot <= self._ballot:
        #     return

        self._ballot = prepare.ballot
        promise = Promise(self._ballot, self._voted)
        self._send(prepare.from_uri, self._promise_url, promise)

Sure enough, Jepsen detected a linearizability violation, and logged a cute emoji:

Analysis invalid! (ﾉಥ益ಥ）ﾉ ┻━┻

Jepsen draws a diagram which would be more legible if my system didn’t produce such large lists of numbers:

Despite the noise, you can see the basic problem: there’s a state (second from the top) with only red arrows leading from it, meaning any transition from that state would violate linearizability. By mousing over the boxes and reading the various log files, you could eventually diagnose the bug. At the end of each run, Jepsen saves its own log files in a timestamped directory, and thoughtfully copies each node’s logs into this directory too.

Conclusion

Jepsen is a very powerful tool. Kingsbury’s making an admirable effort to build an on-ramp for ordinary programmers to test our systems with it. It’s not easy, but well worth the trouble. My direct interactions with Kingsbury were delightful. I opened two GitHub issues and he responded to both within hours. (One was my mistake, the other a lacuna in the tutorial.)

I still think it’s needlessly hard to understand Paxos, compared to Raft. You can read the Raft paper for one canonical description of a full-featured system, but I haven’t found an equally straightforward and full-featured description of Multi-Paxos. However, reading “Paxos Made Simple” and then trying to implement Paxos led me to a small eureka. I was thinking about Paxos as I rode the subway home late last weekend, and as I walked the final blocks in the cold from Union Square to my apartment, suddenly it all fit together. “Yes, Leslie,” I thought, “you’re right, it really is simple.”

Skunk by John James Audubon

Eye-Closing Ceremony at the Village Zendo

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Sun, 05 Dec 2021 18:01:31 +0000

Images © A. Jesse Jiryu Davis

Cleaning Up Rikers Island

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Tue, 09 Nov 2021 10:11:07 +0000

On September 29th, 2021, dozens of NYC faith leaders gathered on Rikers Island. They offered prayers, and worked alongside incarcerated boys to paint and clean cells in the Robert N. Davoren Complex, a youth jail on the island. The new NYC Corrections commissioner Vincent Schiraldi made news when he apologized for mass incarceration, called it a racist policy, and promised to bring it to an end. Of course, it’s not the commissioner who decides who is jailed and who is free in New York City: the bail reform struggle among legislators, prosecutors, and judges will decide the future of incarceration here.

Jennifer and I covered the event for Lion’s Roar:

Faith Leaders Gather to Help Improve Conditions at Rikers Island Jail Complex

Images © A. Jesse Jiryu Davis

Paper review: Scaling Large Production Clusters with Partitioned Synchronization

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Wed, 03 Nov 2021 15:47:23 +0000

Scaling Large Production Clusters with Partitioned Synchronization, USENIX ATC 2021. One of the “best paper” award winners.

In this paper, researchers from Alibaba and Chinese University of Hong Kong design a distributed task scheduler. They have a cluster with thousands of worker machines, and the task scheduler’s job is to choose a “resource slot” on one of these machines for each new task to use. Alibaba’s prior system used a single scheduler, which conked out at a few thousand new tasks per second. They want to design a distributed task scheduler that can scale to 40K new tasks per second, backed by 100k worker machines. They find that just adding more schedulers or more resources yields diminishing returns, so they seek other optimizations. They find that when multiple schedulers are running at once, the staleness of each scheduler’s view of the global state contributes to scheduling conflicts, which slows down the system. To reduce maximum staleness they design Partitioned Synchronization, and they describe how to balance high-quality versus low-latency scheduling.

System Model

The action begins when a client submits a task to a scheduler. The scheduler chooses a “resource slot” to commit to the task. The task runs using the slot until the task finishes. Tasks run for 5 seconds on average; most seem to range between 1 and 10 seconds.

The authors don’t describe what they mean by a “slot”; I assume it’s a slice of one machine’s resources. Each slot can be committed to only task at a time. Not all slots are the same: some have more CPU, or more RAM, or they are closer or farther to the data needed by a certain task, or they have certain previous tasks' data cached.

Their goals:

A distributed task scheduler that can handle about 40k new tasks/sec.
Low-latency, high-quality scheduling.

Low latency means a task is quickly assigned to a machine and starts executing. High quality means a task is assigned to a suitable machine, and that global measurements like fairness are close to ideal. Quality isn’t defined in detail here, which I think is the paper’s greatest weakness. I’ll say more below.

Their constraints:

Can’t statically partition resources among schedulers.
Can’t have single master scheduler.

They can’t just divide the number of machines by the number of schedulers and give each scheduler exclusive access to a subset of machines. They say this leads to under-utilization if some partitions are more idle than others, and there are global properties they want to uphold that require schedulers to be aware of the whole state, not a subset. They can’t have just one scheduler because it can’t assign tasks fast enough, it has to be scaled out.

Baseline Scheduler

The authors describe the simplest possible scheduler algorithm, for the sake of analysis. Each scheduler just assigns each task at random to some slot. There’s no attempt to maximize quality. Often, multiple schedulers assign different tasks to the same resource: this is a “scheduling conflict.” One succeeds and the rest of them retry, which adds latency.

What’s the cost of conflicts, and how can we reduce them? Let’s say each scheduler can handle K tasks per unit of time if there are zero conflicts. So, if there are J new tasks per unit of time, we need N = J / K schedulers. But there will be conflicts, since schedulers just choose slots at random. If S slots are exactly enough to execute all the tasks, let’s add S_idle additional slots to reduce conflicts.

The authors do a half-page of math to derive the expected number of conflicts per unit of time:

$$NK-S_{idle}+S_{idle} * (1-\frac{K}{S_{idle}})^N$$

N : number of schedulers
K : tasks per second
S_idle : additional slots

This formula’s main message is more slots are better. If you graph it with some arbitrary constants for N and K, you see it’s asymptotically decreasing:

So, the more idle slots you have, the fewer conflicts, thus fewer retries, thus less scheduling latency. But there’s diminishing returns. Remember, idle slots are expensive computers in a hot data center, and you’re not even using them!

The authors apply the same argument to schedulers, and find diminishing returns again: More schedulers reduce latency by spreading the load, but they also conflict with each other more and cause more retries.

A More Realistic Simulation

The previous formula is simplistic, so the authors want to simulate their system with more realism. They add two variables:

G : The synchronization gap.
V : The variance of slots' scores.

Each scheduler has its own view of the global state, i.e., which slots are committed and which are idle. A slot’s state changes when another scheduler commits it to a task, or when a task finishes, but there’s a “synchronization gap” G between these changes and the moment when the scheduler learns of them by periodically refreshing. The less frequent these refreshes, the staler the scheduler’s view on average, which means more scheduling conflicts.

In the authors' simulation, each slot has a single “score” which determines how fast it runs tasks. The variance of scores is V. I think this is a misstep. As the authors note, different tasks have different needs. A slot’s score is in the eye of the beholder. In reality, a slot’s score is a function of the slot and the task, and a good scheduler should commit a slot to the task that wants it more than other tasks want it. I believe this oversimplification in the simulation means they actually underestimate how good their solution is.

Partitioned Synchronization

We saw above that adding idle slots and adding schedulers is an expensive way to reduce scheduling latency. The authors try two new tactics:

Reduce the staleness of schedulers' local states.
Sacrifice scheduling quality when necessary to reduce scheduling latency.

Their solution is called Partitioned Synchronization, or ParSync. In this system, they divide the resource slots evenly into P partitions. (It seems like P must be divisible by N, the number of schedulers. If it were me I’d just set P = N, but maybe there’s some reason more partitions useful.) Instead of synchronizing all schedulers' state every G seconds, the system refreshes each scheduler’s view of different partitions in each synchronization round. Therefore a sync round is shorter, it takes only G / N seconds. The partitions are reassigned round-robin, so after N rounds all schedulers have synced all partitions and the cycle repeats.

Note that each scheduler has a fresher view of its most recently synced partition than all other schedulers do. This will be important later.

Total Staleness

The authors said their goal was to reduce staleness, but they achieve something subtler: although average staleness is the same as before partitioning, the variance of staleness is smaller.

Without partitions, each scheduler’s view of the slots' state is refreshed every G seconds. All schedulers' staleness ranges from 0 to G, and the average is G / 2.

But with two partitions, each sync round takes G / 2 seconds and refreshes half the partitions. Let’s consider one of the schedulers in the fleet. At the beginning of some round, its view of Partition 1 has 0 staleness (it was just refreshed) and its view of Partition 2 has G / 2 staleness (it hasn’t been refreshed since the previous round). At the end of this round the Partition 1 staleness has grown to G / 2, and the Partition 2 staleness to G. Then the scheduler refreshes Partition 2 and the cycle repeats, with the partitions' places swapped.

Average total staleness is always G / 2. But the range of staleness is smaller with two partitions. Instead of ranging from 0 to G, it ranges from G / 4 to 3G / 4. With more partitions the range is even smaller.

The authors write:

Scheduling delay increases disproportionately within the period G. When the cluster state is just synchronized, it is fresher and scheduling has fewer conflicts. But when the state becomes more outdated towards the end of G, scheduling decisions result in more conflicts. Conflicts lead to rescheduling, which may in turn cause new conflicts, and hence rescheduling recursively.

A smaller maximum staleness saves more conflicts than a larger minimum adds. Consider the area under the curve between min₀ and max₀. If you narrow the lines to min₁ and max₁, you delete more area from the right than you add on the left.

Quality vs. Scheduling Latency

After each partitioned synchronization round, each scheduler has a fresh view of some partitions and a stale view of others. A scheduler can minimize conflicts (and therefore latency) by committing slots from its freshest partitions, or it can maximize quality by committing the best slots first, regardless of which partitions they’re in. So the authors identify three scheduling strategies:

Latency-first: Look for high-quality slots in the freshest partitions first.
Quality-first: “First choosing the partition with the highest average slot score and then picking available slots by weighted sampling based on slot scores.” (Why so complicated? To reduce conflicts from all schedulers rushing for the top-scoring slots?)
Adaptive: Use Quality-first, but if latency exceeds some SLA switch to Latency-first.

They simulate an experiment with a cluster of 200k slots. Each task takes exactly 5 seconds and requires 1 slot. There are two scheduler groups. The simulation has three phases:

0-30 seconds: low load.
30-60 seconds: high load on scheduler group A.
60-90 seconds: high load on scheduler groups A and B.

Already I have some criticisms:

Wait, what’s a “scheduler group”? This is an unexplained swerve. Why would one group have higher load than the other? Shouldn’t clients submit tasks to all schedulers evenly?
Since real tasks have varied duration, I think it would be more realistic to vary the simulated tasks' durations.
As I said above, giving each slot a single “score” ignores that some tasks prefer different slots than other tasks.
What’s the point of “score” if a higher-scoring slot doesn’t execute a task in less time?

The following charts are the authors'.

Quality-first: Quality

The Quality-first scheduler mostly picks the slots with the highest scores first, so its average score starts high, then drops as the high-quality slots are used up. To my eye there’s some regular oscillation amid the noise, perhaps this is the periodic refresh (after each refresh, schedulers all rush to use the high-quality slots that were freed since the last refresh), or perhaps it’s because the tasks all take exactly 5 seconds.

Quality-first: Delay

Since, for some reason, schedulers in Group A get pummeled for 30 seconds while Group B is underutilized, it’s no surprise that Group A begins to fall behind first. As soon as a scheduler group is overloaded, its queues grow at a constant rate and latency increases linearly, forever. The Quality-first strategy has no solution to this.

Latency-first: Quality

The Latency-first scheduler still prefers high-score slots (in each scheduler’s freshest partitions) so quality begins high and starts oscillating. Interestingly, the average score is almost as good as Quality-first, and the variance is smaller. Whether the slightly worse quality matters depends on what “quality” is measuring in some real system.

However, the author’s simplistic simulation of quality hurts the Quality-first strategy. If S_idle = 0 then I think there’s no point at all to the Quality-first strategy: Since all tasks value high-score slots equally, all the schedulers rush to grab the high-score slots first, and then they all have to fall back to lower-score slots, until all the slots are used. What order the slots are chosen barely affects the overall quality score, as we see in the charts.

But in real life, certain tasks really benefit from using certain resources. A Quality-first scheduler could achieve a higher overall quality score by carefully matching tasks to the slots they prefer, the same as international trade raises global productivity via comparative advantages. Plus, in real life tasks finish faster when they have the slots they prefer, which frees slots faster, and thus reduces latency and improves the quality of future scheduling rounds.

Latency-first: Delay

I don’t understand this chart. Groups A and B must be cooperating somehow, since their latencies both rise a little when Group A is overloaded, then both rise a lot once Group B is also overloaded. The interactions between the groups aren’t explained in the paper. Anyway, notice that this chart’s latency axis is very different from the axis on the Quality-first chart: its latency is capped near 0.75 seconds, versus Quality-first’s 10 seconds and infinitely rising. 1.5 sec is the chosen SLA

Adaptive: Quality

The Adaptive strategy uses Quality-first until latency exceeds a chosen SLA (1.5 seconds in this simulation), then switches to Latency-first. Since quality is not realistically simulated, all the strategies appear to have similar quality. It’s surprising the quality variance increases after the strategy switches to Latency-first (since above we saw the Latency-first has smaller variance), but I wouldn’t put too much stock in these quality charts anyway.

Adaptive: Delay

The system works as designed: it lets latency rise to 1.5 seconds, then caps it.

Their Evaluation

The authors say that partitioned synchronization “increases the scheduling capacity of our production cluster from a few thousand tasks per second on thousands of machines to 40K tasks/sec on 100K machines…. ParSync effectively reduces conflicts in contending resources to achieve low scheduling delay and high scheduling quality.” The Adaptive scheduling strategy balances quality and scheduling latency.

My Evaluation

The ParSync algorithm is very clever, and nicely explained. I bet a whole class of distributed task schedulers can benefit from it.

The paper could’ve spent less time analyzing non-ParSync algorithms; the main action doesn’t start until late in the show.

Their simulation is marred by their simplistic assumptions that all tasks prefer the same slots (measured by a single “score” for each slot), and that all tasks take exactly 5 seconds to run, regardless of which task is running or which slot it uses. This underrates the value of “Quality-first” scheduling. It may underrate the value of the authors' entire contribution. I wish they had used a better simulation, so I could fully appreciate ParSync.

This paper is a good followup to Protean: VM Allocation Service at Scale.

Current and Future Tools for Interactive TLA+

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Tue, 05 Oct 2021 17:38:04 +0000

This is a talk I gave at the 2021 TLA+ conference with my MongoDB colleague Samyukta Lanka. The video is below, and a written version below that.

Update: See this Twitter thread for important corrections. Many of the features we asked for are already possible, but obscure.

Precise vs. holistic

In our experience there are two ways to understand a system: precisely or holistically. (This isn’t a super well-developed philosophy, just our observation.) The precise way of understanding involves well-defined questions like, “Does the system obey a particular invariant or property?” There are lots of powerful tools for this; this is what TLA+ and TLC are for.

The holistic way of understanding involves vague questions like, “Does the system generally conform to my theory of the system?” Holistic understanding usually requires interaction or visualization. Samy and I are programmers, and programmers use tools when we want to understand code: We can interact with a program by trying some inputs and watching the program run. We observe it with debugging, logging, and profiling. We can use visualization tools like flame charts and call graphs.

In contrast, TLA+ feels like math to us. TLA+ writers spend less time trying things out, and more time reasoning, which is slower and prone to different sorts of mistakes. So, our proposal in this talk is to make TLA+ more like programming: more interactive and visual. We’ll review the existing tools and suggest new ones.

Questions about your spec

Spec authors ask different questions at different times, and different tools are best for answering different questions. For example:

Does my spec imply my invariants / properties?
Why is my invariant / property false?
What does this TLA+ expression mean?
Is the spec generally behaving as intended?
Why isn’t my action enabled?
How does this edit affect the spec’s behavior?
How do I use TLA+ to communicate behaviors to other people?

The first question, “Does my spec imply my invariants / properties”, is the most thoroughly supported. It’s the main purpose of model-checking and proofs, and it’s where most of the research seems to go. TLC and TLAPS are production-quality, and there are new tools like Apalache.

As we go down the list there are vaguer questions like “Is the spec generally behaving as intended?” or “How do I use TLA+ to communicate behaviors to other people?” These are decreasingly well-supported. The tools are only prototypes or don’t exist at all.

Why is my invariant or property false?

You ask this question when you have a wrong hypothesis about your spec: you either didn’t write the spec you thought you did, or the spec doesn’t behave as you think. What precisely is the mismatch between your hypothesis and reality? How can the tools help you find it? Specifying Systems § 14.5.2 “Debugging a Specification” is a particularly good section about this, so we won’t repeat it, but we’ll try to add a bit.

Error traces are the main technique for answering this question. You can view the plain-text trace in a .out file, or a GUI trace in the TLA+ Toolbox or VS Code.

Samy’s and my colleague at MongoDB, Siyuan Zhou, made an alternative called tla-trace-formatter:

This is a nice HTML version of a TLA+ trace. It’s obviously easier to read than the plain-text output from TLC. And unlike the Toolbox or VS Code GUIs, you can publish this on the web and share it with colleagues to communicate about spec behaviors. We think that “export to HTML” would be a nice feature for the Toolbox.

What does this TLA+ expression mean?

If you want to experiment and see what some constant TLA+ expression evaluates to, the new REPL is useful. Our former colleague Will Schultz contributed it last year (before he left MongoDB for grad school). Samy and I think it’s especially helpful for beginners, but it’ll be occasionally valuable for any TLA+ user. Download a recent TLC build (the latest release is December 2020 and doesn’t have it) and invoke the REPL thus:

$ java -cp tla2tools.jar tlc2.REPL
Enter a constant-level TLA+ expression.
(tla+)

Let’s say you wanted to understand Bags better. (The REPL preloads some standard modules like Bags and Sequences.)

(tla+) SetToBag({"a", "b"})
[a |-> 1, b |-> 1]
(tla+) SetToBag({1, 2})
<<1, 1>>

When we were playing with this, we were surprised to see that SetToBag({1, 2}) is a tuple. We had to think about it for a second, and it felt like a discovery, something we didn’t know we didn’t know. That’s the great advantage of interacting with a system instead of just reasoning about it.

The REPL is a prototype. It doesn’t have syntax highlighting. We hear that it can extend and instantiate user-written modules, but it’s not well-supported and it crashed when we tried it. If it were more well-developed the REPL would make TLA+ much more interactive.

Is the spec behaving as intended?

Samy and I have this vague question when we start writing specs. It’s not a precise assertion, it’s a general lack of confidence that we’ve specified what we think we’ve specified, without mistakes. If it were a program we’d add print statements or breakpoints, give it some input and see what it does. We don’t have a specific hypothesis like “the program obeys this invariant or that temporal property”, we just watch it run and see if anything looks wrong to us. But print output is confusing in TLA+ model-checking.

Print expressions

Let’s take the HourClock module from Specifying Systems and add a print expression to see how the “hour” variable changes.

VARIABLE hr
HCini == hr \in (1 .. 12)
HCnxt ==
    /\ hr' = IF hr # 12 THEN hr + 1 ELSE 1
    /\ PrintT(<<"hr is ", hr, "hr' is", hr'>>)
HC == HCini /\ [][HCnxt]_hr

Here’s the output:

<<"hr is ", 4, "hr' is", 5>>
<<"hr is ", 3, "hr' is", 4>>
<<"hr is ", 2, "hr' is", 3>>
<<"hr is ", 5, "hr' is", 6>>
...

We expected hr to begin at some value and increase sequentially like “4 5 6”, but instead we get a random-looking order of states. The model-checker is doing a parallelized breadth-first search of states, not a straightforward sequence of steps like a program would, so we don’t see one behavior in order.

TLC’s simulation mode makes print more useful. Enable it in the Toolbox’s “TLC Options” pane and set max traces to 1. Oddly, you also have to set “Number of worker threads” to 1. Instead of this UI, we think a Toolbox button that says “show me one random behavior” would be a useful feature for spec developers to get quick feedback. Anyway, once simulation mode is configured you get what you want:

<<"hr is ", 4, "hr' is", 5>>
<<"hr is ", 5, "hr' is", 6>>
<<"hr is ", 6, "hr' is", 7>>
<<"hr is ", 7, "hr' is", 8>>

Each time you run TLC, it gives you one example behavior. The hour clock starts at an arbitrary value, in this case 4, and then you see consecutive states of one behavior, so now print is actually useful. This is a great way to see if the spec is generally behaving according to your theory of the system, without asking any precise questions.

But, a random sample of traces may not cover the part of the spec you’re most worried about. You can home in on worrisome behaviors by overriding the initial predicate:

Either hand-write it, or if you have an error trace, the Toolbox has a feature to copy any state of the trace into the initial predicate.

Visualizing the state graph

Besides print expressions, another tool that’s tempting for beginners is visualizing the state graph with Graphviz. The Hour Clock’s graph looks like this:

We’d prefer a nice circle of 12 states. (It would look like a clock!) Tweaking the TLC output and using neato produces:

Let’s see if the state graph is useful for answering, “Is the spec behaving as intended?” Imagine we incorrectly add an AM/PM indicator to the clock. It starts with am = TRUE, and it flips at noon and midnight.

\* Incorrectly add am/pm to HourClock
EXTENDS Naturals
VARIABLE hr, am
HCini == hr \in (1 .. 12) /\ am = TRUE
HCnxt == 
    /\ hr' = IF hr # 12 THEN hr + 1 ELSE 1
    \* Oops, AM/PM should flip at noon/midnight, not 1 o'clock.
    /\ am' = IF hr = 12 THEN ~am ELSE am
HC == HCini /\ [][HCnxt]_<<hr, am>>

But we have a bug. AM actually flips when the hour goes from 12 to 1, not when it goes from 11 to 12 as intended. If we were debugging a program we might step through it and see our mistake. With TLA+, will the mistake be obvious from the state graph?

TLC colors the states by variable values, so AM is gray and PM is white. (Correction: TLC colors initial states gray and others white.) Again, this nice layout required manual tweaking. If we zoom in, we might notice that this is an unintended transition: AM flipped from TRUE to FALSE when the hour went from 12 to 1. Whoopee, we found a bug!

But this kind of visualization isn’t usually helpful for answering, “Is the spec behaving as intended?” Real specs have thousands or millions of states. A MongoDB spec Jesse is writing has 11,000 states, which is small for a real spec, but it produces this hairball:

It’s no better even if we zoom in (and if one of the labels weren’t completely black).

Graphviz is mainly useful for beginners to learn the concept of a state space. For real specs, profiling is a much better tool for answering, “Is the spec behaving as intended?”.

Profiling

Let’s consider a new example: an alarm clock.

VARIABLES hr, alarmHr, alarmOn
vars == <<hr, alarmHr, alarmOn>>
HCini == 
    /\ hr \in (1 .. 12)
    /\ alarmHr \in (1..12)
    /\ alarmOn = FALSE 
AdvanceHour ==
    /\ hr' = IF hr # 12 THEN hr + 1 ELSE 1
    /\ UNCHANGED <<alarmHr, alarmOn>>
SetAlarm ==
    /\ alarmHr' \in (1..12)
    \* Oops, forgot to set alarmOn' = TRUE
    /\ UNCHANGED <<hr, alarmOn>>
Ring ==
    /\ alarmOn \* Oops, alarmOn is always FALSE
    /\ hr = alarmHr
    /\ alarmOn' = FALSE
    /\ UNCHANGED <<alarmHr, hr>>
HC == HCini /\ [][AdvanceHour \/ SetAlarm \/ Ring]_vars /\ SF_vars(Ring)

This is the hour clock again, with an alarm that can be set to any hour. But, as you see above, the alarm never rings. The TLA+ profiler makes this mistake obvious.

You’ve probably seen this in your own specs: the Toolbox shows a warning when an action is never enabled. (The warning appears if “action enablement” profiling is on, which it is by default.) The warning also appears when you’re looking at the spec source.

If you turn on full profiling, the never-enabled action and the un-evaluated expressions are outlined in red. Something is almost certainly wrong if you wrote an expression that’s never evaluated.

It’s hard not to notice a never-enabled action if you’re looking at the Toolbox GUI. But at MongoDB, we run the model checker unsupervised in our continuous integration servers, to make sure we don’t commit a spec change that fails model-checking. What if we wanted TLC to check that every action is sometimes enabled? That’s a hyperproperty and we can’t write that assertion in TLA+. We propose a TLC option to fail model-checking if any action is never enabled, and perhaps if any expression is never evaluated.

ShiViz

If your spec involves nodes that exchange messages with a vector clock, ShiViz is a terrific visualization tool to check that your spec behaves as intended. Configuring ShiViz is a few hours investment, so it’s not the first tool to reach for, and it’s not the right way to catch bugs in your TLA+ syntax, like you wrote /\ when you meant \/. Use ShiViz to understand design bugs, e.g. race conditions in your protocol.

ShiViz creates visualizations from traces or log files, which it parses with a regular expression you must provide. It comes preloaded with a trace from Markus Kuppe’s TLA+ spec for distributed termination detection and a regex for parsing it, which gives you a snazzy interactive space-time diagram.

ShiViz can find “motifs”, which are common communication patterns in the trace, and it has a sophisticated feature for comparing traces to each other. Samy and I haven’t explored these yet but we think it would reveal which patterns are associated with successful executions and which are associated with failure; it sounds like an incisive diagnostic tool.

ShiViz doesn’t know about TLA+, it expects the vector clock to be printed as JSON. Here’s Markus’s code for integrating his TLA+ spec with ShiViz. He uses the Json module to print the vector clock and the new Alias feature to customize TLC’s trace output.

------------ MODULE EWD998ChanID_shiviz -------------------
EXTENDS EWD998ChanID, Json

(* ... deleted code ... *)

Alias == 
    [
        Host |-> host
        ,Clock |-> ToJsonObject(clock[host])
        ,active |-> active
        ,color |-> color
        ,counter |-> counter
        ,inbox |-> inbox
    ]
==========================================================

ShiViz (and its companion TSViz for multi-threaded systems) can provide us a holistic understanding of specs of certain kinds of systems. We’ll probably use it the next time we write a spec like this. We propose adding features either to TLC or to ShiViz/TSViz to make integration easy for the ordinary programmer. Perhaps it’s easy already and only the documentation is missing.

Why is my action never enabled?

Let’s consider the alarm clock again. We had a Ring action that’s never enabled, so our natural question is, “Why is my action never enabled?” We asked a couple experts what tools are useful for this question, but they came up empty. They told us to try “staring really hard” at the spec. Developing a tool for this is a good research problem.

Here’s a frequent cause of never-enabled actions:

Push(stack, x) ==
    stack' = Append(stack, x)

Pop(stack) ==
    stack' = SubSeq(stack, 1, Len(stack) - 1)

Init == myStack = <<"x">>

SomeAction ==
    /\ Pop(myStack)
    /\ Push(myStack, "y")
\* Above is equivalent to stack = <<>> /\ stack = <<"y">>, which is FALSE

SomeAction uses two operators, Push and Pop, which use the same primed variable. This is natural for programmers, we might think SomeAction means, “pop the stack, then push a value”, but in TLA+ it’s a contradiction that’s always FALSE. Jesse was bitten by this in a real spec at MongoDB. The bug was buried deep enough that it took him hours to discover it. We think anyone who writes code like this is probably making a mistake, so we propose that TLC should prohibit contradictory uses of a primed variable in an action. That would make one variety of never-enabled actions easy to diagnose; it remains a mystery what sort of tool could answer this question in general.

How does this edit affect the spec’s behavior?

Programmers like Samy and me typically write specs iteratively: we complete a spec, then return months or years later to model a new feature in the system we’re specifying. This is like the HourClock AM/PM example above: we took a trusted spec from Specifying Systems and made changes, and we want to see how those changes specifically affect its behavior.

VS Code debugger

The TLA+ debugger that Markus Kuppe recently prototyped in VS Code is useful for experimenting with changes to a spec. At the moment you must go to Markus’s fork of the VS Code TLA+ extension and download the latest .vsix. (Correction: the debugger has been merged into the mainline TLA+ extension.) In VS Code, go to Extensions, click “…” and “Install from VSIX”, select the downloaded file, and reload VS Code. Open a .tla file in VS Code (there should be an associated .cfg file), right click, choose “Check and debug model in TLC”. (The debug action is special to Markus’s fork.) Now you can set breakpoints and step through the evaluation of expressions within an action. VS Code displays the values of primed and unprimed variables, and the trace that led to the current state.

However, you have the same problem as with print statements (above): the model checker is doing a breadth-first search through the state space. Each time you hit a breakpoint, it will be a different execution. We propose a feature that would run the debugger in simulation mode, so you can step through one behavior sequentially.

Let’s see if the debugger can help us see our mistake in the AM/PM indicator for the hour clock. In the video above, we set a breakpoint in the action we changed and run the debugger. We’re looking for a trace that’s interesting to us, namely when hr = 12, which requires us to hit “continue” many times. Once we’re there, we can inspect the variable values and easily see our mistake: we change am one hour too late.

The debugger is invaluable for understanding how any action is evaluated, but the lack of conditional breakpoints makes it too cumbersome to use for iterative spec development. Markus has said that conditional breakpoints are coming soon; iterative development will be feasible with the debugger once we have them, along with watchpoints: the debugger breaks when a variable changes, i.e. an action assigns a different value to the primed variable.

How do I communicate behaviors to other people?

Often, a spec author is the only TLA+ user on a team, but they need to communicate a behavior to others. You might want to share a counterexample that violates an invariant and demonstrates a protocol bug. Our MongoDB colleague Will Schultz contributed the TLA+ animation module for this purpose.

(We stole this gif from Markus’s TLA+ workshop.)

The module provides TLA+ operators that generate SVG drawings for each state in the trace; we can merge them into an animation. But developing the animation is arduous. The module’s operators are primitive and you must lay out your shapes with absolute positioning; we hope someone contributes more convenient drawing operators.

You can only visualize what you model: e.g. specs of message-passing systems typically don’t record each message’s sender, since it’s not required to check correctness, so you couldn’t animate a message traveling from its sender to its recipient. You could add extra variables or write a refinement of the spec for the sake of animating it.

Future of interactive TLA+ spec development

We’ve surveyed some tools for precise and holistic understanding, and proposed a few features. Let’s consider other possibilities for TLA+ tools.

Iterative spec development

When we develop a spec, we want to quickly see the effects of our changes and spot bugs right away. Ideally, tools catch our mistakes as we make them, or at least make them easy to discover. Some IDEs for the top programming languages are very good at this; let’s bring the same ease of use to TLA+.

As we saw above, the TLA+ debugger in VS Code is almost, but not quite, a convenient way to experiment with spec changes. It could benefit from watchpoints and conditional breakpoints, simulation mode, and perhaps from integration with the TLA+ Toolbox. (Correction from Markus: “The debugger doesn’t have a command to launch TLC in simulation mode, but we can pass ‘-generate’ as options to step through a (randomly generated) behavior.")

When Samy and I gave this talk, several experts in the audience mentioned “state space exploration”. This would be like “choose your own adventure”: at each step of the spec’s execution, you could see all the possible next steps and choose which to follow. Markus showed me that this is possible with TLC today; it should be integrated into the GUIs and publicized.

Experiment with expressions

For learning TLA+, and for iterative spec development, it’s crucial to experiment with expressions and see what they evaluate to. The standard method today is to change an expression in a spec and re-run the model checker; this is a slow feedback loop and the result is prone to misinterpretation. You could speed up the feedback loop by moving the expression to its own spec temporarily, or by using simulation mode and overriding the spec’s initial predicate. All of this is inconvenient. The TLC REPL is easy to use but only for constant expressions and standard modules: it can’t evaluate expressions in the context of a particular spec state, and we couldn’t figure out how to import definitions from our specs.

It would be convenient if the debugger had watch expressions, or if the REPL were integrated (it’s just another UI for evaluating expressions). It’s a common development pattern in programming languages to write some code until you’re stumped, set a breakpoint at the tricky spot and run the program to there. Examine the program’s state, try out some expressions, and you quickly realize what code to write next. If this were possible with TLA+ we could speed up spec authorship dramatically.

But what if you suspect a bug and you don’t know where to look? Here’s another idea.

Graph of actions that enable other actions

We saw above that graphing the state space isn’t useful for real specs. Samy and I think an action-enablement graph would be better. Here’s how it would look for the alarm clock:

This is like a call graph: if action A changes the state such that action B is enabled, draw an edge from A to B. We expected an edge from SetAlarm to Ring, but it’s obviously absent. So we know either that the enablement conditions for Ring are buggy, or SetAlarm doesn’t change the state correctly. (In fact it’s the latter.) A graph like this makes it easy to narrow down our bug hunt.

We learned in Markus’s workshop that TLC can make such a graph, but it’s not documented or integrated into the GUIs; if it were, it would be a familiar and powerful tool for programmers who write specs.

Variable ranges and distributions

Leslie Lamport says to always be suspicious of success, but where should we focus our suspicion? The tools are little help at revealing what we don’t know we don’t know. If you don’t craft an invariant, assert, or property to detect a particular mistake, you may never notice it. We need sanity checks.

One sanity check could be generated by the profiler: It could tell us ranges that it encountered for each variable. Here’s a toy example:

\* Try to loop from 1 to 10 and repeat.
VARIABLE x
Init == x \in (1 .. 10)
ActionOne == 
    /\ x = 10
    /\ x' = 1
ActionTwo == 
    /\ x' = x + 1 \* Oops, this could set x' to 11.

We accidentally allow x to grow past 10. We could add an invariant, x <= 10, to catch this mistake, but first we must suspect this mistake exists!

We propose a feature that shows the range of values for each variable in states discovered by the model checker. It could be limited to numeric values or small sets, and it might take a sample of states in simulation mode. For some kinds of specs, this would be a quick way to eyeball the spec’s behavior. In the example above, we might run the simulator for 100 steps, notice that x grew past 10, and realize our mistake.

If the set of discrete values is small enough, we could show a histogram. For example, we might expect x to be TRUE or FALSE roughly evenly in all the states of this spec:

VARIABLE x
Init == x \in BOOLEAN
ActionFlipToFalse == 
    /\ x = TRUE
    /\ x' = FALSE
ActionFlipToTrue == 
    /\ x = FALSE
    /\ x' = FALSE \* Oops, this should be TRUE
... more actions ...

If the model checker showed us statistics about x’s distribution, we’d see that x is FALSE almost always. This isn’t at all the same thing as a probabilistic model checker; it can’t tell us about the likelihood of states in an actual system. But in cases like this it could arouse our suspicion that the spec doesn’t behave as we intend.

Questions for the Audience

At the TLA+ conference, Samy and I finished our talk asking the audience the following questions:

What use cases did we miss? What questions have you had about a spec that you didn’t know how to answer?
What existing features and tools did we miss? How can they be better promoted so programmers like us would find them next time?
What’s the right direction for making TLA+ easier—more tools, or consolidate more features in one tool? Is that one tool the Toolbox or VS Code or …?

Perhaps you, dear reader, would like to ponder them and answer in the comments, or make a proposal of your own. Or better yet, implement something!

Paper Review: Programming as Theory Building

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Thu, 26 Aug 2021 16:01:58 +0000

If you’re my colleague and you want to trigger my most ferocious pet peeve, ask me to explain something that I’ve already written down. My first response will be a terse link to the docs. I won’t say “RTFM” out loud, but if you listen closely you can hear me think it. If you ask a followup that I’ve also written the answer to, I get sarcastic and rude. Things generally devolve from there.

My pet peeve, when triggered.

Until now I’ve felt justified nurturing this pet peeve. If it bites you, that serves you right for not doing your research before you bothered me. But I’ve just read something that changed my mind.

In his 1985 article “Programming as Theory Building”, Peter Naur (of Backus-Naur form) says a programmer’s main activity isn’t coding, it’s creating a theory of the problem at hand and its solution. This theory is implicit knowledge, which can’t be written down. Sure, I can transmit some knowledge via documentation. But you can’t understand my code as well as I do unless you acquire the theory I built in my head, and you can’t get that by reading my docs or my code. I can’t acquire your theories by reading your writing, either.

Peter Naur.

Naur says that a theory is “the knowledge a person must have in order not only to do certain things intelligently but also to explain them, to answer queries about them, to argue about them, and so forth.” (He’s borrowing a definition of “theory” from philosopher Gilbert Ryle.) This sounds like the kind of knowledge I have about my own software designs, but not about anyone else’s.

Why can’t I write down my theory? Naur says “the programmer’s knowledge transcends that given in documentation in at least three essential areas”:

The programmer knows how the real world maps to the program, and which parts of the world are relevant to the program or not. Naur claims this can’t be captured in text.
The programmer can explain all design decisions. “The justification is and must always remain the programmer’s direct, intuitive knowledge or estimate.” Even if I followed some well-known design principle, my choice of which principle to apply was intuitive, not based on a principle-choosing rule that I can write down.
The programmer knows how best to modify the program to meet new requirements. This depends on recognizing similarities between new and old situations; these similarities “are not, and cannot be, expressed in terms of criteria, no more than the similarities of many other kinds of objects, such as human faces, tunes, or tastes of wine, can be thus expressed.”

I’m ambivalent about these arguments. On the one hand, my experience matches Naur’s: I seem to know something about my designs and code that others don’t, and others know something about theirs that I don’t, and no amount of documentation alters this. On the other hand, I cling to the belief that better writing could capture more of this knowledge. The designs I write today are not the best explanations possible. Each of my design docs gets frozen as soon as I achieve consensus among my dozen reviewers; it’s a ceasefire line, not a friendly guidebook. If I took the time to write the guidebook after the design is approved, and I encouraged everyone to read the guidebook instead of the design, maybe I’d transmit this theory that Naur thinks is so ineffable.

When a design debate ends, the document is a record of when the shooting stopped, not a clear explanation. (The Somme.)

But let’s say Naur’s right: a theory can’t be written. This implies that to modify a program, you must hire people who possess the theory, usually the program’s original authors. Otherwise each change by programmers who don’t possess the theory adds to the program’s decay, until it is unmaintainable. “The death of a program happens when the programmer team possessing its theory is dissolved…. [R]eestablishing the theory of a program merely from the documentation, is strictly impossible.”

It reminds me of Zen. Bodhidharma described it thus: “A special transmission outside the scriptures. No dependency on words and letters.” As in Zen, a programmer can speak from her understanding, though not of her understanding. Naur: “[T]he person possessing a theory will be able to produce presentations of various sorts on the basis of it, in response to questions or demands.”

Does that make me indispensable? Must I answer questions about my designs and code until I quit or die, leading to my software’s death? No, a theory is transmissible (like Zen):

What is required is that the new programmer has the opportunity to work in close contact with the programmers who already possess the theory…. This problem of education of new programmers in an existing theory of a program is quite similar to that of the education problem of other activities where the knowledge of how to do certain things dominates over the knowledge that certain things are the case, such as writing and playing a music instrument. The most important educational activity is the student’s doing relevant things under suitable supervision and guidance.

So it’s not mystical after all; it’s pretty ordinary. A theory includes some implicit/functional knowledge, not just explicit/propositional knowledge, and the implicit knowledge must be transmitted by mentorship.

Until now, my goal when I wrote designs was to preempt all questions. If you’re my colleague, I thought you should be able to read the design (or a blog post or wiki page) and learn everything you need. Like Hamilton, I tried to write my way out of most situations. If you had read my document and still had questions, I’d write more! Until “Programming as Theory Building” I didn’t know why this didn’t work. If I believe Naur (and I half do), then I have to deliberately transmit each of my theories by working closely with the individuals who want to acquire them.

Alexander Hamilton.

What about my pet peeve? I feel less justified snapping at you for asking me questions. It’s a fact of life that we can hold more subtle knowledge in our heads than we can express in language, so we have to play Twenty Questions and Charades trying to find the outlines of each other’s knowledge. That’s a lonely fact, but at least we’re all lonely in this together. I don’t know your theories any better than you know mine.

Coaching For All 2021 PyGotham Speakers

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Tue, 24 Aug 2021 08:34:21 +0000

I help organize PyGotham TV, NYC’s annual conference about the Python programming language, plus an eclectic mix of policy, culture, open source software, and hacks. For the last few years we’ve provided speakers with the services of a professional speaking coach, opera singer Melissa Collom (she’s also coached PyTennessee and PyOhio speakers). In the past we limited coaching to first-time speakers, but we can coach everyone now. If your talk was accepted to PyGotham 2021, we sent you an email with instructions for signing up for coaching, and I urge you to do so: nothing will improve the quality of your presentation, and your enjoyment of it, like an hour with Melissa.

What’s a coaching session like? Melissa will ask about your experience and your confidence level. Together you’ll clarify your goals for this talk, your intended audience, and your main message. Then you’ll run through one or more sections of your talk, and Melissa will give customized coaching. She’ll help you organize your content and focus on your main message, and she’ll train you to use posture, pacing, and your voice to engage the audience.

Coaching is for anyone who wants to raise their profile in the community, or just give a great talk. First time speakers and veterans can benefit: I’m a veteran and I hire Melissa regularly.

Read what speakers said in 2017, 2018, and 2019. (I didn’t collect speakers' feedback in 2020, I was having a weird year.)

And if you’re preparing a talk for something other than PyGotham, Melissa’s available for hire!

Image: Miss Jeannette Rankin, of Montana, speaking from the balcony of the National American Woman Suffrage Association, Monday, April 2, 1917. By C. T. Chapman, Kensington, Md. (Photographer).

Priest ordination for Shōmon at the Village Zendo

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Thu, 05 Aug 2021 08:30:50 +0000

August 2, 2021. The Village Zendo’s head trainee for the summer, Willie Mukei Smith, was ordained and received the name Shōmon, meaning “True Gate”. According to the Soto Zen elders, in the annals of our lineage no one has been named Shōmon before. The ceremony was performed with a small group, wearing masks; the zendo is otherwise closed due to Delta.

Images © A. Jesse Jiryu Davis

The MongoDB Stable API

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Thu, 15 Jul 2021 14:18:26 +0000

MongoDB has released new database versions annually for the whole decade I’ve worked here. But starting now we’re increasing our pace to quarterly, and eventually we’ll release even more often. This means we deliver new features to you sooner, but it also presents you with a problem: you can’t upgrade your database constantly if you can’t trust that your existing code is compatible with the new version of MongoDB.

The MongoDB Stable API, which I designed for MongoDB 5.0, solves this problem. We announced it this week at our MongoDB.live conference. (It was called the “Versioned API” then, we renamed it the “Stable API” a while after that.) Here’s a 10-minute explanation of the Stable API:

I demonstrated how to use the Stable API in Python with PyMongo:

My presentation and article focus on user experience, so I didn’t get to talk about my favorite part of the Stable API project: how we prevent ourselves from making incompatible changes by mistake.

The Stable API permits compatible changes, e.g., we can add an optional parameter to a command, or a new field to the command’s reply, or we can add a whole new command. None of these changes would cause errors in your existing code when you upgrade MongoDB. But incompatible changes are banned: we can’t remove a parameter from a command, or remove a reply field, or remove a command entirely. So, how do we continuously test that we don’t make incompatible changes?

We chose static analysis, which worked better than I expected. Before the Stable API, some MongoDB commands' inputs and outputs were defined in YAML files. This is an example of an interface definition language (“IDL”). We use the IDL to generate C++ code to parse command inputs and generate command replies. This was a nice start for compatibility testing, so we extended it: now all commands in the Stable API are defined in IDL.

Next, we wrote a compatibility checker in Python. It runs on each git commit. For all commands in the Stable API, the checker compares their latest IDL definitions with their definitions in all previous MongoDB releases. (Right now there’s only MongoDB 5.0, since we just started using this system. In the future, we’ll compare the latest commit to all major and minor releases from 5.0 forward.) The compatibility checker verifies that all differences among IDL versions are compatible. That is, if there’s a new command parameter it must be optional, if a parameter’s type changed the new type must be a superset of the old one, and so on.

There are a couple dozen rules for which IDL differences are permitted or banned, and we have meta-tests that check the compatibility checker catches incompatible changes of each kind. The checker is sound—every reported incompatibility is real—though not complete. There are plenty of ways we could mistakenly change MongoDB’s behavior that a static checker won’t catch. We already have some runtime tests that should catch most of those; for example, a “multi-version aggregation fuzzer” that checks that random aggregation pipelines have the same results in all MongoDB versions.

Of course, our testing isn’t perfect and some bugs are bound to slip through. But I’m very happy with how simple and comprehensive the static checker turned out to be.

Paper review: Strong and Efficient Consistency with Consistency-Aware Durability

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Wed, 07 Jul 2021 10:32:54 +0000

Strong and Efficient Consistency with Consistency-Aware Durability, by Aishwarya Ganesan, Ramnatthan Alagappan, Andrea Arpaci-Dusseau, and Remzi Arpaci-Dusseau, FAST (the USENIX conference on file and storage technologies) 2020.

In leader-based consensus protocols like Raft, Paxos, or ZooKeeper Atomic Broadcast (“ZAB”), clients typically write some data, then wait for a majority of servers to replicate the write and make it locally durable, to ensure the write will survive the loss of any minority of servers. This paper makes an eccentric proposal: clients can write without waiting, then wait for durability before reading. This doesn’t prevent data loss, but it does guarantee that clients never read data that is lost subsequently. Add a further guarantee that the system replies to each query with data at least as fresh as its previous reply, no matter which client is querying it, and you get a quirky new consistency property: “cross-client monotonic reads”.

The paper makes four claims:

“Cross-client monotonic reads” is a novel and strong consistency property.
It’s weaker than linearizability but still useful to applications.
It’s more performant than linearizability.
The system proposed, Consistency-Aware Durability or “CAD”, guarantees cross-client monotonic reads.

With cross-client monotonic reads, the system can still reply with stale values. So, read-your-writes is not guaranteed, much less linearizability. But, unlike with session guarantees, this property does not require you to use the same client for a series of reads.

Guaranteeing monotonicity with the durable index

How does CAD guarantee this property? Let’s take an example with 5 servers. CAD works with leader-based protocols—in the paper they have extended ZAB, but I see no reason it couldn’t apply to Raft or Paxos—so let’s say that server S1 is the leader and the others are followers. Suppose the client writes the value “a1” to the key “a” (ZooKeeper is a key-value store). It sends its write to S1, and S1 appends the write to its log, as in most consensus protocols. S1 applies the write to its local data immediately, unlike with Raft or Paxos. S1 eventually replicates the write to its followers. Then the client writes “a2” to the key “a”, but this write hasn’t been replicated yet.

Now S1’s value for “a” is “a2”, and it has two log entries. Its followers' value for “a” is still “a1” and they have one entry.

The important question now is, what reads are permitted without violating monotonicity? If the client reads first from S1, then a follower such as S2, it would see “a2” then “a1”, which is wrong.

In CAD, the leader tracks how recently each key has been updated. It also tracks how recently all the followers have replicated updates: this is the “durable index”.

When the leader receives a request for key “a”, but the followers haven’t replicated the latest write to “a”, that means the durable index is stale. The leader waits for followers to all replicate, and it advances the durable index, before it answers the request with “a2”. That means any later request, from any client on any follower, will also get “a2”. The followers' replies are at least as fresh as the leader’s, thus they’re monotonic.

(By the way I’m showing diagrams as if there’s only one key in the system. In fact, the system is tracking when each value was most recently updated, by tracking the log index of its last update.)

So, that’s how CAD makes a follower read, after a leader read, monotonic. What about two follower reads in a row? Suppose again that a client writes “a2” to the leader, and it replicates to followers S2 and S4, but the other followers haven’t replicated yet.

Now, if all followers were allowed to serve reads without waiting, the client could read “a2” from S2, then the older value “a1” from S5, violating monotonicity.

Let’s see how the durable index guarantees monotonicity when reading from two followers. When the client has written “a2” to the leader, and only two followers have replicated it, the durable index is still on the previous value “a1”. The durable index is not updated when a majority replicates a log entry (unlike most consensus systems), only when all of them have. When S2 serves a read for a key that’s been updated more recently than the durable index—in this case, key “a” was updated in the second log entry, but the durable index is still on the first entry—the follower forwards the read to the leader.

Now the leader behaves just like we saw before: when it serves a read for a key that was updated more recently than the durable index, it waits for all nodes to replicate so it can advance the durable index, then it answers. The follower S2 relays the leader’s answer back to the client.

My interpretation is that the system presents a snapshot of the data at the durable index. Since the index only moves forward, reads are monotonic. How does the system always serve data at the durable index? First, any read of a key’s version after the durable index awaits an advance of the durable index. Second, any follower read of a key’s version before the durable index also awaits an advance of the durable index, so it can reply with the version at the durable index.

Avoiding blocking with the active set

So far, I’ve said that the durable index is advanced only when all followers replicate a log entry. Obviously this is vulnerable to blocking: if one follower is laggy or down, the durable index is stuck. Read requests for keys updated more recently than the durable index will block until the follower recovers. CAD solves this problem with an “active set” of servers. The active set has to be at least a majority, but it can exclude unhealthy, laggy nodes. The leader advances the durable index as soon as a log entry is replicated by the active set, rather than waiting for all followers. Followers that aren’t in the active set can’t serve reads. (The paper doesn’t say how; I assume they forward reads to the leader.)

Suppose servers S1, S2, and S3 are in the active set, and the other two aren’t. If the client writes “a2” to the leader and at least S2 and S3 replicate it, then the durable index advances.

The active set is maintained with leases, and there’s some careful discussion about clocks and timing to ensure a member knows that it’s out of the active set before the leader knows, which is important for monotonic reads.

Now, a read from S2 can be answered without waiting: the durable index has advanced to include “a2”, even though the two laggy nodes haven’t caught up. A read from S5 is forwarded to the leader (I assume), which can answer without waiting for the same reason.

Their claims

Let’s recall the paper’s claims and see if they’re justified.

“Cross-client monotonic reads” is a novel and strong consistency property.

It’s certainly novel to me, but I quibble with the claim that a consistency property is “strong” if it doesn’t even provide read-your-writes. Maybe there’s a well-known definition of “strong” that I don’t know.

Cross-client monotonic reads is weaker than linearizability but still useful to applications.

I’m very skeptical about its usefulness. If you can’t read your writes, do you care about consistency at all? Perhaps you’d prefer even weaker consistency in exchange for completely wait-free reads. The paper offers two examples of applications that might benefit from cross-client monotonic reads, but I find them a bit contrived. In both cases I think an application-specific solution would easy to implement. Apps that need only monotonic reads seem so niche, I’m not convinced a general solution would be marketable.

I had wondered if this is a better example: one application is write-only, and it updates the data in the proper order to preserve foreign-key constraints. Another application is read-only, and relies on monotonicity (it sees changes in the order they were written) to guarantee it never sees foreign-key violations. I thought this seemed like a plausible use case for monotonic reads, but then I remembered that CAD can lose writes, since writes don’t wait for durability, so the writer application can’t guarantee foreign-key constraints after all. (The paper mentions that their system provides a durable-writes option, but then the performance advantage is partly lost.)

To be fair to CAD, my objection is to monotonic reads as a consistency level in general. Providing cross-client monotonic reads is an improvement; I’ll say more below.

Cross-client monotonic reads is more performant than linearizability.

Yes, writes are faster (but unreliable). Reads are faster if you don’t usually read the most recently-written data, so the system relies on certain access patterns. If you constantly read data just after writing it, then performance degrades to that of a linearizable system, as the authors show in their evaluation section.

Their system guarantees cross-client monotonic reads.

Yes, a follower redirects reads to the leader if the follower is not in the active set, or doesn’t have the latest durable version of a value. The leader never returns a value that can be rolled back.

Their conclusion

“Cross-client monotonic reads, a strong consistency guarantee, can be realized efficiently upon CAD. While enabling stronger consistency, CAD may not be suitable for a few applications that cannot tolerate any data loss. However, it offers a new, useful middle ground for many systems that currently use asynchronous durability to realize stronger semantics without compromising on performance.”

My conclusion

If the “monotonic reads” consistency level is useful at all (a big if), then adding the “cross-client” guarantee is very helpful. In my experience, applications have trouble using session-based guarantees (which is how monotonicity is usually provided). A multi-tier application will generate a request near the top layer, and in a deeper layer it creates a database session and sends the request to the database. By the time the reply has bubbled back up to the top layer, the session is lost, and the next request can’t reuse it. So, guaranteeing monotonic reads to applications that don’t keep track of their sessions will make life easier for real-world programmers.

But any consistency level without read-your-writes is niche. If I were building a new distributed database, I’d want to provide users with a variety of choices of consistency level before I added cross-client monotonic reads: Linearizable, snapshot, eventually consistent, session-based causal consistency. Only then might I think it’s worthwhile to add this niche option. But if I did, CAD is a clever way to implement it, and I think it could be combined with other consistency levels.

July 4, 2021

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Mon, 05 Jul 2021 07:45:10 +0000

People watching NYC fireworks from the FDR highway last night.

Images © A. Jesse Jiryu Davis

How To Understand Non-Self: Buddhism And Conway's Game Of Life

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Tue, 01 Jun 2021 09:45:59 +0000

On May 30, 2021, I spoke at the Village Zendo about a math puzzle called John Conway’s Game of Life, and how it helps us understand the Buddhist teaching of no-self. Here’s the video recording, with slides and animations.

Paraphrase of the dharma talk:

I’m going to describe some wonderful math. It will eventually be about Buddhism.

Let’s simulate life, with the simplest mathematical model possible.

Imagine this is a colony of cells in a flat layer on a Petri dish. The black squares are living cells, the white squares are empty space.

So, how are these cells born, how long do they live, how do they die? Let’s make three simple rules.

Rule 1: Survival.

A cell with two or three neighbors survives.

Each cell has eight possible neighbors. At each step of the simulation, mark all the cells that have two or three neighboring cells. In the next step of the simulation, those cells will still be alive.

Rule 2. Death.

A black square with four+ neighbors dies of overcrowding. A black square with zero or one neighbor dies of isolation.

A cell with four or more living neighbors dies, as if by overcrowding. In the next step, its space is replaced by an empty white square.

A cell with zero or one neighbors dies of isolation.

Rule 3. Birth.

An empty space with exactly three neighboring cells becomes a live cell.

The empty white square in the center will spring to life. In the next step of the simulation it will be black.

John Conway

A mathematician named John Conway came up with these three rules around 1970. It’s not intended to be a realistic description of biology, it’s a mathematical game, which Conway called “the Game of Life”. (It’s not the heteronormative board game where you play a family trying to fill its car with children.) Conway’s idea came from decades of mathematicians exploring the question, can simple systems produce complex behavior? Could a system with a few basic rules behave unpredictably? Could it produce self-replicating organisms?

John Conway’s not a computer scientist, but he’s well-known among programmers like me for the Game of Life and similar math research that veers into computer science. I studied some of his work in college, but hadn’t thought about him much until he died last year. I read his obituary in the New York Times, and learned what a delightful person he was.

Photo: Mariana Cook, 2007.

He was born in England in 1937 and spent most of his career at the University of Cambridge, and then at Princeton. The Times said his work “ranged from the rigorously highbrow to the frivolously fun, earning him prizes and a reputation as a creative, iconoclastic and even magical genius”.

As a student, Dr. Conway cultivated his acknowledged lifelong preference for being lazy, playing games and doing no work. He could be easily distracted by what he called “nerdish delights.” He once built a water-powered computer, which he called Winnie (Water Initiated Nonchalantly Numerical Integrating Engine).

Hired at Cambridge as an assistant lecturer, Dr. Conway gained a reputation for his high jinks (not to mention his disheveled appearance). One day when he was lecturing on symmetry and the Platonic solids, he brought in a turnip as a prop, carved it one slice at a time into an icosahedron, with its 20 triangular faces, eating the scraps as he went.

Photo: Dith Pran / New York Times.

When Conway was first playing with the Game of Life, he did it by hand, trying configurations on a board for the game of Go. Soon some computer scientists made programs that can run large games fast, eventually simulating boards with billions of squares, playing thousands of steps per second. The question everyone’s researching is, given how simple the game is, what can it do?

What can Game of Life patterns do?

Can a pattern in the Game of Life do nothing?

Block

Yes it can. This pattern is called a “block”. Each cell has 3 neighbors, so they all stay alive. No empty square has three neighbors, so there are no births. If this pattern is your initial configuration, nothing ever changes in this game. Here are some more still lifes:

Beehive

Loaf

Boat

Tub

So a pattern in the Game of Life can do nothing. Can it disappear? Yes it can.

In the first step, if we apply the three rules, then the center cell survives, the other two cells have one neighbor each so they die of isolation, and one cell is born on the right because it has three neighbors. In the next step there are two cells, both die of isolation, and in the third step the board is empty.

So yes, a pattern in the Game of Life can disappear. Can it oscillate, moving through repeating shapes? Yes it can.

Blinker (period 2)

Toad (period 2)

Beacon (period 2)

Pulsar (period 3)

Penta-decathlon (period 15)

Remember, all these animations show is steps in a simulation, where we apply the three rules at each step to produce the next shape. The only difference between these animations is the initial shape: the subsequent shapes are determined by the initial shape.

So, patterns in the Game of Life can oscillate. Can they move across the board?

Glider

Light-weight spaceship

Medium-weight spaceship

Heavy-weight spaceship

Yes, a pattern can move across the board. Can a pattern grow to infinite size? Conway originally thought not: if your initial configuration has a finite number of cells, he conjectured, the population can’t grow forever. He offered fifty dollars to the first person who could prove or disprove the conjecture before the end of 1970. A team at MIT won the prize with this “glider gun”. It shoots a new glider every 30 steps, which flies off forever. Thus from an initial configuration, the glider gun expands the size of the board infinitely in one direction, and it turns an infinite number of cells black.

Gosper glider gun

Computation in the Game of Life

Let’s take a big jump now. Can a pattern in the Game of Life do computation? For example, could it calculate pi, play tetris, or give you a credit score? Let’s take a digression, and talk about a different English mathematician for a minute: Alan Turing.

Godrey Argent Studio

He’s famous for building computers during WWII that decrypted German submarines' radio messages. Tragically, his work on physical computers had little influence, because it was classified until the 1970s, by which time it was obsolete. His main contribution to computer science was theoretical.

aturingmachine.com

In a 1936 paper, Alan Turing described an imaginary machine. It has an infinitely long tape, which it can move forward or backward. It can read, write, or alter one symbol at a time on any position on the tape. (The tape is like its hard drive.) Additionally, the machine has a fixed number of modes (known as “states” in computer science), and it can transition from one mode to one other mode each time it reads a symbol from its tape. That’s it. The machine is so simple, you can prove theorems about it. In fact, Turing never built one, the video above shows a hobbyist’s interpretation. You should watch it in action.

Turing described his machine and claimed that anything which can be computed, could be computed by this simple machine. There’s been a century of intense research into this claim. The term “computable” is unfortunately vague, but it means something like “a series of actions, following precise rules”. For any reasonable definition you can think of, Turing’s claim is true. This is a useful fact because we can compare any system (such as real computer) to Turing’s machine to see if they are equivalently capable.

Can the Game of Life compute anything a Turing machine can? Yes it can.

This pattern is a Turing machine, constructed in the Game of Life. It’s enormous, covering millions of squares. The long diagonal line is the tape, and the rectangle in the middle is the part that transitions between modes, and moves back and forth along the tape. When the video zooms in, you can see that the machine is made of the basic patterns we saw before: still lifes, oscillators, and gliders. Every square in this machine is following the three rules of survival, death, and birth. All its power comes from how the squares are arranged at the beginning; after that, the three rules determine how the machine behaves. This is a Turing machine, and it can compute anything that can be computed. If you put the right symbols on its tape, it can play chess, it can forecast the weather, it can translate German to Chinese.

Consciousness in the Game of Life

Here’s another big leap: can a pattern in the Game of Life experience consciousness?

So far, everything I’ve said is mathematically proven. Now we’re entering the Twilight Zone. To decide whether the Game of Life can experience consciousness, we have to answer two philosophical questions that can’t be answered with math.

Question One

Does consciousness arise from heaps of parts, interacting according to rules?

I think that my consciousness arises from chemistry and electricity and so on, interacting mostly in my brain but also in my body and my environment. All these parts follow the laws of physics. They’re rules—not as simple as the Game of Life, but rules nevertheless. Lots of modern people with the scientific worldview believe this, maybe you do too.

The Buddha taught that consciousness arises from interacting parts. He didn’t believe consciousness arises from the brain and body, that’s a modern idea, and his theory of reincarnation doesn’t fit with it. But, he did teach that consciousness is formed from parts called “the five skandhas”, or “the five heaps”. They’re called “heaps” because they are parts that contain more parts, and each of those contains more parts and so on. Buddha taught that these parts interact according to rules, which he called karma or the law of cause and effect. His descendants wrote big analyses called abhidharma, breaking down each part of consciousness into tiny sub-sub-parts and describing the rules of their interactions.

So, Buddha and modern neurology both teach that consciousness arises from heaps of parts, interacting according to rules.

Question Two

If I made a simulation of these parts of consciousness, interacting according to their rules, is the simulation conscious?

If you simulate a brain, or the five skandhas, is that consciousness? I think it is. Here’s my argument: Consider these heaps of parts of consciousness, each part is made of parts and so on, and you replace just one of the tiniest parts with a machine that simulates it. For example, say you replace just one of my hundreds of trillions of synapses with an electronic device. The device follows the same rules, so the whole collection of parts behaves exactly the same as before you replaced one synapse. The change isn’t detectable to me or anyone who observes me. It would be unreasonable to claim that I’ve lost consciousness.

You see where I’m going with this. If you replace the parts of my consciousness with a simulation, one by one, there’s never a moment when it’s reasonable to say I’ve lost consciousness. The final result must still be conscious, therefore a simulation of consciousness is conscious.

In summary:

Consciousness can be simulated (since it’s governed by rules).
A simulation of consciousness is conscious (by the replacement argument.
Simulation is computable (anything that follows rules is).
A Turing machine can compute anything computable.
The Game of Life can compute anything a Turing machine can (it can contain a Turing machine).
A pattern in the Game of Life can simulate consciousness.
A pattern in the Game of Life can be conscious.

This hasn’t happened yet, since we don’t know the rules governing consciousness in enough detail to design a Life pattern that could simulate it. But it’s theoretically possible, and that’s important.

No-Self

Buddha Expounding the Dharma, 8th C. Sri Lanka (Anuradhapura). Metropolitan Museum of Art.

Why am I talking about the Game of Life in a dharma talk at the Village Zendo?

A foundational teaching of Buddhism is that there’s no self, even though we experience being individual people. It’s important to develop insight into this fact, to grasp it intuitively, not just study no-self as a concept. The Game of Life was helpful for me: there’s just white and black squares and three simple rules, yet you can see how, with the right starting configuration, it produces any universe you could imagine. It could produce a conscious being, who is unaware that it is made of black and white squares.

The same with the ourselves. We’re heaps of parts, interacting according to rules. That’s not how it feels. I feel like I’m one permanent thing, therefore it’s easy for me to get attached to the thing of my self, and to mourn at the thought of losing my self to sickness, old age, and death. But all that happens is the heaps of parts that produce my consciousness come together, persist for a time, and then dissolve.

John Conway died in April of 2020, of COVID. He was 82 years old. He made serious contributions to the foundations of mathematics, and he made delightful toys, like the Game of Life.

A cartoonist produced this pattern in the Game of Life to honor him:

xkcd.com

Conway was made of heaps of parts, and they dissolved. That will happen to all of us.

He treated his life a bit like a game, but a game with no goal or points. Not a game to play, but a game to play with, to see what’s possible with it, to find ways to enjoy it, to stretch it into new shapes. If we didn’t take our own lives so seriously, if we saw our temporary existence as just some heaps of parts, just some squares on a grid, maybe we could suffer less, and have more fun.

Review: The Scout Mindset

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Thu, 20 May 2021 18:25:55 +0000

The Scout Mindset: Why Some People See Things Clearly And Others Don’t, by Julia Galef.

It’s hard for me to change my mind. I’m impatient, so when given a choice, like which taco to order, which car to buy, which candidate to hire, I want to decide quickly and then find reasons not to reconsider. It’s painful to step back and start over. And yet, I know this about myself. I try to compensate. But the possibility that I’ve committed to the wrong path sometimes keeps me awake.

My hardest decisions are in software designs. I might work for weeks to choose among several options, and get my colleagues to agree, and then wonder: am I confident that I chose correctly? Or am I fooling myself, because I don’t want to keep working on it? If I chose wrong, the consequences could bite us years later.

This is just one situation where I’m aware of my motivated reasoning. There are many others, and even more where I’m not aware. So I was excited by the promise of Julia Galef’s “The Scout Mindset”: to recognize and compensate for motivated reasoning, and be more skilled at finding the truth. (I was excited also because I have long enjoyed her rationalism podcast.) Galef’s central metaphors are the “scout mindset” and the “soldier mindset”. A soldier defends her opinions against all comers, but a scout has few preconceptions, and seeks only the truth. An army sends a scout with a bad map to reconnoitre, and whenever the scout finds a mistake on her map, she gladly updates it. “Aha! My opinion has become more accurate.”

Galef writes:

The best description of motivated reasoning I’ve ever seen comes from psychologist Tom Gilovich. When we want something to be true, he said, we ask ourselves, “Can I believe this?,” searching for an excuse to accept it. When we don’t want something to be true, we instead ask ourselves, “Must I believe this?,” searching for an excuse to reject it. … In contrast to directionally motivated reasoning, which evaluates ideas through the lenses of “Can I believe it?” and “Must I believe it?,” accuracy motivated reasoning evaluates ideas through the lens of “Is it true?”

“The Scout Mindset” is an efficient little book if you have the same desire for self-correction that I do. It brought my attention to how much motivated reasoning I engage in, and it encouraged me to be more rigorous. It’s full of little tricks for better thinking. For example, I learned about Chesterton’s Fence, which is a good principle for changing old code:

I try to abide by the rule that when you advocate changing something, you should make sure you understand why it is the way it is in the first place. This rule is known as Chesterton’s fence, after G. K. Chesterton, the British writer who proposed it in an essay in 1929. Imagine you discover a road that has a fence built across it for no particular reason you can see. You say to yourself, “Why would someone build a fence here? This seems unnecessary and stupid, let’s tear it down.” But if you don’t understand why the fence is there, Chesterton argued, you can’t be confident that it’s okay to tear it down.

The book has some tests for detecting motivated reasoning in myself. I haven’t applied them yet, but they sound promising. For example, the Outsider Test: imagine someone else is in your shoes, what might they do? The Conformity Test: imagine everyone disagreed with you, would you still believe what you do? Status Quo Bias Test: would you choose your current situation, if it weren’t the status quo? If not, maybe you should change your situation. The tests are all in Chapter 5, maybe I’ll go refer to them the next time I’m making a hard decision, especially in a software design.

I recommend this book to anyone who must make decisions in the face of uncertainty that will affect their lives and the lives of others: that is, I recommend it to everyone.

Further reading: Julia Galef interview with Dylan Matthews.

Image: Lieutenant General D’Wayne Gray, chief of staff, Headquarters Marine Corps, observes a live fire demonstration through an M-65 battery commander’s telescope, Okinawa 1984.

Paper review: Paxos vs Raft

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Tue, 11 May 2021 08:37:56 +0000

Paxos vs Raft: Have we reached consensus on distributed consensus? Heidi Howard and Richard Mortier, PaPoC 2020.

The Paxos algorithm, published by Leslie Lamport circa 1998, is foundational to distributed systems research and implementation. It permits a group of computers to reach consensus on a single decision, despite unreliable networks, failure-prone computers, and inaccurate clocks. The MultiPaxos variant permits a sequence of decisions: All computers in the group apply the same transitions to their identical state machines in the same order, thus eventually arriving at the same state. As Heidi Howard says, Paxos is “synonymous with distributed consensus”. But Lamport first described it in The Part-Time Parliament in an obfuscated literary style, which baffled readers then and has stymied comprehension ever since.

In 2014, Diego Ongaro and John Ousterhout presented Raft, a competing algorithm designed to solve the same problems more intelligibly. Indeed, they titled their paper In Search of an Understandable Consensus Algorithm. They measured their students' ability to learn the algorithms and answer quiz questions, and found Raft was indeed more understandable than Paxos.

The distributed systems community is split between the two. Paxos has a head start but Raft might be overtaking it. It seems to me that recently-implemented systems are more likely to be Raft-based, and my (very approximate) count from Google Scholar shows more Raft citations in the last few years:

	Paxos	Raft
Citations	3600	1700
Citations since 2019	700	900

Paxos and Raft are similar

In “Paxos vs Raft: Have we reached consensus on distributed consensus?”, Heidi Howard and Richard Mortier attempt to decide which is the better solution. Their main finding, however, is that Paxos and Raft are very similar. The paper takes MultiPaxos as the canonical variant of Paxos, and describes the common actions of that variant of Paxos and Raft:

One of n servers is the leader. It accepts all writes and sends log entries to the followers.
After a majority replicates an entry, the entry is “committed” and all members eventually apply the operation to their state machines.
If the leader fails, a majority elects a new leader.

Both satisfy two essential properties, as described in the paper:

State Machine Safety: If a server has applied a log entry at a given index to its state machine, no other server will ever apply a different log entry for the same index.

Leader Completeness: If an operation op is committed at index i by a leader in term t then all leaders of terms > t will also have operation op at index i.

For me, one of the great values of the paper is this explanation of the two algorithms' shared foundation. From this perspective they’re both simple, and similar. Since I know Raft much better than Paxos, this was the best Paxos description I’ve read so far. In the coming years, as more people learn Raft first, this “Paxos for Rafters” explanation will become even more valuable.

Digression

My MongoDB colleague Siyuan Zhou pointed me to a complementary paper, On the Parallels between Paxos and Raft, and how to Port Optimizations (2019). It’s a good read for anyone who wants to compare the two algorithms. Like Howard and Mortier, the authors of this paper observe that “in recent years, Raft has gradually overtaken Paxos as the consensus protocol of choice, esp. in the industry”. They want to formally define how similar the two algorithms are, in TLA+. They claim that Raft could be considered a refinement of Paxos. In fact, Raft genuinely differs from Paxos, but the authors consider these differences “superficial”. They construct a new algorithm called Raft* by updating Raft to be more Paxosish in two ways:

Elections: Raft ensures a server can win election only if its log includes all committed entries. The winner will tend to be even more up-to-date than that, since servers vote only for candidates at least as up-to-date as themselves. In Paxos and Raft*, any server can win election. The winner transfers missing log entries from its peers before becoming leader. (This is not a “superficial” difference in my opinion; I’ll say more below.)

Log replication: In Raft, all servers' copies of a log entry have the same index, term, and value. Paxos and Raft* can overwrite the term (aka “ballot number”) of an index entry.

Once they’ve created Raft* and a TLA+ refinement mapping from Paxos, the authors can easily port optimizations from Paxos to Raft*. They show how to do it as a somewhat mechanical task in TLA+, and apply their method to two optimizations: Paxos Quorum Lease reads (a more efficient way to do linearizable reads) and Mencius (a sort of multi-master Paxos where more than one server accepts writes).

End of digression, back to the main paper, “Paxos vs Raft”.

Raft’s advantages

According to Howard and Mortier, Raft’s three benefits over Paxos are its presentation, its simplicity, and its efficient leader election protocol.

Presentation

Leslie Lamport first explained Paxos, in “The Part-Time Parliament”, as if he were describing the legislature of an ancient society. Further obscuring his meaning, he presented the paper as the recently rediscovered notes of an elusive archeologist, with commentary by the computer scientist Keith Marzullo. In 2012, Lamport recollected his error:

Inspired by my success at popularizing the consensus problem by describing it with Byzantine generals, I decided to cast the algorithm in terms of a parliament on an ancient Greek island. Writing about a lost civilization allowed me to eliminate uninteresting details and indicate generalizations by saying that some details of the parliamentary protocol had been lost. To carry the image further, I gave a few lectures in the persona of an Indiana-Jones-style archaeologist, replete with Stetson hat and hip flask.

My attempt at inserting some humor into the subject was a dismal failure. People who attended my lecture remembered Indiana Jones, but not the algorithm. People reading the paper apparently got so distracted by the Greek parable that they didn’t understand the algorithm. Among the people I sent the paper to, and who claimed to have read it, were Nancy Lynch, Vassos Hadzilacos, and Phil Bernstein. A couple of months later I emailed them the following question: “Can you implement a distributed database that can tolerate the failure of any number of its processes (possibly all of them) without losing consistency, and that will resume normal behavior when more than half the processes are again working properly?” None of them noticed any connection between this question and the Paxos algorithm.

Compare what Ongaro and Ousterhout wrote in the Raft paper:

Our approach was unusual in that our primary goal was understandability: could we define a consensus algorithm for practical systems and describe it in a way that is significantly easier to learn than Paxos? Furthermore, we wanted the algorithm to facilitate the development of intuitions that are essential for system builders. It was important not just for the algorithm to work, but for it to be obvious why it works.

Simplicity

If you peel back Paxos’s Talmudic layers, is the actual algorithm any more complex than Raft? Howard and Mortier find that it is slightly more complex, in two ways. First, “Raft decides log entries in-order whereas Paxos typically allows out-of-order decisions but requires an extra protocol for filling the log gaps which can occur as a result.” Second, as I mentioned above, all copies of a log entry in a Raft group have the same index, term, and value. In Paxos, the terms may differ. According to Howard and Mortier, the first of these differences is an important simplification, but the second is not.

Efficient leader election

Each Paxos server has a unique integer id. When a Paxos server runs for election, it picks a unique term in the future, based on its id. When several servers run for election at once, the one with the higher id tends to win. But, if the new leader is missing log entries, it cannot accept writes immediately. It must first catch up by replicating some unpredictable number of log entries from the followers.

Raft elections have a disadvantage: If several equally up-to-date servers run, they might all get a minority of votes; they must wait for a random time then retry. On the plus side, if a server is more up-to-date than the other candidates, it tends to win. The new leader always has all majority-replicated entries and never needs to catch up. Thus the “Paxos vs Raft” authors claim Raft’s election protocol is more efficient overall.

My experience with MongoDB is that servers can sometimes get very far behind, even days behind. For example, if one server is on a slower network than the others, and the system has been running at top speed for a while, the impaired server can lag arbitrarily. If the lagging server won election and needed to catch up before the system was available for writes it would be a catastrophe. However, MongoDB is Raft-based, so lagging servers are never elected. (And, as in Raft, some elections are inconclusive and must be retried.) This seems like a huge win for Raft in real life, but maybe real deployments of Paxos avoid this problem somehow.

Conclusion

Howard and Mortier conclude, “The Raft paper claims that Raft is significantly more understandable than Paxos, and as efficient. On the contrary, we find that the two algorithms are not significantly different in understandability but Raft’s leader election is surprisingly lightweight when compared to Paxos'.” Dr. Howard said in a presentation, “It usually doesn’t matter which you choose, they’re incredibly similar and optimizations that apply to one are almost always applicable to the other.” This jostles my priors, but it seems convincing after reading “Paxos vs Raft”, as well as “On the Parallels between Paxos and Raft”.

My own conclusions are:

First, widespread fear of Paxos’s complexity is due to its originally obfuscated presentation. You only have one chance to make a first impression. No amount of re-explaining Paxos has overcome this.

But Paxos’s obscurity is not just a surface matter; there are aspects of the original algorithm that have not, to my knowledge, been explained in conventional terms in later papers. For example, when my colleagues and I were researching reconfiguration (the protocol for safely adding and removing servers), I read the relevant section of “The Part-Time Parliament”. Of course, Lamport described it as legislators retiring according to some ancient Greek custom. He proposed an optimization where the retirement is announced three terms before it occurs, but he didn’t explain why and I couldn’t reconstruct his thinking.

Ben Horowitz (a comrade in the Distributed Systems Reading Group) recently pointed out to me that Lamport cites Fred Schneider’s 1990 paper, Implementing Fault-Tolerant Services Using the State Machine Approach. Schneider wrote: “a configuration-change request must schedule the new configuration for some point far enough in the future so that clients, state machine replicas, and output devices all find out about the new configuration before it actually comes into effect.”

In contrast, Raft has not one, but two lucidly described reconfig protocols. (One is in the paper, the other in Ongaro’s dissertation, with a later bugfix.) We didn’t need to search for the Raft reconfig protocol at all, whereas Paxos’s is buried in layers of sediment.

My second conclusion is that Viewstamped Replication is underrated. It solved similar problems a decade before Paxos, and reading the paper is a pleasure. While I was reading about Paxos vs Raft, I was thinking, “What about VR? Why isn’t it taught more?”

Finally, Raft’s efficient elections seem like a huge win in practice. The canonical MultiPaxos election algorithm, it seems to me, could make the system unavailable for an abitrary period while the new leader replicates missing log entries from followers. I wouldn’t deploy such a protocol without a solution to this problem.

Get a Coach for your PyCon 2021 Talk

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Tue, 27 Apr 2021 13:30:18 +0000

I get professional coaching for nearly every conference talk I deliver. Now that most conference talks are pre-recorded videos shot from home, I need more help than ever to make my talks polished, convincing, and fun. I rely on my friend Melissa Collom, an opera singer and experienced coach who’s trained speakers at PyGotham, PyOhio, and PyTennessee for many years.

Are you prepping your PyCon 2021 talk this month? No matter what stage you’re at, Melissa can help you write, rehearse, and record your talk and make you a star.

I urge you to book an appointment with her.

Aragog: Scalable Runtime Verification of Shardable Networked Systems

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Sun, 04 Apr 2021 18:03:26 +0000

Last week in the Distributed Systems Reading Group we read Aragog: Scalable Runtime Verification of Shardable Networked Systems. The paper describes an intriguing system from Microsoft Research and U Penn; it can analyze huge numbers of network events and find invariant violations that indicate bugs in the protocol. The invariants are expressed in a cute regular expression language, which Aragog parses and automatically decomposes into local and global verifier state machines. (But I don’t know why it’s called “Aragog”.)

Read Aleksey Charapko’s summary or watch my presentation.

Pegasus: Tolerating Skewed Workloads in Distributed Storage with In-Network Coherence Directories

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Thu, 25 Feb 2021 20:04:56 +0000

I’ve participated in a Distributed Systems Reading Group for the last year. Each week we read a recent research paper, one of us presents it to the group, and we hold a discussion. This week I presented Pegasus: Tolerating Skewed Workloads in Distributed Storage with In-Network Coherence Directories. The authors describe a distributed key-value store implemented on a rack of servers, with a very clever protocol programmed into the ASIC of the switch itself.

Read Aleksey Charapko’s summary or watch my presentation.

Image: An ancient Greek mosaic panel of hunter with spear riding a winged horse. Archaeological Museum in Rhodes, Greece.

RIP Gertrude

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Fri, 19 Feb 2021 10:54:25 +0000

Gertrude Berg Armstrong Davis died yesterday. As usual with Chinese dwarf hamsters, she turned off like a light switch. Last week she banged around the apartment in her plastic ball with the enthusiasm of a young colt. A few days later she began walking stiffly, revealing herself to be a frail lady two years of age. The night before she died, we thought she looked unwell, but Gertrude herself had no premonition. She prepared for the future as she always did, moving sunflower seeds from her bowl to hidden caches in the corners of her cage, stuffing her cheeks for transport until she was wider than she was long.

Gertrude was named for Gertrude Berg, a writer and actor who pioneered the family sitcom form, first on radio, then on television. Gertrude the hamster bore little resemblance to Berg the woman, who was never less than immaculate in pearls and furs. Instead, she reminded us of Berg’s persona, Molly Goldberg, always a bit out of sorts and catching her breath, dashing around in an apron and pushing a strand of hair back under a bobby pin. The hamster was frazzled like Molly Goldberg, but an expert homemaker as well. Each time we cleaned her cage she busied herself setting it to rights again, mounding up the fluff on one side to make a nest, and digging a tunnel to the other side for easy passage.

For the majority of her life, Gertrude shared a household with her sister Hazel Scott. The two were affectionate in their girlhood but as they matured, Hazel grew into a vicious bully. Gertrude was so persecuted we nearly took her into protective custody. Before Hazel died, however, she softened, and the two hamsters turned to each other for comfort in their old age. We hope Gertrude forgave her, and in the waning months of her own life recalled her sister only with fondness.

Give Something Away to a Stranger

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Mon, 08 Feb 2021 13:57:18 +0000

Audio only:

(Podcast subscription: https://emptysquare.libsyn.com/rss)

Transcript

When it comes to vaccines, rich countries are looking out for themselves. According to the People’s Vaccine Alliance, which is a coalition that includes Oxfam and Amnesty International and some other groups, this report that they wrote in December, it says that rich countries, which have 14 percent of the population have reserved 53 percent of the world’s vaccine doses. This report also said that rich countries have enough doses pre-ordered to vaccinate their own populations three times over. And poor countries don’t even have enough to vaccinate their health workers.

The worst vaccine hoarder is Canada. They’ve reserved five times what they need to vaccinate all Canadians. I think this probably made sense last year when we didn’t know which vaccines would work. They pre-ordered enough of each so that if one of them worked they could vaccinate everybody. But then it turned out that all the vaccines worked, and so now Canada has extra. I assume they’re eventually going to release those preorders, and I hope they even donate them to poor countries, but no matter what, Canadians come first. It’s not because Canadians deserve more, or because they need it more, it’s because they are citizens of a country with a rich and powerful government.

The next worst hoarder of vaccines is the United States, which has four times the preorders that we need. And unlike almost every other country on Earth, we did not join COVAX, which is the global vaccine-sharing agreement. The Trump Administration made a bet that we could secure vaccines on our own: We didn’t need a promise from other countries that they would share their vaccines with us, and we didn’t want to make a promise that we would share ours with them. And that bet has paid off, but it makes me wonder—if China had developed a vaccine, say, and we had not, and there was no agreement in place, how generous would they have been? (The Biden Administration is going to join COVAX now.)

Drug companies are also looking out for themselves. If you consider Moderna, they basically had all of their research costs paid by the federal government. And they also had a preorder for 100 million doses. So they had no risk, and they had no cost. And yet they’re keeping exclusive rights to their vaccine for themselves. They plan to charge 15 bucks a dose, which is a very high profit margin. And all the other vaccine makers have made similar deals with the government. All of them are hoarding their intellectual property for themselves.

Universities are not necessarily doing better. Oxford University, they announced last April that the vaccine they’d developed, they were going to release it to the public. They would let any drug manufacturer make it for free. But a few weeks after that they reversed, and they signed an exclusive deal with AstraZeneca, with no requirement that AstraZeneca would charge an affordable price. Both AstraZeneca and Oxford stand to gain an enormous amount of money off of this deal.

Oxford University developed their vaccine with help from the Bill & Melinda Gates Foundation, and I’ve read that Bill Gates advised Oxford to sign this exclusive deal—to reverse their plan to release the vaccine for free. Gates guessed that no drug company would make a vaccine unless they had exclusive rights to it, because otherwise the profit margin would be too low. And if there’s anybody who understands the decision-making of rapacious executives it’s probably Bill Gates, so maybe he has a point.

Everybody’s looking out for themselves. At this moment of crisis, rich countries are like Titanic passengers hogging the life rafts and letting the rest of the world drown. And drug companies are doing what they do.

What it means to look out for yourself is different depending on what your definition of “self” is. A corporation’s self is its legal entity, which has obligations to its shareholders. A nation-state has its definition of self, which is its territory and its citizens. A nation-state’s self is not the rest of humanity. And it’s not trees and birds, and it’s not stars, and it’s not all beings and the whole universe. So when a nation-state is threatened, until all its territory and citizens are secure, all other beings' interests come second. It hoards vaccines, it hoards masks, until it’s sure that everybody is safe inside the country.

Individual people behave this way too. We usually place our own interests and those of our immediate family first, followed by friends and acquaintances, followed by strangers, and finally animals, and the world, maybe, last. And I think that this is probably instinctive, that we have a hierarchy of goodwill in this way. If you subscribe to the Selfish Gene hypothesis, then it would make sense that our selfish DNA would encode instincts for us to help ourselves and people who share our genome first.

If you want to take the bio-psych view, if you imagine that our behaviors evolved when we were in small cooperative bands of hunter-gatherers, then it makes sense that we would help our friends, that we would do favors for people who are likely to do favors for us in return. So it’s no surprise that evolutionary pressures led to a species of selfish creatures who prioritize themselves, their families and their friends, and are indifferent or even hostile to strangers.

But—and this is a really interesting contradiction—even though we behave this way, we admire people who act with universal goodwill toward strangers and toward humanity as a whole. Somebody who inverts the natural hierarchy, somebody who sacrifices themselves for strangers, that person is a hero or a saint.

Based on what I personally know of Western ethical theories—which is limited—in the modern era, when we sit down to write the rules of ethical conduct, we generally say that each individual should act with goodwill towards everybody, equally. Take Utilitarianism; the simplest formula of that philosophy is that we should take actions that promote the greatest good for the greatest number of people. Many Utilitarians explicitly reject the idea that we should favor our own parents or our own children over the general welfare of humanity.

There’s an extremely influential Utilitarian philosopher named Peter Singer, and in 1971 he wrote an essay called “Famine, Affluence, and Morality.” He asked, what would you do if you noticed a child drowning in a pond? You’d jump in and save the child, even if it meant ruining your clothes. In fact, you have a moral obligation to do that, despite the small cost to yourself. At the time Peter Singer wrote this essay, there was a horrible famine in Bangladesh. And Singer wrote that we can save a life in Bangladesh for less money than it would cost you to replace an outfit of clothes:

It makes no moral difference whether the person I can help is a neighbor’s child ten yards away from me or a Bengali whose name I shall never know, ten thousand miles away. The moral point of view requires us to look beyond the interests of our own society. Previously, this may hardly have been feasible, but it is quite feasible now.

So he’s saying that if sacrificing your clothes to save a nearby child is not just a good thing to do but it’s actually required of you ethically, then it’s also ethically required that you send an equivalent amount of money to help somebody in Bangladesh, or whoever is most in need today.

I basically agree with this, and I think the Buddha did too. In the Vatthūpama Sutta, Buddha described how to practice goodwill. He said:

One abides, having suffused with a mind of benevolence
one direction of the world,
likewise the second, likewise the third, likewise the fourth,
and so above, below, around and
everywhere, and the same to all as to oneself.

So this is pretty clearly universal goodwill. Also, in the Karaniya Metta Sutta, Buddha says,

Just as a mother would protect her only child at the risk of her own life,
even so, let one cultivate a boundless heart towards all beings.

Right? He’s saying that this hierarchy of our family over others, we should not practice this. We should practice universal goodwill. And this is a principle of Buddhism just the same as it’s a principle of Utilitarianism, and every other ethical philosophy that I know of.

By the way, sometimes when we’re talking about Buddhism in English we say “goodwill” or “benevolence,” and that’s usually a translation of metta in Pali, or maitri in Sanskrit. Metta is an intention to put into action, to do good for others. On the other hand, when we say “compassion,” that might be a translation of karuna, which is more like a feeling of sympathy toward other beings. So, I’m concentrating more here on metta, the intention to do good for others, because that seems more actionable to me. I can practice goodwill towards my enemy. Whereas feeling sympathy for my enemy is tougher. And anyway, who cares how I feel.

In Buddhism, we have a personification of goodwill, the Bodhisattva Avalokiteshvara. We often see Avalokiteshvara with a thousand hands, and on the palm of each hand is an eye to see the suffering of all beings. So the thousand hands help all beings, and the thousand eyes witness their suffering and their needs. And these thousand arms kind of fan out from the Bodhisattva’s shoulders like a peacock tail. There’s a koan about Avalokiteshvara’s thousand hands and eyes and it’s one of the most beautiful in our tradition. It goes like this.

Ungan asked Dogo, “What does the bodhisattva of great compassion use so many hands and eyes for?”

Dogo said, “It’s like someone reaching back for a pillow in the middle of the night.”

Ungan said, “I understand.”

Dogo said, “How do you understand?”

Ungan said, “All over the body are hands and eyes.”

Dogo said, “You’ve said quite a bit, but you’ve only expressed eighty percent.”

Ungan said, “What about you?”

Dogo said, “Throughout the body are hands and eyes.”

So I’m not sure here if we’re talking about goodwill or compassion. Ungan asks what the Bodhisattva uses so many hands and eyes for, which sounds like goodwill to me, more like the intention to take action for others. And Dogo said that this is like reaching behind your head to adjust your pillow in the middle of the night, when you’re half-asleep, to be comfortable. It’s natural.

Once I studied under a teacher named Tenshin, he’s a carpenter, and he said it’s like when you hit your thumb with a hammer, and your other hand immediately goes to comfort the injured hand. It’s natural. The two hands are one body.

But if this hand is the U.S. and this hand is, say, Ecuador, we don’t recognize the one body. We say, “This hand is myself and this hand is not myself. So this hand gets the vaccine and this hand has to wait.”

For me this natural goodwill, it’s helping my partner Keishin when she’s sick, or if she wants my advice, or just if I’m making a nice dinner to share together. That’s reaching back to adjust my pillow. When I do that it’s natural. These two hands and eyes are part of the Avalokiteshvara peacock tail of hands and eyes. On the other hand, helping an abstract stranger on the other side of the world does not feel that way. My selfish genes and my small-tribe instincts, they compel me to favor myself and my family first, and to put others second.

But what I like about Buddhists is we recognize this problem, and we have a training program for it. Many philosophers debate what’s the right thing to do, but they don’t actually consider how to make ourselves do it when we don’t want to. But Buddhists, we get together and we train our universal goodwill muscle together. Buddhism is a training program for being ethical. And living ethically is a method for accessing wisdom! So it’s a self-reinforcing cycle.

A training exercise that we do for universal goodwill is called metta meditation. It dates back to Buddhism’s very early days, and there are many different scripts for doing this meditation. I’m gonna take one from Jack Kornfield. So, you sit in meditation, and you direct goodwill towards yourself first, and you say:

May I be filled with lovingkindness.
May I be safe from inner and outer dangers.
May I be well in body and mind.
May I be at ease and happy.

Then you expand it out a little bit. You do the same for somebody who loves you and cares about you. And you say:

May you be filled with lovingkindness. May you be safe from inner and outer dangers. May you be well in body and mind. May you be at ease and happy.

And you extend this goodwill farther and farther out to your family, your neighborhood, your sangha, stretching it wider and wider. And finally, you extend this goodwill to your enemy. And you say:

May you be filled with lovingkindness.
May you be safe from inner and outer dangers.
May you be well in body and mind.
May you be at ease and happy.

This is really hard training. I don’t feel a lot of universal compassion, karuna, sympathy. And I don’t act with universal goodwill. I’m generous toward the people who are close to me, but much, much less so with strangers. I have the means to act with goodwill. I have a lot of money, I have skills, I have time, but I don’t do it that much. I believe in universal goodwill, in the abstract, but that’s not actually how I live. I think I need to do more metta meditation.

Peter Singer—the philosopher I mentioned earlier, he’s got the drowning child parable—he’s associated with a movement called Effective Altruism. And Effective Altruists argue that we should use our resources to do the most good according to some standard that’s measurable and evidence-based. And they also argue for impartiality, that the well-being of all people is equally valuable.

There’s an Effective Altruism group that Peter Singer founded called The Life You Can Save. It finds charities that save or improve the most lives for every dollar. So they recommend charities that fight malaria or treat fistulas, or just give money to poor people with direct cash transfers. And all of these, according to the research, are shown to be highly effective ways to help strangers.

But it’s still kind of an opportunity to pick and choose. Maybe I care more about malaria this month, and I care more about poverty next month, so I do the one that I feel more sympathy for. Or you can just split your donation equally among all their top charities, and not express any choice at all. That’s the least satisfying for my primitive human emotions, so I think that is the best training for universal goodwill.

Because the thing is that metta is not a moral virtue. It’s just a muscle. We make it stronger bit by bit by taking on harder challenges. The more we use it, the stronger it gets, just like any other muscle.

So I’ve got the page open right now for The Life You Can Save in a tab, it’s next to my browser tab where I’ve got my talk notes. And they recommend that somebody with my income should give 6 percent of my income. That’s the bare minimum. It would do a lot of good in the world without inconveniencing me. So that’s what I’m going to do, just for this month, but it’s a start. I’m going to give them 6 percent of this month’s income. I’m not going to pick a charity, I’m just going to tell them to split it evenly among all of them.

Now, for me, this is not a sacrifice, because I’m doing really well financially. But emotionally, it is difficult, because I’m human, and I prioritize myself and the people who are close to me. I would rather use the money for something else. But I’m going to exercise some universal goodwill now instead.

You might have some money to spare or you might not. You might have some time to spare, you might have a skill that’s useful. I invite you to think of a way that you’re able, now, to give that away to a total stranger.

Image: Statue of Avalokiteśvara Bodhisattva (Quán Thế Âm) of the Lê or Nguyễn Dynasty, circa 1750-1850.

Nighttime meditation in East River Park

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Mon, 30 Nov 2020 10:06:25 +0000

November 28, 2020. Members of the Village Zendo meditating outdoors at night in Manhattan’s East River Park. Since the arrival of coronavirus, we’ve been meditating primarily on Zoom, but small groups occasionally meet in parks.

These Unreliable United States

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Tue, 10 Nov 2020 17:54:15 +0000

Audio only:

(Podcast subscription: https://emptysquare.libsyn.com/rss)

Transcript

Where were you when they called it yesterday?

I was biking through Brooklyn and I looked down at my phone to check directions, and I saw the text from Keishin saying, “It’s done.”

And over the next few minutes, the streets erupted around me. People were honking their horns and yelling and waving their fists. Every block that I biked after that, I had to go slower because the streets were getting crowded. People were just pouring out into the streets.

On Vanderbilt Avenue, it kind of turned into a dance party. Every car that crossed through an intersection was honking, and the people in the street would like wave and yell back. By the time I got to Grand Army Plaza at Prospect Park, traffic was stopped and there were boomboxes and loudspeakers out, and it was like a rave. It was like the end of a war.

In 44 BC, after Julius Caesar was assassinated by Brutus and the gang to save the Roman Senate system, the old statesman Cicero wrote, “The tyrant is dead, tyranny lives.”

Cicero was pointing out that killing Caesar didn’t fix the problem. The hole in the Roman Republic’s defenses that Caesar had walked through—it was still there and the Senate didn’t patch it.

The factors that made Roman democracy vulnerable to tyranny were still there. The plebeians were disenfranchised and poor and vulnerable to bankruptcy and war. And the republic didn’t serve them; it made them ready to follow any populist autocrat who came along.

After Caesar, the hole in the republic’s defenses was still there to let in anybody who wanted to walk through it. Maybe somebody more dangerous than Caesar: his son, Octavian. In the power struggles after Caesar’s death, Octavian came out on top. He became emperor in 27 BC, and that is dated as the end of the Roman Republic forever.

So you understand what I mean: We defeated Trump, but this isn’t a victory. This is just a temporary stalemate. The holes in our democratic system are still there, and I don’t think that we have the opportunity to patch them right now.

The Republicans are still probably in control of the Senate, definitely of the Supreme Court, state legislatures, and their strategy is anti-democratic. They have no interest in making our system more representative. Which means that somebody more dangerous than Trump—some intelligent, disciplined, ambitious, farsighted, young Octavian can walk through that hole any time. And I think that when that happens, things are going to be dicey.

Back in 2016, right after Trump won, a friend of mine said something really insightful to me. He said that the reason why people were so shocked is that for the first time, they saw the death of the Republic.

Not right away, not in 2016 and apparently not in 2020, but we saw that it was possible for our system to end. And so we realized that it was probably inevitable.

People alive today, we didn’t live through the Civil War, or the election of 1876, or any other constitutional crisis of that size. So we’d come to expect a certain continuity. And the election of Trump woke us up to the impermanence of the United States as we know it.

I read something similar in an interview with Senator Chris Murphy of Connecticut, talking to the Times a couple of months ago. Chris Murphy said,

I have a real belief that democracy is unnatural. We don’t run anything important in our lives by democratic vote other than our government. Democracy is so unnatural that it’s illogical to think it would be permanent. It will fall apart at some point, and maybe that isn’t now, but maybe it is. So I feel like my job is to hold this together so that it survives to the next administration.

I think that’s extraordinary, for a sitting senator, Chris Murphy, to admit that truth, that government by the people, for the people, shall perish from the earth someday.

But we already know this, because Buddha taught impermanence. He taught that nothing lasts. We can’t rely on anything at all.

Buddha’s greatest teaching of impermanence was when he himself died. He was 80 years old. He ate his final meal, and he sensed that he would die soon, so he laid down between two trees, laid on his side, and gathered the sangha around him. He said, “Everything decays. Strive in your practice with diligence.”

He told his beloved attendant Ananda this:

O Ananda, be lamps unto yourselves. Rely on yourselves, and do not rely on external help. Hold fast to the truth as a lamp. Seek salvation alone in the truth. Look not for assistance to any one besides yourselves.

And how, Ananda, can a sister be a lamp unto herself, rely on herself only and not on any external help, holding fast to the truth as her lamp and seeking salvation in the truth alone, looking not for assistance to any one besides herself?"

This is from an old translation by Paul Carus, so I’ve fixed the sexist pronouns. Buddha says that a brother or sister practices like this: She lives in her body, but because of her diligent, thoughtful, mindful practice, she overcomes the grief that arises from the body’s cravings. She has sensations, but through practice, she overcomes the grief that arises from craving due to her sensations. She thinks and reasons and feels, but due to her diligent practice, she overcomes the grief that arises from thinking and reasoning and feeling.

Buddha said,

Those who, either now or after I am dead, are lamps unto themselves, relying upon themselves only and not relying upon any external help, but holding fast to the truth as their lamp, seeking their salvation in the truth alone, and not looking for assistance to any one besides themselves, it is they, Ananda, who shall reach the very topmost height! But they must be anxious to learn.

I think that this message is very simple and very clear, that we can’t rely on anything or anyone besides our own insight and our own practice.

This country that we’re struggling to save, it’s a noble struggle and we must keep it up. But we can’t rely on the outcome of this struggle for our sense of peace, for a sense that our lives have meaning. The meaning of our lives is independent of anything as superficial as the life or death of our country.

No thing is going to free us from grief. The important work is that we practice waking up.

Once upon a time, there was a statesman who had a lot of questions about how to govern his country, and he thought that Buddhism might have the answers for him. His name was Emperor Wu of Liang. And there’s a koan about him that goes like this (adapted from Thomas Cleary’s translation):

Emperor Wu commanded the great master Fu to explain the Diamond Sutra.

The great master Fu shook the lectern once, then got down from the seat.

[Shakes his laptop forcefully, causing the video to shake.]

The emperor was flabbergasted.

Master Shi asked, “Does Your Majesty understand?”

The emperor said, “No.”

Master Shi said, “The great master Fu has explained the sutra.”

Poor Emperor Wu. He’s the fall guy in a couple of stories that go this way, because he’s got a habit of doing this kind of thing. There’s another story about him where he commands the Zen master Bodhidharma to explain things to him, and he’s disappointed in that story, too.

So, in this koan, Emperor Wu asks master Fu to come and explain the Diamond Sutra. The Diamond Sutra is about emptiness and impermanence. And it’s also about the Bodhisattva path—what we should do and how we should practice from day to day to wake up, and to help each other wake up.

Emperor Wu heard that there’s a monk named Fu, who can explain these things to him. Emperor Wu sends soldiers to get master Fu, and they find him in a market selling fish. He’s a thin little old man, sitting on the ground in the dirt by the side of the road, with a pile of smelly fish in the sun in front of him. He’s just trying to make a living.

A group of soldiers walk up clanking, with their spears and their breastplates and they say, “The great Emperor Wu demands your presence.”

Fu shrugs and says he’s not interested. So they pick him up at the elbows and they cart him off to the capital. When he gets there, in his ragged robe, he’s marched into the palace and up to a seat on the platform in front of the emperor, surrounded by his officials and his bodyguards. And the emperor says, “Give me explanations!”

But Fu doesn’t have any. The emperor wants to buy something that Fu isn’t selling. All he has is fish.

[Shakes his laptop.] “Wake up!” That’s what master Fu has to say: “Wake up.”

Practicing Buddhism is not going to tell us how to live in a country that’s on the edge of self-destruction. It doesn’t tell us how to feel or what to do. Buddha said, “Rely on yourselves, and do not rely on external help.” And he also said to practice diligently. Free yourself of the grief of wanting things to be different.

Buddhism doesn’t have any answers for sale. All it’s offering is a way to free ourselves from grief and delusion. And when we act with freedom we choose our own way of living in this whole shitstorm.

We are on our own now more than ever before. We’re separated from each other. We can’t see our teachers face to face. We can’t practice together in the zendo or at a retreat center. We have to improvise how we do everything now on Zoom.

When I started practicing Zen, 20 years ago, it seemed that it was really rigid and harsh, and I loved that. I felt like my life was careening all over the road and I wanted to be set on rails and only go straight. I wanted to live with the discipline of a Japanese monk in a movie, or in the old stories. And Zen seem to have the answers for everything. When do you bow? How do you walk and eat and sit? But it’s only like that when you’re in a monastery on retreat, and even then life doesn’t really stop. It’s still always swerving and sliding.

Look at us now. We can’t play Japanese monks in a monastery now, anymore.

The old forms that we inherited from Asia, we’ve adapted them again and again and again to every situation. But to me now, it just feels like we are way too far from the point of origin. These rituals are like a puppet show, if we did them now. There are no more people meditating in long straight rows in the candlelight. There’s no more oryoki. There’s no more kyosaku stick. We can’t rely on these old forms anymore. We have to invent something new.

And it’s always already been this way, it’s just that we can see so much more clearly now, that the old forms don’t last. The Village Zendo doesn’t last. The United States doesn’t last. Nothing is reliable. No one can explain it to us. We are responsible for waking up and setting our own direction. We have to be lamps unto ourselves.

But when we live this way, as the Buddha said, “holding fast to the truth as our lamp and seeking salvation in the truth alone,” we rely on this only—not on the past or the future or life or death or success or failure—we can rely on this, this one-eyed truth.

On Friday night, my partner Keishin and I, we biked down to the East River to meditate outside. And Soten joined us for our little Friday outdoor sitting group. As we were pulling into the park on our bikes, I saw, in the light from my headlamp, a tiny mouse scurrying around a tree. I shone my headlamp at it and it scurried into its hole among the roots of the tree.

So the three of us sat down. It was after sunset. I could see the lights of the cars on the FDR beyond the park, the white headlights and the red taillights going back and forth. We heard the sound of somebody calling to his dog, playing fetch.

There was a cool breeze moving through the park, moving the brown leaves on the ground, sometimes turning a leaf over. In that movement, out of the corner of my eye, I kept thinking that I saw the mouse again. And then suddenly I did see it.

And it was it was so small that it could hide under a single leaf. So it was approaching us, darting from the cover of one leaf to the next to the next, coming closer, curious about us, until it was right in front of me, almost between my knees.

It was half hidden under a leaf, and it was very still, and I could see one delicate ear, and one tiny eye, looking up at me.

Image: Detail from a handscroll painting by Kawanabe Kyosai.

RIP Hazel

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Sun, 01 Nov 2020 22:22:58 +0000

Our Chinese hamster Hazel Scott Armstrong Davis died yesterday. Like her eponym, the jazz musician and activist Hazel Scott, Hazel combined a fierce disposition with impeccable grooming and elegance.

We had no intuition she was ailing. She spent her final days in her customary pursuits, dining on parsley, watching television with Jennifer, and engaging in vigorous nighttime constitutionals on her wheel. If there was any intimation of her death, it was the behavior of her sister Gertrude. In recent days when Hazel retired to the corner of the cage, Gertrude lay stretched over her, pressing her paws on Hazel’s side as if to protect or warm her.

Domestic harmony came only lately to the sisters' home. They had quarrelled frequently and Hazel was the bully of the two, biting Gertrude’s ear so often it resembled frayed lace. We fretted that they would need to be separated before long. We placed two little houses in their cage to offer each of the girls a place of her own; instead, Hazel claimed both, and chased Gertrude out whenever she entered one. Many mornings we found Gertrude sleeping out on the floor of the cage while Hazel luxuriated in whichever house she preferred.

Unaccountably, in the same week the governor’s order made our apartment a hermitage, peace settled over the girls' relations. They slept in the same house every day, curled together so tightly we could not tell whose feet were whose. Hazel relented her attacks and Gertrude’s chronic anxiousness eased. I cannot attribute this sudden change to any cause, except grace. In the end all disputes were forgotten and Hazel died in her sister’s embrace.

American Religion in America's Time of Crisis

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Mon, 07 Sep 2020 13:50:38 +0000

Audio only:

(Podcast subscription: https://emptysquare.libsyn.com/rss)

Ever since 2016, I’ve been reading about Reconstruction.

A little about the Civil War, too, and Abolitionists in the 1840s and 50s, before the war.

What I want to know is, what happens when there’s a political fight for the meaning of the United States? A fight to determine whether the meaning of this nation is freedom, justice, and life, or if its meaning is hatred and death? If ordinary party politics transforms into a fight like this, what happens next?

In 1860, Lincoln was elected. He was a Republican, they were the progressive party at the time, and Southern Democrats feared Lincoln’s election meant that slavery would be abolished soon. The South seceded, we went to war, and nearly a million Americans died.

That’s one kind of outcome to polarization. And what happens then? How do we ever return to democratic processes, to ordinary life? After the war, the Radical Republicans in Congress passed laws and constitutional amendments that forced Southern whites to respect the rights of freed Black people. 15 Black congressman and 2 Black senators were elected during reconstruction, and Black people held about 15% of state and local offices at the peak.

But white supremacy was only partly interrupted, and not for long. White northerners and the Republican party lost their passion for enforcing Black rights in the South. White Southerners staged coups, rebellions, and massacres to depose the progressive Reconstruction governments in their states.

In 1876, the presidential election ended in a tie. Republicans and Democrats made a deal: the Republican candidate would become president, but the last Union troops withdrew from the South, and white supremacy was restored under Democrat state governments. Black people were denied the right to life and liberty in the South for another century.

But you could imagine the editorials if this happened today. Hey, at least the country wasn’t so polarized anymore! At least civility was restored. Republicans and Democrats could talk to each other like human beings once again. So long as they were white.

I just read a new biography of Frederick Douglass, titled “Frederick Douglass: Prophet of Freedom”, by David W. Blight. I wanted to understand how Douglass’s passion for justice endured through the whole 19th Century, despite defeat after defeat.

He was born a slave in Maryland, probably in 1818, the property of a man named Thomas Auld. Thomas Auld’s sister-in-law taught Frederick Douglass to read at the age of 12. When Douglass was 19, he lived in Baltimore. His owner rented out Douglass’s labor as a ship caulker at the docks. Douglass met a 24-year-old free Black woman named Anna Murray. They fell in love, she scraped together the money for a train ticket to New York City, and he fled to freedom. Anna joined him in New York and they married immediately.

Douglass later wrote:

I have often been asked, how I felt when first I found myself on free soil. And my readers may share the same curiosity. There is scarcely anything in my experience about which I could not give a more satisfactory answer. A new world had opened upon me. If life is more than breath, and the ‘quick round of blood,’ I lived more in one day than in a year of my slave life. It was a time of joyous excitement which words can but tamely describe. Anguish and grief, like darkness and rain, may be depicted; but gladness and joy, like the rainbow, defy the skill of pen or pencil.

That gladness and joy must not have lasted very long though. Four million Black people were enslaved in the South. Douglass wrote and spoke and preached for his whole life, first to abolish slavery, and then to enforce political and social equality for Black people, and finally when Reconstruction had failed and Jim Crow laws restored white supremacy, Douglass spent the last decades of his life, as both an advocate and a critic of the Republican Party, touring the country and thundering about the unfinished work.

What sustained him through a lifetime of struggle?

Douglass found religion when he was 13 or 14 years old in Baltimore. He met a Black preacher named Charles Lawson. Lawson was nearly illiterate, but he understood the Bible and he understood God. Douglass read the bible to Lawson. He said later that “I could teach him the letter, but he could teach me the spirit.”

Douglass began preaching while he was still enslaved, in Baltimore, at an African Methodist Episcopal Church there. After he escaped to freedom, Douglass became a preacher with the AME church. He spoke at churches and used Biblical analogies and quotations his whole life, particularly the prophets Jeremiah and Isaiah. Those prophets had condemned the Hebrews for their immorality and warned that they would suffer for their wickedness—you can see why Douglass was inspired by them.

When Douglass was 30 years old, 10 years after his escape, he wrote a letter to his old master, Thomas Auld.

The very first mental effort that I now remember on my part, was an attempt to solve the mystery—why am I a slave? I had got some idea of God, the Creator of all mankind, the Black and the white, and that he had made the blacks to serve the whites as slaves. How he could do this and be good, I could not tell. I was not satisfied with this theory, which made God responsible for slavery.

One night while sitting in the kitchen, I heard some of the old slaves talking of their parents having been stolen from Africa by white men, and were sold here as slaves. The whole mystery was solved at once. From that time, I resolved that I would some day run away.

The morality of the act I dispose of as follows: I am myself; you are yourself; we are two distinct persons, equal persons. What you are, I am. You are a man, and so am I. In leaving you, I took nothing but what belonged to me.

This passage is amazing, its compassion:

The responsibility which you have assumed is truly awful, and how you could stagger under it these many years is marvelous. Your mind must have become darkened, your heart hardened, your conscience seared and petrified, or you would have long since thrown off the accursed load, and sought relief at the hands of a sin-forgiving God.

I intend to make use of you as a weapon with which to assail the system of slavery—as a means of concentrating public attention on the system, and deepening the horror of trafficking in the souls and bodies of men. I entertain no malice toward you personally. There is no roof under which you would be more safe than mine, and there is nothing in my house which you might need for your comfort, which I would not readily grant. Indeed, I should esteem it a privilege to set you an example as to how mankind ought to treat each other.

I am your fellow-man, but not your slave.

Right here is the whole answer for today. Douglass is both compassionate and angry, both forgiving and uncompromising. He picks his weapon and rushes into the battle for justice.

Ever since 2016, I’ve been wondering what American Buddhists can do about America’s crisis. When I look at the old sutras or the koans they’re not much help to me. The Buddha didn’t have any project to reform ancient India’s politics. Within his own sangha he abolished caste, and women were closer to equality than they were in the surrounding society. But the first sangha didn’t participate in local politics. Internally, the monks and nuns had a wonderful political system: they followed an omniscient guru. Whenever they had a policy question, they asked him, and he told them the answer immediately, in excruciating detail, in verse.

What about our Chinese ancestors of 1300 years ago? They lived in a time of great political turmoil and civil war, but we don’t read about how they voted, or ran for office, or marched in protests.

Political participation is a somewhat modern idea. Reforming society is a modern idea. Some people have found inspiration for today from the ancient texts. The whole Buddhist Peace Fellowship and the Engaged Buddhist movement includes many people who apply the old stories to today’s situation. But right now, I’m finding more inspiration from American religion in the abolitionist era. That’s our lineage, too, as American Buddhists, we descend from them, just as much as we descend from Asian monastics.

We can look back at abolitionists like Benjamin Lay, a Quaker. Shinryu Sensei gave a terrific talk about him a few years ago. I recommend the biography, called The Fearless Benjamin Lay.

We can look to Frederick Douglass, who preached in the African Methodist Episcopal church. The Congregationalist minister Henry Ward Beecher, who raised money to buy slaves' freedom, from a church in Brooklyn. William Lloyd Garrison, who founded the Liberator newspaper and the American Anti-Slavery Society, was inspired by the writings of a Presbyterian minister. Oberlin College, where I went, was one of the first co-ed and integrated colleges, the first in the US to graduate a Black woman, was founded by radical Protestants. The whole abolitionist movement in the United States was led by Quakers, Unitarians, Methodists…. And of course religious leaders and religious groups were in the vanguard of the peace movement, women’s rights, and the civil rights movement of the 20th Century, too.

Imagine if they had said religion and politics don’t mix.

Not all American Buddhists think we have a responsibility to engage in politics. I do. My partner Keishin and I have hosted a fundraiser for Joe Biden, we’re sending texts and postcards to voters, we’re doing everything we can. I asked a Buddhist friend if he’d help me raise money for Biden, and my friend wrote back:

I personally do not find a political appeal in accord with Buddhist practice. I am aware there are practitioners who hold the opposite, but I have found it’s usually because they have put their worldly goals above enlightenment in the name of compassion for others.

Yeah, I do have worldly goals in the name of compassion. That’s why I’m a member of the Village Zendo. Enkyo and Joshin founded this group in the midst of the AIDS crisis, with their fellow activists. This sangha is both a place for traditional Zen practice, and a foundation for activism. In the time I’ve been a member we’ve marched against war and against climate change, we’ve marched for a $15 minimum wage, we held vigils outside the governor’s mansion for clemency and prison reform. That’s my kind of Buddhism and it’s why I joined this sangha 17 years ago.

You might think that because the Buddha preached the Middle Path, we shouldn’t pick sides. That’s not what he meant by the Middle Path. In the oldest sutras, the Middle Path avoids the extremes of ascetism on one side, self-indulgence on the other. In Mahayana sutras, the Middle Path avoids the extreme of believing in an eternal soul, on the one hand, or believing that existence is an illusion and nothing is real, on the other. In our school, the Middle Path is “Not Knowing”. On the one hand, we don’t have any fundamentalist dogmas. On the other hand, we do have opinions. It’s my opinion that if Trump wins another four years, we risk great suffering and death.

The Buddha preached the Middle Path, and he also preached the precepts. Don’t lie, steal, or kill. In a democracy, we are responsible for our nation’s lying, stealing, and killing. There’s no escaping this responsibility. Reforming our nation is practicing the precepts.

You might make a distinction between campaigning for causes, and campaigning for a candidate. This distinction is so recent. It comes from a 1954 tax reform bill. Lyndon Johnson, who was a senator at the time, added a provision that said tax-exempt religious organizations can’t campaign for or against candidates. This was 1954, the peak of McCarthyism, and Lyndon Johnson was trying to suppress right-wing religious support for McCarthyism.

This taboo on religious organizations campaigning, it’s not about the First Amendment, it’s not about the separation of church and state, it’s not fundamental to the American system. It’s a tax rule from 1954.

So, to be clear, the Village Zendo doesn’t endorse a candidate. Our board of directors didn’t encourage me to talk about politics, I’m speaking for myself. I have taken the Bodhisattva vow to protect all beings from suffering, and I have promised to uphold the Buddhist precepts which oppose lying, stealing, and killing. It’s clear to me that American Bodhisattvas should be using every one of our thousand arms to defeat Trump. If politics is an expression of our religion, we must have sacred means, not just sacred ends. We have opponents, but never enemies. We are passionate, but never violent. Like Frederick Douglass when he wrote to his old master, Thomas Auld, we must be immovable in our principles, but forgiving towards our fellow beings.

I do need to say that Douglass was not a pacifist. He said the moment he became a man was when he beat up a slave driver. In the 1850s he advocated killing slave catchers and slave owners. He quoted Isaiah, “There is no peace, said my God, to the wicked.” When the Civil War broke out, Douglass was ecstatic. He said that white supremacy would not be defeated unless Southern society was completely destroyed. As a forecaster, he was right—because Southern society was not completely destroyed, white supremacy did return. But as a Buddhist, I don’t admire this bloodthirstiness in Douglass’s thinking.

In 1877, after the war, after Emancipation and Reconstruction, when Frederick Douglass was 59, he went to Maryland to see Thomas Auld. He wrote later that this homecoming was strange enough in itself, but to see his former master was still more strange.

Thomas Auld was sick and bedridden. His hands shook. Douglass wrote that he sat by Auld’s bedside,

holding his hand and in friendly conversation with him in a sort of final settlement of past differences, preparatory to his stepping into his grave, where all distinctions are at an end, and where the great and the small, the slave and his master, are reduced to the same level.

One question that had nagged Douglass his whole life was what year he was born, was it 1817 or 1818. Auld told him right away that he’d been born in 1818. There’s another thing Douglass wanted to know—was Thomas Auld his father? He didn’t ask. No one knows why he didn’t.

Douglass wrote,

He was to me no longer a slaveholder either in fact or in spirit, and I regarded him as I did myself, a victim of the circumstances of birth, education, law, and custom. Our fates were determined for us, not by us. Even the constancy of hate breaks down before the brightness of infinite light.

God willing, we’ll have the opportunity to forgive, after this struggle. Buddha’s teachings, and abolitionist preachers like Douglass, can show us how. Let’s follow their example.

Top: Daguerrotype of Frederick Douglass, c. 1847-1852, by Samuel J. Miller. Bottom: Collodion plate of Frederick Douglass, c. 1865-1880, Brady-Handy Photograph Collection.

Two attempts to compare a TLA+ spec with a C++ implementation

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Sun, 30 Aug 2020 16:19:06 +0000

At MongoDB, we use TLA+ to specify some of the protocols we use in both the MongoDB server and MongoDB Realm Sync. Formal specification has made us more confident that our protocols are correct, but it leads to a new uncertainty: what if our C++ implementations don’t conform to our specs? Since we change both our specs and our implementations over time, we worry they might diverge, even if they conformed at first.

With my colleagues Max Hirschhorn and Judah Schvimer, I sought a way to continuously test conformance. We found a 2011 paper, Concurrent Development of Model and Implementation, that proposed several testing methods as part of a software development process the authors called “eXtreme Modelling.” We performed two case studies to experiment with two of these testing methods. We described our results in a paper titled eXtreme Modelling in Practice, for the Very Large Databases conference (VLDB).

In our first case study, we tried to test the conformance of the MongoDB Server to its specs, using “model-based trace-checking.” We captured execution traces from the server as we fuzz-tested it, and checked that these traces were permitted by the spec. This case study ended in failure: the project grew to cost more effort than it was worth, so we cancelled it. Our paper describes three factors that led to our failure and what we would do differently if we tried again.

In the second case study, we tested the conformance of MongoDB Realm Sync to its spec using “model-based test-case generation.” We enumerated all behaviors of the spec, and generated C++ unit tests to check that the implementation conforms to each. This project was successful and achieved 100% branch coverage of the specified algorithm. Our paper compares this case study to the model-based trace-checking case study, and explains why this project was successful.

Please join us for a presentation and discussion of this paper at VLDB 2020. It’ll be useful to all engineers who specify industrial software systems. You’ll learn about two useful testing techniques, and our story will help you repeat our success and avoid repeating our mistakes.

Image: The honey bee; its natural history, physiology, and management (1827).

The Python Language Summit 2020

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Mon, 01 Jun 2020 12:01:31 +0000

For the second year in a row, I was invited to report on the Python Language Summit. It’s a private gathering of Python language implementers (both the core developers of CPython and alternative Pythons), plus third-party library authors and other Python community members. This year, the Summit was held over two days by videoconference. I’m no longer mainly a Python programmer, but it’s still exciting to hear new ideas for the language. The core developers' decisions affect millions of programmers; it’s a privilege to be in the room where it happens.

Read the Python Language Summit 2020 coverage on the PSF blog.

Meditation at Home

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Sat, 30 May 2020 08:39:59 +0000

Photographs of Village Zendo members in New York City, in early 2020.

Two of these photos appeared in the July 2020 print issue of Lion’s Roar magazine to accompany my article about Five Ways to Establish a Rock-Solid Meditation Habit, and one of them in the summer edition of Buddhadharma magazine.

Who Can You Trust?

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Fri, 22 May 2020 13:28:54 +0000

Our leaders are plainly untrustworthy. And when we hear about toilet paper hoarders, or protestors storming the Michigan governor’s office, or covidiots crowding together on beaches, we stop trusting each other, too. But there is someone we can trust.

A dharma talk I delivered to the Village Zendo by video, May 14, 2020.

Audio only:

Transcript

Have you heard this story from the 1960s about six Tongan boys who were shipwrecked, and survived on an island for a year?

It was in 1965 and there were six boys—they were between 13 and 16 years old—and they were at a Catholic boarding school on Tonga. They were at St. Andrew’s high school and they were bored stiff, so they decided to escape.

They stole a fishing boat, they packed it with some bananas and coconuts, they set sail at sunset and their plan was to sail to Fiji, which was more than 500 miles away. That night as they were sleeping on the open ocean, there was a storm and it ripped their sail and it broke their rudder.

When the storm passed in the morning, they found themselves floating on the ocean with no control over their heading. They drifted in the ocean for eight days. They ran out of food and water. They collected some water when it rained, in empty coconut shells. They set up a system for rationing, where each boy would get a sip of water in the morning and a sip of water in the evening.

On the eighth day they found a deserted island called Ata which is a thousand-foot-tall mass of rock that juts out of the ocean. They landed there and dragged their boat up, and they lived there for more than a year.

At first they survived on coconuts from the island, and they caught birds and they also ate eggs. As they explored the island, they reached the top and discovered that there was a place at the top of the mountain that had once been inhabited. What had happened was that 100 years previously, slavers had hauled away all of the inhabitants of the island. But there were still some vegetables growing and some chickens that had gone feral. So the boys began to garden, and they established a chicken pen, and they lived off of that.

They worked in teams of two. They took shifts gardening and patrolling and cooking. They said later that whenever they argued they would take a time out until they’d calmed down, and that’s how they resolved disputes.

One of the boys fell off a ledge at one point and broke his leg. The other boys climbed down and rescued him, set his leg, splinted it, and it healed.

One of the boys named Kolo made himself a guitar out of driftwood and some wires that he salvaged from their wrecked boat.

More than a year after the boys were shipwrecked, an Australian fishing boat passed by and the captain saw smoke coming up from the island and so he approached. The boys have been keeping a signal fire going for the entire year. And that is what in the end saved them. When the boat came close to shore, one of the boys, naked, hair down past his waist, jumped into the water and swam up to the boat. And when he reached the side, he called up in English, “My name is Stephen, there are six of us, and we reckon that we’ve been here for 15 months.”

They were rescued. And a couple of years later, they made a short film. And I think many of them are still alive today.

I’m taking this story from an article by a guy named Rutger Bregman, a Dutch writer. This story that appeared in the Guardian recently, it’s an excerpt from his book which going to be called Humankind, published this year. What the book is about, and the lesson that Bregman is taking from this story, is that human beings, by and large, we cooperate.

Think of the contrast between this story and Lord of the Flies, a novel that was published in 1963, a couple of years before the actual stranding. In Lord of the Flies, it’s a group of English school boys. They’re stranded on a deserted island, and they pretty much immediately turn savage and start torturing and killing each other. But in the actual story of the six Tongan boys, they behaved really admirably and bravely.

This matches my own experience. I think that literature and media, they tend to assume that civilization is a thin veneer. And any kind of disruption will quickly degenerate into a war of all against all. That was Thomas Hobbes’s assumption about human nature. And it’s the story of this Contagion movie that everybody watched once we all got into lockdown. In that movie, when cities are quarantined, they break down and there’s violence and looting.

But look around New York City today, right? No looting. People are generally staying home. When they go out, they generally maintain social distancing. Most people are wearing masks. I think that’s kind of incredible, given how incompetent and incoherent and dishonest leadership has been, and nevertheless, we are pulling together and following instructions. Like, at first, the CDC lied to us and they said that we shouldn’t wear masks and that they wouldn’t help. And they said that because they wanted to cover up the fact that there weren’t enough masks, the stockpile was low, and they had no plan to make more of them. So in order to stop us from competing with the hospitals, they told us that we shouldn’t wear them. And then when they flipped and they said that we should wear them, it was too late and we couldn’t get any. So people had to sew their own.

Keishin and I were up late a couple of nights in a row in this living room sewing ourselves masks from a pattern that we found online. Neither of us is good at sewing and we don’t have a machine. So we were up sewing and then undoing and then re-sewing in order to get them right. Keishin sacrificed two old bras so that we could have adjustable ear straps.

And a lot of people went through all that. I went out for a walk on the East River a couple of days after the mask order had been announced and it seemed like people had really made that effort. I saw a family out, they’d all followed the same instructions. They were all wearing bandanas tightened with hair ties. And then I saw a couple who were both wearing masks made from two different parts of the same T-shirt. It was really sweet. It was incredible, the effort that everybody had put into it, given that no leaders with the possible exception of Governor Cuomo have given us any sort of clear instructions, or any kind of inspiration at all.

So this is not the media narrative that we generally see. If you go on Twitter or watch the news, most of what the media is telling us about regular people is nutcases storming the Michigan State Capitol Building. Or covidiots crowding the beaches. Or people hoarding whole carloads of toilet paper and…fine. There is a little bit of that going on. But the sangha of all of this, all of us, it’s resilient. Because we’re pretty much doing what we need to do. And the majority is enough to handle a couple of nutcases or cheaters or covidiots.

Overwhelmingly—just like those six Tongan boys, Kolo and Mano and Stephen and the rest—we’re recognizing that we’re all in the same boat. And I find that comforting.

From a Zen perspective, this is far down in the muck. People are essentially good, people are essentially evil. It’s all just a story we’re telling ourselves, right? This concept of the “decency of our fellow human beings.” We’re making it up. It’s a story just as much as if we said that Batman is going to emerge from the shadows and save us. All of this is just weeds, just concepts that we’re making to cover over the actual experience of being alive right now.

So, the question is, who can we trust, really?

Buddha taught that there are no separate or essential selves. There is no essential goodness or evil. Everything is impermanent, everything changes. There is no ground to stand on. So who can we trust?

On Tuesday morning, Keishin and I—Keishin’s my partner—we woke up together at 7:15 a.m. to join the zendo for zazen at 7:30. And Keishin said, “I’ve got some bad news. I feel sick.”

So we sat together there for a minute. I could feel this nervous tingling in my chest. And my mind was stumped. I wanted to make a plan but it didn’t know what the plan would be.

So I said, “Well, alright, let’s do this. If this is it, then let’s get it over with.”

That’s a pretty good attitude. Handling life as it comes. But a couple of hours later I did not have a good attitude. I decided that the thing to do was to get out into the neighborhood on my bike and try to stock up. We stocked up a little bit two months ago when we got the instructions that we should be prepared for up to two weeks of self-isolation, but we’d used up some of our reserves and we weren’t prepared anymore for that. So I figured if Keishin is sick, and if I might get sick, and if we need to quarantine ourselves, we’re going to need some stuff.

So I went out. And Walmart was out of toilet paper, and CVS was out of toilet paper, and the grocery store was closed early for some reason. I came home empty-handed.

I was furious, like, “Fuck no. I do not want to be here. I do not want to do this. I want to rewind to before Keishin said that she was sick.” Just “No, no, no,” like a two-year-old. I just could not be reasoned with. I refused to be here. “No.” It’s not a very Zen attitude about a stroke of bad luck, but that’s how I was feeling.

Buddhists talk a lot about equanimity, or acceptance. But what is it and how do we practice it? There’s the saying that we keep quoting, that “The ultimate path has no difficulty. Just avoid picking and choosing.” How does that actually go in daily life?

Once a monk said to Joshu, “The ultimate path has no difficulty. Just avoid picking and choosing.” And he asked, “What is not picking and choosing?”

Joshu said, “Above and below, I alone am the World-Honored One.” This is a quote from a story about baby Buddha. Buddha was born by cesarean section, and when he was pulled from his mother’s side, he immediately walked seven steps in every direction. He pointed one tiny finger up and one tiny finger down, and he said, “Above and below, I alone am the World-Honored One.” Think of the chutzpah. Try saying it: “Above and below, I alone am the World-Honored One.” Because who else would it be? We’re interdependent. We’re always changing. Who else would be the World-Honored One if it isn’t us?

So that’s Joshu’s answer. “What’s not picking and choosing?” “I alone am the World-Honored One.”

But the monk was not satisfied with this. The monk said, “This is still picking and choosing.”

I think that the monk’s kind of got a point. When I got home without any toilet paper, I was furious, and I didn’t want to be what I was. I wanted to be a serene Buddha. I wanted to be like baby Buddha. Confident like the cover of a mindfulness magazine. “Ah. So happy….” That’s the way I would have chosen to be. So the monk’s got a point, isn’t this still picking and choosing?

But Joshu says “No! Stupid oaf. Where is picking and choosing?” Whether you’re serene or fucking refusing to be with this, where is picking and choosing? Either way, we are the World-Honored One.

I hope that there’s some comfort in that, in knowing that that one can always be trusted, the World-Honored One who is us.

And also maybe just on a day-to-day basis, knowing that people really cooperate. People do what they need to do for the sake of others. We can mostly rely on that. Despite what we see in the media, what the novelists predict will happen, and the post-apocalyptic movies, people pretty much will sew masks if they need to.

The final thing that might be helpful here is a method that I picked up from a self-help group I was in, in my 20s, and it’s stuck with me ever since. The technique is to say that “I choose this.” Whatever the situation is, no matter how much I would prefer not to, this is what it is, and I choose it.

I choose Keishin being sick. I choose losing my shit sometimes. I choose living through a toilet paper shortage.

In Zen, we say that the adept is the master of every situation. And that’s really saying the same thing: whatever it is, I choose it.

So that’s what I’ve got for you today. I hope that’s helpful. See you all tomorrow morning.

Reading A Journal of the Plague Year in New York

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Mon, 13 Apr 2020 14:10:15 +0000

In New York City today, the streets are empty and the air pollution is clearing. During the plague outbreak of 1665, London’s streets, too, began reverting to a state of nature:

… the great Streets within the City, such as Leaden-hall-Street, Bishopgate-Street, Cornhill, and even the Exchange it self, had Grass growing in them, in several Places; neither Cart or Coach were seen in the Streets from Morning to Evening.

Reading Daniel Defoe’s Journal of the Plague Year, I see parallels with New York on every page. His stories are as eerie, and sometimes horrible, as New York’s current experience. But Defoe’s narrator makes a good companion these days for the lonely reader: swapping stories with a fellow veteran is a comfort.

Daniel Defoe was only five years old in 1665. Yet his Journal of the Plague Year, published in 1722, is so obsessive and disordered it seems like he must have lived through the outbreak as an adult. Defoe’s narrator is a character based on his uncle, Henry Foe, and I suspect much of the book is taken from Henry Foe’s actual journal of that year. No novelist could invent an account so unhinged. The narrator circles again and again around a few stories and ideas, constantly inserting “as I have mention’d before”, “as I have observ’d above” into his sentences. There are no chapters or sections, no structure except the roughly chronological. There is little plot except the one we are living now: the first reports of the infection, the flight of the aristocrats, the exponential onslaught, the isolation of the middle class, the dangerous work of the poor.

This is not a well-written book. But it’s a conversation with someone who survived an epidemic, someone who understands, and that is the only kind of person I want to talk to.

At the onset of the virus in New York, my partner Jennifer and I considered fleeing. Out-of-state family and friends offered their homes to us for the duration. We hesitated until the decision was made for us: the White House advised anyone who escaped from New York to self-quarantine for 14 days, so we stayed put.

Defoe’s narrator “H. F.” also hesitates to leave London, while the news darkens and his brother begs him to evacuate. By the time H. F. decides, it’s too late:

For tho' it is true, all the People did not go out of the City of London; yet I may venture to say, that in a manner all the Horses did; for there was hardly a Horse to be bought or hired in the whole City for some Weeks.

He prepares to leave on foot with a servant, but his servant quits. It’s a sign from God that H. F. should stay, and he takes the hint.

The London magistrates' main defense against the plague was the “shutting up of houses”—when any member of a household became sick, their house was locked and a red cross painted on the door with the words “Lord have Mercy upon us”. Two watchmen guarded the house, one in the day and one at night; it was their duty both to keep its inhabitants inside, and to supply them with food and medicine, and to call a doctor or the Dead-Cart when needed.

New York has not enacted such paternalistic measures, but China has:

This contract is a pledge that we are not going to leave the house over the next 14 days. She also gave us plenty of information material about COVID-19 and then walked us home. Once at home, she put a seal on the door that would make it obvious if we left the house. 15/n pic.twitter.com/1Wvxbs2lI6
— Lukas Hensel (@LukasHenselEcon) March 19, 2020

Generally, China is a great country to be quarantined. The community actively supports us. For example, the community leader offered to bring our online orders to our door as delivery drivers are not allowed to enter communities at the moment (and take our trash out). 16/n
— Lukas Hensel (@LukasHenselEcon) March 19, 2020

In London, H. F. himself is briefly deputized to shut up houses, but the job is so repugnant to him that he quickly bribes his way out of it. On no point is H. F. more passionate than the injustice and uselessness of shutting up houses. This policy confines the well with the sick, likely dooming them to infection. Their sacrifice does not accomplish its purpose, for he suspects that the plague, like coronavirus, might be carried by those who seem well, making containment impossible. And besides, many households break out through stealth, bribery, or force. He hears that one family

blow’d up a Watchman with Gun-powder, and burnt the poor Fellow dreadfully, and while he made hidious Crys, and no Body would venture to come near to help him; the whole Family that were able to stir, got out at the Windows one Story high… After the Plague was abated they return’d, but as nothing cou’d be prov’d, so nothing could be done to them.

In New York, our unloved mayor inspires no confidence during the crisis. But in London, H. F. praises how “my Lord Mayor” and the aldermen manage the city’s response. Law and order are upheld for the most part, and bodies in the street are always carted away by the morning.¹ Even at the plague’s peak, “Provisions were always to be had in full Plenty, and the Price not much rais’d neither”.

Although the food supply is intact in H. F.’s epidemic and ours, shopping for groceries is fraught.

It is true, People us’d all possible Precaution, when any one bought a Joint of Meat in the Market, they would not take it of the Butchers Hand, but take it off of the Hooks themselves. On the other Hand, the Butcher would not touch the Money, but have it put into a Pot full of Vinegar which he kept for that purpose.

In New York:

While many open-air farmers markets remain open to provide New Yorkers with fresh goods from regional farmers, fishers and bakers, GrowNYC has ramped up safety measures to comply with social distancing guidelines. In order to safeguard the health and well-being of market visitors and staff, they’ve banned customers from handling produce (only gloved staff), suspended all sampling by vendors, supplied hand sanitizer at every market and barricaded most food from direct public access with plexiglass.

Despite our precautions, shopping is the most anxious activity in my weekly routine, and perhaps for good reason:

This Necessity of going out of our Houses to buy Provisions, was in a great Measure the Ruin of the whole City, for the People catch’d the Distemper, on those Occasions, one of another, and even the Provisions themselves were often tainted.

The Butchers took that Care, that if any Person dy’d in the Market, they had the Officers always at Hand, to take them up upon Hand-barrows, and carry them to the next Church-Yard.

To avoid shopping, New Yorkers seem determined to bake their way through the crisis. Stores' stocks of flour are low, and yeast is always sold out. I’ve begun a sourdough starter and I bake two loaves a week. H. F. has the same idea:

I went and bought two Sacks of Meal, and for several Weeks, having an Oven, we baked all our own Bread; also I bought Malt, and brew’d as much Beer as all the Casks I had would hold.

In New York we are advised to keep six feet apart and wear face masks. In London they practice social distancing too:

Whether it were in the Street, or in the Fields, if we had seen any Body coming, it was a general Method to walk away.

Londoners in public carry herbal prophylactics in their mouths and pockets, especially once they learn about asymptomatic carriers who appear to be “sound”:

But when the Physicians assured us, that the Danger was as well from the Sound, that is the seemingly sound, as the Sick; and that those People, who thought themselves entirely free, were oftentimes the most fatal…. Then I say they began to be jealous of every Body, and a vast Number of People lock’d themselves up, so as not to come abroad into any Company at all, nor suffer any, that had been abroad in promiscuous Company, to come into their Houses, or near them; at least not so near them, as to be within the Reach of their Breath, or of any Smell from them; and when they were oblig’d to converse at a Distance with Strangers, they would always have Preservatives in their Mouths, and about their Cloths to repell and keep off the Infection.

The rich fled both cities at the onset. A friend of mine left New York by air, and he reported that first class was full but he had all of economy class to himself. The same in London:

Indeed nothing was to be seen but Waggons and Carts, with Goods, Women, Servants, Children, &c. Coaches fill’d with People of the better Sort, and Horsemen attending them, and all hurrying away.

A few merchant families sheltered in their ships on the Thames. A waterman tells H. F.:

All those Ships have Families on board, of their Merchants and Owners, and such like, who have lock’d themselves up, and live on board, close shut in, for fear of the Infection; and I tend on them to fetch Things for them, carry Letters, and do what is absolutely necessary, that they may not be obliged to come on Shore.

I have heard no such reports from New York, but David Geffen is enduring the crisis in his giant yacht off the Grenadines. He posted a sunset photo on Instagram with the message, “I’m hoping everybody is staying safe”.

Staying safe is possible for some middle-class Londoners:

Many Families foreseeing the Approach of the Distemper, laid up Stores of Provisions, sufficient for their whole Families, and shut themselves up, and that so entirely, that they were neither seen or heard of, till the Infection was quite ceased, and then came abroad Sound and Well.

But the poor must work, in London and New York:

Subway use has plummeted in recent weeks, but in poorer areas of New York City, many people are still riding. “I don’t want to get sick, I don’t want my family to get sick, but I still need to get to my job,” said Yolanda Encanción, a home health aide.

Daouda Ba, a 43-year-old immigrant from Senegal, sat hands tucked between his knees at the Burnside Avenue station. Mr. Ba lives in a nearby shelter, where he says more than 50 men share three bathrooms. On a recent morning, a friend had called with a small, paying job: Someone was moving out of their apartment and needed a hand. He sat waiting for the train to take him to Brooklyn, the rin-tin-tin of light rain hitting the metal awning.

“If I die, I die,” he said. ²

In London:

Where they could get Employment they push’d into any kind of Business, the most dangerous and the most liable to Infection; and if they were spoken to, their Answer would be, I must trust to God for that; if I am taken, then I am provided for, and there is an End of me, and the like: OR THUS, Why, What must I do? I can’t starve, I had as good have the Plague as perish for want.

H. F. fears the poor might form mobs, break into wealthy houses, and loot them:

But the Vigilance of the Lord Mayor, and such Magistrates as could be had, for some, even of the Aldermen were Dead, and some absent, prevented this; and they did it by the most kind and gentle Methods they could think of, as particularly by relieving the most desperate with Money, and putting others into Business, and particularly that Employment of watching Houses that were infected and shut up; and as the Number of these were very great, for it was said, there was at one Time, ten thousand Houses shut up, and every House had two Watchmen to guard it, viz. one by Night, and the other by Day; this gave Opportunity to employ a very great Number of poor Men at a Time.

In New York as well, the city has begun a jobs program related to the outbreak:

#JobAlert: @NYCHealthSystem is hiring 500 nonclinical staff and will expand to thousands soon.
— NYC Council Speaker Corey Johnson (@NYCSpeakerCoJo) April 12, 2020

Despite the great differences between H. F.’s London and my New York, and the different diseases attacking us, I mostly see the resemblance of our experiences. Our stories have the same plot, and the same characters: some are craven and selfish, some are brave and generous, most are just trying to make it through. Despite our isolation, we feel intensely our connection with each other and with people of all eras who lived in an epidemic.

Images: UK National Archive.

Samuel Pepys is less impressed with municipal services in his diary entry of August 22, 1665: “I went on a walk to Greenwich, on my way seeing a coffin with a dead body in it, dead of plague. It lay in an open yeard. … It was carried there last night, and the parish has not told anyone to bury it. This disease makes us more cruel to one another than we are to dogs.” ↩︎
Edited for space. Read the full, sad article on the New York Times. ↩︎

How Buddhism Survived The Japanese Internment

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Mon, 13 Apr 2020 08:27:20 +0000

Japanese Americans founded our practice, and they preserved and adapted it to life in the internment camps during World War II. Today we must adapt our practice in a crisis, are we up to the challenge?

A dharma talk I delivered to the Village Zendo by video, April 9, 2020.

Audio only:

Transcript

Zen master Soen was traveling with his attendant.

This was in 1905, and the master, Soen Shaku, was 45 years old, and this was his second trip to America. On his first trip, Soen delivered the very first Zen talk ever delivered in English in North America, at the World Parliament of Religions which was part of the Chicago World’s Fair. But now, in 1905, Soen returned to America to lecture both to Japanese immigrant communities and to English speakers.

Soen had an attendant. His attendant was 29 years old, frail, kind of funny looking. He didn’t speak English. This attendant was an orphan and his story is very unclear. I’ve read many versions of this. His father maybe was a Russian man. He was born in Siberia to this unknown father and a Japanese mother. The father disappeared and the mother died, and the infant was discovered by a traveling Japanese monk beside his mother’s frozen body. The monk took him back to Japan and the boy was raised in a temple, and became a monk. His dharma name was Nyogen because of his foggy past. Nyogen means “like a dream.”

When Nyogen was 20 he came to Soen’s monastery, in Kamakura province. At the time, Nyogen was very sick with tuberculosis, and he asked Master Soen, “What if I die?”

Soen said, “If you die, just die.”

Master Soen traveled to San Francisco in 1905 to teach Zen, and he brought frail little Nyogen as his attendant. They stayed with a wealthy family of white benefactors, and Nyogen took a job as the houseboy. Nyogen was very earnest and hard-working, but he had no experience as a houseboy. He didn’t know how an American household was run. And he didn’t speak any English. So the family gave him five bucks and fired him. Nyogen packed up his suitcase and left.

Master Soen walked with him, and carried his suitcase for him. They passed through Golden Gate Park in San Francisco. There was a thick mist. Picture then this image, the famous, impressive Zen master and his misfit, half-breed attendant. The master carrying the servant’s suitcase.

Suddenly, Soen put down Nyogen’s suitcase. Soen said,

This may be better for you, instead of being my attendant. Just face this great city and see whether it conquers you or you conquer it.

He turned in the mist and left Nyogen, and they never saw each other again.

The year 1905 was a hard time to be a Japanese immigrant on the West Coast. In the previous couple of decades, Japanese people had immigrated to work as cheap labor at farms and factories. They formed a large community on the West Coast, and they called themselves “nikkei,” which is the term I’m going to use here.

Racism against nikkei was on the rise at this time. The year after Nyogen landed in San Francisco, the school board voted to segregate nikkei children. The year after that, the U.S. and Japan negotiated something called “The Gentleman’s Agreement,” in which the Japanese government banned emigration to the United States.

This racism against Japanese immigrants is going to escalate over the next couple of decades. There is a 1924 Congressional act that effectively bans all Asian immigration. And then we’re eventually going to get to Japanese internment. So stay tuned.

Nyogen, meanwhile—he’s broke, he’s homeless, he doesn’t speak any English, and his master has left for Japan. Nyogen took odd jobs as a porter, as a short-order cook, as a hotel manager. Meanwhile he taught himself English and he spent his spare time at the San Francisco Public Library, reading every book he could find either about Buddhism or about Western philosophy.

He began teaching Zen in the early 1920s, and he moved from San Francisco to Los Angeles around 1930. He called his temple “The Floating Zendo.” He taught meditation wherever he could, at a friend’s apartment or a rented space. There was no traditional temple, no fine robes or incense. People who came to learn from him just sat on battered folding chairs. This willingness to improvise is going to come in handy.

We know what’s coming now. The U.S. and Japan take opposing sides in World War II, and war between them seems likely. On December 7, 1941, the Japanese military bombs Pearl Harbor. The forced relocation and internment of all people of Japanese descent on the West Coast begins a few months later.

But the really extraordinary thing that I learned recently, is that after the Pearl Harbor attack, the government began to arrest Buddhist priests within hours. There was one priest of a Zen temple near Pearl Harbor—Pearl Harbor was attacked on a Sunday morning, and this priest was leading Sunday morning services when they heard the explosions outside. They didn’t know what they were but they decided to evacuate the temple. A few hours later, when the priest returned, the temple was surrounded by U.S. soldiers. They said, “Who are you and what are you doing?”

He said, “I’m coming back to feed the temple birds.”

They said that they were going to arrest him. He let out the birds so that they wouldn’t starve while he was gone. He let them fly out the window, and then the soldiers arrested him.

How could the government move that fast? For more than a decade, the FBI and the military had maintained a list of people they would arrest if and when war broke out with Japan. That list was primarily Japanese language teachers, martial arts instructors, and Shinto and Buddhist priests. Nikkei Christian ministers were exempted from this list. They were not considered threats to America.

I’m taking most of this talk from a book that came out last year, American Sutra by Duncan Ryuken Williams, about Buddhists during Japanese internment. I really recommend it. I also want to share some photos with you, mostly by Dorothea Lange. These were commissioned as propaganda by the War Relocation Authority, to show how humane the relocation was. Lange did not obey, though, and when they saw Lange’s photographs, they suppressed them. Most of the photos didn’t come out until 2006.

Japanese American Grocer. (Oakland, CA, March 1942. Dorothea Lange.)

Williams tells the following story about the Kimura family.

The Kimuras lived and farmed in California’s Central Valley near Fresno. Mr. and Mrs. Kimura were first-generation nikkei, and they had a 10-year-old daughter named Masumi. Shortly after the bombing of Pearl Harbor, the Kimuras heard that all the priests of the Fresno Buddhist Temple had been arrested by the FBI, and that some white teenagers had shot up the door of the temple. Then at the even closer temple to them in a town called Madera, the FBI arrested the president of the temple board.

Mr. Kimura was very concerned, because he was on the Madera Buddhist Temple board himself. He thought that the FBI might show up at his door, and sure enough, three weeks after Pearl Harbor, they did. They didn’t arrest anybody that day, but they kept coming back over the next few weeks to ask more questions. Duncan Ryuken Williams writes:

Mr. Kimura decided to take steps to prove the family’s loyalty to America. One day shortly after that first FBI visit, his daughter Masumi was performing her daily chore of lighting the furnace next to their Japanese-style bathtub when Mr Kimura entered the room. He was carrying items he had found throughout the house that had Japanese language inscriptions or “Made in Japan” written on them. Among them were Masumi’s precious Hinamatsuri dolls, which had been given to her on Girl’s Day. As tears rolled down her cheeks, she watched him throw the dolls and all the other Japanese artifacts into the fire.

Her father did not burn everything, however. He could not bring himself to destroy the bound edition of Buddhist sutras that had been handed down through generations of the family. Instead, he asked his wife to find boxes and some Japanese kimono cloth while he went outside and dug a hole behind their garage with a backhoe. After wrapping the scriptures in the kimono cloth, he placed them in tin rice-cracker boxes, carefully lowered them into the hole, and covered them with dirt. By burying them next to the garage, he hoped to be able to find them at some later date.

The FBI never arrested Mr. Kimura, but the story gets worse anyway. In February 1942, FDR signs the executive order for the relocation and internment of all Japanese Americans on the West Coast. Two thirds of these people are American citizens.

Posted Japanese American Exclusion Order. (San Francisco, CA, April 1942. War Relocation Authority.)

Waiting for Registration. (San Francisco, CA, 1942. Dorothea Lange.)

In April, the Kimura family was ordered to report to Fresno to be relocated. They had just a few days' notice and they had to sell everything. They sold their farm for a twentieth of what they had originally bought it for. Just imagine the economic devastation that this causes. An immigrant family who has invested everything in this plot of land and it’s wiped out overnight.

Tenant Farmer. (Woodland, CA, May 1942. Dorothea Lange.)

Tagged for Evacuation. (Salinas, CA, May 1942. Russell Lee.)

They packed one suitcase full of valuables and stored it at the Fresno temple for the duration of the war, and reported to the Fresno assembly center. They were housed in converted racehorse stalls that still stank of manure.

Horse Stall. (San Bruno, CA, June 1942. Dorothea Lange.)

Nyogen was living in Los Angeles when the relocation order came. He wrote the following poem:

Thus have I heard:
The army ordered
All Japanese faces to be evacuated
From the city of Los Angeles.
This homeless monk has nothing but a Japanese face.
He stayed here thirteen springs
Meditating with all faces
From all parts of the world,
And studied the teaching of Buddha with them.
Wherever he goes, he may form other groups
Inviting friends of all faces,
Beckoning them with the empty hands of Zen.

Japanese Buddhists of all kinds arrived at the camps with empty hands.

These were mostly Pure Land Buddhists, mostly of the Nishi Hongwanji sect. But there were also Shingon and Nichiren Buddhists, and Zen. They came with nothing—no robes, no bells, no incense, no altars. If they had brought Buddhist texts with them they were mostly confiscated. The camp authorities had been instructed to confiscate all Japanese-language material, with the exception of Bibles.

A woman named Hisa Aoki wrote in her journal:

Really, do white Americans think all they have to do is confiscate written matter, and the Japanese will completely forget how to read Japanese or completely obliterate the contents of what was written? How can you eliminate in one sweep what was absorbed into your blood, your flesh, your marrow? I want to shout at them—only if you turn the Japanese into ashes can you do that!

Buddhist study group at the Fresno assembly center. (1942. US Army Signal Corps.)

Buddha’s birthday is April 8th, and the first Buddha’s birthday celebration in the camps required a lot of invention. Traditionally for Buddha’s birthday, we have a little statue of Baby Buddha, and we pour sweet tea over it and surround it with flowers. But they didn’t have any of this. At a North Dakota internment camp, a man snuck a carrot from the kitchen and carved into a Baby Buddha, and the internees pooled their sugar rations to make sweet coffee to pour over it. At a camp in Wisconsin, they didn’t have any flowers so they dyed toilet paper red with beet juice and folded that into paper flowers.

Learning to Walk. (Manzanar, CA, 1942. Dorothea Lange.)

Nyogen was sent to Heart Mountain internment camp in Wyoming. He established what he called the Wyoming Zendo. He wrote:

Here in this room, twenty by twenty in size, I live with another family, parents and a daughter, and the visitors bring their own chairs or sit on the floor. Ten or twelve of them enjoy the tranquility of their contemplation. They are the happiest and most contented evacuees in this center.

The camp administrators were suspicious of Buddhism. Officially, the policy was that all religions are respected and permitted equally, but in practice sometimes, some Buddhist gatherings were shut down, or some Buddhists leaders were questioned and given loyalty tests. Some of the Buddhists in the camps thought the best strategy would be to abandon Buddhism, and maybe even convert to Christianity, to avoid any more suspicion, or any more questioning of their loyalty.

But there was a counter-movement called the Young Buddhist Association, which was led by second-generation nikkei. At a camp in Arizona, the Young Buddhist Association held a conference called “Gassho.” The keynote speech was from a Pure Land priest, Noboru Tsunoda. He said:

Buddhism in America is doomed. Such was the impression of many people following the outbreak of war and the resulting evacuation. … However, we find that the very opposite has taken place; the hardships and adversities of our faith in the Lord Buddha and His All-embracing Teachings and our experience in the relocation centers has shown more than clearly that religion does not consist of beautiful churches and hosts of clergymen, neither does it consist of dogma as set down by authorities; the fundamental basis of religion is in the strength of our faith.

When there’s a crisis like this, when people’s lives are turned upside down, they don’t go back to normal. They just keep transforming.

As the war between Japan and the United States came to an end in the summer of 1945, Japanese Americans were incrementally released from the camps. The West Coast towns that they returned to were not welcoming. There were newspaper editorials saying “all Japs should be deported.” There were stores that refused to sell to them. The Buddhist temples that had been locked up for the duration of the internment, almost all of them were broken into, looted, vandalized.

The Kimura family, after the war, returned to try to buy their farm back. The farm’s new owners asked for a price ten times what they’d paid the Kimuras. The Kimuras couldn’t afford that, so they gave up, and went to live with relatives in Los Angeles, and left their farm behind. The new owners had torn down the garage so there was no hope of locating where they had buried the sutras that had been in their family for generations. The Fresno Buddhist temple where they had stored their suitcase of valuables had been broken into, and the suitcase was gone. They lost everything.

Members of the Mochida family awaiting evacuation bus. (Hayward, CA, May 1942. Dorothea Lange.)

Japanese Americans rebuilt, though. They restored their temples. For a while they were using them as hostels or as job centers, to get the returning evacuees on their feet again. The main Pure Land Buddhist school, called the Nishi Hongwanji, renamed itself the Buddhist Churches of America, and transformed itself into something that was more acceptable in America, something that could survive to this day. The nikkei had lost almost everything, but they rebuilt their farms and their businesses, and they found their place in America after the war.

I have told you such a tiny fraction of this story. I really recommend reading the book. I’ve had to skip the whole experience of Japanese American soldiers fighting in the U.S. military. They fought and they died in unbelievable numbers, with astonishing bravery, even while their loyalty was questioned, the military insulted their race and their religion, and their families were imprisoned in concentration camps. That story is the thing that I’m most sorry that I had to skip tonight. So I recommend reading the book.

I think that we should read about this time in history because this is our story. Our practice was founded in America by Japanese immigrants, and it’s because they preserved, protected, and adapted this practice during World War II that it survived in America. After the war, non-Japanese seekers, philosophers, poets, hippies, and beatniks came looking for somebody to teach them zazen. It’s out of the conversation between these seekers and Japanese American teachers—that transmission created the style of Zen that we practice at the Village Zendo.

We owe that generation of Buddhists an enormous debt, an unrepayable debt. But we can try to repay a little bit of it now, by preserving this practice, preserving it in this moment. They weathered confinement and economic devastation and the deaths of their loved ones, and they got through it. We can get through this. We don’t have all our altars and Buddhas and incense bowls right now. We can’t even get into one room and all chant in unison, and we can’t sit side-by-side in zazen. But, religion does not consist of this. It’s in the strength of our faith.

It happened that on the same day Japan surrendered, Nyogen’s camp in Wyoming was closed. He shut down the Wyoming zendo and returned to Los Angeles. An old student of his owned a hotel there and gave him a room to live in.

Nyogen wrote:

For forty years I have not seen
My teacher, Soen Shaku, in person.
I have carried his Zen in my empty fist
Wandering ever since in this strange land.

The cold rain purifies everything on the earth
In the great city of Los Angeles, today.
I open my fist and spread the fingers
At the street corner in the evening rush hour.

Climbing Under Lockdown

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Sun, 29 Mar 2020 14:06:03 +0000

March 29, 2020. The gym has been closed two weeks, I’m starting to climb the walls.

Joren's Shuso Hossen

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Tue, 10 Mar 2020 17:14:52 +0000

March 8, 2020. Our Shuso Hossen ceremony at Village Zendo, in which Joren Sarah Drury gave her first dharma talk and engaged the community in dharma combat. She is now a Senior Student of Zen and empowered to give regular dharma talks at the zendo.

How Not To Panic

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Fri, 14 Feb 2020 22:21:18 +0000

Transcript

It was so nice at the beginning of the last meditation period, listening to Bokushu and Seizan giving beginning meditation instruction to some of the newcomers. I first received meditation instruction in 2001 or 2002, and I am so grateful. It was so small, and the consequences have been so boring and so profound. It’s like a solar sail, the pressure it exerts on my life is subtle and yet it’s always pushing in the right direction. Over time, it’s brought me to a completely different place than I would otherwise be.

My name’s Jiryu. I’m a senior student here. I’m going to talk about climbing. I talk about climbing a lot in my Zen talks because I’m learning so much about Zen from climbing at an indoor climbing gym. Today I want to talk about fear.

A couple months ago I learned a more advanced technique for climbing. Before, I’d been doing something called top rope, where I’m attached to a rope as I climb up the wall. The rope loops around a pulley at the top of the wall, and then my belayer has the other end of the rope tied to a ratchet around her waist. As I climb towards the pulley at the top of the wall, the belayer takes up the slack on the ratchet. And that means if I fall at any point, which is pretty common, I’ll only fall a couple of feet because the belayer has been taking up the slack, and as soon as the rope goes taut she catches me. So it’s not very scary.

But I’ve graduated to a more advanced style called lead climbing, which is much more realistic. It’s much more like outdoor climbing. With lead climbing, I start on the ground next to the belayer and we are attached to each other by a rope without much distance between us. And as I climb up the wall, I take the rope with me. It’s not looped around anything at the top of the wall. I have to take it up as I go.

Every few feet I’ll reach a clip, a carabiner that’s hanging off the wall, and pull up a loop of rope. I’m holding onto the wall with one hand and I pull up a loop of rope with the other hand and clip it through the carabiner. And now I’ve achieved some degree of safety. The belayer pulls up the slack and now if I fall after that, I’ll only fall as far as below the last carabiner that I managed to clip.

But it still kind of freaks me out because the falls that you take when you’re lead climbing can be quite a bit greater. You’ll fall down past your last clip and then past that even further. Depending on how much slack and how bouncy the rope is, you might fall 10 or 15 feet before the rope catches you.

I took my lead climbing class without a lot of difficulty. Passed my certification test. I have this climbing coach that I see about once a week and he was so excited, “You’re lead certified now. Let’s go climb in the cave.”

The gym has a huge indoor cave made out of plywood and covered with plastic climbing holds. You can only climb the cave on lead. You go up from the ground in the back of the cave and climb up into the ceiling so that you are eventually hanging upside down like a sloth, 20 feet above the ground. As you go, you clip the rope every few feet, so that if you fall, you’ll just end up swinging in the space in the middle of the cave.

My coach got me on this climb and I started doing it and it was so exciting. I felt like a real climber, like I saw in the movies. And I was doing great. I was remembering the technique. How to hold onto holds, how to keep my body weight evenly distributed, even though I’m upside down in this very difficult and unnatural position, clipping the rope into the carabiner as I go.

Then I reached what, for me, was the hardest part, which is: I reached the end of the cave, its mouth. And I needed to get around the lip that formed the mouth of the cave and up onto the wall, the vertical face that the cave had been sort of excavated from. So I needed to go from hanging from the ceiling to hanging off of a vertical wall, head up.

The route setters who had designed this climbing route were really pretty brutal. Because I’d been climbing for many minutes—I had covered tens of feet, my hands were exhausted, my forearms were pumped full of blood. And just at this moment they decide to give me a couple of really terrible holds. You can’t wrap your hand around them. They’re called slopers because they’re just textured blobs on the wall. You don’t grab them, you just press your hand on and hope that the friction will keep you from falling.

So I’ve got one hand on one of these slopers, and then with the other hand I now have to get to the first clip outside the cave. I’ve got a lot of slack now behind me, maybe five or six feet. And of course in order to get a loop of rope to attach to this carabiner I have to pull up even more slack. This is the most vulnerable moment.

And now I’m starting to feel afraid. I can feel my legs quivering. My heart is racing. My breath is tight and short, if I’m breathing at all, who knows? I’m sweating, which is the worst part, because my palms are now starting to create a thin layer of mud formed by climbing chalk and sweat, which does not stick to that slopey hold at all. And I feel it start to slip. I’m trying to get this damn rope into the carabiner, and the carabiner’s at an awkward angle, and with just one hand I can’t get that loop of rope through gate of the spring-loaded carabiner and safely through it.

At the moment that my left hand starts to actually begin to move, I just shout “falling!” and I fall. And I’m fine. Maybe it’s a little comical, but I end up swinging 10 feet off the ground like a yoyo that just failed its trick, over the head of my belayer.

I say, “Can I continue? Is there any way to get back on the wall from here?”

And my belayer, my coach, says, “No, you’re coming down.” He lowers me. He had such a look of startle. He said, “That was the worst place you could have fallen.”

You know how a little kid, when they fall, they’re shocked, and they stare into their parent’s eyes? And if the parent is calm, the kid’s fine. But if the parent is afraid then the kid completely loses it. That was me. I was fine until I saw the look on my coach’s face. Then all of a sudden I was losing it. I was like, “Man, what could have happened to me?”

I think in retrospect I misinterpreted it. It really was safe, but it scared me at the moment. I didn’t get right back on the wall then. I walked away with this queasy sense, like something had gone wrong. Something could have—I could have been hurt.

A few days later was the next time that I tried lead climbing, and I was surprised at how scared I was. All of a sudden this thing that I’d been doing without trouble a few times before, now it was freaking me out. Just strapping on the gear for that style of climbing already started my heart beating a little bit faster.

As soon as I started climbing up the wall and clipping the rope in, I got distracted by anticipating thoughts. Starting to visualize, “I’m going to fall here. I’m going to fall here and my finger is going to get caught in the clip and I’m going to rip my skin. Or I’m going to fall and hit my butt on a hold and it’s going to break my pelvis.” Disaster is flashing through my head. I was sweating, and hyperventilating, and the slightest fatigue in my hands would suddenly make me feel like I couldn’t possibly continue the climb.

Very often, for the first few weeks when I was getting over that scare, I would be on the wall, trying to climb and I would panic. And I would be unable to continue. I would be trying to make that clip. The closer the situation was to the moment that I’d fallen first—if there was some bad hold in one hand and I couldn’t get the rope into the clip in the other, and I felt tired—like all of those aspects of the original trauma, the more of them that resembled that, the more likely I was to switch over from manageable fear to unmanageable panic. I would shout, “Take up the slack!” and would lower myself down a couple of holds and deliberately fall, because I could not continue.

Fear happens all the time. Fear is something that we can practice with. Panic is when we can’t practice with it, when the self that is capable of intentionally working with the fear is blown away and all that’s left is the fear and the helplessness. That’s panic. Panic is an animal experience of being alone and naked and disoriented in the dark. Paralyzed.

But if we say that panic is when you’re so scared that your self is obliterated, maybe that’s a good thing! The project of Zen is to liberate us from our sense of an isolated self, to recognize that we are one with all of this. So maybe that experience of a loss of self when we panic, maybe that’s great! Maybe we should try to practice that more often. After all, the first koan that we study when we begin practicing Zen is “What Is Mu?” And you cannot answer that question while viewing yourself as a separate “I.” So maybe we should be panicked all the time.

There’s this excellent Chinese Zen teacher named Sheng Yen. He was born in China and taught in Taiwan. He died in 2009.

Sheng Yen wrote,

The purpose of practice is ultimately to let go of the sense of self, but in the beginning we need to affirm the self as the vehicle for practice. We settle the mind by contemplating the breath going in and out of the nostrils. We are aware of who is breathing. To affirm your sense of self, use the awareness of your whole being as a point of reference.

With a strong sense of self you have a foundation for taking up “What is Mu?”

So Sheng Yen says we need a very strong sense of self. That makes sense, because we have to practice with such determination. We have to keep promises to ourselves. How else could we practice except with a strong sense of who we are? Otherwise we’re just going to get knocked off of our practice the first time we experience something unpleasant, something scary, something that we would rather distract ourselves from.

For me, the main reason that I have distracting thoughts during zazen is anxiety about the future. Some small fear—What am I going to say in my dharma talk tonight? What am I going to say at the meeting with my boss tomorrow? How am I going to solve that problem?

My obsessive need to think about these things that might go wrong, and to try to solve that problem by thinking, is the main distraction that I struggle with during meditation. I think that this is true of a lot of meditators, and I think it’s been true for a very long time. There’s a line in the Śūraṅgama Sūtra which was written maybe 700 AD. The line goes, “If you arouse your mind, even momentarily, anxiety over the material world will come up first.”

So we have to face our fears in order to practice. And when we face these fears, we actually need a very strong sense of who we are, in order to deliberately confront these fears. To allow ourselves to experience it without letting ourselves get caught up in distracting thoughts.

We have to prevent panic. Panic as a dissolution of the self, it’s very different from samadhi. Samadhi is a liberation that expands the self to be the entire world. Panic shrinks it until all of experience is just fear. Samadhi is not knowing, but it’s a not knowing that is omniscient. And panic is a not-knowing that is helpless and disoriented in the dark.

The courage we need to face our experience, but not panic, that courage is actually born of a strong sense of who we are.

A couple years ago I did a week-long meditation retreat, a sesshin, that was the most intense of any that I have ever done. I went into it with—I had completely lost confidence in my life and in the choices I had made. My relationship with my partner was bitter and resentful and silent, and I didn’t know if we could fix it. At work, the company that I’d worked for for six years, where I’d been so successful, suddenly I was disrespected. My ideas were dismissed. I had reached a dead end in my work and I thought I might have to quit. I didn’t know where I would go.

I sat with these two feelings of complete “not knowing the answer” for a week. The temptation was so strong to think about these two problems. “What am I going to do?” Instead, I kept putting those thoughts down, and returning the feeling, which was fear. I have a problem and I don’t know how to fix it. It was a tingling in my stomach, sweating, the heart beating fast, breath tight, a line of heat going up from the pit of my stomach to my heart, and the flame going up in an endless cycle through this line of fire.

By the end of a week of that, when we finally finished the retreat and I was ready to talk again, I could not wait any longer to start fixing my life. I talked to my partner. I described what had happened. Within a year we were in therapy and starting to turn our ship around. At work I began the process of switching teams to find a team where I could contribute again.

But that was a hell of a meditation retreat. A few months later when I was going to do another week-long, I was kind of apprehensive. I thought, “Am I going to have to spend another week with that line of fearsome fire burning in my body?” But at the same time I was sort of excited. Something was finally happening in my meditation practice! It’s not the same old boring drowsy daydreaming crap. It’s intense and emotional and unpredictable.

So, great. Let’s do this. We started the meditation retreat and….zilch. You know, just the regular old daydreaming. Back to business as usual. Kind of disappointing. Turns out my practice is just like it’s always been.

Ordinariness is a monster, too. And I run from that monster, too, when things get too routine. I need to start taking risks, maybe swerve a little closer to a car in traffic or take on some project at work that I actually don’t think I can succeed at. I need some fresh challenge. Some risk. Because ordinariness is scary. As soon as it sets in I start to panic. I think, “What if this is all there is from now on, forever?”

You know Jean-Paul Sartre’s play No Exit? Hell is people stuck in a room forever. There’s no torture. It’s a nice drawing room. They have coffee. The horror of it is it will always be the same.

And this happens to us, too, when we get bored. Boredom is also scary.

I remember when I did my first week-long meditation retreat, 18 years ago or so, I was afraid of it. And maybe you’ve been afraid too. Maybe you remember being afraid the first time you did a day or even half an hour of meditation. What was scary? It could have been pain. But it might have been boredom.

When I have been afraid of starting very long meditation retreats, often the thing that I’m afraid of is being bored. Which is a funny thing—it’s like a paradox—but it is scary. This is meditation’s main teaching, though. To accept boredom. Not to fix it by entertaining ourselves but to sink into it and see what its contents actually are.

I see people walking around all the time preventing boredom like it’s a matter of life or death, like they must be terrified of boredom. Very often at my office when I go to the bathroom, the men on either side of me at the urinals are all looking at their phones while they pee. They are steadying their penises with one hand and they are scrolling with the other. What poor, nervous, terrified people that they might risk a moment of ordinariness. And how desperately they are searching their phones to make this moment special.

But I don’t do that, right? We don’t do that. Because we’re Zen students. We know that this moment is special already. The miracle of the kidneys, the gift of water—this moment will never exactly recur. We should value it.

Besides, any moment that you can practice is worth practicing, because these ordinary moments are a good time to train. The moment of panic will come, sooner or later. And if you wait until then to learn how to accept and concentrate on this experience, if you wait, it might be too late. The panic will come—the naked, alone, animal fear; the darkness.

So I’m determined to keep working on this fear. Climbing is a wonderful opportunity because now I’ve got something I’m afraid of and I can do it. My strategy is to work at 90 percent panic level. So: Pretty afraid, but still, I know who I am and I can practice and work with that fear. I can essentially do meditation on the climbing wall.

Because if I hit 100 percent, I’ll panic and I’ll give up. I’ll say, “Take up the slack,” I’ll lower myself down. The reflex that responds to that situation by paralyzing me, that will only get stronger the more I let it happen. So I get right up to the edge and then I work through it. Even though I’m shaking and sweating, my hand is barely holding on, and I can’t get the damn rope into the fiddly little carabiner, I just stick with it and I do it. And then the next time I’m a little less afraid and I can try a climb that’s a little harder.

The two techniques that are really working for me when I’m climbing are, first of all, to just concentrate on what I’m doing. What my disorganized mind really wants to do is visualize accidents that are about to happen. That’s clearly not helpful. The other thing it likes to do is to come up with reassuring thoughts: you’re on a rope, you’re safe, your belayer is paying attention, you won’t fall very far. That is also not helpful.

What’s helpful is to really feel my body, my hands, my feet, to look at the carabiner, to look at the rope, to remember the exact motion that will get that rope past that spring. Just concentrate on this exact moment and the skill that I am practicing.

The other thing that helps of course is to breathe. Slowly. Especially to slow down the out breath. Phhbbbtt. Doing the horse breath actually really helps. You’ll often hear climbers do this. Phhbbbtt. You can try putting your hand over your heart. Phoooooo. You’ll actually slow it down. You can feel it. The space between the beats is greater as you exhale slowly. Try it next time somebody is taking your pulse. It’s kind of fun to trick them. “Wow your pulse is only 50? You must be a marathoner.” Phoooooo. It really works.

There will come a time when the panic will come, and this will help. Even if it’s not falling off a wall or turning your life upside down. It can just be the ordinary anxiety that comes up as we meditate. Right? The thoughts about, “How am I going to deal with that problem tomorrow?” “What does that person think of me?” We can do the same thing there.

Just phoooooo letting the breath out really slowly and deliberately. If the thoughts are coming up, catastrophizing or planning or trying to reassure you, you can drop those.

A technique that we often teach is to count the breath. Phoooooo. “One,” in your mind. And then the next one. Phoooooo. “Two” in your mind. All the way up to 10, and then start over at one again. That helps keeps us here, breathing, concentrating on what we are actually doing, the experience of our bodies in the room.

The self that can work with fear in this way is the self that we’re going to use as our raft to get us to the liberated shore. This self has the determination to work with the fear, to face whatever we’re going to see on these meditation sessions of half an hour, or days, or weeks. This self is determined to keep facing ourselves, confronting what we really are, in order to liberate ourselves from this scared and isolated little being that we’re trapped in.

When the panic comes, it risks killing that self. Maybe it’ll be the actual death at the end of our lives, or maybe a little death of that moment when we’re obliterated and can’t work with our fear anymore. That’s the ultimate test of our practice.

The Zen teacher Mumon says this about that moment:

Know that coming out of one husk and getting into another is like a traveler’s putting up in hotels.

In case you are not yet enlightened, do not rush about blindly. When suddenly earth, water, fire, and air are decomposed, you will be like a crab fallen into boiling water, struggling with its seven arms and eight legs. Do not say then that I have not warned you.

Image: Sam Bald

Meditation at the Village Zendo

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Mon, 03 Feb 2020 07:37:37 +0000

November 24, 2019. Meditation at a Zen temple in Manhattan.

Yokoji Zen Mountain Center

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Mon, 13 Jan 2020 08:47:41 +0000

August 2019. Yokoji Zen Mountain Center, in the San Bernardino Mountains.

Year End Meditation Retreat

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Mon, 13 Jan 2020 08:08:36 +0000

December 2019. Yoga class and sutra-copying at the Village Zendo’s year-end meditation retreat at the Garrison Institute.

You Look So Zen

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Thu, 02 Jan 2020 22:32:28 +0000

On a visit to some Buddhists incarcerated on a jail barge, I had to improvise how to practice Zen, just like we always have to improvise our practice.

Dharma talk given December 29, 2019, at the Village Zendo’s year-end retreat.

* * *

Audio only:

Transcript:

I sent out a message to the sangha last week, asking if I could take pictures of you meditating at home for a magazine story that I’m doing. I got some interesting responses.

Myogetsu answered, “I’d be happy to do so—but I don’t think that me sitting on an office chair in my bedroom is quite what they want.”

Keiko wrote, “If you wish, I could do this. But my place might not be very photogenic, because I don’t have an altar. No little statue or incense or anything.”

A lot of us have an image of Zen, and we don’t measure up to it.

When I think of Zen, the first thing that comes to mind is an image of a monk with a shaved head, wearing a robe, sitting in lotus position on the floor. It’s weird. Where does this image even come from? I practice with the Village Zendo, which is very far from monastic. I’ve never even been to Japan. Very few people here have shaved heads. And the image that comes to mind is a man even though my teacher is a woman.

As a photographer, it’s a challenge for me because I want to take photographs of us as we really look, and show that to other people, to see us as we are. But the images that I’m attracted to making are exotic and mysterious and monastic.

What’s your image of the Zen? What’s the first thing that pops into your mind? What color skin does Zen have? What gender is Zen for you? How does Zen sit?

Maybe the most important question is: Do you think that you look Zen?

So, let’s fix this now. Let’s take a minute to look around the room at each other. I know it’s against the rules, if you get in trouble, tell them it’s my fault. Look at all of the different kinds of hair, colors of skin, colors of eyes that we have. The different positions that we’re sitting in. The different clothes that we’re all wearing. I’ve been to monasteries where everybody sat on the floor and everybody wore the same robes. It’s beautiful. It’s like a ballet company. It’s rows of bodies in uniform moving in unison. I used to wish that the Village Zendo was more like that.

But now I like all the different clothes we wear. It’s like Buddha Mind’s fingers reach out from the darkness into this room where we all wiggle together but we each have our different fingerprints, unique.

So my image of Zen now is this. Lots of different looking people all practicing together.

And I know that this is getting super cheesy. This is like a dharma talk from Sesame Street. If you wanted a subtle and erudite talk you should have seen me a few years ago when that was what I was trying to do. This is better.

When we’re on retreat, it’s a constant improvisation. We try to structure it, we’ve got a printed schedule, we read from sutra books. But we never really know what’s going to happen next, particularly during zazen. Every minute is unexpected. One minute you’re calm and alert. One minute you’re frustrated. One minute you’re gassy, which is I find is a side effect of meditation retreats.

And your job is to improvise a response to this minute of zazen. To improvise practicing being calm, or being frustrated, or being gassy. Meeting each one anew. You can’t treat every minute of Zen the same. Each one is its own fingerprint. It’s got its own whorls and arches and loops.

If you don’t like how your meditation is going, trying to make it into something else is counterproductive. Dissatisfaction, also known as suffering—or in Sanskrit it’s dukkha—dissatisfaction is wanting this moment to be something else. If we try to make our zazen always the same, always calm and happy and alert, that’s counterproductive. That’s just more of the same.

If you’re here to free yourself from suffering (that’s what I’m here for), the way to do that is embracing what a moment actually is. However frustrated or gassy it is, you’ve got to move towards it and deal with it as it’s actually occurring in the moment, extemporaneously. And it takes courage. That’s why we do it together.

I want to tell you a story about Zen improv.

So, a couple of weeks ago, my partner Keishin and I, we went to a place called the Vernon C. Bain Correctional Center, which is a jail barge moored off Hunts Point, the Bronx. There’s a man imprisoned there, I’m going to call him Charles, he’d requested some kind of Buddhist services. Keishin and I are connected with the jail system so that request found its way to us and we decided to go.

The email just said there’s an inmate at VCBC, so I Googled that and saw an image of a prison boat with 800 people on it. So we rented a Zip Car and we drove up on a Sunday morning. We got to the parking lot. The guard opened up this big razor wire gate. We drove through. The gate closed behind us. He opened another big razor wire gate in front of us and we were into the parking lot.

It was misty and as we approached it, the boat loomed up over us through the mist. Gray and blue prison blocks stacked up seven stories tall looming over us, it looked like an evil Lego model.

We park and we walk across this little gangway onto the deck of the barge. Total slapstick ensues, because we can’t figure out how to get in. The entrance is a little hatch, and it’s just marked “security entrance,” in a long blank gray metal wall. And it’s got two big handles on it. We turned both handles one way and we gave it a tug. We turned the two handles the other way and we gave it tug. Nothing was happening. Nobody was around. I felt like an idiot. I was sure that at any moment, somebody was going to jump out and arrest us for trying to break into jail.

Finally we figure out that we just need to give the thing like a really hard tug and it opens up and we’re in.

So obviously we have no idea what we are doing.

We get inside, it takes half an hour to find the officers who know who we are, what we are doing there. They lead us down to the chapel below decks where we’re going to meet whoever had heard the announcements that there were going to be Buddhists there today and wanted to talk to us.

Down in the part of the boat where we were it’s spooky. It’s this brightly lit metal jail with no windows, just long halls with locked metal doors. Every once in awhile you’ll turn a corner and there’s a Christmas tree. It was weirdly quiet for a jail. Mostly you heard the whoosh of the ventilation system or the buzz of the fluorescent lights. Everything was a little too small. You lose your sense of direction.

So we get to the chapel. It’s a small room with a podium with a dusty bible sitting on it and some chairs. The officers bring in the men who are going to meet us, one by one. The first one, I’m going to call him Ken. Laid-back, friendly guy in his late 20s, short dreads, some face tattoos. He’d been on the barge for two weeks, he had one week to go. He was awaiting extradition to another state. One of the most incredible things about him was that he said how happy we was to be there, that this was the perfect place for him to be.

I asked about his meditation experience. Some of the other people on the cellblock had been teaching him yogic breathing techniques, and he had some meditation experience. He’d learned to meditate from YouTube videos, which I thought was pretty interesting. I learned from reading Zen books and filling my mind with a bunch of ideas before I started sitting, and these days we meet a lot of people who have tried out the Headspace app before they come to us. It’s all great, as long as it goes deeper. As long as it leads to practicing with a group and a teacher and the precepts, all of these different ways to start are amazing.

After we talked to Ken for a couple of minutes, two more men, Charles and Lee, came in. Charles was one who had requested us to come: middle-aged man, black, very quiet, and gentle. He had been chanting. He was practicing Nichiren Buddhism, he’d been been doing it for the two months that he’d been on the barge. He showed us a Soka Gakkai International chant book that he had marked up all over.

And he was there with a guy I’m going to call Lee. I think he was Chinese. He didn’t speak very much English. He had very intense, worried energy. He was staring at the floor. His eyebrows were kind of knitted up the whole time that he was there. He’d been on the boat for eight months, which might explain a few things. And he had also been chanting during the time that he’d been on the boat.

So we had these three men with us, and we had to figure out something helpful to do together. I offered a 15-minute, bare-bones introduction to zazen: sitting on chairs, upright, touch your thumbs together, watch your breath, drop your thoughts. We sat silently together for a little while in a circle. And, because my gaze was down, I was looking at their shoes. The jail—people there are so temporary, so the jail gives them all the same black canvas sneakers.

We finished sitting, I asked what everybody’s experience had been like. Then I asked Charles to lead us in a chant.

Nam-Myo-Ho-Ren-Ge-Kyo
Nam-Myo-Ho-Ren-Ge-Kyo
Nam-Myo-Ho-Ren-Ge-Kyo
Nam-Myo-Ho-Ren-Ge-Kyo

…which is “Homage to the Holy Lotus Sutra.” It’s the main chant for Nichiren Buddhists. Charles had a family court date coming up that he was really worried about. And he said that he was having very difficult conversations on the phone with his wife. He has a wife and kid in the city, and he’s been chanting to wish for a good outcome to the court date. And he chants before he talks on the phone to ask for 20 minutes of conversation without anger or blame.

And that was it. That was the visit. We went back up through the hatch and got into our Zipcar and went home.

None of this looked like the conventional image of Zen. There was no robes, no incense, no bells. But it was profoundly Zen to practice that way. I think we’re going to try to set up a regular monthly visit to the boat next year and it’s going to be complete improv. The population there is constantly changing. Men are coming and going every week and we will never have the same group twice. We’ll never know who we’re going to meet. We’ll have to decide in the moment how to be helpful to each other there. We’re going to have to think on our feet. I think it’s going to be awesome.

There’s a koan in our study text about improv. It goes like this.

Yunmen instructed his assembly, saying, “I am not asking you about it before the 15th day; but see if you can come up with something about it after the 15th day.” No one responded, so he himself answered for them: “Every day is a good day.”

What’s this about “before the 15th day” or “after the 15th day?” In the Chinese lunar calendar, the 15th day is the day of the full moon. So, what he’s asking about—brace yourselves, here comes the word—he’s asking about enlightenment. He’s asking about before you have that moment, and then after that full bright moon moment. What can you say about these?

BULLSHIT!! Every day is a good day.

In our book, Zen master Baochi has a verse on this koan. She wrote that,

On clear days, the sun comes out;
When it rains, the earth is damp.
There is no need to think about anything else
Except being able to finish up your business.

That’s it.

I think maybe master Baochi knew something about people’s conventional images of Zen. Yuuka told this story yesterday, that when Baochi went to a male teacher to prove her understanding, he said, “You are a manly man.”

In this verse she wipes it all away. Male and female, monastic and lay. When the sun is up, it’s warm. When it’s raining, the earth is wet. Your job is to improvise a response. Finish up your business.

And she says “finish up,” but don’t let her trick you. We don’t ever finish. It’s continuous improvisation. There’s never going to be a moment where you say, “Aha! I figured it all out. From now on I know exactly how things are going to go and what I need to do.” You’re going to have to go on winging it for your whole life.

But the good news is that winging it is enlightenment. Practicing the way we do here, extemporizing a response to every minute of zazen, every minute of living, that’s it, that’s the whole secret. And when we do that, that’s a good day.

Nothing’s ever the same twice, obviously. That’s what “every day is a good day” means. Each minute is a fingerprint, a mark that Buddha Mind places in spacetime. It will never be repeated.

So let’s practice this way together this week.

Don’t try to make your zazen the same from moment to moment. Right? Deal with it as it comes.

Definitely don’t try to make yourself fit some image you’ve got of Zen.

Take each minute as it comes, appreciate it as an expression of Buddha’s body, and improvise a response.

Image: Vernon C. Bain Center, seen from a kayak

When The Buddha Met A Murderer

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Sun, 15 Dec 2019 23:02:49 +0000

Once, in the time of the Buddha, there lived a robber named Angulimala. He lived in a forest and ambushed travelers on the roads in and around the forest, and killed them.

Soon, people were afraid to travel. Trade ceased, the economy suffered, pilgrimages stopped, temples were abandoned and fell into disrepair. But Angulimala was not satisfied with this. With the roads empty, he started to go into villages, seeking more victims. And he massacred villagers. Pretty soon refugees fled the region and whole villages were abandoned. The area was depopulated.

Angulimala, he never took money or anything from his victims. He only took, from each of them, one finger. He wore the fingers of his victims in a necklace. That’s how he got his nom de guerre, Angulimala. A mala is a bracelet or a necklace, and Angulimala is “finger necklace.”

Angulimala’s parents lived in the area, but all they heard about was a murderer with this nickname. They had no idea that it was their own son. But one day, his mother, through divine omens, realized the truth and went to go find him.

Now, Buddha and his followers were camping nearby, and Buddha through his superpowers knew that Angulimala’s mother was going to find him, and that if she did, Angulimala would kill her. And if you kill your mom, you will be reborn in hell for innumerable lifetimes. So, Buddha set off briskly to reach Angulimala before his mom did.

Imagine the scene: Buddha, a thin little man, arrives in an empty village. The smell of corpses everywhere. He’s wearing a thin robe and he’s holding a stick, and that’s all he has. He stands in the middle of the dusty main street, barefoot.

Suddenly Angulimala leaps from an alleyway! He’s huge, and armored. He’s carrying a gigantic sword. He charges toward the Buddha.

But, even though Angulimala is running at top speed, the earth between them stretches so that he can’t actually reach the Buddha. He shouts, “You, stop!”

Buddha says, “I am standing still. I have put down the sword for all beings. You are unrestrained toward beings. Therefore I am standing still, you are not standing still.”

Angulimala is impressed. He’s astounded by the Buddha’s courage. He’s never seen anybody respond to him like this before, and he asks for an explanation.

Buddha says monks who follow him control themselves. They restrain their greed and their desire. And this liberates them.

Buddha was so persuasive—I really wish we had somebody like this today. Whenever Buddha talked to anybody, they woke up. Angulimala woke up. He repented for what he’d done and he asked to be a monk. Buddha shaved his head and accepted him into the sangha immediately.

So, the monks and nuns of the early sangha had a criminal in their midst. I wonder how they felt about that. Wider society certainly did not accept that Angulimala had been transformed. When he went begging for food they threw rocks at his face.

And Buddha’s own followers, they’d taken vows, but they were ordinary people. I wonder if they had some fear, or resentment that this guy would go unpunished and would become one of them, one of the noble ones.

But we all have to trust each other in this sangha. I mean, you don’t know what I’ve done. I don’t know what you’ve done. But in this case, the sangha knew exactly what Angulimala had done. It had been on the evening news. I wonder what they thought about having him as one of them.

What would you do in their situation? Would you be afraid? Would you welcome such a person?

We meet people like this sometimes because the zendo has a meditation program at Sing Sing. So we have a sangha there and some of our sangha members have killed people. When I started going 10 years ago, I was very interested in what crimes these men had been convicted of. And it really interested and excited me to wonder about that. That’s not true anymore. You know, I’ve never asked anybody, and now it’s not important to me anymore. I don’t even wonder.

We meet where we meet because of what they’ve done in the past, but it’s clear to me that who I’m meeting is a person who is an event happening in the room at that moment when we meditate there. And so that’s where we meet. I don’t need to know what happened in the past. Although in the cases where I have heard the stories, I wish I could tell you because they are incredible men, and their stories are devastating. But that’s their story to tell.

Every few years one of our sangha members is released and sometimes they come here. One guy did the summer retreat and took Jukai, he was part of my class. More often they might come to sit with us a few times. And most often we just don’t see them at all. We lose touch. Because their lives are very complicated. Being on parole is incredibly complicated and coming to the zendo just does not work for them in their lives on the outside. But they’re welcome here and I hope sometimes they will come.

The reason that Buddha welcomed Angulimala into the sangha was because, from Buddha’s perspective, as soon as Angulimala woke up, he was a different person. Buddha gave him a different name: Ahimsaka, meaning “harmless.” And now he’s a monk, one of the noble ones, and so he belongs with the monks in the community.

That makes sense because even if you take a snapshot of us at one moment, we’re not one thing. Even now. Nothing in the universe is ever a single thing. Like, if you pick a floorboard. It’s both composed of parts and of a larger whole. The floorboard contains the cells that grew when it was part of a living tree, and it’s also part of something larger: the floor. And that’s part of something larger still: the room. All beings in existence are like this, because this slicing up that we do, it’s a product of our perception. And we can break apart or stick together infinitely.

I am also composed of parts and I am also part of a larger whole. I usually perceive myself as my skin and everything inside of it. But if you’ve been following along for any time at all, you know that this is a delusion and that the falseness of this delusion leads to anxiety and suffering.

Because it’s a fact that existence is fractal. If you zoom in, there is just as much detail, and if you zoom out there is also just as much detail. My liver is just as complex as I am. And my society is also just as complex as I am. It’s arbitrary where you say “here is a person.”

The Buddha said that for each of us, our consciousness is composed of five parts.

One: Form, which is the body, stuff.
Two: Sensation.
Three: Perception, which is the process of labeling the things that we’re sensing.
Four: Conception, which is the process of conceiving intentions or desires, or making plans.
Five: Consciousness, which is the process of experiencing of being.

And these five “heaps”—in Sanskrit they are skandhas—these five skandhas of processes are all interacting together. And briefly forming something like a person.

I don’t usually see myself this way, though. I see myself as “I’m me, and I am in control of myself.” And that’s why it’s such a shock to sit down and meditate and see what a chaotic and dynamic system is going on in my head all the time. And how little influence what I think of as “me” has over this set of processes. And that’s a large portion of the point of meditation, is to experience that and to see through the delusion that we are single people in charge of ourselves.

Given how dynamic and chaotic this whole system is, wouldn’t it make sense that at one moment it could all change completely? That each of us could be transformed at any moment?

I’ve been arrested only once. I was about 16 and my buddy and I liked to dare each other to shoplift. We would set goals for each other like, “Let’s go into this Barnes & Noble and you have to steal every book by Anne Rice.” I would go in with a big trench coat and just pocket them all. And the clerk doesn’t care because she’s my age. And it’s easy.

So we got more ambitious. One day we went to a shopping center with about a dozen stores and we said each of us has to steal one thing from each of them. We started from the east end and hit every store, all the way west. The last store was a big grocery store and I lifted, I think, a box of colored lightbulbs that I wanted for my room. I can’t remember what my buddy stole.

Of course they had security cameras and the security guard was just calmly waiting for us at the exit.

He took us upstairs to the surveillance room where we could see just how obvious and stupid we must have been. He called the cops and the cops came and cuffed us, and picked us up and took us to the station. We waited there on the bench while the cops called our parents. I was picked up first by my mom and I have completely blanked out what happened when she arrived.

After a few weeks I got a letter. The city had declined to prosecute. They just said I’ve got a record until I turn 18, so I better be on my best behavior. That was that. And of course, I never shoplifted again.

The city, the justice system of our little suburb, saw me as complex, as capable of being transformed. They figured: Nice white boy from the neighborhood, I would probably grow out of it. And they were right.

A lot of 16-year-olds are not treated this way. I’m thinking right now of Kalief Browder. In 2010 he was 16 years old. He was accused on extremely flimsy evidence of stealing somebody’s backpack, and he was arrested and thrown into Rikers. And he waited there for three years for trial. There is a technicality in the New York City justice system that allows them to show up every few months, say they’re not ready to prosecute and throw you back in jail indefinitely. There is no right to a speedy trial here. He was beaten by guards and he spent two of those three years in solitary. Finally the city just said they weren’t going to prosecute him after all, and released him. He insisted the entire time that he was innocent, and he refused therefore to accept any plea bargain that would require him to lie and say that had done it.

He went home. Got a degree. Tried to hold down a job. Spoke out against the system. And he was severely psychologically damaged. Two years after he was released, he killed himself. He died in 2015.

It was his activism and those who remember him that led to the #CloseRikers movement. He’s the reason why City Council voted to close Rikers this year.

But if you look at how the justice system treated him, they saw him as one solid thing, as simple, as probably guilty of the thing he’d been accused of and incapable of being transformed. And that’s why they decided to just throw him away.

It’s a common attitude in the U.S., I think, stemming from our puritanical founders, that if you’re convicted of a crime or maybe even accused of a crime, then you are beyond redemption and you deserve to be thrown away.

Early Buddhism kind of tilted in the other direction, which is very interesting. The question is, if we’re not unitary individuals, if we’re changing all the time, are we responsible for anything we do? If I hit my jisha, can I just say, “Don’t blame me, man. I’m not the same person I was five seconds ago”?

Even the person I am right now is just this heap of interacting processes—there’s nobody even here to blame! So why practice the precepts? How could we possibly be responsible for any of our actions?

In the first few centuries after the lifetime of the Buddha, a dozen, two dozen schools all arose with competing answers to this question of, “If there is no self, what does karma cling to? What is reborn?” Clearly, harmful or skillful actions have consequences for us in the future, but how is that possible if there’s no self and it doesn’t propagate from moment to moment?

There were lots of different, often rather technical answers to this question. The one that really makes sense to me was by Vasubandhu, who was one of the founders of the Mahayana branch that we practice today. He lived around the 4th Century CE in India.

Vasubandhu says, “Karma is a process that is an interaction among beings.” Just like beings are, ourselves, interactions of our parts. Karma’s real and we’re responsible for it. But it doesn’t exist separate from anything else. It’s part of the flow that is continued by interactions from moment to moment among all of the processes that we are.

Karma doesn’t have to be abstract. I think it actually makes a lot of sense if you think of this analogy:

Think of the United States. The United States is not a person. The United States has innumerable parts. You can’t point to it and say it’s here but not there. You can’t point to one person and say that’s the United States and that isn’t. And yet, the United States does things and we’re responsible for those things.

In the Trump administration, the United States has abandoned our allies, turned away refugees, supported dictators, weakened democracies, started trade wars that have ruined people’s livelihoods, and abandoned climate agreements, which makes it that much harder to prevent climate crisis. Not that our country was ever perfect, but the last few years I think would impress even Angulimala.

So, who is responsible for this? The United States is responsible for this.

And that’s true, even though the United States is a complex system composed of hundreds of millions people and sort of a vague territory. Individual people in the United States are responsible for different parts of this event, and yet the United States is still responsible. It can and must transform and repent and repair—hopefully soon.

So that’s actually pretty simple. We’re used to reasoning about responsibility in this way. And so it’s also true of us. I’m a little nation. I have billions of parts. I have arguments with myself. And yet, what I do is my responsibility. I can and must transform, resolve all my evil karma, repair the damage that I’ve done.

If anybody was here for David Loy’s talk about climate justice a week or two ago, I’m kind of riffing on the same idea he has that nations, like people, suffer a delusion of self and yet are responsible for what they do.

Next Saturday the 14th, Judy Clark is coming to speak here along with members of an activist group called Release Aging People in Prison. Judy Clark is a prison reform activist. She was incarcerated in the state system for 38 years.

She was an accomplice in a robbery in 1981 that left four people dead. At the time of her trial, she was completely unrepentant, but over the course of her years in prison, she really transformed and became an exemplary person. The zendo and many other groups have been advocating intensely for her. This spring she was finally paroled.

I wish I could be here to meet her, but Keishin and I are going on an errand that day, we’re going to go visit a prisoner on the jail barge that the city runs. There’s a barge moored off the Bronx with 800 people. I think it’s probably completely miserable. And there’s a Buddhist man there who wants to talk with somebody about his practice. So we’re going to go see what we can do for him.

So we won’t be here for Judy Clark. I’m sure that she and her fellow activists are going to have a much more forceful and personal way of saying this, but here’s how I would say it.

Everyone deserves the opportunity for redemption, because it’s in our nature that everybody has the ability to transform.

And everybody has the responsibility to transform.

That’s why we practice, to atone for our evil karma, to transform ourselves, to redeem and repair. And that’s why we welcome everyone, even Angulimala, to practice with us.

Angulimala chasing the Buddha, from a temple in Sravasti, India.

PyGotham 2019 Speaker Coaching Recap

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Sat, 07 Dec 2019 22:25:58 +0000

I’m one of the organizers for PyGotham, the yearly Python programming conference in New York City. This year thirteen PyGotham speakers received training from opera singer and speaking coach Melissa Collom, paid for by the conference and free for the speakers. Eight of the speakers were new to the conference scene; Melissa helped them focus on delivering value to their audience, structuring their talks clearly, and speaking with conviction. All the speakers who responded to a survey said they felt more confident and they were more likely to propose conference talks again.

Here’s what some of our speakers said:

“Melissa helped me pick out the areas I needed to improve, that I could work on for maximum impact in the limited time that I had before the conference. More importantly, she told me what she thought were my strengths and it helped me immensely to know what I had working for me.”

“It was fun and Melissa made me feel comfortable to be myself! She brought out the best in me. The positive and constructive feedback was helpful and provided in a supportive way.”

“Her feedback was very insightful—I could definitely tell she knew what to look for and how to offer constructive and pertinent insights.”

The most valuable aspect of Melissa’s coaching, according to one speaker, was “getting me to think about Shakespeare instead of fear.”

I’m grateful to everyone who’s helped PyGotham provide free speaker coaching. In the last couple years we’ve inspired PyOhio and PyTennessee to start similar programs with Melissa. We’re ensuring that everyone has the opportunity to deliver a terrific conference talk. Melissa Collom is available to coach private clients, you can hire her to prepare for an upcoming presentation or general training as a public speaker.

Image: Hortensia, Roman orator, medieval illustration.

Heartwood Sutta

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Wed, 20 Nov 2019 06:43:13 +0000

My fellow Village Zendo student Jeff Berman composed a percussion quartet played on leaves and scraps of raw wood. The debut performance was at Big Reuse, a warehouse in Gowanus that recycles demolition waste. The music was inspired by a story about in the Pali Canon about a woodworker who seeks to harvest “heartwood”, the hardest timber at the center of the tree. I covered the performance for Tricycle Magazine:

The Timbre of Timber
A Zen student’s composition for percussion quartet uses the
sound of scrap wood to convey a Buddhist sutta’s teachings.

PyGotham 2019's ASL and Live Captioning Playbook

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Sat, 09 Nov 2019 21:37:13 +0000

At PyGotham 2019 we provided live captioning and, for the first time, we offered American Sign Language interpretation and did targeted outreach to groups of Deaf programmers. As a result, we had a half-dozen Deaf attendees, and they reported they were able to fully participate in the conference in a way they hadn’t experienced before. I led our effort to provide ASL and captioning; I hope this recap can serve as a playbook for other conferences.

It was downright incredible having both ASL and CART in every track + interpreters willing to follow me to hallway conversations. Being able to approach a speaker after their talk, then get so engrossed in the ensuing conversation that we slide in slightly late to the closing +
— Mel Chua (@mchua) October 7, 2019

+ keynote, seamlessly and without needing to read lips or use my voice, was something I had NEVER done before. I've never experienced a tech conf this way — and I used to organize them.

(to be fair, the "late" was b/c convo with other Deaf attendees - ALSO rare and newer-to-me).
— Mel Chua (@mchua) October 7, 2019

Seriously. I flew in to a regional Python conference *outside* my own region *solely* because they had this sort of access. @brainwane told me about it 11 days before the conf, I found out who was providing services, and immediately got a ticket and booked flights. WOULD DO AGAIN
— Mel Chua (@mchua) October 7, 2019

Motivation

Early in the PyGotham 2019 planning process, we decided to provide captioning for all talks, typed in real time by a human. (This is also known as “communication access real-time translation” or CART.) PyCon has provided captioning in recent years and we wanted PyGotham to be at least as accessible.

Captioning benefits Deaf attendees, of course, but we could think of many other groups it would help. People who are hard of hearing have a less obvious disability than those who are Deaf, but they benefit from captioning, too. We want our conference to be welcoming to older programmers, and many of them are likely to be hard of hearing. Captioning helps people who speak English as a second language and people with ADHD. In fact, everyone who attends PyGotham will misunderstand some words, or space out for a minute; captioning helps all of us.

As a final bonus, we can take the captions that our stenographers typed during the talks and use them as YouTube captions when we publish the talks later, instead of relying on YouTube’s absurd auto-captioning.

But captioning alone doesn’t provide full access to Deaf attendees who use ASL. We decided to also hire ASL interpreters for all three conference tracks. Interpreters provide maximum access for people whose primary language is ASL, they enable Deaf people to ask questions of the speaker, and they permit communication among Deaf and hearing attendees in the hallways and at meals.

Hiring Professionals

We chose White Coat Captioning as our live captioning service. They seemed like the right provider—they captioned PyCon in 2018 and 2019, and the PyCon organizers praised them. One of White Coat’s staff is Mirabai Knight; she founded an open-source stenography project written in Python, and she had captioned PyGotham before and participated in a PyGotham talk about stenography, so we felt a special connection. We arranged for Mirabai to be on site in our largest room, where keynotes and lightning talks happen along with one of the three regular tracks. The two smaller rooms would be captioned by remote staff listening via Google Hangouts.

❤️ this. #PyGotham #pygotham2019 pic.twitter.com/hlmrRlp7UE
— Sartaj Singh (@leosartaj03) October 4, 2019

We evaluated two sign language interpretation agencies. A couple of my friends are ASL interpreters and they advised me to seek agencies that provide interpreters certified by the Registry of Interpreters for the Deaf, who have experience interpreting at tech conferences or software companies. Agencies that specialize in sign language interpretation are likely to be better quality than agencies that also do spoken languages. We chose Lydia Callis Interpreting Services. LCIS assigned us local interpreters with tech experience. The contract was reasonably priced and very flexible: with a few days notice we could ask for more interpreters, or fewer, or cancel altogether. This reassured us we could arrange ASL services but cancel if no Deaf people registered.

Deaf Outreach

Once we had captioning and ASL settled, we wanted Deaf programmers to know PyGotham would be accessible to them. We published a blog post, and with Sumana Harihareswara’s help we added a general accessibility statement. We updated our registration form to say we would provide ASL and captioning, and we added a checkbox asking participants if they would use ASL interpretation. Some people had registered already; we emailed them to ask if they would use ASL. (If we want to estimate demand more accurately next year we may ask attendees if they require ASL interpreters. More likely we’ll decide to have interpreters in all the conference tracks regardless of demand.)

I asked my ASL interpreter friends for help marketing to Deaf programmers. Besides just Googling for groups in New York City, they told me to search in Washington, DC and Rochester, where there are communities of Deaf programmers centered around Gallaudet University and the Rochester Institute of Technology’s National Technical Institute for the Deaf. We contacted the appropriate departments at those schools and at California State University Northridge, plus Empire State Association of the Deaf, DeafTEC, Deaf in Government, NYC Black Deaf Advocates, DC Area Black Deaf Advocates, NYCASL, Deaf Kids Code, and others. To each group PyGotham offered a unique registration code for free tickets for the first five members of the group who signed up, and a second registration code that gave a 20% discount for any number of tickets.

As a result, we think there were six ASL users at the conference. It’s a terrific start, and a foundation for improving our outreach next year. In 2020 we plan to begin outreach earlier and market to more groups. We can talk more on social media about PyGotham’s accessibility, and ask Deaf organizations and schools to help us spread the word. We’ll put a sign language interpretation logo like this one on our website and registration pages:

Gathering Captioning/ASL Prep Materials

We wanted to make our captioners' and interpreters' jobs easier by gathering information about PyGotham talks ahead of time. Our main concern was unfamiliar words, such as technical jargon, names of people and places, and so on. If a speaker mentioned “Raspberry Pi” or “Jupyter Notebook”, our goal was ensure it would be typed and signed correctly. We made a form for speakers to share prep materials:

We emailed speakers about general logistics several times in the months before the conference, and reminded them about the form each time. Half our speakers submitted the form, which left us with about 30 talks that had no updated prep material available. As a fallback, we used the original outlines the speakers submitted with their talk proposals. We arranged the prep materials into Google Drive folders, with a folder per room per day, so captioners and interpreters could find each talk’s materials when it was time for them to prepare. They reported that having materials ahead of time was a great help.

Conference Logistics

Captioning

White Coat Captioning provided live captioning in all three conference rooms. The captions were displayed on big TVs to one side of the stage. To save on travel costs, we assigned our local captioner Mirabai Knight to the big room and had remote staff caption the other two rooms. Mirabai sat in front where she could see the speaker and the slides, and she had an audio feed from the speaker’s mic, through the mixing board, to her headphones. She brought a laptop, which we connected to the TV to show her captions.

The live captioning for @SagnewShreds #PyGotham talk on generating music is so good pic.twitter.com/2DFxLebcTy
— Kelley Robinson (@kelleyrobinson) October 5, 2019

The two remotely-captioned rooms needed two laptops each. We connected one laptop to the audio feed and made a private Google Hangout to transmit the sound to the captioner. The other laptop was connected to the TV and displayed the captions via a web app. One laptop could theoretically do both jobs, but White Coat advised us that separate laptops were more reliable.

White Coat sent us detailed A/V instructions and warned us that if Mirabai sat more than 30 feet from the TV we’d need special HDMI gear to connect her laptop. We gave our A/V company lots of advance notice about our requirements, and we coordinated with White Coat to test remote captioning the afternoon before the conference.

ASL

Sign language interpreters need to switch off every half hour to rest their hands and brains, so we hired six interpreters to cover our three tracks on both days. Lydia Callis Interpreting Services assigned one interpreter as the team lead; he was the main liaison between the interpreters and PyGotham. (If the team lead was in a session and needed our attention he texted me, and I would proxy his message to the PyGotham team’s Slack. Next year we’ll seek a better way to communicate.)

Two interpreters came early to staff the registration desk and to interpret for Deaf attendees at the conference breakfast; they interpreted the opening keynote as well. The other four interpreters came later when our full three-track schedule began.

We placed “ASL” signs to reserve seats in each room for ASL users, clustered at the front to one side near the interpreter. As rooms filled, the interpreters removed the signs to release seats that Deaf attendees weren’t using. They put the signs back before the next session.

#PyGothamNot only does #PyGotham have captioning, they are going to have live ASL interpreters with reserved seats for those who need it to have prime viewing angles. Excelsior! pic.twitter.com/GhccBU3mJQ
— Wanda W. Naylor (@hk125504) October 4, 2019

Post-conference

The day after PyGotham, White Coat Captioning sent us a plain text caption file for each session:

We added these as closed captions when we published the PyGotham talks to YouTube. It’s possible to upload captions for each video one-by-one, but of course we wrote a Python program to publish videos with captions in bulk. YouTube attempts to automatically sync the human-written captions with the video using voice recognition; this is far from perfect, but it’s a practical alternative to manually time-coding the captions for each video.

Outcome

This year we provided captioning for the first time in several years, and it was our first experience ever with sign language interpretation. It seemed risky to do both, but we decided to go for broke. It was far smoother than I had expected, thanks to the professionalism of White Coat Captioning and Lydia Callis Interpreting Services. One of our Deaf attendees, Mel Chua, wrote:

Thank you so much for a transformative conference experience. I wish they were all like this. I’ve seriously never been at anything like that before—even at conferences where the organizers “assign” me interpreters, I’ve still needed to manage them and also spend a lot of time being The Deaf Attendee in the space unless I choose to speak (which is more tiring). Being able to show up and be Deaf and feel “normal” at a tech conference—that was such a gift.

Our lead interpreter texted me during the conference:

We have several Deaf people here. One of them said that he’s so happy that there are interpreters in every room because in the past when he went to conferences the Deaf attendees had to pick which sessions they wanted to see and decide as a group which to go to, which often resulted in missing out on seeing some sessions he really wanted to see.

There is plenty we can improve next year. We should spread the word more effectively to Deaf programmers, gather prep materials from more of our speakers, and ensure we have ASL interpreters available during every break period.

Nevertheless, I’m deeply proud of what PyGotham achieved this year. Attendees who are hard of hearing or speak English as a second language benefitted from captioning, and so did many other attendees. Because we hired live captioners, this year’s videos will have accurate captions for the first time. And our sign language interpreters made PyGotham accessible to ASL users who could not have attended otherwise.

I want to inspire other tech conferences to provide even better access than we did. I hope this recap is a useful guide. Please write me if you’re planning captioning and ASL at your conference.

Deaf programmers at #PyGotham: Alessandro Ryan @orangeearths, Mel Chua @mchua, Margaret Arnold @BroomAirways, Timothy Linceford-Stevens#DeafInTech pic.twitter.com/gmJAhlw4GR
— A. Jesse Jiryu Davis (@jessejiryudavis) October 5, 2019

Five Ways to Establish a Rock-Solid Meditation Habit

jesse@emptysquare.net (A. Jesse Jiryu Davis) — Wed, 23 Oct 2019 06:18:10 +0000

Dharma talk delivered October 3, 2019, at the Village Zendo. This was adapted as an article in Lion’s Roar magazine, July 2020.

Transcript

This talk was advertised as Five Methods to Establish a Rock-Solid Meditation Practice, but I need to start out by admitting that none of these is the method that actually established my meditation practice. The moment when my habit was established for good was in a checkout line at a Whole Foods.

This was in 2002-ish, I was in my early 20s, I was living in Austin, Texas. I was working for a small software company and I was kind of failing at that job. I would smoke pot in the morning before I went to work, and around lunch time I would go out into the parking lot, smoke up in my car again, and then go across the street to Whole Foods to get lunch.

On this particular day, when this Whole Foods cashier changed my life in the checkout line—I’d been meditating for a few months. I kinda knew that I was fucking up my life with failing at my job and smoking pot every day, so I was looking for something.

I had read some Zen books. I had set myself a 30-day meditation challenge. And I was super fortunate because a Zen center moved into a house around the corner from where I was living around that time. It was kind of hard to ignore a signal like that. I’d been going a few times a month and I’d started working with this grouchy, blind priest named Sozan there. Sozan was great, he still is great.

Sozan wanted to teach meditation in the county jail outside of Austin and he needed somebody to drive him. So I’d been going with Sozan to this meditation class at the jail for a little while, and even, in my incredibly naïve way, trying to contribute some kind of instruction to this class.

I’d been getting involved with meditation and with Zen Buddhism, but I wouldn’t say that my practice was firmly established until Cashier Day.

On this particular day, I was in the line and I reached the end of the conveyer belt, me and my sandwich. And I looked up at the cashier and I recognized him. I knew right away that I’d seen him before. He was tanned or sort of sun-worn, even though he was only about my age. Blond guy. And I recognized him from the meditation class in jail. When I’d seen him there I identified with him. He was kind of like me. So I knew right away it was him.

I said, “It’s so great to see you.”

He said, “Yeah, it’s great to see you too.”

And he explained in just a couple sentences while he was ringing up my sandwich that our class had really helped him. There was one moment when he was about to get in a fight, and then he remembered to stop and feel his breath go in and out for a moment. And that allowed him enough self-control to walk away. As I was hearing this I was thinking, well maybe, if he had been in that fight, he wouldn’t be standing here with me at the Whole Foods.

So he said, “Thank you. I wanted you to know.”

I said, “Thanks so much for telling me,” and I swiped my credit card and that was that.

But for me, that was really the moment. That was the moment when I got serious about meditation, because up until then, I’d been playing with it and I sort of thought it probably worked. But it was his personal witness that even one class had made such a difference in his life, that I thought maybe this is for real. And that set me on a path that has led me all the way here.

You have to believe in it. You need some faith. And Cashier Day was when I started believing in it.

I think meditation saves lives. I think it saved mine. It saved me from wasting my life on superficial, deluded shit. I still spend a large portion of my life on superficial deluded shit, but I don’t spend my whole life on it. And I think that is a consequence of my having a regular meditation practice.

I think otherwise I would be obsessed with being disgusted by myself, with setting unrealistic standards for myself and failing them. I would waste it on hating other people and also wanting them to admire me. I would waste it on pot. That I have not wasted my life is thanks to practice.

So what I’m trying to say is the first step in establishing a meditation practice is you’ve got to really want to transform, and have faith that you can achieve that. And that’s pretty inspiring. But then comes the humiliatingly boring chore of actually putting that into practice one day after the next after the next.

There’s a psychology paper from 2010, it’s called How are habits formed: Modeling habit formation in the real world—this is very influential. The researchers define a habit as “producing a behavior in response to a cue.” So when you are in the same environment, experiencing the same cues, you do the same thing.

When my alarm goes off at 7:30, I get up and I go meditate because that’s what I did the time before. When I smell the incense, I try to concentrate because that’s what I did last time. The way that habits produce actions in response to cues means that it doesn’t require any thinking or any willpower, even if you don’t feel like doing it, you still do it.

This obviously has really interesting implications for bad habits and how they work, if we want to break them, like breaking an addiction. This is useful information. But I want to talk right now about how to establish a good habit intentionally.

These researchers, they followed this up in 2013. Their names are Phillippa Lally and Benjamin Gardner. They wrote a paper about forming good habits and they say there are four stages.

One, you decide to act. And then two, you act on the decision. Three, you do that repeatedly. And four, you do it in a way that creates cue-behavior associations: When I experience this thing, I do that thing.

So this brings us to the first of the five methods.

Method number one is to do it the same way every time. If experiencing the same cues triggers the same action, we can use that to create a groove for ourselves when we keep consistently meditating in response to the cue in our life.

Meditate at the same time every day. It really has to be the morning, I think. There’s a 2017 psych paper about habit formation where they took two groups of people and told them to do a particular kind of stretch once a day. They found that those who tried to do that in the evening, it took them 150 days of trying to create a fully automatic habit where they didn’t experience any willpower or thinking. They just did it. Those who did it in the morning also took awhile, but it was only 100 days, not 150.

The psychologists had some complicated theory about cortisol levels, but I think it’s pretty clear why the morning is better. In the evening, my schedule is less predictable. I may have been drinking. If I want to do something every day, I’ve really got to do it after I wake up. The most precious slot of the day is the first thing you do when you wake up. So I devote that irreplaceable slot to meditation.

And besides just meditating at the same time, have a rich routine around meditating. Like having a cushion that you use for that instead of sitting on the couch or a bed pillow. Lighting incense. Having a candle and an altar. Doing it in the same place. All of these things will continue to reinforce that habit and make it stronger and stronger.

So method number one is to do it the same way every time.

Method two is to start small. When I first started meditating, and I set that 30-day challenge for myself, I got a meditation cushion and all I promised myself was that I would I would sit on it cross-legged once a day. My goal was to sit for 10 minutes, but it was okay to sit down, cross my legs, and get up. I could still check the box and feel the satisfaction of keeping a promise to myself. Because that was good enough to reinforce the habit.

We know this about a lot of things, right? You go to the gym and it’s okay if you turn around and do nothing. It’s still establishing the habit pretty well. So, if you’re trying to establish a daily practice, or trying to maintain it through a difficult time, just start small. Just do something that you know that you can accomplish and then work up from there.

And even though these researchers have found that it might take 100 days to achieve complete automaticity, to do it completely automatically, without any willpower, it doesn’t have to be that intimidating. Because first of all, the habit gains strength fastest at the beginning. Just doing something one morning and then the same way the next morning, you’ve already started to create a habit that will gain strength very rapidly initially. And then the progress from “almost completely automatic” to “absolutely completely automatic” might take a hundred days, but the first few weeks are going to see a real strength develop.

The other reason to be encouraged is that even if you miss a day from time to time, these researchers have found that that’s fine. As long as you are taking this approach of trying to do it the same way every time, missing a day now and then doesn’t actually set you back that badly.

Method number three is to sit with other people.

Without a doubt, the main reason why I have a regular meditation practice now is that my partner Keishin and I sit together on weekday mornings. We have a default that on a weekday morning we set the alarm for 7:30, we get each other up and we sit together.

It’s quite far from perfect. Sometimes we don’t go to bed early enough, or we don’t get up early enough, or we’ve got something that we have to do in the morning. And then other days, we know we’re coming here so we know we’ll sit together, so we don’t do it in the morning. But as a default, most weekday mornings we sit together, and we never get that far off track because one of us will always pull the other back.

So if you live with somebody, try to get them to meditate with you. And do it now, because this is the year. Mindfulness is at peak trendiness. There’s never gonna be a time like this to proselytize, so if you’ve got a skeptical spouse or roommate and you’re not sure—don’t hesitate any longer. This is the best it’s ever gonna be. The backlash is coming so hook ‘em now.

If it can’t be somebody you live with, you can do this remotely as well. I’ve talked to a lot of people who will actually meditate together by video chat. They’re sitting at the same time with their laptops. And it’s pretty good. It’s pretty effective.

Or if you can’t sit at the same time of day, you can have just a meditation buddy that you text every day to say that you are beginning your meditation period. And then if the day goes by and you don’t get that text, you can say “get back on track,” and vice versa. That’s pretty effective, too.

You can of course come here to meditate with people. Or whatever group you’re a member of. I find that meditating at a devoted space, a temple, a place that’s been designed for practicing in…. It looks right, it smells right. Everybody is here for the same reason that I am. I sit more still. I make a greater effort. When I get up, I feel the difference in the effort that I’ve put into that period of practice. So come to the zendo and sit with people here. There is probably a time of the week that will work for you. We have approximately 233 different meditation periods every week. So if you’re in the area, find one.

I find that in the same way that sitting at the same time every morning really works for me, coming to the zendo for the same period every week is very effective for the same reason. I try to come here on Thursday nights around six o’clock every week, that’s my default. And that’s worked much better for me than just having a general commitment to coming to the zendo at some time every week. I try to be a Thursday night guy unless something gets in the way.

So method number three is sitting with people. It’s kind of how we get anything done, we do it with people. We inspire each other.

And method number four is not to really expect anything too special from any particular meditation period. A lifetime of meditation is transformative. A single period of meditation, you know…. You win some, you lose some.

When I talk with people about meditation, I so often hear them say, “I tried to meditate but it didn’t work.” Have you heard this? “I tried it and I can’t do it.” It’s heartbreaking because you can see so clearly how that wrong thinking is going to hold them back from transforming themselves.

And it’s funny the way we think about our thinking. We don’t expect that the first time we do pull-ups, we’ll be able to do 10 pull-ups. We don’t expect to understand French the first time we try. But so many people expect that the first time they sit down and meditate, they’ll just be able to stop thinking and sit at peace for minutes on end.

It’s actually really interesting because it shows the outline of where the illusion of self is. What shape it is. Because we think that our bodies are ourselves. But we also sort of know that we’re not entirely in control of our bodies. I can’t do 10 pull-ups just because I want to. We know that about memory; I can’t always remember what I want to. Or about intellect, I can’t always solve every math problem right away. But when it comes to our thinking itself, we think that our thinking is us, and so we think that we are in control of it, and it can be shocking to discover how untrue that is.

And people think that there’s something wrong with them. “But that’s me, how can my intention and what happens in my consciousness be so in conflict?” We can talk more about that some other time. The delusion of self is very complicated.

The main thing for us all to keep in mind is that meditation will often feel hard, and will be unpredictable, and we must not let ourselves get frustrated by this. Because trying to meditate is meditation, and in that sense it’s very easy. If you sit down and try to interrupt your own daydreaming and be conscious of your presence in this room for half an hour, you will achieve something whether it feels like it or not. That said, it shouldn’t always be unpleasant and frustrating.

So, that was method number four, which was not to expect something special.

And method number five is: Enjoy it. Find a way to enjoy meditation so that you will keep coming back.

I would suggest that you really invest in your altar and your cushion—everything you practice with at home. I kind of like to spend money on my Zen stuff, so that I’ll enjoy using it. Our altar has a nice vase made by Myoko; it’s beautiful. And we’ve got a cute little ceramic bodhisattva that I like looking at, and the $20-a-box incense that smells nice. A good meditation cushion. All of this stuff makes me want to come and use it and sit.

It doesn’t have to be spending money, but keep your altar clean, use incense you like. It should be an aesthetically beautiful experience to you, to meditate.

It should be physically pleasurable. So when you sit down, find some part of your body, something you’re smelling or hearing that’s nice, that you enjoy, and let a bit of your mind rest on that. There’s a lot that’s uncomfortable about this, but it’s not supposed to be this grind and if it is, I predict a short career for you. So, find a way to enjoy it and make it a sensual experience.

And avoid frustration, you know? It’s too easy when we catch ourselves daydreaming to think that the way to stop that from happening again is to internally castigate ourselves, and to be angry. But that is also a ticket to burnout. You get up angry and you don’t want to try again.

So, when you interrupt a train of thought, just be glad that you accomplished that, not angry that you were lost. And even if the whole meditation period feels like a total loss—like, you just thought about work for half an hour straight—you did it. You kept the habit up; it’s still there, strong and with integrity for the next time that you try it. So, give yourself a pat on the back, no matter how it went, so that you can look forward to that same sense of satisfaction the next time you sit.

If you do have pain when you sit—it doesn’t necessarily have to be painful. There’s a workshop next Sunday, in 10 days, October 13, it’s called Zen and Your Body: Mutual Adaptation. Yuuka and Howard are leading it and they’re going to help anybody find a way of sitting and a cushion that will work so that you can sit with as little pain as possible. It should be possible to sit comfortably.

It doesn’t have to be boring, either. I think killing off a meditation period, letting the time go by, getting it over with, is boring. But doing your best to stay present, using every technique and every ounce of willpower at your disposal to prevent daydreaming and to listen to your thoughts consciously, and to be present here, that’s a challenge and challenges are fun. So really do it, and enjoy that challenge.

So that’s method number five, to enjoy it.

I hope that if we keep these five methods in mind, that it will really get us off to a good start and create a solid foundation. So that’s:

Do it the same way every time.
Start small.
Sit with people.
Don’t expect something special out of any single meditation period.
Enjoy it.

That’ll establish the habit. That will get you started. But we also all know that just getting started is not enough. It’s not enough to just meditate a little. It’s not enough to just meditate at home.

If you really want to transform your life the way I do, we have to practice solidly every day. We have to practice with a strong sangha, like this one, of people with the same goals and intention as we do. We have to work with teachers, experienced practitioners who will guide us away from dead ends or dangerous paths in our practice. Keep us on the path of awakening. We have to read Buddhist texts.

That kind of chronic meditation can take us far, but we also need acute meditation from time to time. To really go deep sometimes you have to meditate for more than half an hour. We have one-day retreats called zazenkais once a month here, where we sit for half-hour periods—many, many in a row—over one day, and the entire day is spent in silence. You can talk to a teacher and there’s a dharma talk, and this will really get you deeper and have a lasting effect on your practice. Same for the multi-day retreats. We have a weeklong retreat coming up. The Year End Sesshin starts the day after Christmas and lasts through New Years Day. And the whole thing is in silence and it’s a ton of meditation, and that will really make a lasting difference, too.

But it’s not even enough to have a solid, daily practice and to go on these one-day or one-week retreats. We’ve got to share our practice with people. You can’t just achieve peace of mind and then sit there smiling and abandon the rest of the world to its sadness.

In a way, the idea of individual practice doesn’t even make sense. Think about what happened with me and the Whole Foods cashier. I had gone to the county jail and shared my practice with him and it affected his life. Then the next time we met, he shared his practice with me and that affected my life. And I’m keeping the wheel going. I go to Sing Sing regularly with our meditation program there and I share my practice with them. We’re always practicing together. When Buddha had his big awakening moment after he’d sat under a tree for a week, he didn’t say, “Aha! I personally am awake!” He said, “Wonder of wonders, I, all beings and the great earth are simultaneously awakened.” That’s what’s happening for us here.

So meditating and being a regular meditator, is sharing the practice with other people, with your friends and your family, with everybody in the world.

But in my opinion meditating alone is not enough either. We vow to free everybody from suffering. That’s the Zen path, that we vow to be bodhisattvas who are people committed to helping everybody wake up. But, just going around talking about meditation is not enough to end the suffering of the world. It’s great for my friend in Austin county jail, that he personally didn’t get into a fight because of the meditation technique that we taught him. But, the world is broken so long as there are county jails that are full of people who haven’t even been convicted of a crime, who are there because they are too poor to make bail, or because the judge took one look at them and decided that they were a flight risk.

Just going in there and teaching meditation is not enough. It’s why the zendo has been very involved in the Close Rikers campaign. And the same with Sing Sing, where people are serving just absurdly long sentences—decades, or life, absurdly long for the crimes that they’ve been convicted of, and in many cases those convictions are completely false anyway. So it’s great that we go in there and we teach meditation, but we also write letters to Governor Cuomo and stand vigil outside his house demanding more clemency.

So is that enough? If we meditate regularly, if we go on retreats, if we share our practice with others, if we work to transform the world—is that enough? I don’t know. It seems like a lot of work. But I know that everybody here is already doing all of this passionately. Doing everything you can to wake up and to help others wake up. And to transform the world, to end suffering and injustice. I know that, I think, because you’re here. And you have come here to refresh and to strengthen your meditation practice. Because I believe, and you might also believe, that a solid practice is the foundation of everything else that we are committed to.

I hope that my talk here today helped you to accomplish that.

Image: Rock in my Grandma’s yard, 2007.

A. Jesse Jiryu Davis

Lights Out Climbing

New Abbot Ceremony

Climbing in Lost City

Mohonk Trail Builders

Absurdland in the Gunks

Bonticou Crag

Review: Detock: High Performance Multi-region Transactions at Scale

Calvin

Detock’s Architecture

Single-Home Transactions

Multi-Home Transactions

Deterministic Deadlock Avoidance

Their Evaluation

My Evaluation

See Also

Review: Timestamp as a Service, not an Oracle

Timestamp Oracles

Priestess of Delphi (1891) by John Collier

Figure 9 from the paper, lower half

A Service, Not An Oracle

The Algorithm, 1.0

Figure 2 from the paper.

Theorem 1

The Algorithm, 2.0

From Figure 3.

From Figure 3.

Algorithm 2.0 pseudocode

What Should M Be?

Unique Timestamps

Their Evaluation

My Evaluation

How To Use MongoDB Causal Consistency

The Problem With Secondary Reads

Liam

Causal Consistency

The Fine Print

Transfer the clusterTime

Not So Convenient

Further Reading

Monks at Holy Cross, part 2

Jyakuen's Shuso Hossen

Monks at Holy Cross

Bodhisattvas Always Smile

Ordination at Christ Episcopal

Climbing in Puerto Rico

Review: Antipode: Enforcing Cross-Service Causal Consistency in Distributed Applications

Motivating Example

Causal Consistency Isn’t Enough

Cross-service Causal Consistency

My Feelings About Lineages

Tracking And Enforcing Cross-Service Causal Consistency

Their Evaluation

My Evaluation

Further Reading

Review: Leases: An Efficient Fault-Tolerant Mechanism for Distributed File Cache Consistency

The Protocol

Fault Tolerance

Optimal Lease Terms

Their Evaluation

My Evaluation

Review: Exploiting a Natural Network Effect for Scalable, Finegrained Clock Synchronization

Motivation

The Buddy System

The Forbidden Zone

Detecting Asymmetric Delays

Their Evaluation

My Evaluation

See Also

Review: Nezha: Deployable and High-Performance Consensus Using Synchronized Clocks

Table Of Contents

Classic Paxos Quorums

Fast Paxos Quorums

Quorum Sizes

Nezha

Deadline-Ordered Multicast (DOM)

Speculative Execution

Stateless Proxy

The Nezha Protocol

Fast Path