Strong Eventual Consistency

Conflict-free Replicated Data Types (CRDTs) solve a deceptively subtle problem in distributed systems: Procuring a consistent answer from an eventually consistent distributed software system.

Designing a distributed software system that coordinates the forward progress of a cluster responsible for consistent, shared state and resilience against unpredictable faulty behavior is a non-trivial engineering burden. Additionally, the implementor must use such a system correctly by bearing the brunt of its caveats in application logic or deploying CRDTs as a mediating data structure: An anodyne for the frustration of manual conflict resolution in a system that is highly available and distributing its work-load.

A distributed software system is a collection of computing agents — virtual or physical — coordinating the forward progress of a specific computation, a query for a specific piece of information stored / partitioned on a quorum of the agents in the cluster or a calculation taking input from multiple agents; other creative deployments of distributed software systems range from peer-to-peer electricity sharing to Crypto currencies like Bitcoin.

Without conflict-free coordination, whether implemented in the application logic or undergirding data structures, the requester could receive a wrong, conflicting answer.

A conflict is a combination of concurrent updates, which may be individually correct, but that, taken together, would violate some invariant. - Marc Shapiro, Nuno Pregui¸ca, Carlos Baquero, Marek Zawirski

Firstly, I want to talk about distributed consensus.

Distributed Consensus

The peer-gossip protocol for delivering replicas and merging their state is also often responsible for keeping track of partitions; because the peers must gossip with eachother to know who has what partition, which keys on a peer need repairing in the case of anti-entropy, or who should be answering a specific query. A partition is a replica that is deliberately placed only on some, not all of the participating peers following a quantitative specification, usually referred to as the quorum number, or the minimum number of peers that must successfully replicate. Peer gossip can be reduced to a simple peer membership model if each peer must only hold the state of the whole cluster, known as a plenarium.

A peer-gossip, anti-entropy design performs hash-key comparisons with a selection of other peers at a regular interval, handing off to another node if a key has not met its quorum requirements or if a neighboring peer must repair any of its keys. Deliberate replica partitioning is ideally suited for distributed systems that cannot or should not hold the entire keyspace, as in a distributed database. Replica plenarium is a bit different because partitioning of the keys is not necessary, in fact it’s undesirable; IoT, blockchain technology, and P2P products are a good example of systems that want the entire cluster state.

Distributed consensus is a deep topic with a broad solution space, I’ve only described the general character of some of the solutions that I’ve used or understand (namely Riak and Plumtree).

CAP Theorem[2]

The CAP Theorem, first proposed by Eric Brewer in a July 19th, 2000 Keynote at the ACM Symposium on the Principles of Distributed Computing (PODC) ushered in the era of web-scale distributed computing to the wider developer mass-consciousness, usually manifesting as distributed “NoSQL” databases. I want to qualify what I mean by “wider developer mass-consciousness” because software programmers and computer scientists have wrestled with and used distributed systems prior to the early 2000’s and a majority of commodity web-application developers simply didn’t have a need for [distributed systems] until the early to late oughts.

The advent of NoSQL databases, a quickly growing internet user base, and commodity cloud computing spawned the exigency for applications exhibiting availability in the face of failures, high-latency, or a stampeding crowd of users.

CAP stands for Consistency, Availability, and Partition tolerance and you can only have two of the three (though that’s not entirely correct as we will see later). Consistency in a distributed software system manifests to the end-user as the causally correct answer to a request. If the system is available to answer requests it is available and if we expect the system to tolerate network partitions, it is partition tolerant.

Though, my definition of consistency is uncomfortably binary isn’t it? Most distributed systems over the last ten years focused on availability and partition tolerance because that’s where the most pain occurred, as a user base grew and uptime guarantees sagged a proclivity for handling the growth gracefully and “dealing with consistency later” vaulted eventually consistent, distributed databases into the spotlight.

Eventual consistency informally promises that any update to a mutable, shared piece of data will converge without explicit synchronization. This model sacrifices safety for availablility and its use is often error-prone.

We can do better, by using conflict-free data types to mediate the storage of data, using commutativity, associativity, and monotonicity to ensure the absence of conflict.

Strong Eventual Consistency[4]

The availability of a system to continue forward progress is a critical property of high-scale internet systems and without stronger guarantees of consistency also difficult to justify, if answers are incorrect.

A digression on terminology for the reader: The C in CRDT has three different meanings in writing across the internet, the most general definition is “conflict-free” and the other two, commutative and convergent, have subtle semantic differences (they represent separate kinds of CRDTs). In this article, I use the conflict-free version because that is the most generally applicable and commonly used terminology by researchers studying these data types[6].

The simplest CRDTs are the integer counter with increment, decrement operations, and the set with add and remove (with tombstones) operations. The integer counter is “decremented” by subtracting the absolute value of the decrementing incrementer from the absolute value of the incrementing incrementer; the add / remove set has similar operational semantics: It uses a tombstone set (the elements for removal) and the difference with the add set to arrive at the final set state.

More complex commutative and convergent CRDTs are an active research topic; my primary focus for now are the simple CRDTs; they are stable, well-understood, and have mature implementations “in the wild”.

Conflict-free replicated data-types combined with eventually consistent systems is how we arrive at: strong eventual consistency.

Plumtree and Plum

Plum’s Lightpad was a perfect candidate for the use of CRDTs after I chose to port the plumtree project to the ARM platform; we needed a robust tool to replicate a user’s light switch and dimmer topology across all of the Lightpads in the home. I wrote our Lightpad’s firmware in Erlang and once I had the pain of porting and cross-compiling the necessary components for Plumtree solved, I realized we needed a robust method for conflict resolution. The Last Write Wins strategy, which is the default for objects stored using Plumtree, is a well known culprit for data loss; also, there is no recourse for engineers to repair a data integrity issue on a customer’s Lightpad cluster manually (that would be a breach of privacy!)

The need for state replication across Lightpads arose from the requirement that every Lightpad can decouple from the load it is physically controlling and control arbitrary groups of other Lightpads in the same cluster throughout the home. This requires every Lightpad to have the state of the cluster (which Lightpads are in which room, which Lightpad is controlling which group now, Lightpad membership in a scene, etc…)

Before choosing to use Plumtree and a CRDT for intervention-less conflict free operation, my first crack at solving the “decoupled Lightpad requirement” was naive. We ran into all the classic problems every naive attempt at rolling your own distributed consensus algorithm encounters. I stopped that work quckly after realizing it was more difficult than I had prepared for and began hunting for a robust stand-alone OTP application I could integrate into our Erlang release providing “out of the box” distributed replication; my search led me to Basho’s excellent riak ensemble project providing a multi-paxos distributed consensus implementation guaranteeing strong consistency (different from Strong Eventual Consistency!) The library was easy to integrate, quick to understand, and I had something working with our Lightpads effortlessly…

..until a lightning strike caused a power-outage that brought down my developer Lightpad cluster and nuked the quorum requirements of the cluster — I couldn’t revive it without a full-blown factory reset of every Lightpad, which is a no-go for a customer. As I studied distributed systems research and Riak Ensemble’s documentation for insight, it dawned on me that a strongly consistent system would never fulfill the requirements of our Lightpads which are running on small, hot, resource constrained devices within homes posessing electrical wiring of unknown quality and subject to unpredictable hard power faults. A highly available and partition tolerant system was what we needed and my research further led me to the use of Conflict-free Replicated Data Types in Eventually Consistent, Highly Available, and Partition Tolerant systems to achieve the best of “most worlds” (NOTE: Strong Eventual Consistency is not total ordering). An engineer at Basho kindly pointed me in the direction of Helium’s Plumtree project, designed to serve a function similar to that needed by our Lightpads:

  • whole cluster state replication
  • highly available and partition tolerant
  • eventually consistent
  • writes are broadcast to the whole cluster
  • reads are only served from the local replica
  • Riak’s anti-entropy algorithm and datastructure for key repair and epidemic broadcast of cluster state when new nodes join the cluster (hence: epidemic broadcast tree)

Once I had a working port of Plumtree running on our ARM processor, I settled into designing the user’s Light topology schema using the ORSWOT[5] CRDT: A set with add-wins, observed removal, tombstone-less operational semantics. A slightly modified version of the aforementioned simple CRDT type. That decision has served Plum well because the Lightpads are now used by Plum’s customers to configure and control their lighting in many interesting configurations.

CRDTs, I think, are the future of high-scale distributed computing systems; particularly for constrained devices not located in a data center (mobile phones, IoT, P2P) and interesting research is just getting started. If you’re interested in learning more from an expert in the field, I highly recommend the writing and research of Christopher Meiklejohn; he is engaged in exciting bleeding-edge work and publishing much of it through his project, the Lasp language (using Plumtree and Erlang as a foundation).

The Raft Protocol
CAP Theorem
Towards Robust Distributed Systems by Eric Brewer
Brewer’s CAP Theorem
The kool aid Amazon and Ebay have been drinking
Strong Eventual Consistency
Strong Eventual Consistency and Conflict-free Replicated Data Types
A CRDT implementation library in Erlang
Christopher Meiklejohn
Readings in conflict-free replicated data types