Distributed Systems

CAP Theorem

Цели урока

Understand CAP intuition through the two-node partition scenario
Distinguish CP and AP behavior and know examples of each
Apply PACELC to analyze trade-offs in normal operation
Choose a system type based on consistency and availability requirements

Предварительные знания

Understanding of distributed systems and network partitions
Familiarity with CAP from the intro lesson
Experience with any database (SQL or NoSQL)

Greg Linden, Amazon, 2006: an internal A/B test pinned +100ms of latency to -1% in sales. A year later Werner Vogels published the Dynamo paper: eventual consistency in exchange for 99.999% cart availability. CAP is not academic: a 503 mid-checkout costs more than a cart that is 200ms stale.

**etcd in Kubernetes** - CP, because split-brain in a container orchestrator is worse than etcd downtime
**DynamoDB (Amazon)** - AP with eventual consistency for shopping cart: 99.999% availability beats instant consistency
**Google Spanner** - PC/EC via TrueTime (atomic clocks + GPS) for Google financial transactions
**Cassandra at Netflix** - AP/EL for user preferences and viewing history: delayed history sync is acceptable
**PostgreSQL** - CP: primary loss blocks writes, acceptable for financial data

Eric Brewer and the Conjecture That Became a Theorem

In 2000 Brewer was CTO of Inktomi (the first web search engine) and ACM SIGOPS president. At PODC he presented the CAP conjecture. Two years later Gilbert and Lynch at MIT published a formal proof. In 2012 Brewer himself wrote a revision: "CAP Twelve Years Later" - acknowledging the theorem is a simplification and reality is more nuanced, especially with partial failures.

Intuition: Why All Three Is Impossible

**Eric Brewer, PODC 2000 - "Inktomi and the scaling lesson". Conjecture:** a distributed system cannot simultaneously guarantee Consistency, Availability, and Partition tolerance. **2002:** Seth Gilbert and Nancy Lynch at MIT proved it formally. Not a best practice, not a recommendation - a theorem with a proof, like geometry. The failure semantics behind partitions are introduced in the distributed-systems intro.

Property	What it means	Violation looks like
Consistency	All nodes see identical data at any moment. A write to one node is immediately visible on any other.	Query balance - get USD 1000. Query again 100ms later - get USD 800 (nobody charged anything)
Availability	Every request to a live node gets a response (success or error). No live node stays silent.	Server is up, ping works, but the request hangs forever - no response, no timeout
Partition tolerance	The system keeps working when the network splits between nodes.	Two data centers lose connectivity - both halves become unavailable even though each is healthy internally

Intuition via a concrete scenario: two nodes A and B, a client writes x=1 to node A, then the network between A and B breaks. What should node B do when a client reads x? **Option 1:** return the stale value x=0 (breaks Consistency). **Option 2:** refuse to answer, wait for network recovery (breaks Availability). There is no third option.

**Key insight:** P (Partition tolerance) is not optional. Networks always partition - buffer overflow, BGP flap, a 10-second GC pause, a physical cable cut. In the real world the choice is between **CP** and **AP**, not between CAP and CA.

CAP theorem requires picking exactly two properties forever at design time

P is unavoidable, the real choice is CP or AP. Different operations in the same system can make different choices.

Cassandra allows setting consistency level per request: QUORUM for critical ops (CP behavior), ONE for non-critical (AP behavior). MongoDB with minority reads gives AP behavior on reads with CP behavior on writes.

Nodes A and B are separated by a partition. A client reads x from node B. Node B responds with x=0 (stale value). Which CAP property is violated?

CP Systems: Consistency at the Cost of Availability

A CP system refuses requests during a partition - returns an error or waits for quorum recovery. This is not a bug, it's an intentional choice: better to say "don't know" than give a wrong answer. The pattern for finance, inventory, cluster configuration - anything where inconsistency is more expensive than downtime. Quorum mechanics live inside consensus (Paxos, Raft).

System	Consistency algorithm	When to use
etcd / Consul	Raft - leader accepts all writes	Service discovery, distributed locks, cluster config
ZooKeeper	Zab (ZooKeeper Atomic Broadcast)	Coordination - leader election, naming, barriers
MongoDB (majority write)	Quorum writes + Raft oplog	Financial data, inventory, user profile
Google Spanner	TrueTime + 2PC	Global transactions, banking, Google Ads

etcd in Kubernetes: why CP

etcd stores all Kubernetes cluster state: which pods are running, on which nodes, which ConfigMaps are active. Under a network partition, etcd refuses writes without quorum (>50% of etcd nodes). This protects against split-brain: two half-clusters independently managing pods. Temporary etcd unavailability (seconds to minutes) is far better than two schedulers both thinking they are the authority.

CP behavior costs latency: every write waits for quorum acknowledgment. Raft with 3 nodes in one DC - 2-5ms overhead. Raft across regions - 50-150ms. Google Spanner with TrueTime adds a commit-wait of 7-14ms per transaction.

An etcd cluster has 5 nodes, 3 nodes are down (no quorum). What happens to write operations?

AP Systems: Availability at the Cost of Consistency

An AP system keeps running on all nodes during a partition - accepting writes, returning responses. Data will eventually converge (eventual consistency), but during a partition different nodes see different things. The choice for user data, carts, feeds, counters - anywhere unavailability is worse than staleness. Convergence mechanics for AP stores are explored in CRDTs.

System	Conflicts in AP	Resolution strategy
DynamoDB	Concurrent writes to different replicas	Last-writer-wins (timestamp) or CRDTs
Cassandra	Different values on different nodes	Read repair + LWW or user-defined merge
CouchDB	Divergent document revisions	Explicit conflict detection, user resolves
Riak	Sibling values	Built-in CRDTs (G-Counter, LWW-Register)

DynamoDB and the Amazon shopping cart

Werner Vogels (Amazon CTO) described the choice: the shopping cart is stored in DynamoDB with eventual consistency. If a customer adds an item from their phone while a shopping session is open on a laptop and the network is unstable, conflicting versions can arise. Amazon's strategy: on conflict, show the union of both versions. Better to show an extra item in the cart than lose one the customer added or return a 503 error.

**Eventual consistency does not mean slow**. In normal operation (without partition), Cassandra replicates a write within 1-5ms inside one DC. "Eventual" means a guarantee of final convergence, not a specific timeout.

An AP system returns random data - the result is unpredictable to clients

An AP system returns data that was accurate up to the last partition. After the network heals, nodes automatically synchronize.

Eventual consistency is predictable: with no new writes, all replicas converge to one value. Cassandra guarantees this through anti-entropy (Merkle trees) and read repair. Unpredictability occurs only during concurrent writes in an active partition.

Cassandra is configured with consistency level ONE (read/write to one replica). During a network partition between DC1 and DC2, a client reads from DC2. What does it get?

PACELC: CAP Is Only Half the Story

CAP describes behavior only during a partition. But 99.9% of the time the network is healthy. **PACELC** (Daniel Abadi, 2012) fills that gap: **P**artition -> **A** or **C**; **E**lse (normal operation) -> **L**atency or **C**onsistency. This more accurately captures real-world trade-offs. The full consistency landscape is mapped in the consistency-models lesson.

System	Under partition	Normal mode	PACELC class
DynamoDB (eventual)	AP	EL (low latency)	PA/EL
Cassandra (ONE)	AP	EL	PA/EL
MongoDB (majority)	CP	EC (consistency)	PC/EC
etcd / ZooKeeper	CP	EC	PC/EC
Google Spanner	CP	EC (with TrueTime overhead)	PC/EC
MySQL (semi-sync)	CP	EC	PC/EC

In normal operation, EC systems sacrifice latency to guarantee that read data is always fresh. EL systems return immediately from a local replica - lower latency, but concurrent writes may cause temporary divergence.

**Cassandra is unique:** tunable consistency per request. QUORUM gives PC/EC behavior, ONE gives PA/EL. The same cluster can be CP for critical operations and AP for non-critical ones - a rare combination of flexibility.

CAP and PACELC: key takeaways

CAP: proven (Gilbert & Lynch 2002) - cannot simultaneously have C, A, and P
P is unavoidable - the real choice is CP (precision) vs AP (availability)
CP: write rejected without quorum. Examples: etcd, PostgreSQL, Spanner
AP: write accepted locally, replicated async. Examples: Cassandra ONE, DynamoDB
PACELC: in normal operation there is also a Latency vs Consistency trade-off
Cassandra allows choosing CP/AP per request via consistency level

An analytics service reads from Cassandra with consistency level QUORUM. How does PACELC classify its behavior in normal operation (no partition)?

Key Takeaways

**CAP theorem** - proven by Gilbert & Lynch (2002): a distributed system cannot simultaneously guarantee Consistency, Availability, and Partition tolerance
**P is unavoidable** - networks always partition; the real choice is CP (accuracy) vs AP (availability)
**CP systems** - reject writes without quorum; examples: etcd, ZooKeeper, Google Spanner, PostgreSQL
**AP systems** - accept writes locally, replicate asynchronously, data converges eventually; examples: Cassandra ONE, DynamoDB
**Eventual consistency** - does not mean slow: in normal operation Cassandra replicates within 1-5ms inside a DC
**PACELC** - normal operation (no partition) also has a trade-off: Latency vs Consistency (EL vs EC)
**Cassandra is tunable** - consistency level per request: QUORUM (CP behavior) or ONE (AP behavior)

Вопросы для размышления

A messaging app (Telegram, WhatsApp) - which CAP trade-off does it use for storing messages, and why exactly that one? What would happen if the opposite was chosen?

Связанные уроки

ds-01-intro — Failure model and partition concept defined in the intro
dist-12-consistency — Consistency spectrum continues from CAP into linearizable vs eventual
ds-03-consensus — CP systems rely on Paxos/Raft to enforce the C side
prob-22-concentration — Concentration bounds frame quorum failure probabilities
st-08-resilience — Resilience theory mirrors AP-style graceful degradation
isd-08-database-selection
isd-14-consistency-models
db-03-acid