Distributed Systems
CAP Theorem
Цели урока
- Understand CAP intuition through the two-node partition scenario
- Distinguish CP and AP behavior and know examples of each
- Apply PACELC to analyze trade-offs in normal operation
- Choose a system type based on consistency and availability requirements
Предварительные знания
- Understanding of distributed systems and network partitions
- Familiarity with CAP from the intro lesson
- Experience with any database (SQL or NoSQL)
Greg Linden, Amazon, 2006: an internal A/B test pinned +100ms of latency to -1% in sales. A year later Werner Vogels published the Dynamo paper: eventual consistency in exchange for 99.999% cart availability. CAP is not academic: a 503 mid-checkout costs more than a cart that is 200ms stale.
- **etcd in Kubernetes** - CP, because split-brain in a container orchestrator is worse than etcd downtime
- **DynamoDB (Amazon)** - AP with eventual consistency for shopping cart: 99.999% availability beats instant consistency
- **Google Spanner** - PC/EC via TrueTime (atomic clocks + GPS) for Google financial transactions
- **Cassandra at Netflix** - AP/EL for user preferences and viewing history: delayed history sync is acceptable
- **PostgreSQL** - CP: primary loss blocks writes, acceptable for financial data
Eric Brewer and the Conjecture That Became a Theorem
In 2000 Brewer was CTO of Inktomi (the first web search engine) and ACM SIGOPS president. At PODC he presented the CAP conjecture. Two years later Gilbert and Lynch at MIT published a formal proof. In 2012 Brewer himself wrote a revision: "CAP Twelve Years Later" - acknowledging the theorem is a simplification and reality is more nuanced, especially with partial failures.
Intuition: Why All Three Is Impossible
**Eric Brewer, PODC 2000 - "Inktomi and the scaling lesson". Conjecture:** a distributed system cannot simultaneously guarantee Consistency, Availability, and Partition tolerance. **2002:** Seth Gilbert and Nancy Lynch at MIT proved it formally. Not a best practice, not a recommendation - a theorem with a proof, like geometry. The failure semantics behind partitions are introduced in the distributed-systems intro.
| Property | What it means | Violation looks like |
|---|---|---|
| **C**onsistency | All nodes see identical data at any moment. A write to one node is immediately visible on any other. | Query balance - get USD 1000. Query again 100ms later - get USD 800 (nobody charged anything) |
| **A**vailability | Every request to a live node gets a response (success or error). No live node stays silent. | Server is up, ping works, but the request hangs forever - no response, no timeout |
| **P**artition tolerance | The system keeps working when the network splits between nodes. | Two data centers lose connectivity - both halves become unavailable even though each is healthy internally |
Intuition via a concrete scenario: two nodes A and B, a client writes x=1 to node A, then the network between A and B breaks. What should node B do when a client reads x? **Option 1:** return the stale value x=0 (breaks Consistency). **Option 2:** refuse to answer, wait for network recovery (breaks Availability). There is no third option.
**Key insight:** P (Partition tolerance) is not optional. Networks always partition - buffer overflow, BGP flap, a 10-second GC pause, a physical cable cut. In the real world the choice is between **CP** and **AP**, not between CAP and CA.
CAP theorem requires picking exactly two properties forever at design time
P is unavoidable, the real choice is CP or AP. Different operations in the same system can make different choices.
Cassandra allows setting consistency level per request: QUORUM for critical ops (CP behavior), ONE for non-critical (AP behavior). MongoDB with minority reads gives AP behavior on reads with CP behavior on writes.
Nodes A and B are separated by a partition. A client reads x from node B. Node B responds with x=0 (stale value). Which CAP property is violated?
CP Systems: Consistency at the Cost of Availability
A CP system refuses requests during a partition - returns an error or waits for quorum recovery. This is not a bug, it's an intentional choice: better to say "don't know" than give a wrong answer. The pattern for finance, inventory, cluster configuration - anything where inconsistency is more expensive than downtime. Quorum mechanics live inside consensus (Paxos, Raft).
| System | Consistency algorithm | When to use |
|---|---|---|
| etcd / Consul | Raft - leader accepts all writes | Service discovery, distributed locks, cluster config |
| ZooKeeper | Zab (ZooKeeper Atomic Broadcast) | Coordination - leader election, naming, barriers |
| MongoDB (majority write) | Quorum writes + Raft oplog | Financial data, inventory, user profile |
| Google Spanner | TrueTime + 2PC | Global transactions, banking, Google Ads |
etcd in Kubernetes: why CP
etcd stores all Kubernetes cluster state: which pods are running, on which nodes, which ConfigMaps are active. Under a network partition, etcd refuses writes without quorum (>50% of etcd nodes). This protects against split-brain: two half-clusters independently managing pods. Temporary etcd unavailability (seconds to minutes) is far better than two schedulers both thinking they are the authority.
CP behavior costs latency: every write waits for quorum acknowledgment. Raft with 3 nodes in one DC - 2-5ms overhead. Raft across regions - 50-150ms. Google Spanner with TrueTime adds a commit-wait of 7-14ms per transaction.
An etcd cluster has 5 nodes, 3 nodes are down (no quorum). What happens to write operations?
AP Systems: Availability at the Cost of Consistency
An AP system keeps running on all nodes during a partition - accepting writes, returning responses. Data will eventually converge (eventual consistency), but during a partition different nodes see different things. The choice for user data, carts, feeds, counters - anywhere unavailability is worse than staleness. Convergence mechanics for AP stores are explored in CRDTs.
| System | Conflicts in AP | Resolution strategy |
|---|---|---|
| DynamoDB | Concurrent writes to different replicas | Last-writer-wins (timestamp) or CRDTs |
| Cassandra | Different values on different nodes | Read repair + LWW or user-defined merge |
| CouchDB | Divergent document revisions | Explicit conflict detection, user resolves |
| Riak | Sibling values | Built-in CRDTs (G-Counter, LWW-Register) |
DynamoDB and the Amazon shopping cart
Werner Vogels (Amazon CTO) described the choice: the shopping cart is stored in DynamoDB with eventual consistency. If a customer adds an item from their phone while a shopping session is open on a laptop and the network is unstable, conflicting versions can arise. Amazon's strategy: on conflict, show the union of both versions. Better to show an extra item in the cart than lose one the customer added or return a 503 error.
**Eventual consistency does not mean slow**. In normal operation (without partition), Cassandra replicates a write within 1-5ms inside one DC. "Eventual" means a guarantee of final convergence, not a specific timeout.
An AP system returns random data - the result is unpredictable to clients
An AP system returns data that was accurate up to the last partition. After the network heals, nodes automatically synchronize.
Eventual consistency is predictable: with no new writes, all replicas converge to one value. Cassandra guarantees this through anti-entropy (Merkle trees) and read repair. Unpredictability occurs only during concurrent writes in an active partition.
Cassandra is configured with consistency level ONE (read/write to one replica). During a network partition between DC1 and DC2, a client reads from DC2. What does it get?
PACELC: CAP Is Only Half the Story
CAP describes behavior only during a partition. But 99.9% of the time the network is healthy. **PACELC** (Daniel Abadi, 2012) fills that gap: **P**artition -> **A** or **C**; **E**lse (normal operation) -> **L**atency or **C**onsistency. This more accurately captures real-world trade-offs. The full consistency landscape is mapped in the consistency-models lesson.
| System | Under partition | Normal mode | PACELC class |
|---|---|---|---|
| DynamoDB (eventual) | AP | EL (low latency) | PA/EL |
| Cassandra (ONE) | AP | EL | PA/EL |
| MongoDB (majority) | CP | EC (consistency) | PC/EC |
| etcd / ZooKeeper | CP | EC | PC/EC |
| Google Spanner | CP | EC (with TrueTime overhead) | PC/EC |
| MySQL (semi-sync) | CP | EC | PC/EC |
In normal operation, EC systems sacrifice latency to guarantee that read data is always fresh. EL systems return immediately from a local replica - lower latency, but concurrent writes may cause temporary divergence.
**Cassandra is unique:** tunable consistency per request. QUORUM gives PC/EC behavior, ONE gives PA/EL. The same cluster can be CP for critical operations and AP for non-critical ones - a rare combination of flexibility.
CAP and PACELC: key takeaways
- CAP: proven (Gilbert & Lynch 2002) - cannot simultaneously have C, A, and P
- P is unavoidable - the real choice is CP (precision) vs AP (availability)
- CP: write rejected without quorum. Examples: etcd, PostgreSQL, Spanner
- AP: write accepted locally, replicated async. Examples: Cassandra ONE, DynamoDB
- PACELC: in normal operation there is also a Latency vs Consistency trade-off
- Cassandra allows choosing CP/AP per request via consistency level
An analytics service reads from Cassandra with consistency level QUORUM. How does PACELC classify its behavior in normal operation (no partition)?
Key Takeaways
- **CAP theorem** - proven by Gilbert & Lynch (2002): a distributed system cannot simultaneously guarantee Consistency, Availability, and Partition tolerance
- **P is unavoidable** - networks always partition; the real choice is CP (accuracy) vs AP (availability)
- **CP systems** - reject writes without quorum; examples: etcd, ZooKeeper, Google Spanner, PostgreSQL
- **AP systems** - accept writes locally, replicate asynchronously, data converges eventually; examples: Cassandra ONE, DynamoDB
- **Eventual consistency** - does not mean slow: in normal operation Cassandra replicates within 1-5ms inside a DC
- **PACELC** - normal operation (no partition) also has a trade-off: Latency vs Consistency (EL vs EC)
- **Cassandra is tunable** - consistency level per request: QUORUM (CP behavior) or ONE (AP behavior)
Вопросы для размышления
- A messaging app (Telegram, WhatsApp) - which CAP trade-off does it use for storing messages, and why exactly that one? What would happen if the opposite was chosen?
Связанные уроки
- ds-01-intro — Failure model and partition concept defined in the intro
- dist-12-consistency — Consistency spectrum continues from CAP into linearizable vs eventual
- ds-03-consensus — CP systems rely on Paxos/Raft to enforce the C side
- prob-22-concentration — Concentration bounds frame quorum failure probabilities
- st-08-resilience — Resilience theory mirrors AP-style graceful degradation
- isd-08-database-selection
- isd-14-consistency-models
- db-03-acid