Distributed Systems
Introduction to Distributed Systems
Цели урока
- Understand Lamport's definition of distributed systems and its practical meaning
- Know four failure classes and defense strategies for each
- Apply CAP theorem to DB and architecture choices
- Distinguish consistency levels from linearizable to eventual
Предварительные знания
- Basic networking (TCP, HTTP)
- Experience with one DB (PostgreSQL/MySQL/any SQL)
- Understanding of transactions (ACID)
AWS S3, February 28, 2017: one engineer's typo in one command took down half the internet for 4 hours. USD 150M in industry losses. This isn't a bug - it's the nature of distributed systems.
- **Knight Capital (USD 440M in 45 minutes)** - one forgotten feature toggle on one of eight servers
- **Ariane 5 (USD 370M)** - integer overflow during float conversion in the Inertial Reference System (SRI) 37 seconds after liftoff
- **Facebook BGP outage (6 hours, 2021)** - bad routing config cut FB off from the world
- **Cloudflare Regex outage (July 2, 2019)** - one inefficient regex took the Cloudflare CDN/WAF down for 27 minutes
- **Google Spanner** - atomic clocks and GPS in every data center to cheat CAP
Leslie Lamport and the Birth of Distributed Computing
In 1978 Lamport published "Time, Clocks, and the Ordering of Events in a Distributed System" - 8 pages that created an entire field. Simple idea: distributed systems have no global time, but **causal ordering** exists via logical clocks (Lamport timestamps). For this work and Paxos he received the Turing Award in 2013. One of the most-cited papers in distributed systems (~14,700 citations).
What Is a Distributed System
**February 28, 2017. 9:37 AM PST. One Amazon engineer types a command in the S3 console - a typo in a parameter. For 4 hours half the internet is down: Slack, Trello, Quora, Medium, Coursera. USD 150M of business revenue lost over a single ticket.** That is a distributed system: a million spinning gears, any one of which can stop everything.
A distributed system is one in which the failure of a computer nobody knew existed can render the whole thing unusable.
Nobody has written a better definition in 38 years. Every word maps to real production pain: some VM in an Iowa data center nobody remembered the auth pipeline depended on - and now the whole site returns 503. Independent failure also drives the need for consensus across replicas.
**Formally:** a distributed system is a collection of independent computers that appears to its users as a single coherent system. **Practically:** code where `network call != function call` - and that breaks every familiar intuition.
Scale of real distributed systems
| System | Nodes | Key metric |
|---|---|---|
| Bitcoin | ~17,000 full nodes | No central authority, byzantine-tolerant |
| AWS S3 | Millions of disks | 11 nines durability (losing a file ~ winning lottery 6 times in a row) |
| Google Spanner | Thousands of nodes across 5 regions | Global strong consistency via atomic clocks |
| Cloudflare DNS (1.1.1.1) | 330+ cities | p50 latency 11ms globally |
| Facebook Memcached | 10,000+ servers | 1 billion requests/sec |
Distributed system = just multiple copies of an app for scaling
Distributed system = nodes with independent failures, partial outages, no shared time
The hard problem is not throughput but the fact that any node can crash, respond slowly, or return stale data - independently of others. A single server is either up or down. A distributed system is always in some intermediate state: 3 of 5 nodes healthy, 1 responding with 2s latency, 1 returning year-old data.
Which statement most accurately describes a distributed system per Lamport?
Failure Is Not Exception, It's the Norm
**Knight Capital, August 1, 2012. 9:30 EST. They deploy new code to 8 of 8 servers - but one server still has an old feature toggle flag. In 45 minutes of trading, the buggy server makes 4 million trades worth USD 7 billion. By close of business the company has lost USD 440 million; within a week it needs an emergency capital injection, and in December 2012 it is absorbed by Getco. One forgotten file on one server out of eight.**
**Cardinal rule of distributed systems:** the question is not "will a node fail" but "what does the system do when it fails". With 100 nodes at 1-year MTBF, a failure occurs every 3.65 days. With 1000 nodes - every 8.7 hours. With 10,000 - every 52 minutes. **Failure is a normal operation, not a catastrophe.** This is the same blind spot enumerated in the eight fallacies.
Four classes of failures (Cristian/Schneider model)
| Failure type | What happens | Defense |
|---|---|---|
| **CRASH** | Node dies and stops responding (kernel panic, OOM, kill) | Heartbeat + retry on another cluster node |
| **OMISSION** | Node lives but silently drops messages (network drop, buffer overflow) | Idempotency + acknowledgement |
| **TIMING** | Node replies, but too late (GC pause, CPU overload, swap) | Timeout + retry, circuit breaker |
| **BYZANTINE** | Node lies - corrupts data, sends contradictory replies (bug, hardware failure, malicious) | PBFT, Paxos with >2/3 healthy nodes, signed messages |
Omission failure in a banking transaction
Server received a USD 1000 charge request. Charged it. Sent `200 OK`. The reply was lost on the network. Client sees timeout. What should the client do? **Retry** - possible double charge. **Don't retry** - charge may not have happened. The fix: every transaction gets a unique `idempotency-key`. On retry with the same key, the server returns the result of the first request without charging again. Stripe and PayPal have worked this way since the early 2010s.
Anything that can go wrong, will go wrong - and at scale, it will go wrong every Tuesday at 3 PM.
With reliable hardware, failures can be avoided
At thousands-of-nodes scale, failure is an event happening every few hours regardless of hardware quality
Pinheiro/Weber/Barroso (Google, FAST 2007) published the data: AFR ~2% in year 1, ~8% in year 2, ~8.6% in year 3. At 100,000 disks that's thousands of failures per year. No hardware beats statistics. The fix is to design the system so node failure is a routine event with automatic recovery.
With 1000 servers at MTBF (mean time between failures) = 1 year, how often does something fail in the cluster?
The CAP Theorem - Pick Two of Three
**Eric Brewer, 2000, PODC conference. Hypothesis:** a distributed system cannot simultaneously guarantee three properties - **C**onsistency, **A**vailability, **P**artition tolerance. **2002:** Seth Gilbert and Nancy Lynch at MIT prove it formally. Now it's a law, like thermodynamics. Full treatment in the CAP lesson.
- **Consistency (C)** - all nodes see the same data at the same moment. Write to one, read fresh from any other.
- **Availability (A)** - every request gets a response (success or error), no hangs. No live node stays silent.
- **Partition tolerance (P)** - the system keeps working when the network splits between nodes. In real life networks always partition, so P is not optional.
**Practical reading:** P is unavoidable (networks partition). So the real choice is **CP or AP**. What matters more when nodes lose connection: return precise data at the cost of unavailability (CP), or return some data at the cost of possible staleness (AP).
Real systems and their pick
| System | Type | Why |
|---|---|---|
| PostgreSQL (single primary) | CP | If primary is down - writes rejected. Financial data demands precision. |
| MongoDB (with majority writes) | CP | On partition the minority side becomes read-only. |
| DynamoDB | AP | Eventual consistency by default - prefers availability for shopping carts. |
| Cassandra | AP | Tunable consistency - choose per query. |
| etcd / Consul (Raft) | CP | Cluster coordination - better fail than serve conflicting configs. |
| DNS | AP | Updates propagate over hours - stale records beat unreachable domains. |
**Common mistake:** treating CAP as a binary system-wide choice. In practice even one DB makes different choices per operation (Cassandra: `consistency_level` per query). The modern view is **PACELC**: under Partition pick A or C, **Else** pick Latency or Consistency.
A bank picks a DB to store account balances. Which trade-off is correct?
The Consistency Spectrum
**Strong vs eventual consistency** is not two options - it's the poles of a spectrum. Between them are a dozen levels, each with its own latency vs precision trade-off.
| Level | Guarantee | Latency | Example |
|---|---|---|---|
| Linearizable | Write visible to all instantly (like single machine) | High (10-100ms global) | Google Spanner, etcd |
| Sequential | All see operations in one order | Medium | ZooKeeper read consistency |
| Causal | Causally related ops in order | Low | Riak, COPS |
| Read-after-write (RYW) | Self-issued writes visible immediately, others eventually | Low | Session consistency in DynamoDB |
| Eventual | Eventually all nodes converge | Minimal | DynamoDB, Cassandra default |
Google Spanner: how to cheat CAP
Spanner is the first global strong consistency DB. The trick is **TrueTime API**: every data center has atomic clocks and GPS receivers. The API returns not exact time but an interval `[earliest, latest]` with bounded uncertainty (typically 7ms). A transaction waits for the interval to pass - getting globally ordered timestamps without coordination. The mechanics of logical and physical time are unpacked in the clocks lesson.
**Cost of strong consistency:** every commit blocks for several ms of waiting. Spanner throughput - thousands of tx/sec per node, not millions like DynamoDB. That's why Google uses Spanner for finance and Ads, but eventually consistent systems for search.
Key takeaways
- Distributed system = independent failures + partial outages + no shared time (Lamport 1987)
- At thousands-of-nodes scale, failure happens every few hours - design for the norm, not the exception
- 4 failure classes: crash, omission, timing, byzantine - each needs its own defense
- CAP theorem: P is unavoidable, choose between CP (precision) and AP (availability)
- Consistency is a spectrum from linearizable to eventual, picked by business requirements
What's next in the course
Each concept from this lesson expands into a dedicated topic.
- CAP theorem and PACELC — Formal proofs and choice nuances
- Consensus (Paxos, Raft) — How nodes agree under failures
- Logical clocks — Lamport timestamps, vector clocks for event ordering
- Replication — Leader/follower, multi-leader, leaderless approaches
- CRDTs — Conflict-free data structures for AP systems
How does Google Spanner achieve global strong consistency without huge latency?
Key Concepts
- A distributed system is a set of nodes where each node cannot tell when another has failed (Lamport's definition)
- CAP theorem: of Consistency, Availability, and Partition tolerance only two can be guaranteed simultaneously - and P is always required, so the real choice is C vs A
- CP systems (e.g. ZooKeeper, HBase) reject requests during a network partition but always return consistent data
- AP systems (e.g. Cassandra, DynamoDB) stay available during partitions but different nodes may return different values
- Four failure classes: crash (node dies), omission (packet lost), timing (response too late), Byzantine (node lies)
- Consistency levels from strong to weak: linearizable - sequential - causal - eventual; each step gains performance but reduces predictability for clients
Вопросы для размышления
- Consider any daily-use service (banking app, messenger, navigation). Which CAP trade-off did it likely pick, and why exactly that one?
Связанные уроки
- ds-02-cap-theorem — CAP formalises the partition trade-off introduced here
- ds-03-consensus — Paxos and Raft solve agreement under the failures defined here
- ds-05-replication — Replication strategies follow from independent-failure semantics
- sd-10-microservices — Every microservice mesh inherits Lamport's failure model
- alg-01-big-o — Complexity intuition needed to reason about cluster scale
- st-01-feedback-loops — Organisational coordination shows the same partial-failure dynamics
- ibd-21-docker-k8s-interview
- net-01-intro