Distributed Systems

Introduction to Distributed Systems

Цели урока

Understand Lamport's definition of distributed systems and its practical meaning
Know four failure classes and defense strategies for each
Apply CAP theorem to DB and architecture choices
Distinguish consistency levels from linearizable to eventual

Предварительные знания

Basic networking (TCP, HTTP)
Experience with one DB (PostgreSQL/MySQL/any SQL)
Understanding of transactions (ACID)

AWS S3, February 28, 2017: one engineer's typo in one command took down half the internet for 4 hours. USD 150M in industry losses. This isn't a bug - it's the nature of distributed systems.

**Knight Capital (USD 440M in 45 minutes)** - one forgotten feature toggle on one of eight servers
**Ariane 5 (USD 370M)** - integer overflow during float conversion in the Inertial Reference System (SRI) 37 seconds after liftoff
**Facebook BGP outage (6 hours, 2021)** - bad routing config cut FB off from the world
**Cloudflare Regex outage (July 2, 2019)** - one inefficient regex took the Cloudflare CDN/WAF down for 27 minutes
**Google Spanner** - atomic clocks and GPS in every data center to cheat CAP

Leslie Lamport and the Birth of Distributed Computing

In 1978 Lamport published "Time, Clocks, and the Ordering of Events in a Distributed System" - 8 pages that created an entire field. Simple idea: distributed systems have no global time, but **causal ordering** exists via logical clocks (Lamport timestamps). For this work and Paxos he received the Turing Award in 2013. One of the most-cited papers in distributed systems (~14,700 citations).

What Is a Distributed System

**February 28, 2017. 9:37 AM PST. One Amazon engineer types a command in the S3 console - a typo in a parameter. For 4 hours half the internet is down: Slack, Trello, Quora, Medium, Coursera. USD 150M of business revenue lost over a single ticket.** That is a distributed system: a million spinning gears, any one of which can stop everything.

A distributed system is one in which the failure of a computer nobody knew existed can render the whole thing unusable.

Nobody has written a better definition in 38 years. Every word maps to real production pain: some VM in an Iowa data center nobody remembered the auth pipeline depended on - and now the whole site returns 503. Independent failure also drives the need for consensus across replicas.

**Formally:** a distributed system is a collection of independent computers that appears to its users as a single coherent system. **Practically:** code where `network call != function call` - and that breaks every familiar intuition.

Scale of real distributed systems

System	Nodes	Key metric
Bitcoin	~17,000 full nodes	No central authority, byzantine-tolerant
AWS S3	Millions of disks	11 nines durability (losing a file ~ winning lottery 6 times in a row)
Google Spanner	Thousands of nodes across 5 regions	Global strong consistency via atomic clocks
Cloudflare DNS (1.1.1.1)	330+ cities	p50 latency 11ms globally
Facebook Memcached	10,000+ servers	1 billion requests/sec

Distributed system = just multiple copies of an app for scaling

Distributed system = nodes with independent failures, partial outages, no shared time

The hard problem is not throughput but the fact that any node can crash, respond slowly, or return stale data - independently of others. A single server is either up or down. A distributed system is always in some intermediate state: 3 of 5 nodes healthy, 1 responding with 2s latency, 1 returning year-old data.

Which statement most accurately describes a distributed system per Lamport?

Failure Is Not Exception, It's the Norm

**Knight Capital, August 1, 2012. 9:30 EST. They deploy new code to 8 of 8 servers - but one server still has an old feature toggle flag. In 45 minutes of trading, the buggy server makes 4 million trades worth USD 7 billion. By close of business the company has lost USD 440 million; within a week it needs an emergency capital injection, and in December 2012 it is absorbed by Getco. One forgotten file on one server out of eight.**

**Cardinal rule of distributed systems:** the question is not "will a node fail" but "what does the system do when it fails". With 100 nodes at 1-year MTBF, a failure occurs every 3.65 days. With 1000 nodes - every 8.7 hours. With 10,000 - every 52 minutes. **Failure is a normal operation, not a catastrophe.** This is the same blind spot enumerated in the eight fallacies.

Four classes of failures (Cristian/Schneider model)

Failure type	What happens	Defense
CRASH	Node dies and stops responding (kernel panic, OOM, kill)	Heartbeat + retry on another cluster node
OMISSION	Node lives but silently drops messages (network drop, buffer overflow)	Idempotency + acknowledgement
TIMING	Node replies, but too late (GC pause, CPU overload, swap)	Timeout + retry, circuit breaker
BYZANTINE	Node lies - corrupts data, sends contradictory replies (bug, hardware failure, malicious)	PBFT, Paxos with >2/3 healthy nodes, signed messages

Omission failure in a banking transaction

Server received a USD 1000 charge request. Charged it. Sent `200 OK`. The reply was lost on the network. Client sees timeout. What should the client do? **Retry** - possible double charge. **Don't retry** - charge may not have happened. The fix: every transaction gets a unique `idempotency-key`. On retry with the same key, the server returns the result of the first request without charging again. Stripe and PayPal have worked this way since the early 2010s.

Anything that can go wrong, will go wrong - and at scale, it will go wrong every Tuesday at 3 PM.

With reliable hardware, failures can be avoided

At thousands-of-nodes scale, failure is an event happening every few hours regardless of hardware quality

Pinheiro/Weber/Barroso (Google, FAST 2007) published the data: AFR ~2% in year 1, ~8% in year 2, ~8.6% in year 3. At 100,000 disks that's thousands of failures per year. No hardware beats statistics. The fix is to design the system so node failure is a routine event with automatic recovery.

With 1000 servers at MTBF (mean time between failures) = 1 year, how often does something fail in the cluster?

The CAP Theorem - Pick Two of Three

**Eric Brewer, 2000, PODC conference. Hypothesis:** a distributed system cannot simultaneously guarantee three properties - **C**onsistency, **A**vailability, **P**artition tolerance. **2002:** Seth Gilbert and Nancy Lynch at MIT prove it formally. Now it's a law, like thermodynamics. Full treatment in the CAP lesson.

**Consistency (C)** - all nodes see the same data at the same moment. Write to one, read fresh from any other.
**Availability (A)** - every request gets a response (success or error), no hangs. No live node stays silent.
**Partition tolerance (P)** - the system keeps working when the network splits between nodes. In real life networks always partition, so P is not optional.

**Practical reading:** P is unavoidable (networks partition). So the real choice is **CP or AP**. What matters more when nodes lose connection: return precise data at the cost of unavailability (CP), or return some data at the cost of possible staleness (AP).

Real systems and their pick

System	Type	Why
PostgreSQL (single primary)	CP	If primary is down - writes rejected. Financial data demands precision.
MongoDB (with majority writes)	CP	On partition the minority side becomes read-only.
DynamoDB	AP	Eventual consistency by default - prefers availability for shopping carts.
Cassandra	AP	Tunable consistency - choose per query.
etcd / Consul (Raft)	CP	Cluster coordination - better fail than serve conflicting configs.
DNS	AP	Updates propagate over hours - stale records beat unreachable domains.

**Common mistake:** treating CAP as a binary system-wide choice. In practice even one DB makes different choices per operation (Cassandra: `consistency_level` per query). The modern view is **PACELC**: under Partition pick A or C, **Else** pick Latency or Consistency.

A bank picks a DB to store account balances. Which trade-off is correct?

The Consistency Spectrum

**Strong vs eventual consistency** is not two options - it's the poles of a spectrum. Between them are a dozen levels, each with its own latency vs precision trade-off.

Level	Guarantee	Latency	Example
Linearizable	Write visible to all instantly (like single machine)	High (10-100ms global)	Google Spanner, etcd
Sequential	All see operations in one order	Medium	ZooKeeper read consistency
Causal	Causally related ops in order	Low	Riak, COPS
Read-after-write (RYW)	Self-issued writes visible immediately, others eventually	Low	Session consistency in DynamoDB
Eventual	Eventually all nodes converge	Minimal	DynamoDB, Cassandra default

Google Spanner: how to cheat CAP

Spanner is the first global strong consistency DB. The trick is **TrueTime API**: every data center has atomic clocks and GPS receivers. The API returns not exact time but an interval `[earliest, latest]` with bounded uncertainty (typically 7ms). A transaction waits for the interval to pass - getting globally ordered timestamps without coordination. The mechanics of logical and physical time are unpacked in the clocks lesson.

**Cost of strong consistency:** every commit blocks for several ms of waiting. Spanner throughput - thousands of tx/sec per node, not millions like DynamoDB. That's why Google uses Spanner for finance and Ads, but eventually consistent systems for search.

Key takeaways

Distributed system = independent failures + partial outages + no shared time (Lamport 1987)
At thousands-of-nodes scale, failure happens every few hours - design for the norm, not the exception
4 failure classes: crash, omission, timing, byzantine - each needs its own defense
CAP theorem: P is unavoidable, choose between CP (precision) and AP (availability)
Consistency is a spectrum from linearizable to eventual, picked by business requirements

What's next in the course

Each concept from this lesson expands into a dedicated topic.

CAP theorem and PACELC — Formal proofs and choice nuances
Consensus (Paxos, Raft) — How nodes agree under failures
Logical clocks — Lamport timestamps, vector clocks for event ordering
Replication — Leader/follower, multi-leader, leaderless approaches
CRDTs — Conflict-free data structures for AP systems

How does Google Spanner achieve global strong consistency without huge latency?

Key Concepts

A distributed system is a set of nodes where each node cannot tell when another has failed (Lamport's definition)
CAP theorem: of Consistency, Availability, and Partition tolerance only two can be guaranteed simultaneously - and P is always required, so the real choice is C vs A
CP systems (e.g. ZooKeeper, HBase) reject requests during a network partition but always return consistent data
AP systems (e.g. Cassandra, DynamoDB) stay available during partitions but different nodes may return different values
Four failure classes: crash (node dies), omission (packet lost), timing (response too late), Byzantine (node lies)
Consistency levels from strong to weak: linearizable - sequential - causal - eventual; each step gains performance but reduces predictability for clients

Вопросы для размышления

Consider any daily-use service (banking app, messenger, navigation). Which CAP trade-off did it likely pick, and why exactly that one?

Связанные уроки

ds-02-cap-theorem — CAP formalises the partition trade-off introduced here
ds-03-consensus — Paxos and Raft solve agreement under the failures defined here
ds-05-replication — Replication strategies follow from independent-failure semantics
sd-10-microservices — Every microservice mesh inherits Lamport's failure model
alg-01-big-o — Complexity intuition needed to reason about cluster scale
st-01-feedback-loops — Organisational coordination shows the same partial-failure dynamics
ibd-21-docker-k8s-interview
net-01-intro