Distributed Systems
8 Fallacies of Distributed Computing
Цели урока
- Know all 8 Deutsch fallacies and understand why each gets violated in production
- Apply defense patterns: retry, circuit breaker, idempotency, service discovery
- Understand network traffic costs and choose serialization format deliberately
- Design systems accounting for heterogeneity and multiple administrators
Предварительные знания
- Basic understanding of HTTP and network protocols
- Experience working with REST APIs
- Basic familiarity with microservice architecture
Peter Deutsch nailed down the 8 fallacies in 1994. Thirty years later, not one has shifted - and every year new companies pay millions for the same mistakes.
- **Amazon Prime Day 2012** - network buffers saturated, millions of requests silently lost (fallacy 1)
- **Netflix** - saved tens of millions per year after switching to Protobuf (fallacy 3)
- **Juniper Networks backdoor (2015)** - VPN traffic intercepted inside corporate networks for years (fallacy 4)
- **Fastly CDN outage (2021)** - 1 hour and Reddit, GitHub, Financial Times fell with a single provider (fallacy 6)
- **Uber gRPC migration (2018)** - latency -40%, traffic -5-10x, serialization CPU usage halved (fallacies 3, 7)
Peter Deutsch and the 8 Fallacies
Peter Deutsch is one of the creators of Smalltalk and the developer of Ghostscript. In 1994, working at Sun Microsystems, he listed the first 7 fallacies on a whiteboard during a discussion about problems in distributed object systems. James Gosling (creator of Java) later added the 8th. The list was never formally published - it spread as an oral tradition and was only formally written up in 2006 by Arnon Rotem-Gal-Oz.
The Network Is Reliable and Latency Is Zero
**2012 Amazon Prime Day. Traffic spiked 4x past forecast. Network buffers saturated in seconds. Millions of requests vanished silently - no errors, just silence. Services sat waiting for replies that never came.** That is the first two fallacies playing out live: the network is not reliable, and latency is not zero. The same blind spots resurface under partitions in the CAP lesson.
In 1994 Peter Deutsch at Sun Microsystems framed the **8 fallacies of distributed computing** - false assumptions developers smuggle into production. James Gosling (creator of Java) later rounded the list out to 8. Every one of them gets violated eventually.
Fallacy 1: the network is reliable
Packets get lost. Connections drop. Switches fail. Buffers overflow. In 2008, a shark bit through an undersea cable near Egypt and 25% of India's internet traffic vanished instantly. The network is an unreliable transport, by definition.
| Pattern | Application |
|---|---|
| Retry with exponential backoff | Retry after 1s, 2s, 4s, 8s - not immediately |
| Idempotent operations | Repeated call with same idempotency-key has no duplicate effect |
| Circuit breaker | After N consecutive errors - stop sending requests, let dependency recover |
| Timeout on every call | Without a timeout a thread hangs forever |
Fallacy 2: latency is zero
An in-memory function call: ~1 nanosecond. A request to a service in the same data center: 1-2 milliseconds - 1,000,000x slower. Cross-region: 50-150ms. Cross-ocean: 100-300ms. Code that handles one network call gracefully falls apart at a thousand calls without architectural rework. The TCP mechanics behind the latency floor live in the TCP basics lesson.
| Pattern | Application |
|---|---|
| Batching | 100 requests in one call instead of 100 sequential ones |
| Caching | Frequent reads from memory, not over the wire |
| Async processing | Do not wait for reply - enqueue and continue |
| CDN / edge | Data geographically closer to the user |
If the service works locally without errors, production will be the same
Localhost hides both fallacies: local network is reliable and has near-zero latency. Production is a different world
That is exactly why chaos engineering (Netflix Chaos Monkey, 2010) deliberately injects failures into production. Systems must be designed for failures, not tested under ideal conditions.
Service A calls service B 1000 times sequentially in a loop. B latency is 2ms. What happens?
Bandwidth Is Infinite and the Network Is Secure
**2020. Netflix accounts for 15% of global internet traffic at peak hours. Switching from JSON to Protobuf cut payload size 3-7x. Savings: tens of millions of dollars per year in bandwidth alone.** That is fallacy 3: bandwidth is not infinite, and it costs real money.
Fallacy 3: bandwidth is infinite
| AWS Traffic | Cost |
|---|---|
| Within same AZ | Free |
| Between AZs in same region | USD 0.01/GB |
| Between regions | USD 0.02-0.09/GB |
| To internet | USD 0.09/GB |
100 TB of cross-region traffic per month is USD 2,000 just in transfer fees. With microservice architecture every service chats with several others. Multiply across 100 services and bandwidth turns into a meaningful slice of the infrastructure bill.
- **Compression:** gzip on HTTP, Protobuf or MessagePack instead of JSON (3-10x smaller)
- **Pagination:** do not return 10,000 records when one page of 20 is needed
- **Delta sync:** send only changes, not the full object
- **Data locality:** process where data lives
Fallacy 4: the network is secure
In 2015 Juniper Networks uncovered a backdoor in its VPN devices - someone had been decrypting all traffic through those boxes for years. Packets can be intercepted at any hop between source and destination. An internal corporate network does not make data secure on its own.
| Pattern | Application |
|---|---|
| TLS everywhere | Even inside the data center between services |
| mTLS (mutual TLS) | Both parties authenticate each other with certificates |
| Zero-trust architecture | No network is trusted by default - every request is authenticated |
| Encryption at rest | Data on disk is encrypted even if the disk is stolen |
A microservice returns a JSON list of 500 users (~2KB each). Which optimization yields the biggest impact?
Topology Is Static and There Is One Administrator
**Kubernetes rolling update: 50 pods get swapped for the new version. In 2 minutes, IPs change on 50 components. A service with hardcoded IPs in its config loses half its connections.** That is fallacy 5: in production, topology never stops changing. The orchestration details live in the Kubernetes interview lesson.
Fallacy 5: topology does not change
IP addresses change on deploy, failover, autoscaling, migration. In a cloud environment a server can get torn down and recreated with a new IP at any moment. A hardcoded IP in config is technical debt with a very short fuse.
Fallacy 6: there is one administrator
Any real system depends on dozens of external components: AWS, Stripe, Twilio, Auth0, GitHub. Each has its own SLA, maintenance windows, and incidents. In 2021 the Fastly CDN outage ran 1 hour and dragged Reddit, GitHub, Financial Times, and The Guardian down - none of them controlled their CDN provider. Microservice coupling patterns are catalogued in the microservices lesson.
- **Graceful degradation:** when Stripe is unavailable - show cached data, do not fail completely
- **Fallback strategies:** alternative provider or simplified code path
- **SLA monitoring:** do not learn about external dependency problems from users
- **Contract testing:** test integration edge cases, not just happy path
If an external provider's SLA says 99.9% uptime there will be no problems
99.9% uptime = 8.76 hours downtime per year. With 10 external dependencies at 99.9% each, combined availability is roughly 99%, meaning around 87 hours of downtime per year
System availability compounds multiplicatively, not additively. The more external dependencies, the lower the real-world availability even with good individual SLAs.
The payment service is down - Stripe returned 503. How should an e-commerce system respond?
Transport Is Free and the Network Is Homogeneous
**Uber, 2018. Internal microservice traffic migrated from REST/JSON to gRPC/Protobuf. Results: latency down 30-40%, serialization CPU cut in half, network traffic shrunk 5-10x.** That is the final pair of fallacies: transport is not free, and the network is not homogeneous.
Fallacy 7: transport is free
Network calls cost money for traffic, CPU for serialization/deserialization, memory for buffers, and time for latency. A 1KB JSON object actually burns 10-100x more resources than it looks once serialization, transmission, and deserialization are tallied.
| Format | Size (example) | CPU (ser/deser) | Human-readable |
|---|---|---|---|
| JSON | 100% | High | Yes |
| MessagePack | ~50% | Medium | No (binary) |
| Protobuf | ~20-30% | Low | Requires schema |
| Avro | ~20% | Low | Requires schema |
Fallacy 8: the network is homogeneous
Production runs Linux and Windows servers side-by-side, Java 8 next to Java 17, multiple Kubernetes versions, different encodings (UTF-8, Latin-1), big-endian and little-endian processors. A Node.js microservice talks to a Go service that pulls data from a Python script. Data format is a contract; break it and silent data corruption follows.
Real bug: endianness mismatch
Service on x86 (little-endian) writes int32 to a file: bytes [01 00 00 00] = number 1. Service on SPARC (big-endian) reads those same bytes: [01 00 00 00] = 16,777,216. Data crossed the boundary without an error and arrived as the wrong number. The same trap snaps shut on float, timestamp, and any binary format without an explicit byte order.
- **Standard formats:** JSON, Protobuf, Avro with explicit schema - byte order defined
- **API versioning:** `/v1/users`, `/v2/users` - do not break backward compatibility
- **Backward compatibility:** new fields are optional, old clients keep working
- **Contract testing:** Pact, Dredd - test the contract, not the implementation
The 8 fallacies are historical artifacts from the 1990s - modern cloud-native systems have solved them
Cloud-native tools (Kubernetes, service mesh) mitigate some fallacies but do not eliminate them - they shift responsibility to a different layer
Kubernetes solves topology via service discovery. Istio handles mTLS. But bandwidth, latency, and network reliability are physical constraints. Homogeneity and administration have become more complex, not simpler: now there is also Kubernetes, Helm, Terraform, and multiple cloud providers.
Which of the 8 fallacies is the most common root cause of cascade failures?
Key Takeaways
- **Network is unreliable** - packets get lost, buffers overflow, sharks cut cables; defense: retry with exponential backoff, circuit breaker, timeout on every call
- **Latency is not zero** - in-memory ~1ns, same DC ~1-2ms, cross-region 50-150ms; solutions: batching, caching, async processing, CDN
- **Bandwidth is not infinite** - AWS cross-region traffic costs USD 0.02-0.09/GB; Protobuf instead of JSON gives 3-7x savings
- **Network is not secure** - packets can be intercepted at any hop; patterns: TLS everywhere, mTLS, zero-trust architecture
- **Topology changes** - IPs shift on deploy, failover, autoscaling; use service discovery, never hardcode addresses
- **Multiple administrators** - external dependencies (AWS, Stripe, CDN) operate on their own SLAs; 10 dependencies at 99.9% each yields ~99% combined availability
- **Transport is not free** - JSON burns CPU and memory; Protobuf: 70-80% smaller payload, significantly lower serialization CPU cost
- **Network is not homogeneous** - different OS, versions, encodings, endianness; explicit data contracts required: Protobuf/Avro with schema, API versioning
Связанные уроки
- ds-02-cap-theorem — Recognising network unreliability primes the CAP partition story
- sd-10-microservices — Every microservice mesh trips on the same eight assumptions
- net-15-tcp-basics — TCP retransmit and timeouts directly answer fallacies 1 and 2
- ds-01-intro — Lamport's failure model formalises Deutsch's assumptions
- st-04-leverage — Hidden assumptions in any complex system mirror these fallacies
- isd-11-load-balancing
- isd-09-caching-strategies
- net-01-intro