Distributed Systems

8 Fallacies of Distributed Computing

Цели урока

  • Know all 8 Deutsch fallacies and understand why each gets violated in production
  • Apply defense patterns: retry, circuit breaker, idempotency, service discovery
  • Understand network traffic costs and choose serialization format deliberately
  • Design systems accounting for heterogeneity and multiple administrators

Предварительные знания

  • Basic understanding of HTTP and network protocols
  • Experience working with REST APIs
  • Basic familiarity with microservice architecture

Peter Deutsch nailed down the 8 fallacies in 1994. Thirty years later, not one has shifted - and every year new companies pay millions for the same mistakes.

  • **Amazon Prime Day 2012** - network buffers saturated, millions of requests silently lost (fallacy 1)
  • **Netflix** - saved tens of millions per year after switching to Protobuf (fallacy 3)
  • **Juniper Networks backdoor (2015)** - VPN traffic intercepted inside corporate networks for years (fallacy 4)
  • **Fastly CDN outage (2021)** - 1 hour and Reddit, GitHub, Financial Times fell with a single provider (fallacy 6)
  • **Uber gRPC migration (2018)** - latency -40%, traffic -5-10x, serialization CPU usage halved (fallacies 3, 7)

Peter Deutsch and the 8 Fallacies

Peter Deutsch is one of the creators of Smalltalk and the developer of Ghostscript. In 1994, working at Sun Microsystems, he listed the first 7 fallacies on a whiteboard during a discussion about problems in distributed object systems. James Gosling (creator of Java) later added the 8th. The list was never formally published - it spread as an oral tradition and was only formally written up in 2006 by Arnon Rotem-Gal-Oz.

The Network Is Reliable and Latency Is Zero

**2012 Amazon Prime Day. Traffic spiked 4x past forecast. Network buffers saturated in seconds. Millions of requests vanished silently - no errors, just silence. Services sat waiting for replies that never came.** That is the first two fallacies playing out live: the network is not reliable, and latency is not zero. The same blind spots resurface under partitions in the CAP lesson.

In 1994 Peter Deutsch at Sun Microsystems framed the **8 fallacies of distributed computing** - false assumptions developers smuggle into production. James Gosling (creator of Java) later rounded the list out to 8. Every one of them gets violated eventually.

Fallacy 1: the network is reliable

Packets get lost. Connections drop. Switches fail. Buffers overflow. In 2008, a shark bit through an undersea cable near Egypt and 25% of India's internet traffic vanished instantly. The network is an unreliable transport, by definition.

PatternApplication
Retry with exponential backoffRetry after 1s, 2s, 4s, 8s - not immediately
Idempotent operationsRepeated call with same idempotency-key has no duplicate effect
Circuit breakerAfter N consecutive errors - stop sending requests, let dependency recover
Timeout on every callWithout a timeout a thread hangs forever

Fallacy 2: latency is zero

An in-memory function call: ~1 nanosecond. A request to a service in the same data center: 1-2 milliseconds - 1,000,000x slower. Cross-region: 50-150ms. Cross-ocean: 100-300ms. Code that handles one network call gracefully falls apart at a thousand calls without architectural rework. The TCP mechanics behind the latency floor live in the TCP basics lesson.

PatternApplication
Batching100 requests in one call instead of 100 sequential ones
CachingFrequent reads from memory, not over the wire
Async processingDo not wait for reply - enqueue and continue
CDN / edgeData geographically closer to the user

If the service works locally without errors, production will be the same

Localhost hides both fallacies: local network is reliable and has near-zero latency. Production is a different world

That is exactly why chaos engineering (Netflix Chaos Monkey, 2010) deliberately injects failures into production. Systems must be designed for failures, not tested under ideal conditions.

Service A calls service B 1000 times sequentially in a loop. B latency is 2ms. What happens?

Bandwidth Is Infinite and the Network Is Secure

**2020. Netflix accounts for 15% of global internet traffic at peak hours. Switching from JSON to Protobuf cut payload size 3-7x. Savings: tens of millions of dollars per year in bandwidth alone.** That is fallacy 3: bandwidth is not infinite, and it costs real money.

Fallacy 3: bandwidth is infinite

AWS TrafficCost
Within same AZFree
Between AZs in same regionUSD 0.01/GB
Between regionsUSD 0.02-0.09/GB
To internetUSD 0.09/GB

100 TB of cross-region traffic per month is USD 2,000 just in transfer fees. With microservice architecture every service chats with several others. Multiply across 100 services and bandwidth turns into a meaningful slice of the infrastructure bill.

  • **Compression:** gzip on HTTP, Protobuf or MessagePack instead of JSON (3-10x smaller)
  • **Pagination:** do not return 10,000 records when one page of 20 is needed
  • **Delta sync:** send only changes, not the full object
  • **Data locality:** process where data lives

Fallacy 4: the network is secure

In 2015 Juniper Networks uncovered a backdoor in its VPN devices - someone had been decrypting all traffic through those boxes for years. Packets can be intercepted at any hop between source and destination. An internal corporate network does not make data secure on its own.

PatternApplication
TLS everywhereEven inside the data center between services
mTLS (mutual TLS)Both parties authenticate each other with certificates
Zero-trust architectureNo network is trusted by default - every request is authenticated
Encryption at restData on disk is encrypted even if the disk is stolen

A microservice returns a JSON list of 500 users (~2KB each). Which optimization yields the biggest impact?

Topology Is Static and There Is One Administrator

**Kubernetes rolling update: 50 pods get swapped for the new version. In 2 minutes, IPs change on 50 components. A service with hardcoded IPs in its config loses half its connections.** That is fallacy 5: in production, topology never stops changing. The orchestration details live in the Kubernetes interview lesson.

Fallacy 5: topology does not change

IP addresses change on deploy, failover, autoscaling, migration. In a cloud environment a server can get torn down and recreated with a new IP at any moment. A hardcoded IP in config is technical debt with a very short fuse.

Fallacy 6: there is one administrator

Any real system depends on dozens of external components: AWS, Stripe, Twilio, Auth0, GitHub. Each has its own SLA, maintenance windows, and incidents. In 2021 the Fastly CDN outage ran 1 hour and dragged Reddit, GitHub, Financial Times, and The Guardian down - none of them controlled their CDN provider. Microservice coupling patterns are catalogued in the microservices lesson.

  • **Graceful degradation:** when Stripe is unavailable - show cached data, do not fail completely
  • **Fallback strategies:** alternative provider or simplified code path
  • **SLA monitoring:** do not learn about external dependency problems from users
  • **Contract testing:** test integration edge cases, not just happy path

If an external provider's SLA says 99.9% uptime there will be no problems

99.9% uptime = 8.76 hours downtime per year. With 10 external dependencies at 99.9% each, combined availability is roughly 99%, meaning around 87 hours of downtime per year

System availability compounds multiplicatively, not additively. The more external dependencies, the lower the real-world availability even with good individual SLAs.

The payment service is down - Stripe returned 503. How should an e-commerce system respond?

Transport Is Free and the Network Is Homogeneous

**Uber, 2018. Internal microservice traffic migrated from REST/JSON to gRPC/Protobuf. Results: latency down 30-40%, serialization CPU cut in half, network traffic shrunk 5-10x.** That is the final pair of fallacies: transport is not free, and the network is not homogeneous.

Fallacy 7: transport is free

Network calls cost money for traffic, CPU for serialization/deserialization, memory for buffers, and time for latency. A 1KB JSON object actually burns 10-100x more resources than it looks once serialization, transmission, and deserialization are tallied.

FormatSize (example)CPU (ser/deser)Human-readable
JSON100%HighYes
MessagePack~50%MediumNo (binary)
Protobuf~20-30%LowRequires schema
Avro~20%LowRequires schema

Fallacy 8: the network is homogeneous

Production runs Linux and Windows servers side-by-side, Java 8 next to Java 17, multiple Kubernetes versions, different encodings (UTF-8, Latin-1), big-endian and little-endian processors. A Node.js microservice talks to a Go service that pulls data from a Python script. Data format is a contract; break it and silent data corruption follows.

Real bug: endianness mismatch

Service on x86 (little-endian) writes int32 to a file: bytes [01 00 00 00] = number 1. Service on SPARC (big-endian) reads those same bytes: [01 00 00 00] = 16,777,216. Data crossed the boundary without an error and arrived as the wrong number. The same trap snaps shut on float, timestamp, and any binary format without an explicit byte order.

  • **Standard formats:** JSON, Protobuf, Avro with explicit schema - byte order defined
  • **API versioning:** `/v1/users`, `/v2/users` - do not break backward compatibility
  • **Backward compatibility:** new fields are optional, old clients keep working
  • **Contract testing:** Pact, Dredd - test the contract, not the implementation

The 8 fallacies are historical artifacts from the 1990s - modern cloud-native systems have solved them

Cloud-native tools (Kubernetes, service mesh) mitigate some fallacies but do not eliminate them - they shift responsibility to a different layer

Kubernetes solves topology via service discovery. Istio handles mTLS. But bandwidth, latency, and network reliability are physical constraints. Homogeneity and administration have become more complex, not simpler: now there is also Kubernetes, Helm, Terraform, and multiple cloud providers.

Which of the 8 fallacies is the most common root cause of cascade failures?

Key Takeaways

  • **Network is unreliable** - packets get lost, buffers overflow, sharks cut cables; defense: retry with exponential backoff, circuit breaker, timeout on every call
  • **Latency is not zero** - in-memory ~1ns, same DC ~1-2ms, cross-region 50-150ms; solutions: batching, caching, async processing, CDN
  • **Bandwidth is not infinite** - AWS cross-region traffic costs USD 0.02-0.09/GB; Protobuf instead of JSON gives 3-7x savings
  • **Network is not secure** - packets can be intercepted at any hop; patterns: TLS everywhere, mTLS, zero-trust architecture
  • **Topology changes** - IPs shift on deploy, failover, autoscaling; use service discovery, never hardcode addresses
  • **Multiple administrators** - external dependencies (AWS, Stripe, CDN) operate on their own SLAs; 10 dependencies at 99.9% each yields ~99% combined availability
  • **Transport is not free** - JSON burns CPU and memory; Protobuf: 70-80% smaller payload, significantly lower serialization CPU cost
  • **Network is not homogeneous** - different OS, versions, encodings, endianness; explicit data contracts required: Protobuf/Avro with schema, API versioning

Связанные уроки

  • ds-02-cap-theorem — Recognising network unreliability primes the CAP partition story
  • sd-10-microservices — Every microservice mesh trips on the same eight assumptions
  • net-15-tcp-basics — TCP retransmit and timeouts directly answer fallacies 1 and 2
  • ds-01-intro — Lamport's failure model formalises Deutsch's assumptions
  • st-04-leverage — Hidden assumptions in any complex system mirror these fallacies
  • isd-11-load-balancing
  • isd-09-caching-strategies
  • net-01-intro
8 Fallacies of Distributed Computing

0

1

Sign In