Software Engineering

Chaos Engineering

August 2008: Netflix is down for three days when a database corruption hits their data center; DVDs cannot ship. The lesson: data centers fail, infrastructure must survive failures on its own. Netflix begins its AWS migration, builds Chaos Monkey by 2010, announces the Simian Army in July 2011, and open-sources the code in 2012. The tool randomly kills production instances - not in staging, not in QA, but in production during business hours. Industry reaction: they are insane. Outcome: Netflix becomes the reliability benchmark for cloud-native architecture. The Simian Army grows into a zoo of chaos tools. Principle: better to inject failures and learn than to wait for unplanned incidents.

  • **Netflix Simian Army**: Chaos Monkey (kills instances), Chaos Kong (kills entire availability zones), Latency Monkey (injects network delays) - a full suite of chaos tools that Netflix uses to verify resilience daily.
  • **Amazon Prime Day**: Game Days are conducted months before the event. Teams simulate peak load with deliberate failures, finding bottlenecks before they affect hundreds of millions of customers.
  • **Facebook Codepath**: automatic fault injection on every staging deployment. If a service cannot survive a dependency failure in staging, it does not reach production.

Chaos Monkey

Chaos Monkey is a tool created by Netflix in 2011 during migration to AWS. It randomly kills production instances to verify that the system automatically recovers. The idea: if you cannot prevent failures, make them happen frequently enough that the system learns to survive them.

Chaos Engineering principles (Netflix): build a hypothesis about system behavior in steady state, introduce turbulence (kill instance, inject latency), observe whether behavior matches the hypothesis. If not - the hypothesis was wrong and a real vulnerability was found before a customer encountered it.

Netflix kills random production instances during business hours. Why specifically during business hours?

Fault Injection

Fault Injection is deliberately introducing errors into a system to test its response. Fault types: Resource Exhaustion (fill memory, exhaust CPU), Network Faults (latency injection, packet loss, partition), Dependency Faults (service unavailable, slow response, corrupt response), Infrastructure Faults (disk full, clock drift).

Tools: Chaos Toolkit (open source, YAML experiment descriptions), AWS Fault Injection Simulator (managed service, safe guardrails), Gremlin (commercial, broad fault library), Litmus (Kubernetes-native, CNCF project). All tools support defining stop conditions: if error rate exceeds 5%, automatically stop the experiment.

Fault injection showed: with a 2-second delay in Payment Service, Order Service hangs for 30 seconds per request. What is the correct fix?

Game Days

A Game Day is a planned event where a team deliberately creates realistic failure scenarios and practices incident response. Unlike automated Chaos Engineering, Game Days involve the entire team: developers, SRE, product, sometimes even leadership.

DiRT (Disaster Recovery Testing) at Google: annual disaster recovery scenario tests. An entire team goes 'offline' to simulate a data center outage. The system must continue working. Amazon Prime Day: Game Days are conducted several months before the event, simulating peak load with deliberate failures.

A Game Day ran without a single problem - all systems worked normally. This is:

Resilience Patterns

Resilience is the system's ability to continue operating during partial failures. Graceful Degradation: serve reduced functionality when a dependency is unavailable. Circuit Breaker: automatically stop calling a failing dependency. Timeout: do not wait indefinitely for a response.

Bulkhead - resource isolation to prevent cascade failures. Analogy: ship compartments: a problem in one compartment does not sink the ship. In microservices: separate thread pools for different dependencies. Retry with Exponential Backoff: retry failed requests, but not immediately - wait longer with each attempt to avoid thundering herd.

Chaos Engineering is dangerous for production - it deliberately breaks the system

Chaos Engineering is dangerous for production systems that cannot survive failures. The goal of Chaos Engineering is to find these systems before users do. A system that cannot survive controlled chaos will definitely not survive uncontrolled production failures.

The alternative to Chaos Engineering is not 'no failures' - it is 'failures discovered by users at 3am'. Netflix chose controlled experiments during business hours precisely because uncontrolled failures are worse.

Review Service is down. The product page should show reviews. What does Graceful Degradation mean in this scenario?

Key Ideas

  • **Chaos Monkey**: random instance killing in production during business hours - finds failure modes before users do, with the team ready to respond.
  • **Fault Injection**: deliberate network latency, resource exhaustion, dependency failures - reveals Circuit Breaker and timeout gaps that are invisible during normal operation.
  • **Resilience Patterns**: Graceful Degradation (serve partial functionality), Circuit Breaker (stop calling failing dependencies), Bulkhead (isolate thread pools) - each pattern is validated by chaos experiments.

Related Topics

Chaos Engineering works together with other reliability practices:

  • SRE: Site Reliability Engineering — Error budget: chaos experiments consume error budget but create resilience that extends future budget. Trade-off: short-term budget burn for long-term reliability
  • Observability: Logs, Metrics, Traces — Without good observability, chaos experiments are meaningless - there is nothing to observe. Traces and metrics show whether the system behaved as hypothesized

Вопросы для размышления

  • Where to start introducing Chaos Engineering in a team that has never done it - and which scenario to run first?
  • Chaos in production vs chaos in staging: which scenarios can only be tested safely in staging, and which ones must run in production to be meaningful?
  • How to determine that the system is sufficiently resilient - and is absolute resilience to all failures even a desirable goal?

Связанные уроки

  • sd-03-scalability
Chaos Engineering

0

1

Sign In