Real-Time Backend

Chaos Engineering

The service went down on Friday at 18:00. The cause: a race condition during a network partition that does not reproduce in staging. Netflix solved this problem the radical way: break production every business day until the system learns to survive.

Netflix Chaos Monkey has been killing production EC2 instances every business day since 2011; this led to the creation of Chaos Engineering as a discipline
Gremlin found that 60% of companies discover critical bugs within the first 30 days of chaos engineering; most of those bugs had existed for years unnoticed
Amazon runs an annual GameDay: several hours of managed chaos in production. Each one surfaces 3-7 serious resilience issues

Network Partitions: how the system behaves when the network breaks

**Netflix, 2011.** Engineers were deliberately killing production EC2 instances at random during business hours. Not at 3 AM, in the middle of the day, when the team could react quickly. It was called Chaos Monkey. The idea: if the system cannot survive failures under controlled conditions, it will absolutely crash under uncontrolled ones.

A network partition is when two sets of nodes cannot talk to each other, while each set still works internally. For a WebSocket service it means: some users are connected to nodes A-B-C, others to nodes D-E-F, and there is no link between the groups. What happens to a group chat?

Gremlin (a commercial chaos engineering platform) reported that 60% of companies uncover critical bugs in their systems within the first 30 days of chaos engineering. Most of those bugs had existed for years but never showed up under normal conditions.

Why did Netflix run Chaos Monkey during business hours rather than at night?

Slow Consumers: when one client slows everyone down

A slow consumer is a client that reads messages slower than the server sends them. The send buffer fills up. If the server waits, the thread serving that client blocks. With 1000 clients on a single thread (Node.js event loop), one slow client slows down everyone.

Chaos test for slow consumers: artificially delay reading on a test client by 10 seconds. Verify: other clients receive messages without delay, the slow client is eventually disconnected, and server memory does not grow into OOM.

Discord wrote on its Engineering Blog in 2020 about how a slow Discord overlay (the in-game HUD) caused a server memory leak. A single client with the overlay enabled accumulated a 50MB+ message queue per hour. After adding the MAX_QUEUED_MESSAGES guard, the issue was gone.

What is the right server behavior on detecting a slow consumer?

Node Failures: what is lost when a pod dies

A pod died. It was holding 50K WebSocket connections. They all break at once and try to reconnect. Within 30 seconds 50K clients try to establish new connections on the remaining pods. That is a thundering herd: a stampede crashing through the door.

Two layers of problems: **connection** (clients reconnect, solved by exponential backoff + jitter) and **state** (what was in memory of the dead pod: subscriptions, in-flight messages, user presence data).

Why is jitter added to exponential backoff on reconnect?

Tools: Chaos Monkey, Toxiproxy, Litmus

Chaos engineering is not breaking random things. It is a scientific approach: **hypothesis** ('with one lost node, p99 latency will not exceed 500ms'), **experiment** (kill the node), **observation** (metrics), **conclusion** (confirmed or not).

**Chaos Monkey** (Netflix OSS) - kills random EC2/pod during business hours. Simplest tool, node failures only
**Toxiproxy** (Shopify) - TCP proxy with configurable faults: latency, jitter, bandwidth limit, connection reset. Ideal for testing network conditions
**Chaos Mesh / Litmus** (K8s-native) - pod kill, network chaos, CPU/memory stress, clock skew, IO errors via CRD
**Gremlin** (commercial) - managed chaos with rollback, attack templates, PagerDuty integration

Amazon runs an annual GameDay: several hours of deliberately staged chaos in production with the team standing by. After every GameDay 3-7 serious resilience issues come to light. AWS calls it 'building confidence through failure'.

Chaos engineering is about deliberately breaking everything to see what survives

Chaos engineering is a scientific method: hypothesis -> controlled experiment -> measurement -> conclusion about resilience

Without a clear hypothesis it is impossible to tell what is being verified and what is being broken. A real chaos experiment first defines the expected system behavior, then checks whether reality matches the expectation.

What is the right approach to running a chaos experiment?

Summary

Chaos engineering is a scientific method: hypothesis, experiment, metrics, conclusion - not random breakage
Slow consumers are handled with a bounded queue and a disconnect on overflow; otherwise a memory leak
The reconnect storm from a dying pod is dampened by exponential backoff + jitter on the client
Toxiproxy and Chaos Mesh let you inject network partitions, latency, and CPU stress under controlled conditions

Вопросы для размышления

How do you define blast radius for a chaos experiment - what fraction of users may be affected in a safe test?
What is the steady state for a WebSocket service, and which metrics should be measured before and after chaos?
How do you set up automatic rollback of a chaos experiment when metrics exceed an acceptable range?