Real-Time Backend

Connection Lifecycle

Discord serves 19 million concurrent users. When a node restarts, millions of clients must reconnect within seconds, not hours. That is only possible if the connection lifecycle was designed correctly from the start.

  • Socket.io uses exponential backoff with a default randomizationFactor of 0.5. That is exactly why reconnection after a failure looks smooth instead of a wave
  • Discord Gateway sends a heartbeat every 41,250 ms plus random jitter. Without it, millions of clients would synchronize and pound the server at the same time
  • Kubernetes rolling deploys send SIGTERM to pods 30 seconds before removal. That grace period is exactly what WebSocket servers use for the DRAIN pattern
  • Slack stores missed events in Redis: a client that reconnects within 2 minutes receives a delta, otherwise it gets a full channel snapshot

Reconnection

A WebSocket connection can live from seconds to days, and any of that time it can break. A mobile client loses Wi-Fi, the load balancer restarts, the server falls over. Without automatic reconnect, the user just sees a stuck UI and leaves. Reconnection is not a nice-to-have. It is a baseline contract for a realtime app.

The WebSocket handshake is a regular HTTP Upgrade: the browser sends `GET /ws` with `Upgrade: websocket` and `Sec-WebSocket-Key` headers, and the server replies with `101 Switching Protocols`. After that the TCP connection switches to bidirectional mode. A break happens when the TCP stack stops receiving ACKs from the other side. That can take anywhere from a second (active RST) to several minutes (keepalive timeout).

The disconnect reason matters. `io server disconnect` means the server intentionally closed the connection (for example, an invalid token). In that case an automatic reconnect is pointless. Show an error to the user or refresh the credentials.

0

1

Sign In

After a WebSocket disconnect, the client received reason='io server disconnect'. What should the client logic do?

Backoff and jitter

Scenario: the server goes down and 50,000 clients receive a disconnect at the same instant. Without jitter, they all try to reconnect after the same delay and hammer the server the moment it comes back up. This is the thundering herd. Exponential backoff plus jitter fixes it: every client waits a random time in [0, min(cap, base * 2^attempt)].

Discord uses a heartbeat with a 41,250 ms period plus random jitter up to 2,500 ms. These are not arbitrary numbers. The period is tuned so the gateway can process millions of connections without load spikes. If a heartbeat does not get an ACK within one period, the connection is considered dead and reconnect begins.

  • **Full Jitter**: uniform distribution in [0, cap], minimal server load during mass reconnects
  • **Equal Jitter**: distribution in [cap/2, cap], guarantees a minimum delay and better P99
  • **Decorrelated Jitter**: next delay = random(base, prev*3), effectively breaks correlation between clients

AWS recommends Full Jitter for most retry scenarios. Decorrelated Jitter spreads load slightly better but is harder to implement and debug.

10,000 clients lose their connection at the same time. Which reconnect strategy minimizes the thundering herd?

State recovery

Reconnecting restores the TCP connection but not the application state. During the outage the server may have pushed 50 chat messages, updated the game score, and moved cursor positions. The client does not know what it missed. Proper state recovery is not just "reconnected". It is "reconnected and got everything I missed".

Since version 4.6, Socket.io supports connection state recovery out of the box. The server holds a buffer of events (in memory or Redis) and on reconnect within a configured window (2 minutes by default) automatically replays the missed events. The client gets `socket.recovered = true` if recovery succeeds.

  1. The client stores `lastEventId` locally (localStorage, IndexedDB)
  2. On reconnect, it sends `lastEventId` in auth or query params
  3. The server checks: if the event is in the buffer, it sends a delta; if the buffer has expired, it sends a snapshot
  4. The client applies the delta idempotently (duplicates must be ignored)

A client reconnects after a 10-minute outage. The server holds a 2-minute event buffer. What should the server do?

Graceful disconnect

There are two ways to close a WebSocket: graceful and abrupt. Graceful means a WebSocket Close Frame with a code (1000 = normal closure, 1001 = server going away, 1008 = policy) plus a TCP FIN. Abrupt is a TCP RST with no warning, or just a timeout with no signal at all. The client behaves differently depending on what signal it sees.

TCP FIN is a polite goodbye: "I am done sending, but I can still receive". TCP RST is abrupt: "close right now, all data is lost". The WebSocket Close Frame travels inside TCP before the FIN. It is an application-level layer over the transport. If the server is just killed (`kill -9`), the client gets RST and must start reconnecting.

  • **Close code 1000**: normal closure, no reconnect needed
  • **Close code 1001**: server going away (deploy/restart), client should reconnect
  • **Close code 1008**: policy violation (invalid token), reconnect without refreshing credentials is pointless
  • **Close code 1011**: internal server error, reconnect with backoff is fine
  • **No Close Frame (RST/timeout)**: abrupt break, reconnect immediately with backoff

The DRAIN pattern is the standard for zero-downtime deploys: the load balancer (nginx, HAProxy) sends SIGTERM to the worker, the worker notifies clients and waits a grace period (5-30s) before closing connections. In that window, clients can reconnect to another node without a visible break.

Graceful shutdown is just `server.close()` that stops accepting new connections

Graceful shutdown for WebSocket has three phases: notify clients, allow a grace period for reconnect, and only then close the remaining connections

`server.close()` stops new connections but does not touch existing ones, which keep hanging. Without explicit notification, clients do not know to reconnect to another node. The DRAIN pattern fixes exactly this: it gives clients the time and information to migrate cleanly.

The server received SIGTERM before a deploy. Which sequence delivers zero-downtime for WebSocket clients?

Key takeaways

  • **Reconnection is not automatic**: distinguish server-initiated disconnect (credential error) from transport-level break (network); reconnect logic differs
  • **Thundering herd kills a freshly recovered server**: exponential backoff with Full Jitter converts a spike of 50,000 simultaneous reconnects into a smooth 30-second load curve
  • **State recovery = lastEventId + server buffer**: the client stores position, the server decides delta (if buffered) or snapshot (if expired)
  • **DRAIN pattern for zero-downtime**: SIGTERM, notify clients, 5-30s grace period, close with Close Frame 1001

Related topics

Connection Lifecycle builds on a few foundational topics:

  • WebSocket protocol — Handshake and Close Frame are protocol details at the RFC 6455 level
  • WebSocket scaling — DRAIN pattern and state recovery are critical for horizontal scaling across multiple nodes
  • Heartbeat and keepalive — Heartbeat detects dead connections that trigger reconnect logic

Вопросы для размышления

  • How would your reconnect strategy change for an app running on flaky mobile networks where drops happen every few minutes?
  • Which data should go into the state recovery snapshot and which is better fetched via a separate REST request after reconnect?
  • How do you combine the DRAIN pattern with sticky sessions when a client is pinned to a specific node through the load balancer?

Связанные уроки

  • net-16-tcp-flow
Connection Lifecycle