Real-Time Backend
Connection Lifecycle
Discord serves 19 million concurrent users. When a node restarts, millions of clients must reconnect within seconds, not hours. That is only possible if the connection lifecycle was designed correctly from the start.
- Socket.io uses exponential backoff with a default randomizationFactor of 0.5. That is exactly why reconnection after a failure looks smooth instead of a wave
- Discord Gateway sends a heartbeat every 41,250 ms plus random jitter. Without it, millions of clients would synchronize and pound the server at the same time
- Kubernetes rolling deploys send SIGTERM to pods 30 seconds before removal. That grace period is exactly what WebSocket servers use for the DRAIN pattern
- Slack stores missed events in Redis: a client that reconnects within 2 minutes receives a delta, otherwise it gets a full channel snapshot
Reconnection
A WebSocket connection can live from seconds to days, and any of that time it can break. A mobile client loses Wi-Fi, the load balancer restarts, the server falls over. Without automatic reconnect, the user just sees a stuck UI and leaves. Reconnection is not a nice-to-have. It is a baseline contract for a realtime app.
The WebSocket handshake is a regular HTTP Upgrade: the browser sends `GET /ws` with `Upgrade: websocket` and `Sec-WebSocket-Key` headers, and the server replies with `101 Switching Protocols`. After that the TCP connection switches to bidirectional mode. A break happens when the TCP stack stops receiving ACKs from the other side. That can take anywhere from a second (active RST) to several minutes (keepalive timeout).
The disconnect reason matters. `io server disconnect` means the server intentionally closed the connection (for example, an invalid token). In that case an automatic reconnect is pointless. Show an error to the user or refresh the credentials.