Real-Time Backend

WebSocket: Anatomy of a Protocol

2013. Slack is just picking up its first thousands of users. Every 3 seconds every browser knocks on the server: any new messages? The answer is almost always: no. That is long-polling - and it is a disaster at scale. The team switches to WebSocket and measures load. A 40x drop. One persistent channel instead of thousands of one-shot requests. That is what made real-time presence - the green online indicator - possible at scale.

  • **Slack** - 40x server load reduction when switching from polling to WebSocket in 2013
  • **Figma** - collaborative editing: every cursor move, every selection is a 2-10 byte WS frame instead of a 500+ byte HTTP request
  • **Binance** - order book updates 10-100 times per second, WebSocket is the only practical option
  • **GitHub Copilot** - token streaming over WebSocket: the user sees code as it is generated, not after

HTTP Upgrade: how HTTP turns into WebSocket

2013. Slack switches from polling to WebSocket and measures a 40x drop in server load - with the same number of users. Not 2x. Not 5x. Forty. Every 3 seconds, a browser no longer fires an HTTP request into the void asking if anything changed.

WebSocket starts as a plain HTTP/1.1 GET. This is not a coincidence - it is a deliberate maneuver. Corporate proxies, load balancers, CDNs - all of them understand HTTP. If WebSocket opened on its own port with its own protocol, half the corporate networks in the world would block it. Instead: a Trojan horse. Start as HTTP, then ask to switch.

Status `101 Switching Protocols` is one of the rarest HTTP codes. In the entire history of the web it is used almost exclusively for WebSocket. After this response the TCP connection does not close. It stops being HTTP. Same socket, same bytes - different protocol from here on.

`Sec-WebSocket-Key` is not encryption or authentication. It is protection against HTTP servers that might accidentally respond `101` without knowing what they are doing. The client generates a random base64 key, the server concatenates it with the magic UUID `258EAFA5-E914-47DA-95CA-C5AB0DC85B11`, takes SHA-1, returns base64. The client verifies. Handshake confirmed.

After `101`, HTTP is over. No more headers, no status codes, no `Content-Type`. Only WebSocket frames over a raw TCP connection. HTTP was the bootstrap mechanism - nothing more.

Why does WebSocket use HTTP to initiate the connection rather than its own port?

Framing: how WebSocket packages data

Once the handshake completes, both ends speak frames. A WebSocket frame is a minimal wrapper: 2-10 bytes of header plus payload. Compare that to an HTTP request: 500-800 bytes of headers even when transmitting a single integer. WebSocket costs 2 bytes. That is where the 40x comes from.

Masking is another proxy defense mechanism. Without it, a sequence of bytes could accidentally look like an HTTP response, and a caching proxy in the middle might store it and serve it to another client. With masking - 4 random bytes XORed with the payload - that match becomes impossible. The client MUST mask. The server MUST NOT.

Payload length encoding is compact: if <= 125 bytes, directly in the 7-bit field. If <= 65535 - 2 additional bytes. Larger - 8 additional bytes. In practice, most real-time messages (cursor positions, presence updates, chat messages) fit the first two cases.

Fragmentation: one large message can be split across multiple frames. FIN=0 means "more to come", FIN=1 means "this is the end". The sender can start transmitting without knowing the total size - streaming without buffering. The Node.js ws library and the browser WebSocket API hide this behind a single `message` event - the application sees the fully assembled message.

Opcodes split frames into two classes: data (text, binary, continuation) and control (close, ping, pong). Control frames cannot be fragmented and have a hard limit: payload <= 125 bytes. Ping frames cannot be large - by design.

Why must the client mask frames but the server must not?

Ping/Pong, heartbeats, and connection lifecycle

A TCP connection can die silently. A NAT router drops the entry after 5 minutes of inactivity. A mobile network switches towers. An ISP reboots hardware. From the OS perspective the connection is still "alive" - but packets go nowhere. WebSocket solves this with ping/pong.

The browser WebSocket API does not expose ping/pong - the browser handles it at the protocol level. On the server, full control is available. Standard pattern: ping every 30 seconds, if no pong arrives before the next ping - the connection is dead, call `terminate()`.

The difference between `ws.close()` and `ws.terminate()`: `close()` sends a Close frame (opcode 0x8) and waits for the other side to echo it back - graceful shutdown with confirmation. `terminate()` destroys the TCP connection immediately with no notification. For dead connections only `terminate()` works - no one will receive the Close frame.

The WebSocket connection lifecycle has four distinct states. `CONNECTING` (0): handshake in progress. `OPEN` (1): connection established, data can flow. `CLOSING` (2): Close frame sent or received, graceful shutdown in progress. `CLOSED` (3): connection gone.

Close codes 4000-4999 are reserved for applications. Standard usage: 4001 - not authenticated, 4002 - rate limit exceeded, 4003 - room closed. The client reads the code and decides: reconnect, show an error, or end the session cleanly.

WebSocket ping/pong is the same as ICMP ping (the terminal ping command)

WebSocket ping/pong is an application-level mechanism defined in RFC 6455. ICMP ping is a network-level protocol. They are unrelated

ICMP operates at the IP layer (L3), WebSocket ping/pong at the application layer (L7). A WebSocket ping does not prove the network is alive - only that the other end of the WebSocket connection is alive and reading

A server pings every 30 seconds. A client stops responding. How should the server close the connection?

Key ideas

  • **HTTP Upgrade** is a Trojan horse: start as HTTP to pass through proxies, then switch to WebSocket over the same TCP socket
  • **Frame = 2-10 bytes of header** + payload. An HTTP request burns 500+ bytes of headers even to say nothing. That is the 40x
  • **Masking** is not encryption - it is protection against proxy caching. XOR with 4 public bytes, client must, server must not
  • **Ping/pong** is the only way to detect a dead TCP connection before the 2-hour TCP keepalive default fires
  • **terminate() vs close()** - dead connections need terminate(), living ones get close() with a code 1000-4999

Related topics

WebSocket protocol is the layer beneath every real-time application:

  • HTTP limitations — WebSocket is the direct answer to HTTP request-response constraints
  • Server-Sent Events — Alternative for one-directional streaming without a protocol switch
  • Transport comparison — When to pick WebSocket, SSE, or long-polling
  • OSI model — WebSocket handshake is a real-world example of L4 and L7 interaction

Вопросы для размышления

  • WebSocket handshake uses HTTP/1.1, not HTTP/2. What would change if WebSocket could run over HTTP/2?
  • Why does masking client frames protect against proxies, while masking server frames would be pointless?
  • In what situation is a graceful close (ws.close()) more dangerous than terminate() - when should the hard shutdown be preferred?

Связанные уроки

  • rt-04 — RFC 6455 - the spec behind the handshake
  • rt-03-sse — SSE is one-directional WebSocket without protocol switch
  • rt-06 — Transport comparison builds on knowing WS internals
  • net-02-osi-overview — TCP/IP layers clarify why Upgrade works this way
  • rt-02-http-limits — WebSocket is the direct answer to HTTP limitations
  • net-36-websocket
WebSocket: Anatomy of a Protocol

0

1

Sign In