Real-Time Backend

Distributed Tracing

A message goes through 7 services in 340ms. Where are 200ms lost? Without distributed tracing it is a multi-hour log hunt. With it, 30 seconds.

In 2021 Slack found the source of a degradation in 8 minutes thanks to a trace through WebSocket -> Kafka -> PostgreSQL -> Push Service; without a trace the same incident would have taken hours
Uber cut production incident MTTR from 45 to 12 minutes after rolling out a correlation ID across 1000+ microservices
Discord, at the scale of 26 million concurrent WebSocket connections, uses per-message spans to isolate slow operations without inspecting aggregated metrics

Tracing WebSocket Connections

**February 2021, Slack.** At 14:23 the first complaints came in: messages were arriving 8-12 seconds late. Engineers checked metrics: p99 latency was normal, CPU calm, Redis responsive. Four hours of searching in the dark. Then someone found a WebSocket connection stuck on a specific backend node running a version 2 weeks behind the rest. Without a trace, four hours in fog.

HTTP tracing is simple: each request is separate, and the trace travels with it in headers. WebSocket breaks that model. A single connection runs for hours and carries thousands of messages. How do you attach a trace to a specific message, not the connection as a whole?

The answer is **per-message span**. Each message is a separate span inside a long-lived root trace for the connection. The root span opens on handshake and closes on disconnect. Each message span references it as a parent.

Discord handles 26 million concurrent WebSocket connections. Without per-message spans, debugging a specific message drop at that scale is impossible. Metrics show aggregates; the trace shows the specific path.

Which approach is right for tracing a WebSocket session with thousands of messages?

Correlation IDs across async boundaries

A message arrives via WebSocket, gets processed, hits Kafka, a consumer picks it up, writes to PostgreSQL, then sends a push notification through FCM. Seven services. Three different transport technologies. How do you stitch these operations into a single trace?

A correlation ID is a string that travels with the data across every boundary. It can be a trace ID from OpenTelemetry, a homegrown UUID, or both together. The cardinal rule: **it must never get lost**. Every system that receives the data must propagate it onward.

In 2018 Uber published that a correlation ID cut MTTR (mean time to resolution) for production incidents from 45 to 12 minutes. The difference: an engineer sees the full request path immediately, instead of piecing it together from different logs.

AsyncLocalStorage in Node.js lets you avoid passing context explicitly through every function call. The context lives in the async scope automatically: everything started within one async flow gets the same context without explicit passing.

What happens to the correlation ID when a WebSocket message moves into Kafka?

Propagation standards: W3C and B3

Before 2019, every tracing system had its own header format. Zipkin used `X-B3-TraceId`, Jaeger used `uber-trace-id`, AWS X-Ray used `X-Amzn-Trace-Id`. When a request went through three companies with different systems, the chain broke.

W3C Trace Context (RFC approved in 2021) standardized this. Two headers: `traceparent` contains version, trace-id, parent-id, and flags. `tracestate` carries vendor-specific data. Now Jaeger understands a trace from Zipkin and vice versa.

For WebSocket the context is passed differently, since there are no headers at the message level. Two patterns: **envelope** (wrap every message in an object with a `traceContext` field) and **initial handshake** (pass the trace ID on connect, then use per-message span IDs). Envelope adds size; handshake hides the path of an individual message.

Which two headers does the W3C Trace Context standard define?

Tools: Jaeger, Tempo, OpenTelemetry

OpenTelemetry (OTel) is not a storage tool, it is an instrumentation standard. SDKs for Node.js, Python, and Go generate spans and export them wherever you point. The destinations (Jaeger, Grafana Tempo, Honeycomb, Datadog) are storage and visualization backends.

Sampling is the key decision. Keeping 100% of traces is unrealistic at Slack scale (26 million concurrent connections). The typical approach is **tail-based sampling**: the decision is made after the trace completes. Errors and slow requests (p95) are kept 100%. Normal traffic, 1-5%.

Grafana Tempo stores traces in object storage (S3, GCS) without indexing, which is radically cheaper than Jaeger with Elasticsearch. Lookups by trace ID are instant; lookups by attribute require Grafana Loki as an index on top. For most production scenarios Tempo is 5-10x cheaper.

**Jaeger** - open source, Elasticsearch backend, mature UI for trace analysis, good for self-hosted
**Grafana Tempo** - cheap S3 storage, integration with Loki and Prometheus, cloud SaaS
**Honeycomb** - tail-based sampling out of the box, columnar storage, analytics on arbitrary attributes
**Datadog APM** - full observability stack, expensive, but no self-managed infrastructure

Logging a request ID in each service is the same as distributed tracing

Logs with a correlation ID give you something, but not a structured hierarchy of spans with latency at each step and parent-child relationships

To find the slow step in a chain of 7 services you need a time axis of spans with precise timestamps. Grep on a correlation ID across log systems only confirms that a request passed through a service. It gives you neither timing nor a dependency graph.

What is tail-based sampling in distributed tracing?

Summary

WebSocket requires per-message spans inside the connection's root span, otherwise detail on individual messages is lost
A correlation ID is propagated across every transport boundary (HTTP headers, Kafka headers, gRPC metadata) and must never be lost
W3C Trace Context (traceparent + tracestate), standardized in 2021, replaced the zoo of Zipkin/Jaeger/X-Ray formats
Tail-based sampling lets you keep 100% of error traces and only 1-10% of normal traffic

Вопросы для размышления

How does the sampling strategy change as you grow from 10K to 10M concurrent WebSocket connections?
What happens to the trace when a client reconnects after a drop - a new trace or a continuation of the old one?
How do you pass trace context in a browser WebSocket client when you cannot add arbitrary headers on connect?

Связанные уроки

dist-06-ordering