DevOps

Distributed Tracing

An Uber request from a user flows through 20+ microservices in 500ms. When a request takes 3 seconds, how do you find where the time is lost? Before Jaeger, the answer was SSH into each service and read logs. After Jaeger, it is one waterfall diagram that shows the bottleneck in 5 seconds.

  • **Shopify** uses OpenTelemetry for tracing payment transactions across 30+ microservices - tail-based sampling guarantees 100% retention of every failed payment regardless of the overall sampling rate.
  • **Twitter/X** integrated distributed tracing into its ad serving pipeline - discovered that 15% of latency was in one legacy service that no one suspected just by observing aggregate metrics.
  • **Datadog APM** uses the OTel SDK as the foundation of its instrumentation - users can switch from Datadog to another backend without changing application code.

Jaeger

Jaeger is an open-source distributed tracing system developed by Uber. It collects traces - records of requests as they flow through multiple services - and visualizes them as waterfall diagrams showing exactly where time is spent.

Jaeger waterfall visualization immediately shows which service or database query is responsible for the majority of latency - information that is invisible in aggregate metrics.

An API request takes 3 seconds. Aggregate metrics show p99=3s for the gateway. What does Jaeger reveal that metrics cannot?

OpenTelemetry

OpenTelemetry (OTel) is the CNCF standard for collecting traces, metrics, and logs. A single SDK, any backend. Auto-instrumentation for HTTP, SQL, gRPC, and popular frameworks without writing tracing code.

OTel is vendor-neutral: switching from Jaeger to Datadog APM or Honeycomb requires changing the exporter URL, not the instrumentation code.

A team wants to add distributed tracing to 15 Node.js microservices without modifying each service's business logic. What is the correct approach?

Trace Context Propagation

Trace context propagation passes the trace_id and span_id between services via HTTP headers. Without propagation, each service creates an isolated trace - the distributed nature of the request is invisible.

W3C Trace Context (RFC 7834) is the standard. All major observability vendors (Datadog, Honeycomb, Jaeger) support it. Once propagation is in place, adding a new service to the trace requires only adding the SDK.

Two microservices both use OpenTelemetry but traces show as separate disconnected spans. What is missing?

Spans and Sampling

A span is a unit of work within a trace: an HTTP request, a SQL query, a Redis call, an external API call. Each span records: operation name, start time, duration, status, and attributes. Sampling controls what percentage of traces are exported.

Tail-based sampling decides after the entire trace is complete - guaranteeing that 100% of error traces are kept even at 1% overall sampling rate. Head-based sampling (at the SDK) cannot know the outcome when the trace starts.

Distributed tracing is only useful for debugging production incidents

Tracing is also essential for performance optimization: identifying N+1 queries, slow external API calls, and unexpected database patterns during normal operations.

Pinterest used Jaeger traces during normal load to discover that their recommendation API was making 47 database queries per request (N+1). The fix reduced latency by 60% - found without any incident.

At 50,000 RPS, storing 100% of traces is not economically viable. How should sampling be configured to keep all error traces?

Key Ideas

  • **Jaeger** stores and visualizes traces - the waterfall diagram shows exactly where time is lost in a microservice chain.
  • **OpenTelemetry** - vendor-neutral standard: one instrumentation, any backend; auto-instrumentation for HTTP/SQL/Express without writing tracing code.
  • **Trace Context + Tail Sampling** - propagation via traceparent header links services; tail-based sampling in OTel Collector guarantees errors are kept at 1% overall rate.

Related Topics

Distributed tracing connects log correlation and incident response:

  • ELK Stack and Logging — trace.id in structured ELK logs links log records to Jaeger spans.
  • Service Mesh: Istio, Linkerd — Istio automatically generates spans for every inter-service call and propagates context through Envoy.

Вопросы для размышления

  • At what RPS and system complexity does distributed tracing start paying off compared to traditional logging?
  • How does tail-based sampling affect the accuracy of p99 latency metrics in Jaeger?
  • If a database is the bottleneck in traces - what actions follow from that information?

Связанные уроки

  • dist-05-vector-clocks
  • dist-06-ordering
Distributed Tracing

0

1

Sign In