Backend Transport

Distributed Tracing and Observability

Uber 2017: P99 latency at checkout increased by 20%. 100 engineers looking at logs. Without distributed tracing the search took 3 days. After introducing Jaeger, a similar incident was resolved in 2 hours: one slow database query in the inventory service was visible immediately on the trace waterfall.

  • **Uber** developed Jaeger (2016) and open-sourced it. Jaeger is now a CNCF graduated project used across thousands of companies.
  • **Shopify** uses OpenTelemetry to trace all Ruby on Rails requests. Spans are automatically created for every ActiveRecord query and HTTP call.
  • **Cloudflare** exposes distributed tracing through the Workers Tracing API - every Edge Worker participates in the trace chain.

Why Distributed Tracing

In a monolith, a slow request can be found in one log file. In microservices, a single user request triggers 5-10 service calls. Without a common trace ID, logs from different services cannot be correlated to a specific request. Distributed tracing provides a single trace ID that propagates through all services.

Uber engineers spent 3 days diagnosing a P99 regression that distributed tracing would have revealed in minutes. The bottleneck was a single slow database query in one service, but without traces the team had to manually correlate logs across 20+ services.

Why are logs alone insufficient for diagnosing slow requests in microservices?

OpenTelemetry: Observability Standard

OpenTelemetry (OTel) is a CNCF standard for collecting traces, metrics, and logs. It provides vendor-neutral instrumentation: write once, export to any backend (Jaeger, Zipkin, Datadog, Honeycomb). Auto-instrumentation covers HTTP, gRPC, database drivers, and message brokers without code changes.

OpenTelemetry auto-instrumentation handles the 80% case. express.js requests, pg queries, Redis calls, and Kafka produce/consume are all traced automatically. Manual spans cover the remaining 20% of business-critical operations.

What is the advantage of OpenTelemetry over vendor-specific SDKs (Datadog agent, New Relic agent)?

Spans and Context Propagation

A trace consists of spans. Each span represents one operation with a start time, duration, service name, and attributes. Spans are connected by parent-child relationships. Context propagation carries the trace ID across service boundaries via HTTP headers (W3C traceparent) or gRPC metadata.

The W3C Trace Context standard (RFC) ensures interoperability between services using different tracing libraries. Jaeger, Zipkin, Datadog, Honeycomb all understand traceparent headers. One standard header propagates traces across language and library boundaries.

How is the trace ID preserved when transitioning from Order Service to Payment Service?

Jaeger and Zipkin

Jaeger (CNCF graduated) and Zipkin are distributed tracing backends. They receive spans via OTLP, store them, and provide UI for waterfall visualization. Jaeger supports sampling, alerting on slow traces, and integration with Prometheus metrics.

Grafana can display traces (Jaeger/Tempo), metrics (Prometheus), and logs (Loki) in one interface. The correlation workflow: metric alert -> find trace -> read log. Without all three, the debugging puzzle is incomplete.

Why is it important to correlate traces with logs and metrics in one tool?

Transport Metrics

Transport-level metrics expose what application-level logs cannot: TCP retransmits, connection pool utilization, Kafka consumer lag, RabbitMQ queue depth. These are the leading indicators of performance degradation, visible before errors appear.

Kafka consumer lag growing from 1K to 500K in 10 minutes is a critical alert. It means either the consumer crashed or cannot keep up with producer rate. The lag will only grow; it does not self-heal without intervention.

Distributed tracing replaces logs and metrics

Traces, metrics, and logs are three different and complementary pillars of observability. Each is necessary.

Metrics: aggregate view (error rate 0.5%). Traces: specific slow request and its path. Logs: details of a specific error. Diagnosis: metric alert -> find trace -> read log. Without any one of them - the puzzle is incomplete.

Kafka consumer lag suddenly grew from 1K to 500K in the last 10 minutes. What does this indicate?

Summary

  • **Distributed tracing** solves the fundamental microservices problem: correlating a request across multiple services. Without a trace ID, logs are disconnected.
  • **OpenTelemetry** is the standard: one instrumentation, any backend. Auto-instrumentation covers HTTP, DB, Kafka without code changes.
  • **Three pillars**: metrics (aggregate health), traces (specific request path), logs (detailed events). Correlation in Grafana: metric -> trace -> log.

Related Topics

Observability applies to all transport protocols and patterns:

  • API Gateway — API Gateway generates the first span and request ID - the start of a distributed trace through all downstream services
  • Transport Debugging — When traces reveal a problem at the network level - tcpdump and Wireshark help investigate deeper

Вопросы для размышления

  • How to implement trace sampling (not tracing 100% of requests) without losing important errors and slow requests?
  • How does distributed tracing help with post-mortem analysis of an incident that happened 3 days ago?
  • When does the overhead of distributed tracing (CPU, network, storage) become unacceptable?

Связанные уроки

  • devops-14
Distributed Tracing and Observability

0

1

Sign In