Software Engineering
Observability: Logs, Metrics, Traces
Production incident: API is slow, users are complaining. Without observability - hours of blind searching. With observability - 5 minutes: metrics show error rate spikes at 14:32, traces reveal the bottleneck in PostgreSQL, logs show the specific query. Observability is the difference between debugging by intuition and debugging by evidence.
- **Netflix**: centralized Observability platform processes trillions of spans per day. Distributed tracing allows Netflix engineers to identify the root cause of latency issues across 700+ microservices within minutes.
- **Honeycomb**: pioneer in high-cardinality observability. Charity Majors (CTO): 'you need to be able to ask any question of your production system'. Traditional monitoring cannot answer questions that were not anticipated at alert-creation time.
- **OpenTelemetry**: became a CNCF incubating project in 2021. Supported by Google, Microsoft, AWS, Datadog, New Relic, and 50+ companies - the safe choice for vendor-neutral instrumentation.
Structured Logs
Observability is the ability to understand the internal state of a system from its external outputs. The three pillars: Logs (events), Metrics (measurements over time), Traces (request path through services). Each pillar answers different questions: Logs - what happened, Metrics - how often/how much, Traces - where time was spent.
Log levels: TRACE (detailed debugging), DEBUG (internal states), INFO (normal events), WARN (unusual situations that do not require immediate action), ERROR (failures requiring attention), FATAL (system cannot continue). Structured logging: JSON format with fixed fields allows machine parsing, filtering, and aggregation.
A log contains: '2024-01-15 ERROR User payment failed'. Why is this a problem in a microservices architecture?
Metrics with Prometheus
Metrics are numerical aggregated measurements of system state over time. Four types in Prometheus: Counter (monotonically increasing, e.g., total requests), Gauge (can go up and down, e.g., active connections), Histogram (distribution of values, e.g., request duration), Summary (quantiles calculated on the client side).
USE Method (Brendan Gregg): for each resource - Utilization (occupancy), Saturation (queue, delays), Errors. RED Method (Tom Wilkie): for each service - Rate (requests per second), Errors (error rate), Duration (latency distribution). RED is more aligned with user experience than infrastructure metrics.
The number of active WebSocket connections needs to be tracked. Which Prometheus metric type is appropriate?
Distributed Tracing with Jaeger
Distributed Tracing tracks the path of a request through all system services. Trace = a tree of Spans. Span = a unit of work in one service (HTTP call, DB query, external API call). Each span has: trace_id, span_id, parent_span_id, service name, operation name, start time, duration, tags.
Jaeger (developed by Uber, CNCF project) and Zipkin are popular distributed tracing systems. Propagation: trace_id is passed through HTTP headers (X-B3-TraceId or W3C Trace Context format). Each service reads the incoming trace_id and creates child spans with the same trace_id.
API endpoint /checkout responds slowly (p99 = 2 seconds). Metrics show there is a problem, but not where exactly. Which tool helps find the bottleneck?
OpenTelemetry
OpenTelemetry (OTel) is an open standard and toolkit for Logs, Metrics, and Traces. Created by merging OpenCensus (Google) and OpenTracing (CNCF) in 2019. Goal: vendor-neutral instrumentation - instrument code once, send data to any backend (Jaeger, Datadog, New Relic, Honeycomb).
Auto-instrumentation: OTel automatically instruments popular libraries (Express, gRPC, Redis, PostgreSQL clients) without code changes. Manual instrumentation: create custom spans for business logic. Exporter: configures where to send telemetry data - change without code changes, only configuration.
OpenTelemetry became a CNCF incubating project in 2021, supported by Google, Microsoft, AWS, Datadog, New Relic, and 50+ other companies. This broad adoption makes it the safe choice for instrumentation: switching observability vendors requires only changing the exporter configuration, not re-instrumenting the codebase.
Observability and Monitoring are the same thing, just different terms
Monitoring answers pre-known questions ('is the service working?'). Observability allows asking arbitrary questions about system state without pre-defining what to measure. Monitoring: defined dashboards and alerts. Observability: explore any combination of logs, metrics, and traces to understand unknown failures.
A system can be well-monitored but not observable. Monitoring detects known failure modes. Observability enables debugging unknown failure modes - the class of problems that appear in production and were not anticipated during design.
A team uses Jaeger for tracing via the Jaeger SDK. A year later they decide to switch to Datadog. What changes are required?
Key Ideas
- **Structured Logs**: JSON format with trace_id enables machine parsing, cross-service correlation, and aggregation. Plain text logs are unqueryable at scale.
- **Metrics (Prometheus)**: Counter/Gauge/Histogram/Summary for different measurement types. USE method for infrastructure, RED method for service health. Dashboards and alerts are built from metrics.
- **Distributed Tracing (Jaeger/OTel)**: trace_id propagated through all services, waterfall view of where time is spent in a request. The essential tool for debugging latency in microservices.
Related Topics
Observability is the foundation of all reliability and performance practices:
- SRE: Site Reliability Engineering — SLIs are measured through metrics. Observability is the measurement instrument for SLOs - without it, reliability targets are aspirational not enforced
- Chaos Engineering — Chaos experiments require good observability to watch system behavior under failure - you cannot interpret what you cannot measure
Вопросы для размышления
- The three observability pillars: Logs, Metrics, Traces. In which scenarios is each most valuable?
- Trace sampling (1-10%) reduces overhead but misses rare events. How to resolve the tradeoff between cost and completeness?
- High-cardinality attributes (user_id, trace_id) are powerful for debugging but expensive for metric systems. How to decide which attributes to index?