System Design
Observability
The three pillars of Observability
**Observability vs Monitoring** **Monitoring:** you decide upfront what to watch -> 'CPU > 80%' -> alert **Observability:** you can ask ANY question about the system -> 'Why are requests from users in Germany slow on Tuesdays?' Observability is the ability to understand internal state from external outputs.
**How the three pillars connect:**
All three are stitched together by **correlation IDs** (trace_id, request_id).
What did you learn about the three pillars of Observability?
Logging
**Structured logging is the foundation of observability** Bad: `console.log('User login failed for user123')` Good:
Benefits: - You can search by fields - You can aggregate - The format is machine-readable
**Log levels:** | Level | When to use
| **TRACE** | Deep debug | Function entry/exit | | **DEBUG** | Development debug | Variable values | | **INFO** | Normal operation | Request processed | | **WARN** | Potential issue | Retry attempt | | **ERROR** | Operation failed | Failed to save | | **FATAL** | System down | Cannot connect to DB |
What did you learn about Logging?
Metrics
**Metrics - numeric measurements of the system** Metric types: 1. **Counter** - monotonically increases (requests_total) 2. **Gauge** - current value (cpu_usage, active_connections) 3. **Histogram** - distribution (request_duration) 4. **Summary** - client-side percentiles
**Prometheus - the metrics standard** **Pull model:** - Prometheus periodically (every 15 s) does an HTTP GET - Services expose a `/metrics` endpoint **Metrics format:**
**RED Method - the key metrics for services:** | Metric | What it measures
| **R**ate | Requests per second | `rate(http_requests_total[5m])` | | **E**rrors | Error rate | `rate(http_errors_total[5m])` | | **D**uration | Latency (p50, p99) | `histogram_quantile(0.99, ...)` | **USE Method - for resources (CPU, memory, disk):** - **U**tilisation - share of time busy - **S**aturation - waiting queue - **E**rrors - number of errors
What did you learn about Metrics?
Distributed Tracing
**Tracing - the path of a request through the system** In a microservice architecture a single user request hops through 5 to 20 services. How do you tell where the latency lives? **Terms:** - **Trace** - the full path of one request - **Span** - one operation (service call, DB query) - **Context** - trace_id + span_id, propagated between services
**Context propagation:** How trace_id travels between services:
**W3C Trace Context** is the standard: `traceparent: 00-{trace-id}-{span-id}-{flags}`
**Sampling - you cannot store 100% of traces:** | Strategy
| **Head-based** | Decision at ingest (random 1%) | | **Tail-based** | Decision after completion (all errors, all slow) | | **Adaptive** | Dynamic, follows load | Tail-based is better for debugging (it keeps the problematic traces), but it needs a buffer in the collector.
What did you learn about Distributed Tracing?
Alerting and SLO
**SLI, SLO, SLA - the language of reliability** **SLI (Service Level Indicator)** - the metric - '99.5% of requests complete in <200ms' - '0.1% error rate' **SLO (Service Level Objective)** - the target - 'Availability must be at least 99.9%' **SLA (Service Level Agreement)** - the contract - 'If the SLO is breached, refund the customer'
**Alerting best practices:** **Alert fatigue** - too many alerts and the team ignores them. **Actionable alerts:** - Every alert needs an action - No alerts for 'interesting information' - Severity levels: critical (page), warning (ticket), info (log) **Multi-window alerts:**
The short window (5 m) catches spikes. The long window (1 h) filters noise.
What did you learn about Alerting and SLO?