DevOps

Prometheus and Grafana: the observability stack

2021. Facebook (Meta). BGP misconfiguration. All data centers offline for 6 hours. USD 6 billion in market cap in one day. Engineers could not enter the building - the badge system ran on Facebook infrastructure. Monitoring detected the problem within seconds. But alerts went to Slack - which was also offline. Observability is not only 'what broke' but also 'how we find out'.

Spotify manages 400+ microservices through a single Prometheus/Grafana stack: 10M+ metrics, 5000+ dashboards automatically created via Backstage (service catalog)
Cloudflare handles 3 trillion DNS queries per day. A Prometheus alert fired 12 seconds before the 2021 BGP leak became visible - the team began mitigation before mass complaints arrived
GitHub uses Prometheus to track all CI/CD metrics: build time, flaky test rate, queue depth. Alerts on degradation trigger automatic scaling of the runner pool

Metrics: four types and the pull model

Prometheus was born at SoundCloud in 2012, inspired by Google Borgmon. The key design decision: pull model - Prometheus scrapes metrics from exporters over HTTP at the /metrics endpoint. Not push (as StatsD or Graphite do). Advantage of pull: Prometheus knows when a service dies (scrape failure). With push: silence is ambiguous - dead or just no events.

Four metric types. Counter: only increases, never decreases (request count, error count). Gauge: any direction (RAM, CPU, temperature). Histogram: distributes values across buckets (latency: 0-10ms, 10-100ms, 100ms-1s, >1s). Summary: client-side percentiles (p50, p95, p99).

RED method - three key metrics for any service: Rate (requests per second), Errors (error percentage), Duration (latency percentiles). USE method for infrastructure: Utilization, Saturation, Errors. Google SRE four golden signals: latency, traffic, errors, saturation. Start with RED - it surfaces problems fast.

Why does a Counter never decrease?

PromQL: query language for time series

PromQL is a functional language for time series. Not SQL. Instead of tables there are instant vectors (values at time T) and range vectors (values over a period [T-5m, T]). Functions operate on these vectors.

histogram_quantile(0.99, ...) computes p99 from a Histogram. Important: use sum(...) by (le) to aggregate buckets before calling histogram_quantile. Aggregating after produces an inaccurate result. This is a common mistake in latency queries.

irate vs rate: rate() averages over the full range (5m). irate() uses only the last two samples - reacts quickly to spikes but is noisy. Rule: rate() for alerting (fewer false positives), irate() for dashboards where fast response matters. For p99 histogram_quantile always use rate(), not irate().

Why is sum(...) by (le) needed before histogram_quantile?

Grafana: dashboards as code

Grafana is the de-facto standard for Prometheus visualization. Dashboards, variables, templating, plugins for 50+ datasources. The enterprise pain point: dashboards are created through the UI and stored in a database. No version control, no code review, no reproducibility.

Dashboard as Code solutions: Grafonnet (Jsonnet library with official support), Grafana Terraform provider, Grafana API + Python/TypeScript scripts. Grafonnet: Jsonnet code compiles to dashboard JSON, stored in git, deployed via CI/CD.

$__rate_interval is a special Grafana variable. It automatically selects the appropriate rate interval based on the scrape interval and the selected time range. Better than hardcoding [5m] - works correctly whether the overview spans 15 seconds or an hour.

Why store Grafana dashboards as code in git?

Alerting: when to wake someone at 3am

Alert fatigue occurs when there are so many alerts that they stop being noticed. The Google SRE book: an alert must require immediate human action. If an alert can be ignored until morning - either lower its severity, automate the remediation, or delete it.

Alertmanager routes alerts from Prometheus. Silence: mute an alert during maintenance. Inhibition: suppress a child alert when the parent is already firing. Group: merge 20 alerts from one incident into a single message. Route: critical -> PagerDuty, warning -> Slack, info -> Grafana only.

SLA-based alerting: track burn rate, not absolute values. Error budget: 99.9% SLA = 0.1% budget = 43.8 minutes per month. Burn rate 1 = will exhaust the budget exactly by month end. Burn rate 14 = exhausted in 2 days. Alert when burn rate > 14 (fast exhaustion) - the Google SRE approach from Implementing SLOs.

More alerts = better monitoring. Cover every possible failure

Too many alerts cause alert fatigue - engineers stop responding to all of them. Alert quality matters more than quantity

Google SRE data: teams with more than 10 alerts per shift have worse response times than teams with fewer than 5. Principles: an alert requires action right now, is actionable (has a runbook), and is not duplicated. Start with 3 alerts (errors, latency, saturation) and add more only with justification

What is the purpose of the `for: 5m` parameter in an alert rule?

Key Ideas

4 metric types: Counter (grows), Gauge (any direction), Histogram (distribution), Summary (percentiles)
PromQL: rate() for counters, histogram_quantile(0.99, sum(...) by (le)) for p99 latency
Dashboards as code: git + CI/CD. $__rate_interval instead of hardcoded [5m]
Alerting: quality over quantity. `for: 5m` against flapping. Burn rate for SLA-based alerting

Вопросы для размышления

How do you distinguish a metric that requires an immediate alert from one that belongs only on a dashboard?
When is Summary preferable to Histogram, and why do large systems usually choose Histogram?
How do you implement SLA burn rate alerting for a service with a 99.95% availability target?

Связанные уроки

devops-15 — Prometheus is deployed via IaC (Pulumi/Helm)
bt-26-observability — Application observability - Prometheus for infrastructure
cloud-15 — Compliance monitoring via Prometheus alerts
cloud-16 — WAF Operational Excellence - the observability trinity
stat-31-eda
net-45-network-monitoring