Real-Time Backend

Live Dashboards

A Grafana dashboard with 20 panels and 1,000 concurrent users fires 40,000 queries per minute at the metrics database. Without the right architecture, monitoring kills itself.

  • **Grafana Cloud** processes over 10B metrics per day; Grafana Labs uses its own product to monitor Grafana Cloud - 'dogfooding' at production scale
  • **Datadog** renders dashboards for Netflix, Samsung, and Airbnb simultaneously; its query engine processes 10+ trillion data points per day with <1s p99 latency
  • **Cloudflare Radar** is a public live dashboard showing global internet traffic in real time; it aggregates 100M+ DNS queries per second into a visualization with 5-minute windows
  • **AWS CloudWatch** serves over 1 trillion API requests per month; live dashboards for Lambda, EC2, and RDS refresh with under 60s of delay

Metrics Streaming

Metrics streaming is continuous real-time delivery of numeric values from a system to a dashboard. Grafana Cloud ingests over 10B metric points per day from teams worldwide. Datadog processes 10+ trillion metrics daily from customers like Samsung, Airbnb, and Peloton. The key difference from classic polling: the data arrives on its own the moment it appears.

Push vs pull is the fundamental architecture choice. Prometheus uses pull: the server scrapes a /metrics endpoint every 15s. That simplifies service discovery but adds at least 15s of latency. Grafana Live (SSE/WebSocket) and InfluxDB's push model deliver under 1s. Operational dashboards ('what is happening right now') need push; capacity planning is fine with pull at 1-minute granularity.

  • **SSE (Server-Sent Events)**: one-way push, automatic reconnect, HTTP/2 multiplexing - ideal for dashboards

0

1

Sign In

  • **WebSocket**: bidirectional, needed for interactive actions (drill-down queries, filters)
  • **Long polling**: fallback for environments without SSE/WS; adds 100-500 ms per request
  • **StatsD protocol**: UDP-based, fire-and-forget; Datadog/Prometheus node_exporter agents listen on port 8125
  • A dashboard shows CPU load for 50 servers. It needs an update every 1 second with no interactivity. What to pick?

    Aggregation Windows

    Raw metrics arrive at thousands of points per second; displaying all of them is pointless. An aggregation window groups points over a period (1s, 1m, 5m) and computes an aggregate: avg, max, p99. InfluxDB uses downsample retention policies: raw data lives 7 days, 1-minute aggregates 30 days, hourly aggregates 1 year. That cuts storage volume by 1000x.

    Picking the window size trades detail for stability. Grafana Cloud recommends: realtime dashboard (last 5 minutes) - 5s window; daily view - 1m window; weekly view - 1h window. Too small and the chart is noisy with spikes; too large and it hides important peaks (a 1-hour p99 can look fine even when everything was broken for 30 seconds).

    • **Tumbling window**: non-overlapping intervals [0:10), [10:20); used for billing counters
    • **Sliding window**: compute an aggregate over the last N seconds every M seconds; latent for rate-of-change
    • **p99 vs avg**: the mean hides tail latency; Datadog recommends always showing p99 for SLO metrics
    • **Downsampling**: raw -> 1m -> 1h -> 1d; each tier lives longer at lower resolution

    API response time averages 120 ms, but users complain about slowness. Why?

    Sparklines

    A sparkline is a compact inline chart with no axes or labels, showing a trend on one table row. Grafana uses sparklines in table panels to show a metric trend next to the current value. Datadog's widget system renders thousands of sparklines on one dashboard, each showing the last 2 hours of aggregated data.

    Sparklines work because of Tufte's data-density principle: maximum information in minimum space. 100x20 pixels fits 100 data points, enough to read trend, seasonality, and anomalies. Critical: a sparkline must use consistent Y scale within a dashboard, otherwise you cannot compare servers. By default Grafana normalizes each sparkline independently, which has to be disabled for comparative dashboards.

    A dashboard shows 50 servers with CPU sparklines. Server A's sparkline fills the full height (looks alarming) but its CPU is just 15%. Server B's sparkline is flat but CPU is 80%. What is wrong?

    Dashboard Architecture

    Grafana Cloud serves millions of dashboard requests per day. Live dashboards have a specific bottleneck: request fan-out. One dashboard with 20 panels fires 20 parallel queries to the backend on load. With 1,000 concurrent users that is 20,000 requests/sec to the metrics database from a single dashboard.

    Grafana uses query caching with a TTL matching the panel refresh interval. If 50 users watch one dashboard with a 30s refresh, the server runs 20 queries/30s, not 50 * 20 = 1,000 queries/30s - results are cached and shared. ClickHouse and InfluxDB are optimized for this pattern: one heavy query is faster than thousands of small ones thanks to column-store storage and vectorized execution.

    • **BFF pattern**: Backend for Frontend aggregates N panel requests into one WebSocket stream
    • **Query deduplication**: cache identical queries within a single refresh interval
    • **Incremental updates**: send only new points, not the entire timeseries each time
    • **Variable timerange**: on zoom-out switch to downsampled data (Grafana does this automatically)
    • **Alert-driven refresh**: when metrics are normal refresh every 30s; on anomaly refresh every 1s

    The higher the dashboard refresh rate, the better for monitoring

    Aggressive refresh (< 5s) with many users creates a Query Storm on the metrics DB and can degrade the monitoring system itself

    That is why Grafana Cloud caps minimum refresh at 5s. Instead of frequent polling for critical metrics, use alerting: the system notifies you on a threshold crossing without constant polling

    1,000 users open one Grafana dashboard with 20 panels, refresh 30s. How many queries per minute hit InfluxDB without optimizations?

    Takeaways

    • **SSE vs polling**: Server-Sent Events provide push with auto reconnect; for read-only dashboards SSE beats WebSocket on simplicity and HTTP/2 support
    • **Aggregation windows**: avg hides tail latency; always track p99 for SLOs; window size is a detail vs stability trade-off
    • **Sparklines**: Y-axis normalization must be global on comparative panels, otherwise 15% CPU looks scarier than 80%
    • **Query Storm**: 1,000 users * 20 panels = 20,000 parallel queries; query caching and a BFF pattern cut load by 1000x

    Related topics

    Live dashboards build on several technologies from the realtime-backend course:

    • IoT Real-Time — IoT telemetry is a typical source for operational dashboards; InfluxDB + Grafana is the standard stack for industrial IoT
    • Real-time rate limiting — Metric databases throttle query rates; aggressive dashboard refresh without rate limiting breaks the monitoring tool itself
    • Financial trading — Trading dashboards are the extreme case: tick-level updates (milliseconds), sparklines for bid/ask spreads, aggregation windows for VWAP

    Вопросы для размышления

    • A Netflix team dashboard shows error rate across 50 microservices. How do you pick the right aggregation window to spot real incidents without reacting to noise?
    • 1,000 engineers opened Grafana during an incident. The dashboard itself started collapsing under load. What architecture should have been in place beforehand?
    • A sparkline shows one server's CPU over the last 2 hours. At what point should it switch from 1-second data to 1-minute aggregates, and why exactly there?

    Связанные уроки

    • db-19-redis
    Live Dashboards