Real-Time Backend
Monitoring Real-Time Systems
In 2015 Slack went down for 3 hours. The postmortem showed that engineers noticed the issue 20 minutes after it started. With proper monitoring, the first alerts should have arrived in 2 minutes. The difference is 18 minutes of downtime and thousands of angry users.
- **Discord** monitors 26+ million concurrent WebSocket connections via Prometheus + Grafana; alerts on connection drops and P99 latency go to PagerDuty with 24/7 on-call rotation.
- **Datadog** uses P99 as the primary SLO metric for its own WebSocket endpoints (real-time dashboard updates); SLO = 99.9% of requests < 200ms.
- **Grafana** has publicly shared that GC pressure in Go servers showed up as P99 spikes every few seconds, diagnosed through histogram_quantile, not through the average.
Connection Metrics
Connection metrics are the foundation of WebSocket system monitoring. Key signals: active connections (how many are open right now), connection rate (new connections per second), disconnection rate and reasons (normal close vs error vs timeout). Discord runs 26+ million concurrent WebSocket connections; losing metrics at this scale means flying blind.
Grafana + Prometheus is the standard stack for WebSocket monitoring. Grafana offers ready-made alerting rules: for example, alert if active connections drop more than 20% in 5 minutes (a sign of mass disconnect or a deploy without graceful shutdown).
An active connections graph shows a sharp drop from 50,000 to 5,000 in 30 seconds. Connection rate stays normal (new connections keep arriving). What most likely happened?
Message Rate
Message rate is the number of messages per unit of time, both inbound and outbound. Anomalies in the inbound/outbound ratio can signal problems: a sharp rise in inbound without a matching rise in outbound points to a processing bottleneck; a drop in outbound without a drop in inbound points to broadcast or downstream issues.
Splitting metrics by event_type (chat, join, ping, presence) shows which message type drives load. Discord found that presence updates (online/offline status) made up 60%+ of gateway traffic, which led to an overhaul of the presence subsystem.
Grafana shows: inbound messages/sec = 10,000, but processing time P50 = 200ms. Server throughput = 5 msg/sec per worker, 1000 workers = 5000 msg/sec. What is happening to the queue?
Latency P99
P99 latency (the 99th percentile) shows the delay experienced by the slowest 1% of requests. In real-time systems P99 matters more than the average: if P50 = 5ms but P99 = 2000ms, 1% of users see a frozen UI. Discord targets P99 < 100ms for message delivery. Datadog uses P99 internally as its primary SLO metric.
Histogram in Prometheus is the right tool for percentiles: data is grouped into buckets on the client (without storing raw values), and P99 is computed via histogram_quantile() in Grafana. The Summary type in prom-client computes percentiles on the client and cannot be aggregated across instances.
Grafana shows a graph: `histogram_quantile(0.99, rate(ws_e2e_latency_milliseconds_bucket[5m]))`. Value = 450ms. What does this mean for users?
Alerting
Alerting is automatic notification when metrics cross thresholds. Standard alerts for WebSocket systems: active connections dropped > 20% in 5 min (deploy/crash), P99 latency > 500ms (degradation), error rate > 1% (processing issues), message rate dropped > 50% (upstream issues). Grafana Alerting and PagerDuty are the standard stack in production systems.
The `for` parameter in Grafana Alert Rules sets how long the condition must hold before the alert fires. It guards against flapping (oscillating alerts) on brief spikes. For critical alerts, 1-2 minutes; for warnings, 5-10 minutes. PagerDuty supports routing: critical wakes the on-call engineer, warning goes to Slack.
The average (P50) latency is a sufficient metric for monitoring WebSocket systems
P99 and P999 are critical for real-time systems: the average hides the tail of the distribution, where the worst user experiences live
With 100,000 active users, P99 = 450ms means 1000 users have a poor experience right now. Meanwhile P50 = 10ms creates a false sense of well-being.
P99 latency spikes to 800ms every 30 seconds for 5 seconds (GC pause in Node.js). The alert is configured: expr > 500ms, for: 5m. Will the alert fire?
Summary
- **Connection metrics** (active, rate, disconnection by code) are the baseline health indicator; a sharp connection drop with a normal connection rate = deploy without graceful shutdown.
- **Message rate** (inbound vs outbound by event_type) exposes processing bottlenecks; an inbound > outbound imbalance means backlog accumulation.
- **P99 latency via Histogram** (not Summary) is the right metric for real-time; alerts with `for: 5m` guard against flapping on brief spikes.
Related topics
Monitoring is connected to the operational and security aspects of the system:
- DDoS and Abuse — Anomalous growth in connection rate and message rate are early DDoS signals detectable through monitoring
- Audit and Compliance — Operational metrics complement the audit trail: metrics for real-time response, audit logs for forensic analysis
- Horizontal Scaling — Multi-instance deployments require aggregating metrics across instances; Prometheus aggregates automatically during scraping
Вопросы для размышления
- How do you measure end-to-end latency when client clocks are not synchronized with the server (clock skew up to 100ms)?
- Which metrics distinguish degradation from load growth versus degradation from a memory leak in the Node.js process?
- How do you set up monitoring for a WebSocket system scaled horizontally across 50 servers - aggregate at the infrastructure or application layer?