Real-Time Backend
Real-Time Rate Limiting
One buggy WebSocket bot created 50,000 connections per minute on production Discord. Without rate limiting that would take the server down for 150 million users.
- **Discord** limits 120 events/min on a Gateway connection and permanently bans bots that ignore rate-limit signals; this shields 19M servers from accidental and deliberate abuse
- **Binance WebSocket API** uses a token bucket: 5 subscriptions/sec with a burst of up to 10; exceeding it closes the connection with code 1008; API keys get an independent bucket
- **Cloudflare** handles 100,000+ WebSocket connections/sec through Workers; a sliding window counter in a distributed key-value store protects against connection flooding without a single point of failure
- **Twitch** slowmode and follower-only mode are a UX wrapper over per-room rate limiting; built on Redis sorted sets; streamers enable them with 50,000+ concurrent viewers
Per-Connection Rate Limiting
Per-connection rate limiting caps the number of messages from one WebSocket connection in a time window. Discord limits up to 120 events/minute per connection; exceeding triggers a temporary ban with code 4008 (Rate limited). Binance WebSocket: at most 10 subscriptions/second per connection, up to 1,024 active subscriptions. The goal is to shield the server from one aggressive client without affecting the rest.
It is important to separate message-level rate limiting from new-connection rate limiting. Cloudflare WebSocket: up to 100 new WebSocket connections/sec per IP (connection rate limit) and up to 1,000 messages/min per connection (message rate limit). Both layers matter: connection flooding via rapid connect/disconnect attacks cannot be caught by the message limiter alone.
- **Message rate limit**: 120-1,000 msg/min is the typical range; depends on the app
- **Connection rate limit**: 10-100 new WS/sec per IP; protection against connection flooding
- **Graceful response**: do not drop the connection right away; return an error with `retryAfter` and keep going
- **Backpressure**: if the client sends faster than the server processes, the buffer grows; an explicit signal is required
Discord sets a 120 events/min limit per connection. A bot sent its 121st message. What should the server do?
Per-Room Rate Limiting
Per-room rate limiting caps total activity in a channel/room regardless of how many participants there are. Twitch limits chat to 20-100 messages/30 seconds per channel (depending on bot verification). This protects against coordinated spam: 1,000 bots at 1 message/sec each is legal under per-connection limits but devastating for the channel.
Redis ZSET (sorted set) is the standard tool for distributed rate limiting in real-time systems. Key is the resource id, score is the timestamp, member is a unique request ID. ZREMRANGEBYSCORE drops stale entries, ZCARD returns the current count. The pipeline runs atomically. Discord uses a Redis cluster to rate-limit 150M users with <1 ms latency.
- **Two-tier limiting**: per-user (stops spam from one source) + per-room (stops coordinated attacks)
- **Role-tiered limits**: moderators - 50 msg/min, regular users - 5 msg/min, verified bots - 100 msg/min
- **Slowmode**: Twitch and Slack add a delay between one user's messages (1-120s) instead of a hard limit
- **Redis pipeline**: group ZREMRANGEBYSCORE + ZADD + ZCARD into one round-trip; critical for latency
A Twitch streamer enabled 30-second slowmode in a chat with 50,000 viewers. How is it implemented technically?
Sliding Window Algorithm
Sliding window is a rate-limiting algorithm that precisely counts events over a moving time interval. Unlike a fixed window ('no more than 100/min, reset at 00:00'), sliding window does not have the double-spike problem at the period boundary. Cloudflare uses a sliding window counter to protect 19M sites, handling 100,000+ requests/sec via Redis Cluster.
In practice the Sliding Window Counter (not Log) is used to balance accuracy and memory. The error stays under limit/window_size - for 100 req/min the max miscount is 1-2 requests. Cloudflare uses this exact algorithm in its WAF. For financial systems that need precision, a Redis ZSET (sliding window log) is used despite the higher overhead.
- **Fixed Window**: simple, O(1), vulnerable to bursts at the period boundary
- **Sliding Window Log**: exact, O(N) where N is events in the window; Redis ZSET implementation
- **Sliding Window Counter**: trade-off O(1), <1% error; recommended for most cases
- **Leaky Bucket**: smooths bursts, constant consumption rate; used in network QoS
Limit: 100 requests/minute. Fixed window. In [0:50-0:59] 100 requests were sent. In [1:00-1:10] another 100. Is the limit violated?
Token Bucket Algorithm
Token bucket allows bursty traffic within reasonable bounds. Picture a bucket of tokens: tokens are added at rate R/sec, capped at B tokens. Each request spends one token. AWS API Gateway uses a token bucket: 10,000 requests/sec baseline plus a burst of up to 5,000 additional requests instantly. That absorbs spikes without degradation.
Token bucket beats sliding window for real-time systems with bursty traffic. A WebSocket client on startup may send 50 subscriptions at once (a legitimate burst), then run at 1-2 messages/sec. Sliding window penalizes any burst equally. Token bucket allows a reasonable initial burst (capacity) while enforcing the average rate (refillRate). The Binance WebSocket API uses exactly token bucket: 5 subscriptions/sec with a burst of up to 10.
- **Capacity (burst size)**: how many requests you can send instantly with a full bucket
- **Refill rate**: replenishment speed; sets sustained throughput
- **Distributed bucket**: store tokens in Redis (`INCR`, `EXPIRE`, `DECRBY`) for multi-instance services
- **Variable cost**: heavy operations (subscribe to 100 symbols) consume more tokens than light ones (ping)
- **Retry-After header**: always return when the next request will be allowed (tokens/refillRate seconds)
Rate limiting is just protection against DDoS and malicious clients
Rate limiting in real-time systems also protects honest users from accidental abuse (client bugs, reconnect storms) and ensures fair resource sharing across all connections
A WebSocket bot with a buggy reconnect loop can create 10,000 connections per second without bad intent. Without connection-level rate limiting, one broken client degrades the service for everyone. Discord disconnects bots with code 4000 on Identify errors precisely because of accidental reconnect loops
Token bucket: capacity=10, refillRate=1 token/sec. The client connected with a full bucket and sent 10 messages instantly. After how many seconds can it send 5 more messages?
Takeaways
- **Per-connection + per-room**: two layers of defense; the connection limit stops one aggressor, the room limit stops a coordinated attack
- **Sliding window vs Fixed window**: fixed window is vulnerable to double bursts at the boundary; sliding window is exact but needs O(N) memory or a Redis ZSET
- **Token bucket**: the best fit for real-time - it allows a legitimate burst (capacity) while enforcing the sustained rate (refillRate); Binance and Discord use it
- **Graceful degradation**: return `retryAfter`, do not drop the connection on the first violation; log offenders to monitor abuse patterns
Related topics
Rate limiting in real-time systems sits at the intersection of security and architecture:
- WebSocket security — Rate limiting is part of overall WebSocket defense: connection rate limit protects the Upgrade endpoint from flooding; message rate limit blocks application-layer abuse
- Financial trading — Binance and NYSE apply strict rate limiting to trading APIs; violations trigger 24h IP bans and API-key revocation; weight-based billing for heavy queries
- IoT Real-Time — AWS IoT Core bills per-message; the edge agent must implement a token bucket so a broken sensor cannot burn the monthly budget in minutes
Вопросы для размышления
- A WebSocket chat: a user sent 50 messages in 1 second due to a client bug (double submit). How should rate limiting respond to avoid hurting an honest user while stopping the abuse?
- Discord uses a token bucket with capacity=120 at refillRate=2/sec (120/min). Why not just sliding window 120/min? In what scenarios is the bucket capacity critical?
- A distributed system: 10 WebSocket servers, per-user rate limit 100 msg/min. A user is connected to server A; how does A know how many messages the user already sent through servers B-J?