System Design
Rate Limiting and Throttling
Цели урока
- Understand why Rate Limiting is needed and how it differs from Throttling
- Study the Token Bucket algorithm and its burst-friendly behavior
- Understand Leaky Bucket for traffic smoothing
- Master Sliding Window for accurate counting
- Learn to implement Distributed Rate Limiting with Redis
Предварительные знания
- Previous System Design lessons
- Basic understanding of Redis
Rate Limiting is a mandatory component of any production API. Without it, one client can take down an entire service. Interviewers frequently ask: 'How would you protect an API?' - and expect knowledge of concrete algorithms.
- Used in production systems
Why Limit Requests?
**Rate Limiting** is the control of how many requests a client can make per unit of time. It is a fundamental protection mechanism for any API.
Without rate limiting, a single malicious (or buggy) client can bring down an entire service by sending millions of requests.
**Where Rate Limiting is applied:**
**Rate Limiting vs Throttling:**
When returning 429, always include the Retry-After header with the number of seconds until the limit resets. This allows clients to handle restrictions gracefully.
Which HTTP status code is returned when a rate limit is exceeded?
429 Too Many Requests is the standard status for rate limiting. 403 is for authorization errors, 503 is for temporary service unavailability.
Token Bucket
**Token Bucket** is the most popular rate limiting algorithm. It is used in AWS API Gateway, Stripe, and NGINX.
**The idea:** there is a bucket of tokens. Each request consumes one token. Tokens are added at a fixed rate. If there are no tokens, the request is rejected.
**Python implementation:**
**Advantages of Token Bucket:**
- **Burst-friendly** - allows short spikes of activity
- **Simplicity** - easy to implement and understand
- **Memory efficient** - stores only 2 numbers (tokens, timestamp)
- **Flexible** - configurable burst and sustained rate
Token Bucket is ideal when bursts need to be permitted but the average rate must be capped. For example, a user opens a page and fires 10 API requests at once, then goes quiet.
Token Bucket with capacity=100 and rate=10/sec. How many requests can be made in the first second?
At startup the bucket is full (100 tokens), so a burst of 100 requests can be made instantly. After that, only 10/sec as tokens refill.
Leaky Bucket
**Leaky Bucket** - requests "leak" out of the bucket at a constant rate, smoothing traffic into a steady stream.
Unlike Token Bucket, there is no burst - only a queue. Requests accumulate and are processed at a uniform pace.
**Token Bucket vs Leaky Bucket:**
**When to use Leaky Bucket:**
- **Network traffic shaping** - steady packet flow
- **Backend protection** - don't pass bursts to a weak downstream service
- **Video streaming** - uniform bitrate delivery
- **Message queues** - steady consumption rate
Leaky Bucket adds latency! Requests wait in the queue. For user-facing APIs, Token Bucket with fast rejection is usually better.
What is the main difference between Leaky Bucket and Token Bucket?
Leaky Bucket converts any incoming traffic into a steady stream. Token Bucket allows bursts. Leaky Bucket uses more memory (the queue) and can add latency.
Sliding Window
**Sliding Window** counts requests within a rolling time window. More accurate than Fixed Window, but harder to implement.
**The Fixed Window problem:**
**Sliding Window Log:**
Store the timestamp of each request. On a new request, discard old entries (outside the window) and count the remainder.
**Sliding Window Counter (a compromise):**
Combines Fixed Window with a weighting coefficient. Less memory than Log, more accurate than Fixed.
**Algorithm comparison:**
Sliding Window Counter uses 2 counters. Why is the previous window counter needed?
The previous window counter is used for weighted calculation: part of the previous window still falls inside the current sliding window. This solves the boundary burst problem of Fixed Window.
Distributed Rate Limiting
In real systems, rate limiting must work **in a distributed fashion** - across multiple servers that must share a common counter.
**Solution 1: Centralized Counter (Redis)**
**Solution 2: Sticky Sessions**
**Solution 3: Token Bucket with Synchronization**
**Rate Limit Headers:**
For most use cases, Redis + Sliding Window Counter is the optimal choice: O(1) memory, O(1) operations, high accuracy, simple implementation.
Why is Redis commonly used for distributed rate limiting?
Redis is ideal for rate limiting: atomic INCR without race conditions, EXPIRE for automatic window reset, sub-millisecond latency, and a simple API.
Key Takeaways
- Token Bucket: allows burst, used in API Gateway, Stripe
- Leaky Bucket: smooths traffic into a steady stream
- Fixed Window: simple but inaccurate at boundaries
- Sliding Window Counter: a compromise - O(1) memory with high accuracy
- For distributed systems: Redis + INCR + EXPIRE
What's Next
Next lesson: Consistent Hashing - how to distribute data across servers so that adding or removing a server minimally disrupts the existing distribution.
- System Design — course continuation