AI Engineering
Rate Limiting for AI API: Token Bucket, Sliding Window, Per-User Quotas
Цели урока
- Understand the difference between RPM and TPM rate limiting for LLM APIs
- Implement a Token Bucket algorithm with two dimensions: requests and tokens
- Build a per-user quota system with Redis for fair resource distribution
- Create a NestJS Guard combining global rate limits and per-user quotas
OpenAI rate limit: 10,000 RPM for tier 3. Sounds like plenty. One user with an automation script - 600 requests per minute. 17 such users - and the entire account is down. Rate limiting for AI isn't about 429 errors. It's about token budget per user.
- OpenAI tier system: tier 1 gives 500 RPM and 30,000 TPM. One RAG request with 8K context = 8,000 TPM. After 3 such concurrent requests - the limit is gone for everyone
- Anthropic limits tier 1 to 50 RPM and 40,000 TPM with a daily cap of 1,000 requests - without per-user quotas one active user can zero out the day
- ChatGPT uses per-user rate limiting: the 'message limit reached' notice surfaces the same token budget directly in the UI
- Vercel AI SDK ships a built-in rate limiter with Redis-backed token counting - the industry has accepted this as a standard problem
From Network Algorithms to LLM Quotas
Rate limiting predates AI by decades. The **token bucket** and **leaky bucket** algorithms come from classic networking and traffic shaping, where they smoothed bursts and enforced average rates on routers and switches. Those same algorithms now govern LLM APIs, expressed as RPM (requests per minute) and TPM (tokens per minute) limits. The twist for AI is that cost and load scale with token count, not just request count, so the old buckets are applied to token budgets per user, not only to request frequency.
Предварительные знания
Why Rate Limiting for AI APIs Is Different
OpenAI tier 3: 10,000 RPM. Sounds like plenty. One user with an automation script - 600 requests per minute. 17 such users - and the entire account goes down. But here's the thing: the problem isn't the number of requests. The problem is **tokens**.
Traditional rate limiting counts requests - 100 req/min, 1000 req/hour. For LLM APIs, that's a blunt instrument. A single GPT-4o request with 50,000 tokens of context consumes 100x more resources than one with 500 tokens. Providers enforce **TPM (tokens per minute)** - and TPM is the real bottleneck. Rate limiting for AI isn't about 429 errors. It's about **token budget per user**.
| Provider | Tier | RPM | TPM | RPD |
|---|---|---|---|---|
| OpenAI (GPT-4o) | Tier 1 | 500 | 30,000 | - |
| OpenAI (GPT-4o) | Tier 5 | 10,000 | 30,000,000 | - |
| Anthropic (Claude Sonnet) | Tier 1 | 50 | 40,000 | 1,000 |
| Anthropic (Claude Sonnet) | Tier 4 | 4,000 | 400,000 | - |
| Google (Gemini 1.5) | Free | 15 | 1,000,000 | 1,500 |
Three protection layers in an AI application: 1. **provider limit protection** - to avoid 429 errors from OpenAI 2. **per-user quotas** - so one user can't exhaust the limit for everyone else 3. **budget protection** - a financial ceiling. Each layer solves a different problem. None can be skipped.
**Retry is not a replacement for rate limiting.** Retry handles sporadic 429s, but when limits are systematically exceeded it creates a growing queue. 100 requests → 70 fail → retry in 2 seconds → 70 more → limit exhausted again. Proactive rate limiting before sending the request is what's needed.
Why is standard request-count rate limiting (RPM) insufficient for LLM APIs?
Token Bucket for LLMs: Limiting by Tokens, Not Requests
Token Bucket is an algorithm with history. A bucket fills at a constant rate. Each request takes exactly as much as it costs. Bucket empty - request waits or gets dropped. Sounds simple, but there's a catch: for LLM APIs, **two buckets are needed simultaneously** - one for RPM, one for TPM. The first protects against request storms, the second against a single 128K-token request wiping out the minute's allowance.
The problem: at request time, the number of **output tokens is unknown** - the model hasn't started generating yet. Standard solution: estimate using `system_tokens + user_tokens + max_tokens` (or historical average). After receiving the response, `release()` returns the unused credit to the bucket. This keeps accuracy high even when estimates are systematically inflated.
How does the token bucket handle a situation where actual token consumption was less than the estimate?
Per-User Quotas with Redis: Fair Resource Distribution
Global rate limit protects against provider 429s. But not against the scenario where one user with an automation script consumes 80% of the monthly budget overnight. Per-user quotas are **fairness as an architectural decision**. The limit is divided between users - not owned by whoever gets there first.
Redis is the standard choice for distributed rate limiting: atomic operations, TTL on keys, horizontal scaling. Keys by userId + date, reset at midnight. Sliding window vs fixed window - for token budget management, fixed window is enough: a daily limit is easier to explain to a user than a rolling 24-hour window.
The API response on quota exceeded must include **informative headers**: `X-RateLimit-Remaining-Requests`, `X-RateLimit-Remaining-Tokens`, `X-RateLimit-Reset`. This is the de-facto standard - exactly how OpenAI and GitHub API do it. The client gets precise information for graceful degradation instead of guessing from a status code.
Why does the per-user quota use adjustTokenUsage() after receiving a response from the LLM?
NestJS Guard: Rate Limiting for AI Endpoints
NestJS Guard is the right place for rate limiting: runs before the controller, has request context access, blocks requests before any business logic. A Guard for AI endpoints combines two layers: global rate limit (protection against provider 429s) and per-user quota. Order matters - global goes first. Otherwise per-user passes a request that will collide with OpenAI's limit on the way out.
In what order does the NestJS Guard check rate limits?
Rate limiting for AI is about protecting against DDoS and bots
Rate limiting for AI is primarily about managing token budget and cost - not defending against attacks
DDoS protection works at the IP and request-count level. For LLMs the real threat is different: one legitimate user with a large context window or an automation script can spend the monthly budget in hours. No malicious intent required - tokens cost money, and consumption is wildly disproportionate to request count. A sliding window on RPM is useless when a single request consumes 50,000 tokens and several dollars.
Rate Limiting for AI APIs
- LLM APIs are limited by RPM and TPM. TPM is the real bottleneck: one request with 50K tokens exhausts the per-minute quota
- Token Bucket with two dimensions: request bucket (RPM) + token bucket (TPM). release() returns unused credit
- Per-user quotas via Redis: daily limits on requests and tokens, tiers (free/pro/enterprise), adjustTokenUsage() for accuracy
- NestJS Guard: global rate limit → per-user quota → execute. Rate limit headers in every response
What's Next
Rate limiting protects against overspending and ensures fair access. The next task is understanding whether the AI system works correctly through evaluation and testing.
- Evaluation: How to Test LLMs — Rate limiting is quantitative control. Evaluation is qualitative: does the model respond correctly
- Cost Management — Rate limiting and budget alerts are two budget protection mechanisms that work together
- Error Handling for LLMs — 429 errors from rate limits are one type of LLM-specific error for graceful handling
Связанные уроки
- aie-05-api-integration — Rate limiting wraps the provider client
- aie-29-cost-management — Limits protect the spend budget
- aie-32-error-handling-llm — Handle 429s with backoff and retry
- aie-31-evaluation — Measure throughput under limit pressure
- net-62-rate-limiting — Same token-bucket throttling at the network layer