AI Engineering

Rate Limiting for AI API: Token Bucket, Sliding Window, Per-User Quotas

Цели урока

Understand the difference between RPM and TPM rate limiting for LLM APIs
Implement a Token Bucket algorithm with two dimensions: requests and tokens
Build a per-user quota system with Redis for fair resource distribution
Create a NestJS Guard combining global rate limits and per-user quotas

OpenAI rate limit: 10,000 RPM for tier 3. Sounds like plenty. One user with an automation script - 600 requests per minute. 17 such users - and the entire account is down. Rate limiting for AI isn't about 429 errors. It's about token budget per user.

OpenAI tier system: tier 1 gives 500 RPM and 30,000 TPM. One RAG request with 8K context = 8,000 TPM. After 3 such concurrent requests - the limit is gone for everyone
Anthropic limits tier 1 to 50 RPM and 40,000 TPM with a daily cap of 1,000 requests - without per-user quotas one active user can zero out the day
ChatGPT uses per-user rate limiting: the 'message limit reached' notice surfaces the same token budget directly in the UI
Vercel AI SDK ships a built-in rate limiter with Redis-backed token counting - the industry has accepted this as a standard problem

From Network Algorithms to LLM Quotas

Rate limiting predates AI by decades. The **token bucket** and **leaky bucket** algorithms come from classic networking and traffic shaping, where they smoothed bursts and enforced average rates on routers and switches. Those same algorithms now govern LLM APIs, expressed as RPM (requests per minute) and TPM (tokens per minute) limits. The twist for AI is that cost and load scale with token count, not just request count, so the old buckets are applied to token budgets per user, not only to request frequency.

Предварительные знания

LLM API Integration: OpenAI, Anthropic, Open-Source Models

Why Rate Limiting for AI APIs Is Different

OpenAI tier 3: 10,000 RPM. Sounds like plenty. One user with an automation script - 600 requests per minute. 17 such users - and the entire account goes down. But here's the thing: the problem isn't the number of requests. The problem is **tokens**.

Traditional rate limiting counts requests - 100 req/min, 1000 req/hour. For LLM APIs, that's a blunt instrument. A single GPT-4o request with 50,000 tokens of context consumes 100x more resources than one with 500 tokens. Providers enforce **TPM (tokens per minute)** - and TPM is the real bottleneck. Rate limiting for AI isn't about 429 errors. It's about **token budget per user**.

Provider	Tier	RPM	TPM	RPD
OpenAI (GPT-4o)	Tier 1	500	30,000	-
OpenAI (GPT-4o)	Tier 5	10,000	30,000,000	-
Anthropic (Claude Sonnet)	Tier 1	50	40,000	1,000
Anthropic (Claude Sonnet)	Tier 4	4,000	400,000	-
Google (Gemini 1.5)	Free	15	1,000,000	1,500

Three protection layers in an AI application: 1. **provider limit protection** - to avoid 429 errors from OpenAI 2. **per-user quotas** - so one user can't exhaust the limit for everyone else 3. **budget protection** - a financial ceiling. Each layer solves a different problem. None can be skipped.

**Retry is not a replacement for rate limiting.** Retry handles sporadic 429s, but when limits are systematically exceeded it creates a growing queue. 100 requests → 70 fail → retry in 2 seconds → 70 more → limit exhausted again. Proactive rate limiting before sending the request is what's needed.

Why is standard request-count rate limiting (RPM) insufficient for LLM APIs?

Token Bucket for LLMs: Limiting by Tokens, Not Requests

Token Bucket is an algorithm with history. A bucket fills at a constant rate. Each request takes exactly as much as it costs. Bucket empty - request waits or gets dropped. Sounds simple, but there's a catch: for LLM APIs, **two buckets are needed simultaneously** - one for RPM, one for TPM. The first protects against request storms, the second against a single 128K-token request wiping out the minute's allowance.

The problem: at request time, the number of **output tokens is unknown** - the model hasn't started generating yet. Standard solution: estimate using `system_tokens + user_tokens + max_tokens` (or historical average). After receiving the response, `release()` returns the unused credit to the bucket. This keeps accuracy high even when estimates are systematically inflated.

How does the token bucket handle a situation where actual token consumption was less than the estimate?

Per-User Quotas with Redis: Fair Resource Distribution

Global rate limit protects against provider 429s. But not against the scenario where one user with an automation script consumes 80% of the monthly budget overnight. Per-user quotas are **fairness as an architectural decision**. The limit is divided between users - not owned by whoever gets there first.

Redis is the standard choice for distributed rate limiting: atomic operations, TTL on keys, horizontal scaling. Keys by userId + date, reset at midnight. Sliding window vs fixed window - for token budget management, fixed window is enough: a daily limit is easier to explain to a user than a rolling 24-hour window.

The API response on quota exceeded must include **informative headers**: `X-RateLimit-Remaining-Requests`, `X-RateLimit-Remaining-Tokens`, `X-RateLimit-Reset`. This is the de-facto standard - exactly how OpenAI and GitHub API do it. The client gets precise information for graceful degradation instead of guessing from a status code.

Why does the per-user quota use adjustTokenUsage() after receiving a response from the LLM?

NestJS Guard: Rate Limiting for AI Endpoints

NestJS Guard is the right place for rate limiting: runs before the controller, has request context access, blocks requests before any business logic. A Guard for AI endpoints combines two layers: global rate limit (protection against provider 429s) and per-user quota. Order matters - global goes first. Otherwise per-user passes a request that will collide with OpenAI's limit on the way out.

In what order does the NestJS Guard check rate limits?

Rate limiting for AI is about protecting against DDoS and bots

Rate limiting for AI is primarily about managing token budget and cost - not defending against attacks

DDoS protection works at the IP and request-count level. For LLMs the real threat is different: one legitimate user with a large context window or an automation script can spend the monthly budget in hours. No malicious intent required - tokens cost money, and consumption is wildly disproportionate to request count. A sliding window on RPM is useless when a single request consumes 50,000 tokens and several dollars.

Rate Limiting for AI APIs

LLM APIs are limited by RPM and TPM. TPM is the real bottleneck: one request with 50K tokens exhausts the per-minute quota
Token Bucket with two dimensions: request bucket (RPM) + token bucket (TPM). release() returns unused credit
Per-user quotas via Redis: daily limits on requests and tokens, tiers (free/pro/enterprise), adjustTokenUsage() for accuracy
NestJS Guard: global rate limit → per-user quota → execute. Rate limit headers in every response

What's Next

Rate limiting protects against overspending and ensures fair access. The next task is understanding whether the AI system works correctly through evaluation and testing.

Evaluation: How to Test LLMs — Rate limiting is quantitative control. Evaluation is qualitative: does the model respond correctly
Cost Management — Rate limiting and budget alerts are two budget protection mechanisms that work together
Error Handling for LLMs — 429 errors from rate limits are one type of LLM-specific error for graceful handling

Связанные уроки

aie-05-api-integration — Rate limiting wraps the provider client
aie-29-cost-management — Limits protect the spend budget
aie-32-error-handling-llm — Handle 429s with backoff and retry
aie-31-evaluation — Measure throughput under limit pressure
net-62-rate-limiting — Same token-bucket throttling at the network layer