AI Engineering

LLM Caching: Semantic Cache, Prompt Cache, KV Cache - Save 10x

Цели урока

Implement exact match cache for LLM responses in Redis with key normalization
Build semantic cache based on embeddings with a configurable similarity threshold
Use prompt caching APIs (OpenAI, Anthropic) for savings on long contexts
Understand how KV cache works in transformers and its impact on cost and latency
Assemble a multi-level caching service in NestJS with efficiency metrics

Anthropic prompt caching cuts repeated context costs by 90%. One 10K-token system prompt, 1M requests per day - the difference is USD 180K vs USD 18K per month. This is not a late-stage optimization. It is an architectural decision made at the start of product design. Without caching, a product built on large contexts is economically unsustainable. With caching, the same product is profitable.

GPTCache by Zilliz - open-source semantic caching, 7000+ GitHub stars; used in enterprise RAG systems to cut costs by 40-60%
Anthropic prompt caching gives a 90% discount - critical for 'chat with document' apps on 200K tokens: USD 0.60 vs USD 0.06 per request
OpenAI prompt caching: automatic for GPT-4o at 1024+ token prefix, 50% discount - zero code changes required on the developer side
Redis Labs reports: 40% of enterprise AI projects use Redis as an LLM response cache; exact match cache delivers the fastest ROI of any optimization

How Caching Reached the LLM World

Caching is an old idea, but applying it to LLMs is recent. **GPTCache (Zilliz, 2023)** popularized semantic caching: instead of matching requests byte-for-byte, it compares meaning through embeddings, so paraphrased questions can reuse a stored answer. The providers then built caching into the APIs themselves. **Anthropic introduced prompt caching (August 2024)**, letting a large repeated context be cached and billed at a fraction of the cost. **OpenAI followed with prompt caching (October 2024)**. Within months, caching shifted from a clever add-on to a standard cost lever for any product running large prompts.

Предварительные знания

Exact Match Cache: The Simplest and Most Reliable Cache

January 2024. A startup with an AI support chatbot receives an OpenAI bill: USD 47,000 for the month. Log analysis reveals that **63% of requests were duplicates** - the same questions about delivery, returns, order status. Each request spent ~800 tokens, even though the answer had been fetched a minute earlier. Three hours of engineering work. The following month's bill: USD 16,000.

Exact match cache is a Redis hash of a normalized prompt that responds in 5 milliseconds instead of 800. The cache key: SHA-256 of messages + model + temperature + system prompt. One rule - cache only at temperature = 0, where the response is deterministic.

The catch with exact match: "What's the weather in Moscow?" and "what's the weather in moscow?" are two different keys. Any text difference - a space, a capital letter - causes a cache miss. The fix: normalize before hashing.

Metric	Without Cache	With Exact Match Cache
Average latency	800-2000ms	5-15ms (cache hit)
Cost /1M requests (GPT-4o)	$~15,000	$~5,500 (at 63% hit rate)
API load	100%	~37%
Response consistency	Varies	100% for cached

Why shouldn't exact match cache be used with temperature > 0?

Semantic Cache: Caching by Meaning Through Embeddings

"How do I cancel an order?", "I want to cancel my purchase", "Where's the cancel button?" - three different strings, one intent, three exact match misses. Semantic cache solves this through embeddings: instead of hashing the string, a 1536-dimensional vector is computed via `text-embedding-3-small`, and the cache counts as a hit when cosine similarity with an existing entry exceeds a threshold of 0.95-0.97.

GPTCache by Zilliz - the most popular open-source implementation, 7000+ GitHub stars. The economics: embedding costs USD 0.02 per 1M tokens. GPT-4o costs USD 2.50/USD 10.00 per 1M input/output. Even counting embedding overhead, savings are substantial. In production, similarity search runs against Qdrant or pgvector via HNSW - O(log n) instead of linear scan.

**The production version of semantic cache uses a vector database (Qdrant, Pinecone, pgvector), not iterating over all Redis keys.** The example above demonstrates the principle. With 10,000+ entries, linear search becomes a bottleneck - a vector DB provides O(log n) search via ANN algorithms (HNSW).

The critical parameter is the **similarity threshold**. At 0.99, the cache triggers only for nearly identical requests. At 0.90, the net is too wide - false matches appear. The 0.95-0.97 sweet spot catches rephrasings and synonyms while filtering out semantically distinct queries.

Threshold	Hit Rate	Precision	Example
0.98-0.99	Low (~10%)	Very high	Catches only rephrasings with minimal differences
0.95-0.97	Medium (~35%)	High	Catches rephrasings and synonyms - recommended starting point
0.90-0.94	High (~55%)	Medium	May return an answer to a similar but not identical question
< 0.90	Very high	Low	High risk of false matches - not recommended

What similarity threshold is recommended for starting semantic cache in production?

Prompt Caching: OpenAI and Anthropic Cache the System Prompt

Anthropic prompt caching is a 90% discount on repeated context. The math: a 10K-token system prompt, 1M requests per day. Without cache: USD 30 per 1M tokens times 1M requests = USD 30,000/day, USD 900K/month. With cache: USD 3,000/day, USD 90K/month. This is not a performance detail - it is the difference between a product that can exist and one that cannot.

The mechanism: if the beginning of a prompt (prefix) matches a previously sent request, the provider reuses internal computations and reduces cost. OpenAI does it automatically for GPT-4o (50% discount). Anthropic requires explicit `cache_control` headers (90% discount). Google Gemini needs `cachedContent` (75% discount, TTL up to 1 hour).

Provider	Mechanism	Discount	Minimum Tokens	Cache TTL
OpenAI	Automatic	50% on cached tokens	1024 token prefix	5-10 minutes
Anthropic	Explicit (cache_control)	90% on cached tokens	1024 (Sonnet) / 2048 (Haiku)	5 minutes
Google (Gemini)	Explicit (cachedContent)	75% on cached tokens	32,768 tokens	Up to 1 hour

The architectural rule: **static content goes first in the prompt**. System prompt, RAG context, few-shot examples - all of these precede the dynamic part. The user's question comes last. Cache works on prefix match - the longer the matching prefix, the more tokens fall under the discount.

**Prompt cache doesn't guarantee a hit.** The cache can be evicted under heavy provider load. Don't build budgets assuming 100% cache hit rate - plan for 50-70% depending on usage patterns.

How should a prompt be structured for maximum prompt caching efficiency?

KV Cache: Internal Transformer Optimization

API-level prompt caching gives a 90% discount. Where does that saving come from? Inside the provider's GPU - it is the KV Cache. When generating each new token, the transformer recalculates attention for the entire previous sequence. Without KV cache: O(n^2) operations. With KV cache: O(n), because Key and Value matrices for already processed tokens are stored and reused.

KV cache eats **GPU memory**. For a GPT-4 class model: ~1-2 MB per context token. With a 128K token context window, that is up to 256 GB of GPU memory just for the cache. This is why providers charge more for long prompts and limit context windows - GPU memory is the physical bottleneck, not a marketing choice.

Aspect	KV Cache	Prompt Cache (API)	Semantic Cache (Application)
Level	Inside the model (GPU)	Provider-side	Application-side
Developer control	None (automatic)	Partial (prompt structure)	Full
What's cached	Key-Value attention matrices	Intermediate prefix computations	Ready LLM responses
Cost impact	Indirect (context window)	50-90% discount	100% (call not made)
Latency impact	10-15x generation speedup	50%+ TTFB reduction	Response in ~5ms

For developers using APIs, KV cache is a background mechanism. But understanding it explains three things at once: why long prompts cost more, why TTFB grows with context length, and where the 90% Anthropic discount actually comes from - the provider saves GPU resources by not rebuilding the KV cache from scratch.

What problem does KV cache solve in a transformer?

Production: NestJS Service with Multi-level Caching

In production, **multi-level caching** is used: exact match - semantic cache - prompt caching API - LLM call. Each level filters out its share of requests. Only unique, never-before-seen requests reach the actual API call. Combined hit rate of 60-70% translates to 50-80% savings on the LLM API budget.

The final architecture: **three cache levels**, each with its own strategy. Exact match (Redis, ~5ms) catches repeated requests. Semantic cache (vector DB, ~50ms) catches rephrasings. Prompt caching (provider API, ~200ms vs ~800ms) reduces cost for long contexts. Combined - 50-80% savings on the LLM API budget.

In what order should the system check caches for maximum efficiency?

Caching LLM responses is risky - data will go stale and users will get outdated answers

For stable requests (FAQ, documentation, classification) caching is safe and essential. TTL controls freshness

Caching is risky for requests that depend on real-time data ('what is the current dollar rate?') or personal context. But an FAQ about return policy does not change hourly. API documentation stays stable for weeks. Ticket classification by topic - for months. For these patterns, a cache with a 1-24 hour TTL is completely safe. The key question is not 'cache or not' but 'what TTL fits this request type'.

Provider prompt caching replaces application-level cache - one of the two is enough

These are different layers: prompt caching reduces the cost of an API call, application cache eliminates the call entirely. Both are needed

Prompt caching (OpenAI/Anthropic) works inside a single API call - it reduces cost, but the call still happens (~200-800ms). Application-level cache (Redis exact + semantic) returns a response in ~5ms with no API call at all. The full architecture uses both: application cache eliminates 60-70% of requests, prompt cache reduces the cost of the remaining ones.

LLM Caching Strategies

Exact match cache: Redis + SHA-256 of normalized prompt. Works at temperature = 0. Response in ~5ms. Fastest ROI of any optimization
Semantic cache: text-embedding-3-small (1536 dim) + cosine similarity 0.95-0.97. Catches rephrasings. Qdrant/pgvector in production
Prompt caching API: OpenAI -50% automatically (1024+ token prefix), Anthropic -90% explicitly via cache_control. Place static content first
KV cache: internal GPU optimization, O(n^2) to O(n). Explains why long contexts cost more and where provider discounts come from
Multi-level NestJS service: exact -> semantic -> API prompt cache -> full call. Combined savings 50-80% of LLM API budget

What's Next

Caching is the first level of LLM cost optimization. The next step is comprehensive cost management: from token counting to choosing the right model for the task.

Cost Management — Caching reduces the number of calls, cost management optimizes the cost of each call
Rate Limiting for AI API — Rate limiting protects against budget overspend and exceeding provider quotas
Observability — Cache metrics (hit rate, tokens saved) are part of the observability pipeline

Связанные уроки

aie-09-embeddings — Semantic cache keys built from embeddings
aie-29-cost-management — Caching directly cuts per-request cost
aie-30-rate-limiting-ai — Cache hits reduce pressure on rate limits
aie-35-observability — Track hit rate to validate cache value
sd-07-caching — Same cache layering as classic system design
db-26-caching