AI Engineering

AI System Design: production AI application architecture from zero to scale

Цели урока

Calculate load, cost, and concurrency for an AI system
Design a 6-layer architecture for a production AI application
Apply scaling strategies: caching, routing, prompt compression, queue management
Implement resilience patterns: circuit breaker, multi-provider fallback, budget kill switch

AI system design differs from classical system design in one thing: a nondeterministic component at the center. A REST API returns the same result every time. An LLM doesn't. That changes everything: testing, monitoring, SLA, fallback strategies. A startup plugs in the OpenAI API, a month later gets a USD 47K bill, 5% of requests fail on rate limits, users complain about hallucinations - and only then does real system design begin.

Notion AI handles 100M+ requests per day - multi-provider routing, aggressive semantic caching, per-user rate limits; without this stack, costs would be 4x higher
Vercel AI Gateway - an open-source solution for routing, caching, and fallback between LLM providers; it appeared because every team was building the same thing from scratch
40% of Anthropic enterprise clients exceed their budget in the first month without budget controls - the budget kill switch became a standard architecture component
Stripe AI fraud detection: 6-layer architecture, 99.99% uptime via circuit breaker across 3 LLM providers, shadow mode testing before every model switch

How the LLM app stack came together

Before late 2022 a typical AI application was a thin wrapper around a single model call. The launch of ChatGPT on November 30, 2022 and the public ChatGPT API in March 2023 set off a wave in which a recognizable stack for production LLM applications took shape within a year. It came from the practice of dozens of teams rather than a single paper: retrieval (RAG over a vector database) gives the model fresh context, agents and tool calling let it take actions, guardrails filter input and output, and an evaluation layer measures quality against regressions. 2023 brought the tooling that cemented these layers: LangChain and LlamaIndex for orchestration and retrieval, vector databases like Pinecone, Weaviate, and pgvector, and observability platforms such as LangSmith. The defining contrast with classical architecture is that the central component is nondeterministic: the same request can return different answers, so testing, caching, and error handling are designed differently. By 2024 this shape (RAG plus agents plus guardrails plus eval) had become the de facto industry template for building AI services.

Предварительные знания

Load Estimation: calculating load for an AI system

AI system design starts with estimation - and here's the first surprise. A typical CRUD request costs USD 0.0001 and takes 50ms. An AI request costs USD 0.001-0.10 and takes 1-30 seconds. This isn't a quantitative difference - it's different physics. **High latency** (1-30s), **variable cost** (depends on prompt length), **limited throughput** from providers - each parameter breaks familiar scaling patterns.

10,000 DAU, 15 AI requests per user per day, Claude Sonnet (USD 3 per 1M input). Sounds like a small product. Do the math: 150,000 requests/day × 2,000 tokens = 300M input tokens = **USD 900 per day = USD 27,000/month**. No caching, no model routing, no prompt compression - just raw requests. This is exactly why estimation for an AI system comes first, not last.

**USD 60K/month on LLM API with 10K DAU is a real number.** This is why AI system design ALWAYS includes an optimization strategy: semantic caching, model routing (cheaper model for simple queries), prompt compression.

One number in the table below flips the whole mental model: throughput. A standard backend holds 1,000+ RPS on a single instance. An AI system with the same traffic runs at 1-5 RPS - because the bottleneck isn't the server, it's the API rate limits from the provider. This completely inverts the logic of scaling.

Metric	Standard Backend	AI Backend
Latency (p50)	50-200ms	1,000-10,000ms
Cost per request	USD 0.0001	USD 0.001-USD 0.10
Throughput	1000+ RPS	10-100 RPS (per API key)
Failure mode	Timeout, 5xx	Rate limit, content filter, hallucination
Scaling	Horizontal (add servers)	Limited by API rate limits

A SaaS application: 5000 DAU, 10 AI requests per user per day, average latency of 4 seconds. Approximately how many concurrent requests at peak? (peak = 3x average)

Component Diagram: AI application architecture

In a standard microservice, adding servers scales throughput linearly. In an AI system, adding servers doesn't help: the bottleneck is the LLM API. So the architecture is built differently - around **6 layers**, each addressing a specific problem created by AI-specific traffic.

The observability layer in an AI system isn't just logs and metrics. It's **hallucination rate**, **cost per request**, **response quality over time** (model degradation after a silent provider update). Langfuse and Helicone are built specifically for this - they understand tokens, models, and quality evaluation. Prometheus + Grafana don't do this out of the box.

**Rule: each layer must be replaceable without changing the others.** Switching from Pinecone to pgvector shouldn't affect the orchestration layer. Switching from OpenAI to Anthropic - only the LLM layer. Interfaces between layers are the contract.

Which layer of the AI architecture is responsible for model selection and semantic cache checking?

Scaling Strategy: scaling an AI system

The classic scaling mantra is "add servers". In an AI system, that doesn't work. The bottleneck isn't CPU or RAM - it's **LLM API rate limits and token costs**. Anthropic gives 40,000 tokens per minute per API key. OpenAI gives 30,000. Adding servers doesn't change those limits. Scaling is built around a single principle: **call the LLM as rarely as possible**.

Semantic cache is the most counterintuitive level of the pyramid. It seems like requests are all different - cache won't help. But "How do I integrate Stripe?" and "How do I add Stripe to my project?" are the same thing. `text-embedding-3-small` (1536 dim, USD 0.02 per 1M) turns both queries into nearby vectors. Cosine similarity > 0.95 - return the cached response. **Real-world hit rate in production: 35-50%.** Every hit saves USD 0.005-0.05 and 2-10 seconds of latency.

**Case study:** One SaaS product reduced AI costs from USD 45K/mo to USD 12K/mo by applying all 4 levels: BullMQ for smoothing (-5%), prompt compression (-20%), semantic cache with 42% hit rate (-42%), model routing (-25% of the remainder).

Which AI system scaling strategy typically yields the largest cost reduction?

Failure Modes: what breaks in AI systems and how to protect against it

A REST API returns the same thing for the same input. An LLM doesn't. And that changes everything: testing, monitoring, SLA, fallback strategies. **Hallucination** isn't a bug in code that can be reproduced and fixed. It's a probabilistic failure mode that appears randomly and requires probabilistic mitigation: RAG grounding, fact-checking, confidence scores. Standard backend engineering has no equivalent.

There's another failure mode that catches teams off guard: **model degradation**. OpenAI or Anthropic silently updates a model - and GPT-4o-2025-04 behaves differently from GPT-4o-2024-08. Response quality drops 15%. CPU metrics and latency look fine. Hallucination rate in Langfuse is climbing. Without AI-specific observability, this gets discovered a week later when users start complaining.

A circuit breaker for LLM isn't the same as for a standard service. A classic circuit breaker counts errors (5xx, timeout) and trips open. An LLM circuit breaker also tracks: **hallucination rate > 10%**, **cost per request > USD threshold**, **structured output parse failures > 20%**. Anthropic went down for 15 minutes in March 2024 - every system without a fallback went down with it.

**Budget Kill Switch** is a critically important component. If costs start growing uncontrollably due to a bug or attack, the system must automatically shut down:

AI systems can use the same patterns as regular microservices - circuit breaker, retry, load balancing

The patterns share names but the implementation is fundamentally different due to LLM nondeterminism

A standard circuit breaker counts 5xx and timeouts - everything is deterministic. An AI circuit breaker must track hallucination rate (a probabilistic metric), cost per request (variable), and structured output parse failures. A/B testing for LLM isn't just traffic splitting: quality evaluation requires LLM-as-judge or human eval, because there's no deterministic correct answer to compare against. Shadow mode testing - routing traffic through a new model without returning its output to users - became standard practice precisely because status codes can't be compared.

Which failure mode is specific to AI systems and absent in standard backends?

AI system design is just adding retry and fallback to LLM calls

AI system design requires rethinking every component: from SLA (hallucination rate < X%) to observability (Langfuse instead of Prometheus) and testing (shadow mode instead of unit tests)

A unit test for an LLM call is meaningless - there's no deterministic correct answer. An SLA of '99.9% uptime' is incomplete without 'hallucination rate < 2%' and 'cost per request < USD 0.05'. Standard monitoring will show everything is green while response quality degrades after a silent model update from the provider. This is a different engineering discipline, not a layer on top of an existing one.

Key Takeaways

A nondeterministic LLM at the center changes testing, monitoring, SLA, and fallback strategies
6 architecture layers: Gateway, Orchestration, LLM, Retrieval, Tools, Observability
Scaling pyramid: queue → prompt compression → semantic cache (35-50% hit rate) → model routing → multi-provider
AI-specific failure modes: hallucination, content policy, model degradation, cost spikes - each requires its own approach
Shadow mode testing and A/B for LLM are required - a probabilistic system can't be tested with deterministic tests
Budget kill switch and hallucination rate in observability are not optional - they're baseline components

Вопросы для размышления

What would an SLA for an AI feature look like in a real product - what metrics beyond uptime and latency need to be included?
Shadow mode testing for LLM: how to decide when the new model is 'good enough' to switch, when there's no deterministic correct answer?
Semantic cache gets 35-50% hit rate - but cache goes stale. How to set TTL for LLM responses in a system where knowledge changes?

What's Next

The architecture is designed. Next step - real-time AI: WebSocket + LLM streaming, voice assistants, collaborative editing. And then - a concrete implementation on NestJS.

Realtime AI — WebSocket + LLM streaming, voice assistants, live collaboration
AI Backend on NestJS — Concrete implementation of the architecture on Node.js/NestJS

Связанные уроки

aie-21-orchestration-patterns — System design composes orchestration patterns
aie-35-observability — Observability is a design pillar at scale
aie-43-realtime-ai — Design supports real-time AI workloads
aie-40-model-serving — Serving is a building block of the design
sd-01-intro — Same system design methodology, AI-specific
net-37-load-balancing
db-04-cap