Capstone: Designing and Building a Production AI Application from Scratch

Цели урока

Design an AI application architecture from requirements to stack selection
Implement the full pipeline: API → RAG → LLM → Tool Calling → Response
Apply production hardening: caching, rate limiting, monitoring, error handling
Set up containerization (Docker), CI/CD, and secrets management
Organize a demo → feedback → evaluation → iteration cycle

An engineer walks out of a YC startup interview - hired. Forty candidates competed for the role. The deciding factor was a single GitHub repo: a Knowledge Assistant with Qdrant, streaming responses, an agent loop, and semantic cache. Not a degree. Not LeetCode scores. The entire AI Engineering stack - RAG, embeddings, tool calling, vector DB, cost monitoring - assembled into one working product. That is exactly what you build here.

Stripe Docs AI: RAG over 500+ pages of documentation with tool calling for code generation - architecturally identical to the capstone project
Cursor IDE: agentic pipeline with codebase RAG, embeddings via text-embedding-3-small, streaming over SSE - same patterns, larger scale
Perplexity AI: search + RAG + citations - under the hood: vector store, re-ranking, LLM-as-judge for relevance scoring; started at USD 50 per day on API
70% of YC W24/S24 startups use RAG + multi-agent architecture from this course - the only difference is data volume and number of tools

AI Engineering becomes a discipline

Before ChatGPT (November 2022), building with AI mostly meant training models - the work of ML researchers. The wave of capable hosted models that followed split off a separate craft: composing existing models into reliable products through prompting, RAG, tool calling, evaluation, and cost control. By 2023-2024 the title "AI Engineer" had spread across job boards, and patterns like vector search, agent loops, and guardrails settled into a shared toolkit. There was no single founding event, just steady maturation. A capstone is where that scattered toolkit becomes one working system.

Предварительные знания

Planning: Requirements, Architectural Decisions, Stack Selection

Every production AI project starts not with code and not with picking an LLM provider, but with **directly defining the problem**. Latency requirements dictate streaming. Cost requirements dictate model routing and semantic cache. Scale requirements dictate queues and horizontal scaling. Without these answers, even a perfectly written RAG pipeline will be optimizing the wrong thing.

The capstone project is an **AI Knowledge Assistant**: documentation is ingested, chunked, and indexed in Qdrant using `text-embedding-3-small`. On each user query, RAG retrieval runs, the results feed an agent with tool calling (web search, calculator, external API), and the answer streams over SSE. This is not a toy example - it is the Stripe Docs AI and Cursor architecture in miniature.

Stack selection must be driven by requirements, not hype or what everyone else is using. Every component needs to be justified by a concrete need:

Component	Choice	Alternatives	Rationale
Runtime	Node.js + NestJS	Python FastAPI, Go	Ecosystem, type safety, familiar stack
LLM Provider	OpenAI GPT-4o + fallback Anthropic	Ollama (self-hosted)	Price/quality balance, abstraction layer for switching
Vector Store	Qdrant	Pinecone, pgvector, Weaviate	Self-hosted, fast, filtering, free
Embedding	text-embedding-3-small	Cohere, local model	Cheap (USD 0.02 per 1M tokens), high quality
Queue	BullMQ + Redis	RabbitMQ, SQS	Already in the stack, simple, built-in retries
Database	PostgreSQL	MongoDB	Metadata, history, users

**A common mistake during planning** is starting with technology choices instead of defining requirements. Qdrant may be "cooler" than pgvector, but if the project has 1,000 documents and no need for filtering - pgvector inside the existing PostgreSQL will be simpler and cheaper.

What should the design of an AI application begin with?

Implementation: API, LLM, RAG Pipeline, Tool Calling

The core of an AI application is the **Orchestrator**: the assembly point for everything covered in this course. A request arrives, the agent decides whether Qdrant RAG or a tool call is needed (web search, calculator, external API), builds context from conversation history and embedding results, calls the LLM - and returns a structured response with citations. This is exactly where the multi-agent patterns from lesson 19 meet Advanced RAG from lesson 13.

The **RAG pipeline** is where most projects lose quality silently. Wrong chunk size breaks semantic units, retrieval without score threshold returns irrelevant chunks, embeddings without batching make ingestion 10x slower. Below is a battle-tested implementation with recursive chunking, Qdrant userId filtering, and batched `text-embedding-3-small`:

**Chunking tip:** start with chunk_size=512 and overlap=64. If answers are inaccurate - reduce chunk_size to 256. If context is lost - increase overlap to 128. Always measure quality on real questions rather than tuning parameters blindly.

What is the role of the Orchestrator in an AI application?

Production Hardening: Caching, Rate Limiting, Monitoring

A working prototype is 20% of the journey. The remaining 80% is **production hardening**: reliability, cost control, observability. One unpredictable traffic spike without semantic cache and rate limiting can burn an entire day's budget in an hour - a real YC startup lost USD 8,000 overnight on GPT-4. Three components cover 90% of this risk: semantic cache on top of Qdrant, per-user rate limiting via Redis, and cost tracking with Prometheus.

**Semantic cache** - for repeated queries (saves 30-60% on LLM calls)
**Rate limiting** - per-user limits, different tiers (free/premium)
**Circuit breaker** - automatic failover to a fallback when the provider is down
**Request timeout** - 30s for standard requests, 60s for reasoning models
**Input validation** - prompt length limits, prompt injection filtering
**Cost tracking** - logging every call with cost calculation
**Alerting** - notifications when the daily budget is exceeded

**Semantic cache with a 0.95 threshold** can return the wrong answer for subtle differences in questions. "How do I delete a file?" and "How do I delete a directory?" have ~0.96 cosine similarity, but the answers are different. Always test the cache on edge cases.

Which component of a production AI system reduces LLM API costs for repeated queries?

Deployment: Docker, CI/CD, Secrets, Environments

AI applications are more complex to deploy than standard services: API keys for three providers, self-hosted Qdrant with persistent storage, embedding models with versioning, optionally GPUs for local inference. One mistake - `COPY . .` without `.dockerignore` - and `OPENAI_API_KEY` lands in Docker Hub in plaintext. Proper containerization and secrets management eliminate this entire class of problems permanently.

**OPENAI_API_KEY** - rotate every 90 days, use different keys for dev/staging/prod
**ANTHROPIC_API_KEY** - fallback provider, store in the same secret manager
**DATABASE_URL** - never hardcode, not even in docker-compose
**Tools:** GitHub Secrets for CI/CD, Docker secrets or .env for runtime
**Leak monitoring:** use git-secrets or gitleaks in a pre-commit hook

**API keys in Docker images** are a common leak. Never copy .env into a Docker image via `COPY . .` without a .dockerignore. Secrets are passed through environment variables at container runtime, not at build time.

What is the correct way to pass LLM provider API keys to a Docker container?

Demo, Feedback, Iterations: The AI Product Lifecycle

Deployment is not the finish line - it is the baseline. AI applications degrade on their own: OpenAI updates models and prompt behavior shifts, users find edge cases that were never tested, semantic cache accumulates stale responses. Without an eval pipeline (LLM-as-judge on a real-question dataset), degradation only becomes visible once NPS has already dropped.

**Preparing a demo** is a skill in itself. AI applications are unpredictable at temperature > 0: the same question can yield different answers. The goal of a demo is to show the system's capabilities (RAG + tool calling + streaming + correct refusal under hallucination risk), not one perfect answer to one question. A demo should be prepared, but not faked:

Prepare 5-7 questions that showcase all capabilities (RAG, tools, history)
Test each question 3-5 times - make sure the responses are stable
Prepare a fallback scenario: what to show if the LLM API goes down during the demo
Show metrics: latency, cost per query, cache hit rate
Demonstrate an edge case: an out-of-context question → a correct refusal instead of a hallucination

Metric	Target	How to measure
Answer relevance	> 4.0/5.0	LLM-as-judge on eval dataset
Hallucination rate	< 5%	Verify claims against source documents
P95 latency	< 5 seconds	Application metrics (Prometheus)
Cost per query	< USD 0.05	Token usage tracking
Cache hit rate	> 30%	Semantic cache metrics
User satisfaction	> 80% thumbs up	Thumbs up/down on each answer

**Iteration cycle:** Demo → Feedback → Analyze metrics → Improve prompts / chunking / model routing → Eval pipeline → Next Demo. Each iteration takes 1-2 weeks. After 3-4 iterations, quality stabilizes.

**The capstone project's guiding principle:** do less, but do it better. An excellent RAG pipeline with 3 sources beats a mediocre system with 10 features. Depth over breadth - that's what separates a senior AI engineer from a junior one.

What approach to AI application quality evaluation is most scalable?

Summary

Start with requirements (latency, cost, scale) - they dictate architecture; the choice between Qdrant and pgvector follows from there, not from hype
Build the Orchestrator first: Qdrant RAG retrieval + conversation history + tool calling assembled into a single pipeline before the first LLM call
Add semantic cache on top of Qdrant - cuts LLM spending by 30-60% without any quality loss
Wire up per-user rate limiting, circuit breaker, and cost tracking from day one - the first traffic spike will otherwise burn through budget
Set up Docker multi-stage + CI/CD + secrets via env - zero API keys in image layers
Run an eval pipeline (LLM-as-judge) on every prompt or model change - that is CI/CD for AI

What's Next

The capstone project brought together all the AI Engineering course skills. The following lessons cover cutting-edge directions that will shape the future of AI engineering.

Reasoning Models — o1/o3 - the next quality leap, changing the architecture of AI applications
World Models — From text to understanding the physical world - the next horizon for AI
The Path to AGI — Scaling laws, emergent abilities, and what they mean for developers

Связанные уроки

aie-42-ai-system-design — Capstone applies the full AI system design
aie-13-advanced-rag — The project builds an advanced RAG pipeline
aie-19-multi-agent — Agents and orchestration power the capstone
aie-35-observability — Production hardening needs monitoring and tracing
sd-22-observability — Deployment monitoring mirrors system observability
net-37-load-balancing — Scaling the deploy applies load balancing
sd-10-microservices