Capstone: Designing and Building a Production AI Application from Scratch
Цели урока
- Design an AI application architecture from requirements to stack selection
- Implement the full pipeline: API → RAG → LLM → Tool Calling → Response
- Apply production hardening: caching, rate limiting, monitoring, error handling
- Set up containerization (Docker), CI/CD, and secrets management
- Organize a demo → feedback → evaluation → iteration cycle
An engineer walks out of a YC startup interview - hired. Forty candidates competed for the role. The deciding factor was a single GitHub repo: a Knowledge Assistant with Qdrant, streaming responses, an agent loop, and semantic cache. Not a degree. Not LeetCode scores. The entire AI Engineering stack - RAG, embeddings, tool calling, vector DB, cost monitoring - assembled into one working product. That is exactly what you build here.
- Stripe Docs AI: RAG over 500+ pages of documentation with tool calling for code generation - architecturally identical to the capstone project
- Cursor IDE: agentic pipeline with codebase RAG, embeddings via text-embedding-3-small, streaming over SSE - same patterns, larger scale
- Perplexity AI: search + RAG + citations - under the hood: vector store, re-ranking, LLM-as-judge for relevance scoring; started at USD 50 per day on API
- 70% of YC W24/S24 startups use RAG + multi-agent architecture from this course - the only difference is data volume and number of tools
AI Engineering becomes a discipline
Before ChatGPT (November 2022), building with AI mostly meant training models - the work of ML researchers. The wave of capable hosted models that followed split off a separate craft: composing existing models into reliable products through prompting, RAG, tool calling, evaluation, and cost control. By 2023-2024 the title "AI Engineer" had spread across job boards, and patterns like vector search, agent loops, and guardrails settled into a shared toolkit. There was no single founding event, just steady maturation. A capstone is where that scattered toolkit becomes one working system.
Предварительные знания
Planning: Requirements, Architectural Decisions, Stack Selection
Every production AI project starts not with code and not with picking an LLM provider, but with **directly defining the problem**. Latency requirements dictate streaming. Cost requirements dictate model routing and semantic cache. Scale requirements dictate queues and horizontal scaling. Without these answers, even a perfectly written RAG pipeline will be optimizing the wrong thing.
The capstone project is an **AI Knowledge Assistant**: documentation is ingested, chunked, and indexed in Qdrant using `text-embedding-3-small`. On each user query, RAG retrieval runs, the results feed an agent with tool calling (web search, calculator, external API), and the answer streams over SSE. This is not a toy example - it is the Stripe Docs AI and Cursor architecture in miniature.
Stack selection must be driven by requirements, not hype or what everyone else is using. Every component needs to be justified by a concrete need:
| Component | Choice | Alternatives | Rationale |
|---|---|---|---|
| Runtime | Node.js + NestJS | Python FastAPI, Go | Ecosystem, type safety, familiar stack |
| LLM Provider | OpenAI GPT-4o + fallback Anthropic | Ollama (self-hosted) | Price/quality balance, abstraction layer for switching |
| Vector Store | Qdrant | Pinecone, pgvector, Weaviate | Self-hosted, fast, filtering, free |
| Embedding | text-embedding-3-small | Cohere, local model | Cheap (USD 0.02 per 1M tokens), high quality |
| Queue | BullMQ + Redis | RabbitMQ, SQS | Already in the stack, simple, built-in retries |
| Database | PostgreSQL | MongoDB | Metadata, history, users |
**A common mistake during planning** is starting with technology choices instead of defining requirements. Qdrant may be "cooler" than pgvector, but if the project has 1,000 documents and no need for filtering - pgvector inside the existing PostgreSQL will be simpler and cheaper.
What should the design of an AI application begin with?
Implementation: API, LLM, RAG Pipeline, Tool Calling
The core of an AI application is the **Orchestrator**: the assembly point for everything covered in this course. A request arrives, the agent decides whether Qdrant RAG or a tool call is needed (web search, calculator, external API), builds context from conversation history and embedding results, calls the LLM - and returns a structured response with citations. This is exactly where the multi-agent patterns from lesson 19 meet Advanced RAG from lesson 13.
The **RAG pipeline** is where most projects lose quality silently. Wrong chunk size breaks semantic units, retrieval without score threshold returns irrelevant chunks, embeddings without batching make ingestion 10x slower. Below is a battle-tested implementation with recursive chunking, Qdrant userId filtering, and batched `text-embedding-3-small`:
**Chunking tip:** start with chunk_size=512 and overlap=64. If answers are inaccurate - reduce chunk_size to 256. If context is lost - increase overlap to 128. Always measure quality on real questions rather than tuning parameters blindly.
What is the role of the Orchestrator in an AI application?
Production Hardening: Caching, Rate Limiting, Monitoring
A working prototype is 20% of the journey. The remaining 80% is **production hardening**: reliability, cost control, observability. One unpredictable traffic spike without semantic cache and rate limiting can burn an entire day's budget in an hour - a real YC startup lost USD 8,000 overnight on GPT-4. Three components cover 90% of this risk: semantic cache on top of Qdrant, per-user rate limiting via Redis, and cost tracking with Prometheus.
- **Semantic cache** - for repeated queries (saves 30-60% on LLM calls)
- **Rate limiting** - per-user limits, different tiers (free/premium)
- **Circuit breaker** - automatic failover to a fallback when the provider is down
- **Request timeout** - 30s for standard requests, 60s for reasoning models
- **Input validation** - prompt length limits, prompt injection filtering
- **Cost tracking** - logging every call with cost calculation
- **Alerting** - notifications when the daily budget is exceeded
**Semantic cache with a 0.95 threshold** can return the wrong answer for subtle differences in questions. "How do I delete a file?" and "How do I delete a directory?" have ~0.96 cosine similarity, but the answers are different. Always test the cache on edge cases.
Which component of a production AI system reduces LLM API costs for repeated queries?
Deployment: Docker, CI/CD, Secrets, Environments
AI applications are more complex to deploy than standard services: API keys for three providers, self-hosted Qdrant with persistent storage, embedding models with versioning, optionally GPUs for local inference. One mistake - `COPY . .` without `.dockerignore` - and `OPENAI_API_KEY` lands in Docker Hub in plaintext. Proper containerization and secrets management eliminate this entire class of problems permanently.
- **OPENAI_API_KEY** - rotate every 90 days, use different keys for dev/staging/prod
- **ANTHROPIC_API_KEY** - fallback provider, store in the same secret manager
- **DATABASE_URL** - never hardcode, not even in docker-compose
- **Tools:** GitHub Secrets for CI/CD, Docker secrets or .env for runtime
- **Leak monitoring:** use git-secrets or gitleaks in a pre-commit hook
**API keys in Docker images** are a common leak. Never copy .env into a Docker image via `COPY . .` without a .dockerignore. Secrets are passed through environment variables at container runtime, not at build time.
What is the correct way to pass LLM provider API keys to a Docker container?
Demo, Feedback, Iterations: The AI Product Lifecycle
Deployment is not the finish line - it is the baseline. AI applications degrade on their own: OpenAI updates models and prompt behavior shifts, users find edge cases that were never tested, semantic cache accumulates stale responses. Without an eval pipeline (LLM-as-judge on a real-question dataset), degradation only becomes visible once NPS has already dropped.
**Preparing a demo** is a skill in itself. AI applications are unpredictable at temperature > 0: the same question can yield different answers. The goal of a demo is to show the system's capabilities (RAG + tool calling + streaming + correct refusal under hallucination risk), not one perfect answer to one question. A demo should be prepared, but not faked:
- Prepare 5-7 questions that showcase all capabilities (RAG, tools, history)
- Test each question 3-5 times - make sure the responses are stable
- Prepare a fallback scenario: what to show if the LLM API goes down during the demo
- Show metrics: latency, cost per query, cache hit rate
- Demonstrate an edge case: an out-of-context question → a correct refusal instead of a hallucination
| Metric | Target | How to measure |
|---|---|---|
| Answer relevance | > 4.0/5.0 | LLM-as-judge on eval dataset |
| Hallucination rate | < 5% | Verify claims against source documents |
| P95 latency | < 5 seconds | Application metrics (Prometheus) |
| Cost per query | < USD 0.05 | Token usage tracking |
| Cache hit rate | > 30% | Semantic cache metrics |
| User satisfaction | > 80% thumbs up | Thumbs up/down on each answer |
**Iteration cycle:** Demo → Feedback → Analyze metrics → Improve prompts / chunking / model routing → Eval pipeline → Next Demo. Each iteration takes 1-2 weeks. After 3-4 iterations, quality stabilizes.
**The capstone project's guiding principle:** do less, but do it better. An excellent RAG pipeline with 3 sources beats a mediocre system with 10 features. Depth over breadth - that's what separates a senior AI engineer from a junior one.
What approach to AI application quality evaluation is most scalable?
Summary
- Start with requirements (latency, cost, scale) - they dictate architecture; the choice between Qdrant and pgvector follows from there, not from hype
- Build the Orchestrator first: Qdrant RAG retrieval + conversation history + tool calling assembled into a single pipeline before the first LLM call
- Add semantic cache on top of Qdrant - cuts LLM spending by 30-60% without any quality loss
- Wire up per-user rate limiting, circuit breaker, and cost tracking from day one - the first traffic spike will otherwise burn through budget
- Set up Docker multi-stage + CI/CD + secrets via env - zero API keys in image layers
- Run an eval pipeline (LLM-as-judge) on every prompt or model change - that is CI/CD for AI
What's Next
The capstone project brought together all the AI Engineering course skills. The following lessons cover cutting-edge directions that will shape the future of AI engineering.
- Reasoning Models — o1/o3 - the next quality leap, changing the architecture of AI applications
- World Models — From text to understanding the physical world - the next horizon for AI
- The Path to AGI — Scaling laws, emergent abilities, and what they mean for developers
Связанные уроки
- aie-42-ai-system-design — Capstone applies the full AI system design
- aie-13-advanced-rag — The project builds an advanced RAG pipeline
- aie-19-multi-agent — Agents and orchestration power the capstone
- aie-35-observability — Production hardening needs monitoring and tracing
- sd-22-observability — Deployment monitoring mirrors system observability
- net-37-load-balancing — Scaling the deploy applies load balancing
- sd-10-microservices