AI Engineering
Production Prompt Patterns: system/user/assistant, Few-Shot, Chain-of-Thought
Цели урока
- Write structured system prompts - with sections, rules, and format
- Use few-shot examples for stable output
- Apply Chain-of-Thought for tasks requiring logic
- Choose an output formatting strategy: JSON mode, Zod, XML tags
- Build prompt templates for production - reusable and testable
Предварительные знания
- LLM API Integration
Prompt engineering is not an art. It's an interface to probabilistic computation - and it has strict rules. Structure the model has seen in training data. Examples that shift the output distribution. Phrases that activate the right "chains" through the weights. The quality gap between a naive and an engineered prompt reaches 40%. Same model. Same money. Different result.
- Notion AI - 50+ prompt templates for different tasks (summary, translate, brainstorm), all A/B tested like code
- Cursor - chain-of-thought in prompts improved code autocomplete accuracy by 30% without touching the model
- Stripe - few-shot examples for ticket classification hit 95% accuracy without fine-tuning (Brown et al. 2020 in the wild)
- GitHub Copilot - a structured system prompt of 2000+ tokens sets repository context; not a "hint" - a specification
Five Words That Changed ML
2022. Jason Wei (Google Brain) publishes "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" - show the model a few worked examples with reasoning written out, and accuracy on math problems climbs sharply. Months later Kojima et al. 2022 found the trick needs no examples at all: just append "Let's think step by step" and accuracy on MultiArith jumps from about 18% to 79%. The model didn't change. Only the prompt. Until then, the assumption was: better results require a bigger model or fine-tuning. Chain-of-Thought revealed a third path: **ask correctly**. And earlier, Brown et al. 2020 (GPT-3) found few-shot learning - the model learns from in-context examples with zero gradient steps. Both discoveries underlie every production prompt written today.
System Prompt: Architecture, Not a Hint
One developer writes a one-line system prompt: "You are a helpful assistant." Another structures it as a **specification** - with sections, constraints, and format. The second approach gets better results - not because the model is different, but because a well-structured prompt aligns with what the model was trained on.
Why does this work? The model was trained on billions of documents - and most of them are structured: READMEs, specs, API docs, markdown files. When a prompt looks like a structured document, it lands in the statistical "zone of the familiar" - and instruction-following improves sharply. Not magic. Statistics.
- **Role** - who the model is, what context it operates in
- **Rules** - what it can and can't do (constraints)
- **Tone** - communication style
- **Format** - what the response should look like
- **Examples** (optional) - 1-2 examples of an ideal response
**System prompt does NOT guarantee behavior.** Users can "convince" the model to break the rules (prompt injection). Don't rely on the system prompt as a security boundary - validate output on the backend.
Why does a structured system prompt (with sections like ## Role, ## Rules) work better than a single line?
Few-Shot: Teaching the Model by Example
Brown et al. 2020 - the GPT-3 paper - discovered a phenomenon they called **few-shot learning**: show the model 2-5 "input → output" examples directly in the prompt and it immediately grasps the pattern. No training. No gradient update. Just context. It rewrote what "learning" means for an LLM.
**Rule: 3 examples is the sweet spot.** 1 example - the model might not catch the pattern. 5+ examples - tokens get burned without meaningful improvement. 3 examples cover positive, negative, and edge case.
**When few-shot is critical:**
- Non-standard output format (specific JSON schema, CSV, XML)
- Classification with custom categories (not general purpose)
- Stylistic tasks - the model should mimic the style of examples
- Extraction from unstructured text into a specific structure
**Store few-shot examples in a database**, not in code. This enables A/B testing of different example sets and updates without redeploying.
Building an API that classifies support tickets into 12 custom categories. Which approach is more reliable?
Chain-of-Thought: Making the Model Think Aloud
2022. Jason Wei from Google Brain showed that worked reasoning examples elicit step-by-step thinking - Chain-of-Thought prompting. Then Kojima et al. found one phrase was enough: **"Let's think step by step"**. Math accuracy jumps from 18% to 79%. Not a new architecture. Not more training data. Not fine-tuning. Five words. It became one of the most cited ML discoveries of the year.
Why it works - mechanism, not intuition. LLMs generate one token at a time. When the model "thinks aloud," the intermediate reasoning tokens become **context** for the next tokens. A scratch pad built directly into the context window. The model literally uses its own text as working memory - it has no other kind.
| Task | Without CoT | With CoT | Improvement |
|---|---|---|---|
| Math (GSM8K) | ~57% | ~93% | +36% |
| Logic puzzles | ~45% | ~85% | +40% |
| Multi-step analysis | ~60% | ~90% | +30% |
| Simple classification | ~95% | ~95% | 0% (not needed) |
**CoT uses more output tokens** - reasoning takes up space. Don't use CoT for simple tasks (classification, extraction) - it's a waste of money.
Which task would benefit most from Chain-of-Thought?
Output Formatting: JSON, XML Tags, Structured Output
In production, "text" isn't enough - what's needed is **structured data**: JSON that can be parsed and stored in a database. There are several strategies to get stable output. The right choice depends on the provider and strictness requirements.
**Strategy 1: JSON mode** (OpenAI) - guarantees valid JSON:
**Strategy 2: Structured Outputs** (OpenAI) - even stricter, with a Zod schema:
**Strategy 3: XML tags** - works with any provider (Claude, open-source):
| Method | Format Guarantee | Provider | When to Use |
|---|---|---|---|
| JSON mode | Valid JSON | OpenAI | Simple JSON responses |
| Structured Outputs + Zod | Exact schema match | OpenAI | When strict typing is needed |
| XML tags | No guarantee (needs validation) | Any | Multi-provider, complex responses with CoT + data |
| Plaintext + regex | No guarantee | Any | Simple responses (yes/no, a number) |
**Best practice:** for OpenAI use Structured Outputs + Zod. For Anthropic and open-source - XML tags + backend validation. In any case - always wrap parsing in try/catch.
Building a production API that extracts structured data and must work with both OpenAI and Anthropic. Which approach?
Prompt Composition: A Template Engine for AI
In production, a prompt isn't a hardcoded string. It's a **template** where data from the request, database, and config gets injected. The prompt is assembled dynamically - like a SQL query or an HTML template. The difference: a prompt has no compiler to catch mistakes. That's why the architecture matters even more.
**Advanced pattern: prompts in files** - store templates separately from code:
**Why extract prompts?** A product manager can edit prompts via CMS/admin panel without a developer. A/B tests of different prompts - without redeploying. Prompt versioning - rollback to a previous version if quality degrades.
The main reason to use prompt templates instead of string literals in code:
The longer and more detailed the prompt, the better the result
Extra instructions add noise: the model loses focus as attention spreads across irrelevant parts of the context
Attention in a transformer is literally a distribution of weights across all tokens in the context window. A system prompt with 50 rules gives each rule less "attention" weight. The optimal system prompt is precise and minimal - only what genuinely shapes behavior. Everything else isn't help, it's noise.
The system prompt reliably restricts model behavior - it's a security boundary
A system prompt is a strong shift in the probability distribution, not a hard constraint
The model is a probabilistic machine. A clever enough user prompt can shift the distribution back. This is called prompt injection. The first CVEs have already been filed. The only reliable barrier is backend output validation, guardrails, and privilege separation.
Patterns at a Glance
- System prompt - a specification, not a hint. Structure (Role → Rules → Format) works because the model was trained on structured documents
- Few-shot (3 examples, Brown et al. 2020) - shifts the output distribution without fine-tuning; store examples in a database, not in code
- Chain-of-Thought (Wei et al. 2022) - five words give +40% on logic tasks; intermediate tokens become a scratch pad in context
- Structured Outputs (Zod) for OpenAI, XML tags + validation for multi-provider setups
- Prompts in templates separate from code - A/B tests and edits without redeploys
- Longer prompt does not mean better: extra instructions add noise and dilute attention
What's Next
Now it's clear how to write prompts like an engineer - as an interface to probabilistic computation with strict rules. The next step: getting the model to return data in a strict format and call backend functions.
- Structured Output — A closer look at JSON Schema, function calling, tool use
- Prompt Injection — How to protect prompts from user attacks
- Evaluation — How to measure prompt quality and automate testing
Вопросы для размышления
- Which pattern (few-shot, CoT, structured output) would give the biggest impact in a typical LLM project? Why that one?
- If a system prompt isn't a security boundary, what needs to be added to the backend architecture for real protection?
- CoT increases the number of output tokens. What would a cost calculation look like for 100K requests per day with CoT vs. without?
Связанные уроки
- aie-05-api-integration — Prompt patterns run on top of the chat API
- aie-07-structured-output — Patterns lead into schema-constrained outputs
- aie-34-prompt-injection-deep — Robust prompts must resist injection attacks
- aie-31-evaluation — Prompt quality needs systematic measurement
- ml-37-bert-gpt — Few-shot prompting exploits in-context learning of GPT models
- alg-20-greedy