AI Engineering
DSPy: Prompts as Code - Compile, Don't Handcraft
Цели урока
- Understand why manual prompt engineering doesn't scale and is brittle
- Master the three DSPy abstractions: Signature, Module, Program
- See how BootstrapFewShot and MIPROv2 optimizers work
- Build an optimized RAG pipeline and measure quality before and after
ChatGPT launched in 2022. LinkedIn saw 10,000 'Prompt Engineer' listings within a year. By 2025 most of that work was automated. Stanford DSPy showed: given a metric and training data, a compiler finds a better prompt than any human. Phi-3 mini with DSPy-optimized prompts outperformed GPT-4 on several NLP benchmarks - despite being 50x smaller.
- JetBlue uses DSPy for request routing - accuracy improved from 71% to 89% without changing the model
- Weaviate integrated DSPy as the standard optimization layer for production RAG pipelines
- Phi-3 mini (3.8B params) with DSPy optimization beat GPT-4 on the HotpotQA benchmark
- DSPy appears in 50+ academic papers from 2024-2025 as a prompt optimization baseline
From Handcrafted Prompts to Compilers
**2022 - Chain-of-Thought (Wei et al.)**: 'think step by step' as a magic phrase engineers inserted manually. **2023 - Automatic Prompt Engineer (Zhou et al.)**: first attempt at generating prompts automatically via LLM. **October 2023 - DSPy (Khattab et al., Stanford)**: prompt as a compiler abstraction, not a manual artifact. **2024 - TextGrad**: text 'gradients' from LLMs as a backpropagation mechanism. **2024 - MIPROv2**: Bayesian optimization of Signature + few-shot examples simultaneously, currently DSPy's best optimizer. The shift from prompts-as-text to prompts-as-programs took exactly two years.
Предварительные знания
The Problem With Manual Prompt Engineering
ChatGPT launched in 2022. Within a year, LinkedIn saw 10,000 'Prompt Engineer' job listings. By 2025, most of that work had been taken over by code. Not because prompts stopped mattering - because handcrafting them doesn't scale.
Manual prompts are code without a compiler. An engineer spends hours tuning phrasing, tests on 20 examples, deploys. A month later OpenAI updates the model. The prompt that delivered 87% accuracy now delivers 71%. The cycle starts over.
| Problem | Symptom | Scale |
|---|---|---|
| Brittleness | Prompt 'breaks' after model update | GPT-3.5 -> GPT-4 -> GPT-4o - redo each time |
| Subjectivity | Two engineers write different prompts for the same task | No objective metric for choosing |
| Non-portability | GPT-4o prompt doesn't work on Claude | Vendor lock-in at the text level |
| Scale | 100 tasks = 100 handcrafted prompts | O(N) effort for O(N) tasks |
The root issue is conflating **the task** with **the instruction**. The task is stable: 'classify sentiment'. The instruction is unstable: it depends on the model, the examples, the output format. DSPy separates them.
Sclar et al. (2023) showed that changing a prompt by a single token can shift accuracy by 10-15%. This means a handcrafted prompt is a random point in a vast instruction space - not an optimum.
Why doesn't manual prompt engineering scale?
DSPy: Declare the Task, Don't Write the Prompt
Stanford DSPy (Khattab et al., 2023) is a framework where prompts are never written by hand. Instead, a **Signature** (input/output contract), a **Module** (processing strategy), and a **Program** (pipeline of modules) are declared. The compiler finds optimal prompts automatically.
The compiler analogy is precise. Writing C without a compiler means manually assigning processor registers - the compiler does it better. In DSPy, writing 'Think step by step' manually is unnecessary - the optimizer finds better instructions by maximizing a metric through backward passes over the instruction space.
**Built-in DSPy Modules** - ready-made processing strategies:
| Module | What It Does | When to Use |
|---|---|---|
| Predict | Direct LLM call per Signature | Simple classification, extraction |
| ChainOfThought | Adds reasoning before the answer | Complex tasks, math, multi-step |
| ReAct | Reasoning + Action - calls tools | Agents with external tools |
| Retrieve | Searches a vector store | RAG components |
| ProgramOfThought | Generates and executes code | Computation, structured processing |
DSPy Signature supports Python type annotations - this enables automatic validation of LLM output. If the model returns an invalid type or malformed JSON, DSPy retries automatically (up to 3 times).
In DSPy, a Signature describes...
DSPy Optimizers: BootstrapFewShot, MIPROv2, BayesianSignatureOptimizer
The heart of DSPy is its optimizers (formerly called teleprompters). The quality metric acts as a loss function: the optimizer searches for the prompt that maximizes the metric on training examples. The backward pass is not through model weights - it traverses the space of possible instructions.
| Optimizer | How It Works | When to Use |
|---|---|---|
| BootstrapFewShot | Generates few-shot examples automatically from training set | Few examples (10-50), quick start |
| MIPROv2 | Bayesian optimization: searches instructions + examples simultaneously | 50-500 examples, balanced quality/speed |
| BayesianSignatureOptimizer | Optimizes the Signature phrasing itself (not just examples) | When the task is poorly defined initially |
| BootstrapFewShotWithRandomSearch | Random search over prompt space | Baseline when other optimizers are unstable |
What happens inside MIPROv2: candidate instructions are generated (10 to 50), each is evaluated on a training subset, Bayesian optimization selects the next candidate instruction. The result is a prompt that the specific model understands best for the specific task.
Optimization requires LLM calls. MIPROv2 with auto='medium' and 100 training examples makes roughly 5,000 calls. On gpt-4o-mini that costs approximately USD 0.5-2. Run once, save the result - do not re-run on every deploy.
What plays the role of the 'loss function' in DSPy optimization?
Practice: RAG Pipeline With DSPy and Evaluation
Concrete scenario: RAG system for technical documentation. Baseline - handcrafted prompt, 68% F1 on a golden dataset. After MIPROv2 optimization - 84% F1. No model change, no retrieval code change.
What changed inside after optimization can be inspected:
TextGrad (Yuksekgonul et al., 2024) and Adalflow are DSPy alternatives. TextGrad uses literal text 'gradients': an LLM explains why an answer is wrong, and this explanation is used to improve the prompt. Adalflow is a lighter framework focused on production deployment.
When switching from GPT-4o to Claude, an optimized DSPy prompt should be...
DSPy is only for research - handcrafted prompts are simpler in production
DSPy saves hours on every model change and delivers reproducible results - it is a production tool
Switching from GPT-4o to Claude 3.5 requires reworking a handcrafted prompt manually. A DSPy Program recompiles in 10-30 minutes. With 10+ models in the portfolio or frequent model upgrades, the savings grow linearly.
DSPy optimizers enumerate all possible prompts - that takes too long
MIPROv2 uses Bayesian optimization: 50-200 evaluations, not millions - takes 10-30 minutes
A Bayesian optimizer learns from each iteration and steers the search toward promising regions of the prompt space. This is not a grid search - it is guided search with posterior updates.
Key Takeaways
- A handcrafted prompt is a brittle point in instruction space - it breaks when the model updates
- DSPy separates the task (Signature) from the instruction (prompt) - the compiler generates the prompt
- The metric function is the gradient; the optimizer maximizes it on training data
- MIPROv2 uses Bayesian optimization, 50-200 LLM calls, and delivers +10-25% vs a handcrafted prompt
- When switching models: recompile, don't rewrite - this is the core advantage of the approach
Вопросы для размышления
- Which tasks in the current project use hardcoded prompts? How brittle are they to model changes?
- What would be the right optimization metric for a specific pipeline - F1, accuracy, or something domain-specific?
- DSPy optimizes prompts on a training set - how to ensure the metric doesn't overfit to that set?
Related Topics
DSPy automates prompts. Fine-tuning goes further - baking knowledge into model weights.
- Evaluation and golden datasets — The DSPy metric is the same as an evaluation pipeline: good test data is required
- Fine-tuning — The alternative path - instead of optimizing a prompt, train the model on the task
- Advanced RAG — DSPy enables optimization of an entire RAG pipeline as a single program