AI Engineering

DSPy: Prompts as Code - Compile, Don't Handcraft

Цели урока

  • Understand why manual prompt engineering doesn't scale and is brittle
  • Master the three DSPy abstractions: Signature, Module, Program
  • See how BootstrapFewShot and MIPROv2 optimizers work
  • Build an optimized RAG pipeline and measure quality before and after

ChatGPT launched in 2022. LinkedIn saw 10,000 'Prompt Engineer' listings within a year. By 2025 most of that work was automated. Stanford DSPy showed: given a metric and training data, a compiler finds a better prompt than any human. Phi-3 mini with DSPy-optimized prompts outperformed GPT-4 on several NLP benchmarks - despite being 50x smaller.

  • JetBlue uses DSPy for request routing - accuracy improved from 71% to 89% without changing the model
  • Weaviate integrated DSPy as the standard optimization layer for production RAG pipelines
  • Phi-3 mini (3.8B params) with DSPy optimization beat GPT-4 on the HotpotQA benchmark
  • DSPy appears in 50+ academic papers from 2024-2025 as a prompt optimization baseline

From Handcrafted Prompts to Compilers

**2022 - Chain-of-Thought (Wei et al.)**: 'think step by step' as a magic phrase engineers inserted manually. **2023 - Automatic Prompt Engineer (Zhou et al.)**: first attempt at generating prompts automatically via LLM. **October 2023 - DSPy (Khattab et al., Stanford)**: prompt as a compiler abstraction, not a manual artifact. **2024 - TextGrad**: text 'gradients' from LLMs as a backpropagation mechanism. **2024 - MIPROv2**: Bayesian optimization of Signature + few-shot examples simultaneously, currently DSPy's best optimizer. The shift from prompts-as-text to prompts-as-programs took exactly two years.

Предварительные знания

  • Production Prompt Patterns: system/user/assistant, Few-Shot, Chain-of-Thought
  • Evaluation: How to Know an LLM Didn't Break After Deploy

The Problem With Manual Prompt Engineering

ChatGPT launched in 2022. Within a year, LinkedIn saw 10,000 'Prompt Engineer' job listings. By 2025, most of that work had been taken over by code. Not because prompts stopped mattering - because handcrafting them doesn't scale.

Manual prompts are code without a compiler. An engineer spends hours tuning phrasing, tests on 20 examples, deploys. A month later OpenAI updates the model. The prompt that delivered 87% accuracy now delivers 71%. The cycle starts over.

ProblemSymptomScale
BrittlenessPrompt 'breaks' after model updateGPT-3.5 -> GPT-4 -> GPT-4o - redo each time
SubjectivityTwo engineers write different prompts for the same taskNo objective metric for choosing
Non-portabilityGPT-4o prompt doesn't work on ClaudeVendor lock-in at the text level
Scale100 tasks = 100 handcrafted promptsO(N) effort for O(N) tasks

The root issue is conflating **the task** with **the instruction**. The task is stable: 'classify sentiment'. The instruction is unstable: it depends on the model, the examples, the output format. DSPy separates them.

Sclar et al. (2023) showed that changing a prompt by a single token can shift accuracy by 10-15%. This means a handcrafted prompt is a random point in a vast instruction space - not an optimum.

Why doesn't manual prompt engineering scale?

DSPy: Declare the Task, Don't Write the Prompt

Stanford DSPy (Khattab et al., 2023) is a framework where prompts are never written by hand. Instead, a **Signature** (input/output contract), a **Module** (processing strategy), and a **Program** (pipeline of modules) are declared. The compiler finds optimal prompts automatically.

The compiler analogy is precise. Writing C without a compiler means manually assigning processor registers - the compiler does it better. In DSPy, writing 'Think step by step' manually is unnecessary - the optimizer finds better instructions by maximizing a metric through backward passes over the instruction space.

**Built-in DSPy Modules** - ready-made processing strategies:

ModuleWhat It DoesWhen to Use
PredictDirect LLM call per SignatureSimple classification, extraction
ChainOfThoughtAdds reasoning before the answerComplex tasks, math, multi-step
ReActReasoning + Action - calls toolsAgents with external tools
RetrieveSearches a vector storeRAG components
ProgramOfThoughtGenerates and executes codeComputation, structured processing

DSPy Signature supports Python type annotations - this enables automatic validation of LLM output. If the model returns an invalid type or malformed JSON, DSPy retries automatically (up to 3 times).

In DSPy, a Signature describes...

DSPy Optimizers: BootstrapFewShot, MIPROv2, BayesianSignatureOptimizer

The heart of DSPy is its optimizers (formerly called teleprompters). The quality metric acts as a loss function: the optimizer searches for the prompt that maximizes the metric on training examples. The backward pass is not through model weights - it traverses the space of possible instructions.

OptimizerHow It WorksWhen to Use
BootstrapFewShotGenerates few-shot examples automatically from training setFew examples (10-50), quick start
MIPROv2Bayesian optimization: searches instructions + examples simultaneously50-500 examples, balanced quality/speed
BayesianSignatureOptimizerOptimizes the Signature phrasing itself (not just examples)When the task is poorly defined initially
BootstrapFewShotWithRandomSearchRandom search over prompt spaceBaseline when other optimizers are unstable

What happens inside MIPROv2: candidate instructions are generated (10 to 50), each is evaluated on a training subset, Bayesian optimization selects the next candidate instruction. The result is a prompt that the specific model understands best for the specific task.

Optimization requires LLM calls. MIPROv2 with auto='medium' and 100 training examples makes roughly 5,000 calls. On gpt-4o-mini that costs approximately USD 0.5-2. Run once, save the result - do not re-run on every deploy.

What plays the role of the 'loss function' in DSPy optimization?

Practice: RAG Pipeline With DSPy and Evaluation

Concrete scenario: RAG system for technical documentation. Baseline - handcrafted prompt, 68% F1 on a golden dataset. After MIPROv2 optimization - 84% F1. No model change, no retrieval code change.

What changed inside after optimization can be inspected:

TextGrad (Yuksekgonul et al., 2024) and Adalflow are DSPy alternatives. TextGrad uses literal text 'gradients': an LLM explains why an answer is wrong, and this explanation is used to improve the prompt. Adalflow is a lighter framework focused on production deployment.

When switching from GPT-4o to Claude, an optimized DSPy prompt should be...

DSPy is only for research - handcrafted prompts are simpler in production

DSPy saves hours on every model change and delivers reproducible results - it is a production tool

Switching from GPT-4o to Claude 3.5 requires reworking a handcrafted prompt manually. A DSPy Program recompiles in 10-30 minutes. With 10+ models in the portfolio or frequent model upgrades, the savings grow linearly.

DSPy optimizers enumerate all possible prompts - that takes too long

MIPROv2 uses Bayesian optimization: 50-200 evaluations, not millions - takes 10-30 minutes

A Bayesian optimizer learns from each iteration and steers the search toward promising regions of the prompt space. This is not a grid search - it is guided search with posterior updates.

Key Takeaways

  • A handcrafted prompt is a brittle point in instruction space - it breaks when the model updates
  • DSPy separates the task (Signature) from the instruction (prompt) - the compiler generates the prompt
  • The metric function is the gradient; the optimizer maximizes it on training data
  • MIPROv2 uses Bayesian optimization, 50-200 LLM calls, and delivers +10-25% vs a handcrafted prompt
  • When switching models: recompile, don't rewrite - this is the core advantage of the approach

Вопросы для размышления

  • Which tasks in the current project use hardcoded prompts? How brittle are they to model changes?
  • What would be the right optimization metric for a specific pipeline - F1, accuracy, or something domain-specific?
  • DSPy optimizes prompts on a training set - how to ensure the metric doesn't overfit to that set?

Related Topics

DSPy automates prompts. Fine-tuning goes further - baking knowledge into model weights.

  • Evaluation and golden datasets — The DSPy metric is the same as an evaluation pipeline: good test data is required
  • Fine-tuning — The alternative path - instead of optimizing a prompt, train the model on the task
  • Advanced RAG — DSPy enables optimization of an entire RAG pipeline as a single program

Связанные уроки

  • aie-06-prompt-patterns
  • aie-31-evaluation
  • aie-13-advanced-rag
  • aie-36-fine-tuning
  • aie-29-cost-management
DSPy: Prompts as Code - Compile, Don't Handcraft

0

1

Sign In