AI Engineering

DSPy: Prompts as Code - Compile, Don't Handcraft

Цели урока

Understand why manual prompt engineering doesn't scale and is brittle
Master the three DSPy abstractions: Signature, Module, Program
See how BootstrapFewShot and MIPROv2 optimizers work
Build an optimized RAG pipeline and measure quality before and after

ChatGPT launched in 2022. LinkedIn saw 10,000 'Prompt Engineer' listings within a year. By 2025 most of that work was automated. Stanford DSPy showed: given a metric and training data, a compiler finds a better prompt than any human. Phi-3 mini with DSPy-optimized prompts outperformed GPT-4 on several NLP benchmarks - despite being 50x smaller.

JetBlue uses DSPy for request routing - accuracy improved from 71% to 89% without changing the model
Weaviate integrated DSPy as the standard optimization layer for production RAG pipelines
Phi-3 mini (3.8B params) with DSPy optimization beat GPT-4 on the HotpotQA benchmark
DSPy appears in 50+ academic papers from 2024-2025 as a prompt optimization baseline

From Handcrafted Prompts to Compilers

**2022 - Chain-of-Thought (Wei et al.)**: 'think step by step' as a magic phrase engineers inserted manually. **2023 - Automatic Prompt Engineer (Zhou et al.)**: first attempt at generating prompts automatically via LLM. **October 2023 - DSPy (Khattab et al., Stanford)**: prompt as a compiler abstraction, not a manual artifact. **2024 - TextGrad**: text 'gradients' from LLMs as a backpropagation mechanism. **2024 - MIPROv2**: Bayesian optimization of Signature + few-shot examples simultaneously, currently DSPy's best optimizer. The shift from prompts-as-text to prompts-as-programs took exactly two years.

Предварительные знания

The Problem With Manual Prompt Engineering

ChatGPT launched in 2022. Within a year, LinkedIn saw 10,000 'Prompt Engineer' job listings. By 2025, most of that work had been taken over by code. Not because prompts stopped mattering - because handcrafting them doesn't scale.

Manual prompts are code without a compiler. An engineer spends hours tuning phrasing, tests on 20 examples, deploys. A month later OpenAI updates the model. The prompt that delivered 87% accuracy now delivers 71%. The cycle starts over.

Problem	Symptom	Scale
Brittleness	Prompt 'breaks' after model update	GPT-3.5 -> GPT-4 -> GPT-4o - redo each time
Subjectivity	Two engineers write different prompts for the same task	No objective metric for choosing
Non-portability	GPT-4o prompt doesn't work on Claude	Vendor lock-in at the text level
Scale	100 tasks = 100 handcrafted prompts	O(N) effort for O(N) tasks

The root issue is conflating **the task** with **the instruction**. The task is stable: 'classify sentiment'. The instruction is unstable: it depends on the model, the examples, the output format. DSPy separates them.

Sclar et al. (2023) showed that changing a prompt by a single token can shift accuracy by 10-15%. This means a handcrafted prompt is a random point in a vast instruction space - not an optimum.

Why doesn't manual prompt engineering scale?

DSPy: Declare the Task, Don't Write the Prompt

Stanford DSPy (Khattab et al., 2023) is a framework where prompts are never written by hand. Instead, a **Signature** (input/output contract), a **Module** (processing strategy), and a **Program** (pipeline of modules) are declared. The compiler finds optimal prompts automatically.

The compiler analogy is precise. Writing C without a compiler means manually assigning processor registers - the compiler does it better. In DSPy, writing 'Think step by step' manually is unnecessary - the optimizer finds better instructions by maximizing a metric through backward passes over the instruction space.

**Built-in DSPy Modules** - ready-made processing strategies:

Module	What It Does	When to Use
Predict	Direct LLM call per Signature	Simple classification, extraction
ChainOfThought	Adds reasoning before the answer	Complex tasks, math, multi-step
ReAct	Reasoning + Action - calls tools	Agents with external tools
Retrieve	Searches a vector store	RAG components
ProgramOfThought	Generates and executes code	Computation, structured processing

DSPy Signature supports Python type annotations - this enables automatic validation of LLM output. If the model returns an invalid type or malformed JSON, DSPy retries automatically (up to 3 times).

In DSPy, a Signature describes...

DSPy Optimizers: BootstrapFewShot, MIPROv2, BayesianSignatureOptimizer

The heart of DSPy is its optimizers (formerly called teleprompters). The quality metric acts as a loss function: the optimizer searches for the prompt that maximizes the metric on training examples. The backward pass is not through model weights - it traverses the space of possible instructions.

Optimizer	How It Works	When to Use
BootstrapFewShot	Generates few-shot examples automatically from training set	Few examples (10-50), quick start
MIPROv2	Bayesian optimization: searches instructions + examples simultaneously	50-500 examples, balanced quality/speed
BayesianSignatureOptimizer	Optimizes the Signature phrasing itself (not just examples)	When the task is poorly defined initially
BootstrapFewShotWithRandomSearch	Random search over prompt space	Baseline when other optimizers are unstable

What happens inside MIPROv2: candidate instructions are generated (10 to 50), each is evaluated on a training subset, Bayesian optimization selects the next candidate instruction. The result is a prompt that the specific model understands best for the specific task.

Optimization requires LLM calls. MIPROv2 with auto='medium' and 100 training examples makes roughly 5,000 calls. On gpt-4o-mini that costs approximately USD 0.5-2. Run once, save the result - do not re-run on every deploy.

What plays the role of the 'loss function' in DSPy optimization?

Practice: RAG Pipeline With DSPy and Evaluation

Concrete scenario: RAG system for technical documentation. Baseline - handcrafted prompt, 68% F1 on a golden dataset. After MIPROv2 optimization - 84% F1. No model change, no retrieval code change.

What changed inside after optimization can be inspected:

TextGrad (Yuksekgonul et al., 2024) and Adalflow are DSPy alternatives. TextGrad uses literal text 'gradients': an LLM explains why an answer is wrong, and this explanation is used to improve the prompt. Adalflow is a lighter framework focused on production deployment.

When switching from GPT-4o to Claude, an optimized DSPy prompt should be...

DSPy is only for research - handcrafted prompts are simpler in production

DSPy saves hours on every model change and delivers reproducible results - it is a production tool

Switching from GPT-4o to Claude 3.5 requires reworking a handcrafted prompt manually. A DSPy Program recompiles in 10-30 minutes. With 10+ models in the portfolio or frequent model upgrades, the savings grow linearly.

DSPy optimizers enumerate all possible prompts - that takes too long

MIPROv2 uses Bayesian optimization: 50-200 evaluations, not millions - takes 10-30 minutes

A Bayesian optimizer learns from each iteration and steers the search toward promising regions of the prompt space. This is not a grid search - it is guided search with posterior updates.

Key Takeaways

A handcrafted prompt is a brittle point in instruction space - it breaks when the model updates
DSPy separates the task (Signature) from the instruction (prompt) - the compiler generates the prompt
The metric function is the gradient; the optimizer maximizes it on training data
MIPROv2 uses Bayesian optimization, 50-200 LLM calls, and delivers +10-25% vs a handcrafted prompt
When switching models: recompile, don't rewrite - this is the core advantage of the approach

Вопросы для размышления

Which tasks in the current project use hardcoded prompts? How brittle are they to model changes?
What would be the right optimization metric for a specific pipeline - F1, accuracy, or something domain-specific?
DSPy optimizes prompts on a training set - how to ensure the metric doesn't overfit to that set?

Связанные уроки

AI Engineering

DSPy: Prompts as Code - Compile, Don't Handcraft

Цели урока

Understand why manual prompt engineering doesn't scale and is brittle
Master the three DSPy abstractions: Signature, Module, Program
See how BootstrapFewShot and MIPROv2 optimizers work
Build an optimized RAG pipeline and measure quality before and after

JetBlue uses DSPy for request routing - accuracy improved from 71% to 89% without changing the model
Weaviate integrated DSPy as the standard optimization layer for production RAG pipelines
Phi-3 mini (3.8B params) with DSPy optimization beat GPT-4 on the HotpotQA benchmark
DSPy appears in 50+ academic papers from 2024-2025 as a prompt optimization baseline

From Handcrafted Prompts to Compilers

Предварительные знания

The Problem With Manual Prompt Engineering

Problem	Symptom	Scale
Brittleness	Prompt 'breaks' after model update	GPT-3.5 -> GPT-4 -> GPT-4o - redo each time
Subjectivity	Two engineers write different prompts for the same task	No objective metric for choosing
Non-portability	GPT-4o prompt doesn't work on Claude	Vendor lock-in at the text level
Scale	100 tasks = 100 handcrafted prompts	O(N) effort for O(N) tasks

Sclar et al. (2023) showed that changing a prompt by a single token can shift accuracy by 10-15%. This means a handcrafted prompt is a random point in a vast instruction space - not an optimum.

Why doesn't manual prompt engineering scale?

DSPy: Declare the Task, Don't Write the Prompt

**Built-in DSPy Modules** - ready-made processing strategies:

Module	What It Does	When to Use
Predict	Direct LLM call per Signature	Simple classification, extraction
ChainOfThought	Adds reasoning before the answer	Complex tasks, math, multi-step
ReAct	Reasoning + Action - calls tools	Agents with external tools
Retrieve	Searches a vector store	RAG components
ProgramOfThought	Generates and executes code	Computation, structured processing

DSPy Signature supports Python type annotations - this enables automatic validation of LLM output. If the model returns an invalid type or malformed JSON, DSPy retries automatically (up to 3 times).

In DSPy, a Signature describes...

DSPy Optimizers: BootstrapFewShot, MIPROv2, BayesianSignatureOptimizer

Optimizer	How It Works	When to Use
BootstrapFewShot	Generates few-shot examples automatically from training set	Few examples (10-50), quick start
MIPROv2	Bayesian optimization: searches instructions + examples simultaneously	50-500 examples, balanced quality/speed
BayesianSignatureOptimizer	Optimizes the Signature phrasing itself (not just examples)	When the task is poorly defined initially
BootstrapFewShotWithRandomSearch	Random search over prompt space	Baseline when other optimizers are unstable

What plays the role of the 'loss function' in DSPy optimization?

Practice: RAG Pipeline With DSPy and Evaluation

Concrete scenario: RAG system for technical documentation. Baseline - handcrafted prompt, 68% F1 on a golden dataset. After MIPROv2 optimization - 84% F1. No model change, no retrieval code change.

What changed inside after optimization can be inspected:

When switching from GPT-4o to Claude, an optimized DSPy prompt should be...

DSPy is only for research - handcrafted prompts are simpler in production

DSPy saves hours on every model change and delivers reproducible results - it is a production tool

DSPy optimizers enumerate all possible prompts - that takes too long

MIPROv2 uses Bayesian optimization: 50-200 evaluations, not millions - takes 10-30 minutes

A Bayesian optimizer learns from each iteration and steers the search toward promising regions of the prompt space. This is not a grid search - it is guided search with posterior updates.

Key Takeaways

A handcrafted prompt is a brittle point in instruction space - it breaks when the model updates
DSPy separates the task (Signature) from the instruction (prompt) - the compiler generates the prompt
The metric function is the gradient; the optimizer maximizes it on training data
MIPROv2 uses Bayesian optimization, 50-200 LLM calls, and delivers +10-25% vs a handcrafted prompt
When switching models: recompile, don't rewrite - this is the core advantage of the approach

Вопросы для размышления

Which tasks in the current project use hardcoded prompts? How brittle are they to model changes?
What would be the right optimization metric for a specific pipeline - F1, accuracy, or something domain-specific?
DSPy optimizes prompts on a training set - how to ensure the metric doesn't overfit to that set?

DSPy: Prompts as Code - Compile, Don't Handcraft

Цели урока

From Handcrafted Prompts to Compilers

Предварительные знания

The Problem With Manual Prompt Engineering

DSPy: Declare the Task, Don't Write the Prompt

DSPy Optimizers: BootstrapFewShot, MIPROv2, BayesianSignatureOptimizer

Practice: RAG Pipeline With DSPy and Evaluation

Key Takeaways

Вопросы для размышления

Related Topics

Связанные уроки

DSPy: Prompts as Code - Compile, Don't Handcraft

Цели урока

From Handcrafted Prompts to Compilers

Предварительные знания

The Problem With Manual Prompt Engineering

DSPy: Declare the Task, Don't Write the Prompt

DSPy Optimizers: BootstrapFewShot, MIPROv2, BayesianSignatureOptimizer

Practice: RAG Pipeline With DSPy and Evaluation

Key Takeaways

Вопросы для размышления

Related Topics

Связанные уроки