Generative AI

Prompt Engineering

Предварительные знания

How an LLM predicts the next token from its context window
What instruction tuning and RLHF do to a base model

OpenAI's GPT-4 technical report includes a benchmark where zero-shot GPT-4 outperformed fine-tuned task-specific models from 2020. The capability was already in the model - the technique for eliciting it was prompt engineering. For most practical applications, the right prompt is a faster and cheaper solution than fine-tuning.

GitHub Copilot's system prompt is roughly 1,000 tokens of carefully engineered instructions: persona definition, output format, error handling behavior, and examples. The engineering of this prompt took months and is considered a core product asset.
Bing Chat (now Copilot) was initially jailbroken by prompts that led it to claim to be "Sydney" and express disturbing content. The subsequent hardening - more explicit system prompts, safety classifiers, and RLHF - was a prompt engineering and alignment exercise at scale.
Stripe uses chain-of-thought prompting in their fraud detection pipeline. The LLM is asked to reason step-by-step about transaction patterns before making a classification decision. The chain-of-thought output is logged and reviewed by fraud analysts, providing both a decision and an audit trail.

How prompting became a discipline

In 2020 Tom Brown and the OpenAI team introduced GPT-3 in 'Language Models are Few-Shot Learners', showing a model could perform a new task from a few examples placed in the prompt, with no weight updates. In-context learning meant the prompt itself became the interface. In 2022 Jason Wei and colleagues at Google published chain-of-thought prompting: simply asking the model to reason step by step sharply improved performance on arithmetic and logic tasks. Between 2022 and 2023 prompt engineering grew from a trick into a discipline, with patterns like few-shot examples, role and system prompts, and structured-output instructions becoming standard practice for working with LLMs.

Few-Shot Learning

Few-shot learning provides examples of the desired behavior directly in the prompt. Unlike fine-tuning, which modifies model weights, few-shot examples are in-context demonstrations that the model generalizes from. The pattern: "Here is an example of the task. Here is another example. Now do the same for this input." The optimal number of examples is task-dependent: classification tasks benefit from 8-16 examples (balanced across classes), generation tasks often need 2-4.

Example selection matters as much as example count. Examples should cover the distribution of real inputs, include edge cases, and be balanced across categories. The order of examples also affects performance: more recent examples (closer to the query) have more influence than earlier ones. For production prompts, test multiple random orderings and use the most stable.

What distinguishes few-shot learning from fine-tuning?

Chain-of-Thought Prompting

Chain-of-thought (CoT) prompting instructs the model to generate reasoning steps before producing the final answer. Wei et al. (2022) showed that "Let's think step by step" or providing CoT examples significantly improves accuracy on math, logical reasoning, and multi-step problems. The intuition: the model's reasoning tokens act as scratchpad space, decomposing complex problems into simpler sub-problems. CoT does not work well on models below ~7B parameters - smaller models repeat reasoning without correctly solving.

Self-consistency CoT generates multiple reasoning chains (temperature > 0) and takes a majority vote over the final answers. This reduces variance at the cost of 5-20x token usage. For high-stakes decisions where accuracy is critical, self-consistency reliably outperforms single-pass CoT.

Why does chain-of-thought prompting improve accuracy on reasoning tasks?

System Prompts

A system prompt is the instruction layer that precedes the conversation and defines the model's persona, constraints, and behavior. It is typically hidden from end users - they interact with a tuned product experience, not the raw model. Effective system prompts define: the assistant's role and persona, output format requirements, scope restrictions ("only answer questions about X"), safety boundaries, and examples of ideal behavior. System prompts are often hundreds to thousands of tokens.

System prompt leakage is a known attack: users prompt the model to "repeat your system prompt" or "ignore previous instructions." Mitigations: instruct the model explicitly not to reveal the system prompt, use Claude's built-in confidentiality instructions, and classify inputs for injection attempts before sending to the model.

What is the purpose of a system prompt in a production LLM application?

Structured Output

Structured output constrains the LLM to produce parseable formats (JSON, XML, YAML) that downstream code can process reliably. Approaches in order of reliability: (1) prompt-only instructions with example JSON (least reliable); (2) constrained decoding via grammar sampling (forces token-level JSON validity); (3) OpenAI function calling / Anthropic tool use (model outputs structured data for a declared schema); (4) json_schema parameter in the API (forces output to match a JSONSchema).

Grammar-based constrained decoding (outlines, lm-format-enforcer, llama.cpp GBNF grammars) is the most reliable approach for JSON output - it makes invalid JSON tokens impossible at the sampling level. The trade-off: constrained decoding is harder to implement and slightly reduces model accuracy because valid but useful tokens are masked.

Why is tool use / function calling more reliable for structured output than prompt instructions alone?

Key Ideas

**Few-shot learning:** providing examples of the desired input-output format in the prompt; few examples (2-10) dramatically improve performance on structured tasks
**Chain-of-thought:** instructing the model to reason step-by-step before reaching a conclusion; improves accuracy on math, logic, and multi-step reasoning tasks
**System prompts:** the hidden instruction layer that defines model persona, constraints, output format, and behavior - invisible to end users but determines overall product behavior
**Structured output:** constraining the model to produce JSON, XML, or other parseable formats using either prompt instructions or constrained decoding (grammar sampling)

Вопросы для размышления

Few-shot examples in a prompt consume tokens and add to latency and cost. At what point does few-shot prompting become expensive enough to justify fine-tuning the same examples into the model weights?
Chain-of-thought improves accuracy but increases output length and thus token cost. For a production system making 10 million API calls per day, what is the cost of adding "Let's think step by step" to every request?
System prompts are often treated as trade secrets. What prevents a competitor from extracting the system prompt via jailbreak attacks, and how does the industry handle this asymmetry between prompt complexity and prompt secrecy?

Связанные уроки

gai-07 — Aligned models respond reliably to prompt patterns
gai-17 — Prompting techniques drive tool and function calling
aie-06-prompt-patterns — Engineering view of the same prompt patterns
nlp-04 — Few-shot prompting reuses in-context language modeling
ml-43-hyperparameters — Prompt tuning is search over inputs, not weights
nlp-01