AI Engineering
Reasoning Models: o3, o4, Extended Thinking - How Next-Gen Models Think
Цели урока
- Understand test-time compute scaling and how it differs from train-time scaling
- Grasp the architecture of reasoning models (o1, o3, DeepSeek-R1)
- Learn to identify tasks where reasoning models deliver a dramatic advantage
- Master model routing and escalating reasoning patterns for production systems
o4-mini passes AIME at the 99th percentile of humans. Claude 4 Opus solves PhD-level chemistry and biology problems - not because it memorized the answer, but because it thinks out loud for several minutes. DeepSeek R2 does the same in open-source. This is not the future - this is production 2026. Reasoning models are already embedded in Cursor, GitHub Copilot, Notion AI. The question is not 'will they arrive' - but 'how many reasoning tokens does this specific task cost'.
- o4-mini (OpenAI, 2025) in production: used in Cursor to analyze complex bug reports where standard generation produced errors
- Claude 4.x Extended Thinking (Anthropic, 2025-2026) in production for code audits and architecture reviews - the thinking process is visible and budget-controllable
- DeepSeek R2 - open-source reasoning at frontier level, available for self-hosting: reasoning without vendor lock-in is real
- Reasoning token cost is the key 2026 metric: o4-mini is 10x cheaper than o3 at comparable quality on most tasks - routing by task type saves 70-90% of the budget
From Chain-of-Thought to Reasoning Models
In January 2022, the Google Brain team published "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" (Jason Wei et al.). A simple trick - showing the model a few worked examples with their reasoning steps written out - boosted PaLM 540B accuracy on the GSM8K math benchmark from about 18% to 57%. This was the first hint: LLMs *can* reason, they just don't do it by default. OpenAI took it further - training a model to *always* reason via RL, creating o1. The reasoning model race had begun.
Предварительные знания
Test-Time Compute Scaling: Think Longer = Think Better
The traditional path to better LLMs is more parameters and more data. But in 2024 a second vector emerged: **test-time compute scaling** - allocating more computation *at inference time*, not at training. OpenAI shipped this in o1; DeepMind formalized it mathematically in "Scaling LLM Test-Time Compute Optimally".
An analogy from cognitive science: Daniel Kahneman in "Thinking, Fast and Slow" described two modes of thinking. **System 1** - fast, automatic, intuitive (like a standard LLM: receives a prompt → immediately produces an answer). **System 2** - slow, step-by-step, analytical (like a reasoning model: receives a prompt → thinks → verifies → answers).
DeepMind's paper formalized a **scaling law for inference**: growing compute at generation time yields predictable quality gains - especially on tasks with verifiable answers (math, code). For such tasks, a mid-sized model with large test-time compute outperforms a 10x larger model without it.
| Approach | What scales | Result | Example |
|---|---|---|---|
| Train-time scaling | Parameters, data, training compute | A smarter model overall | GPT-3 → GPT-4 |
| Test-time scaling | Compute at inference (chains of reasoning) | Better on a specific task | GPT-4o → o1 |
| Combined | Both directions | Maximum results | o3 (large model + reasoning) |
**Practical significance:** test-time compute scaling means the same model can operate in two modes - fast (cheap) for simple tasks and slow (expensive) for complex ones. This fundamentally changes the economics of AI applications.
What is the core idea behind test-time compute scaling?
o1/o3 Architecture: Chain-of-Thought on Steroids
OpenAI o1 (September 2024) and o3 (February 2025) are the first commercial reasoning models. The core difference from GPT-4o: the model is trained to generate a **hidden chain of reasoning** before the final answer. DeepSeek-R1 (January 2025) reproduced the same approach in open-source, proving the pattern is replicable without secret architectures.
The training technique is **RLVR** (Reinforcement Learning with Verifiable Rewards): the model receives reward not for the quality of intermediate reasoning but for the correctness of the final answer. This is what drives it to independently discover backtracking, hypothesis testing, and problem decomposition. DeepSeek-R1 used RLVR openly - unlike o1, whose training details remain secret.
| Characteristic | GPT-4o | o1 / o1-pro | o3-mini |
|---|---|---|---|
| System message | Yes | No (user only) | Yes (developer role) |
| Temperature | 0-2 | Fixed (1) | Fixed (1) |
| Streaming | Yes | Limited | Yes |
| Reasoning effort | - | - | low / medium / high |
| Cost input (USD/1M) | 2.50 | 15.00 | 1.10 |
| Cost output (USD/1M) | 10.00 | 60.00 | 4.40 |
| Speed | ~50 tok/s | ~10-30s per task | ~5-15s per task |
**Reasoning tokens are billed as output tokens.** A request to o1 requiring 2,000 reasoning tokens + 200 output tokens costs as much as 2,200 output tokens. On complex math problems, reasoning can consume 10,000+ tokens - that's USD 0.60 per single request to o1.
Within six months of o1, the entire market realigned: Gemini 2.0 Flash Thinking, Claude Extended Thinking (Anthropic), Grok-3 Think (xAI) - all shipped chain-of-thought reasoning variants. DeepSeek-R1 did it in open-source with a publicly described training process. What started as a research lab experiment became a commodity feature in under a year.
Why are reasoning models more expensive than standard LLMs when used via API?
Reasoning vs Generation: Two Modes of One Model
Not every task needs deep reasoning. Running o1 to generate marketing copy is like renting a supercomputer to add two numbers: same result, 10x the price. Knowing the boundary between **reasoning** tasks (logic, math, complex code) and **generation** tasks (copy, translation, summarization) is one of the core AI engineering skills.
The benchmark numbers make the boundary concrete. On MMLU (general knowledge) the gap between GPT-4o and o1 is near-noise: 88% vs 92%. On AIME 2024 (US math olympiads) it's a chasm: GPT-4o scores 13%, o1 - 83%, o3 - 96%. On Codeforces, GPT-4o holds an 808 rating; o3 reaches 2727 (International Grandmaster). Verifiable tasks with clear correctness criteria are the native habitat of reasoning models.
| Benchmark | GPT-4o | o1 | o3 | Task type |
|---|---|---|---|---|
| MMLU | 88% | 92% | 96% | General knowledge, facts |
| GPQA (PhD-level science) | 53% | 78% | 88% | Scientific reasoning |
| AIME 2024 (math olympiads) | 13% | 83% | 96% | Mathematical logic |
| Codeforces rating | 808 | 1807 | 2727 | Algorithmic problems |
| SWE-bench Verified | 33% | 49% | 71% | Real code bugs |
| ARC-AGI (abstract reasoning) | 5% | 32% | 88% | Patterns and analogies |
**The ARC-AGI benchmark** (created by Francois Chollet, the creator of Keras) was specifically designed as a test for generalization - tasks that can't be solved by memorizing patterns. o3's 88% on ARC-AGI sparked serious debate: is this genuine reasoning, or very good pattern recognition?
For which task would a reasoning model (o1/o3) provide the greatest advantage over standard GPT-4o?
Reasoning Engineering: Application Architecture with Reasoning Models
Reasoning models force a rethink of AI application architecture. Using one model for everything means either overpaying on simple tasks or underperforming on complex ones. The solution is **model routing**: a cheap classifier (gpt-4o-mini) determines query complexity and directs it to the appropriate model. Production systems like Cursor and GitHub Copilot already work this way.
The **reasoning_effort** parameter in o3-mini creates a controllable spectrum: `low` - fast and cheap (close to GPT-4o cost), `medium` - balanced, `high` - maximum thoroughness with a risk of 10,000+ reasoning tokens per request. The Escalating Reasoning pattern starts at `low` and steps up only when the model's confidence is low - cutting costs while maintaining reliability.
Reasoning models also flip the **prompt engineering** playbook - counterintuitively. For GPT-4o, detailed instructions ("think step by step", "first outline a plan") raise quality. For o1/o3 they are redundant: the model builds its own chain of reasoning via RLVR, and a wordy prompt can literally derail it. The optimal prompt for a reasoning model is a precise problem statement with no process instructions.
- Prompt for GPT-4o (detailed) — Detailed instructions, step-by-step plan, examples, constraints - all of this improves quality. The system message is critical. Chain-of-thought must be explicitly requested: "Let's think step by step".
- Prompt for o1/o3 (concise) — Brief problem statement. The model decides how to reason on its own. Excessive instructions can reduce quality. Chain-of-thought is built in - no need to ask. Better to focus on directly stating *what the desired output is*.
**Latency:** reasoning models are significantly slower. A request to o1 can take 10-60 seconds. For real-time chatbots, this is critical. Architectural solution - streaming partial results: show the user "Analyzing..." with a progress bar while the model thinks.
**State of 2026:** the boundary between reasoning and generation models has blurred. Claude 4 switches to Extended Thinking on request; o4-mini is available through the same API as gpt-4o-mini. Knowing *when* reasoning is needed is the main cost optimization lever.
What pattern is most effective for a production system handling diverse user queries?
Key Ideas
- Test-time compute scaling: DeepMind proved that for complex tasks, giving a model more time to reason beats training a 10x larger model
- o1/o3 and DeepSeek-R1 generate thousands of hidden reasoning tokens via RLVR - rewarding correct final answers, not intermediate steps
- Reasoning tokens are billed as output - a single complex o1 request can easily cost USD 0.60; budget leaks silently without monitoring
- AIME 2024: GPT-4o scores 13%, o1 scores 83% - but on MMLU the gap is minimal; pick the model for the task, not by default
- Model routing via a gpt-4o-mini classifier cuts production system costs by 5-10x without sacrificing quality on complex queries
- GPT-4o, Claude 3.5, Gemini 1.5 - the non-reasoning generation - gave way to thinking-native models: o4-mini vs gpt-4o-mini, Claude Extended Thinking mode vs standard Sonnet
What's Next
Reasoning models are one path toward more general AI. The next lessons explore other directions: world models for understanding the physical world and the path to AGI through scaling laws.
- World Models — Reasoning about language → reasoning about the physical world
- The Path to AGI — Reasoning models as one step toward general intelligence
- Model Routing — Practical patterns for routing between models
Связанные уроки
- aie-17-agent-fundamentals — Reasoning models extend chain-of-thought agent loops
- aie-22-model-routing — Route to reasoning models only when it pays
- aie-29-cost-management — Thinking budgets directly drive token cost
- aie-65-alignment-rlhf-dpo — Reasoning is trained via RL on reasoning traces
- ml-50-policy-gradient — Test-time search resembles policy optimization over steps
- ml-01