AI Engineering

Reasoning Models: o3, o4, Extended Thinking - How Next-Gen Models Think

Цели урока

Understand test-time compute scaling and how it differs from train-time scaling
Grasp the architecture of reasoning models (o1, o3, DeepSeek-R1)
Learn to identify tasks where reasoning models deliver a dramatic advantage
Master model routing and escalating reasoning patterns for production systems

o4-mini passes AIME at the 99th percentile of humans. Claude 4 Opus solves PhD-level chemistry and biology problems - not because it memorized the answer, but because it thinks out loud for several minutes. DeepSeek R2 does the same in open-source. This is not the future - this is production 2026. Reasoning models are already embedded in Cursor, GitHub Copilot, Notion AI. The question is not 'will they arrive' - but 'how many reasoning tokens does this specific task cost'.

o4-mini (OpenAI, 2025) in production: used in Cursor to analyze complex bug reports where standard generation produced errors
Claude 4.x Extended Thinking (Anthropic, 2025-2026) in production for code audits and architecture reviews - the thinking process is visible and budget-controllable
DeepSeek R2 - open-source reasoning at frontier level, available for self-hosting: reasoning without vendor lock-in is real
Reasoning token cost is the key 2026 metric: o4-mini is 10x cheaper than o3 at comparable quality on most tasks - routing by task type saves 70-90% of the budget

From Chain-of-Thought to Reasoning Models

Reasoning Engineering: Application Architecture with Reasoning Models

Reasoning models force a rethink of AI application architecture. Using one model for everything means either overpaying on simple tasks or underperforming on complex ones. The solution is **model routing**: a cheap classifier (gpt-4o-mini) determines query complexity and directs it to the appropriate model. Production systems like Cursor and GitHub Copilot already work this way.

The **reasoning_effort** parameter in o3-mini creates a controllable spectrum: `low` - fast and cheap (close to GPT-4o cost), `medium` - balanced, `high` - maximum thoroughness with a risk of 10,000+ reasoning tokens per request. The Escalating Reasoning pattern starts at `low` and steps up only when the model's confidence is low - cutting costs while maintaining reliability.

Reasoning models also flip the **prompt engineering** playbook - counterintuitively. For GPT-4o, detailed instructions ("think step by step", "first outline a plan") raise quality. For o1/o3 they are redundant: the model builds its own chain of reasoning via RLVR, and a wordy prompt can literally derail it. The optimal prompt for a reasoning model is a precise problem statement with no process instructions.

Prompt for GPT-4o (detailed) — Detailed instructions, step-by-step plan, examples, constraints - all of this improves quality. The system message is critical. Chain-of-thought must be explicitly requested: "Let's think step by step".
Prompt for o1/o3 (concise) — Brief problem statement. The model decides how to reason on its own. Excessive instructions can reduce quality. Chain-of-thought is built in - no need to ask. Better to focus on directly stating *what the desired output is*.

**Latency:** reasoning models are significantly slower. A request to o1 can take 10-60 seconds. For real-time chatbots, this is critical. Architectural solution - streaming partial results: show the user "Analyzing..." with a progress bar while the model thinks.

**State of 2026:** the boundary between reasoning and generation models has blurred. Claude 4 switches to Extended Thinking on request; o4-mini is available through the same API as gpt-4o-mini. Knowing *when* reasoning is needed is the main cost optimization lever.

What pattern is most effective for a production system handling diverse user queries?

Approach	What scales	Result	Example
Train-time scaling	Parameters, data, training compute	A smarter model overall	GPT-3 → GPT-4
Test-time scaling	Compute at inference (chains of reasoning)	Better on a specific task	GPT-4o → o1
Combined	Both directions	Maximum results	o3 (large model + reasoning)

Characteristic	GPT-4o	o1 / o1-pro	o3-mini
System message	Yes	No (user only)	Yes (developer role)
Temperature	0-2	Fixed (1)	Fixed (1)
Streaming	Yes	Limited	Yes
Reasoning effort	-	-	low / medium / high
Cost input (USD/1M)	2.50	15.00	1.10
Cost output (USD/1M)	10.00	60.00	4.40
Speed	~50 tok/s	~10-30s per task	~5-15s per task

Benchmark	GPT-4o	o1	o3	Task type
MMLU	88%	92%	96%	General knowledge, facts
GPQA (PhD-level science)	53%	78%	88%	Scientific reasoning
AIME 2024 (math olympiads)	13%	83%	96%	Mathematical logic
Codeforces rating	808	1807	2727	Algorithmic problems
SWE-bench Verified	33%	49%	71%	Real code bugs
ARC-AGI (abstract reasoning)	5%	32%	88%	Patterns and analogies

Reasoning Models: o3, o4, Extended Thinking - How Next-Gen Models Think

Цели урока

From Chain-of-Thought to Reasoning Models

Reasoning Models: o3, o4, Extended Thinking - How Next-Gen Models Think

Цели урока

From Chain-of-Thought to Reasoning Models

Предварительные знания

Test-Time Compute Scaling: Think Longer = Think Better

o1/o3 Architecture: Chain-of-Thought on Steroids

Reasoning vs Generation: Two Modes of One Model

Reasoning Engineering: Application Architecture with Reasoning Models

Key Ideas

What's Next

Связанные уроки