Automata and Cognition
Mirror of the Mind: Agent Self-Models
Цели урока
- Understand the levels of self-modeling from reactive to meta-cognitive
- Implement confidence calibration and compute ECE
- Distinguish aleatoric and epistemic uncertainty and choose the right estimation method
- Apply the introspective loop (Chain-of-Thought, Self-Refine) in agents
Предварительные знания
- MDP and decision-making under uncertainty
- Meta-learning: learning how to learn
- Basic neural network concepts and softmax
GPT-4 hallucinates 20% of factual answers with equal confidence. Chain-of-Thought fixes this by 39 percentage points without retraining. The mechanism is the self-model.
- **OpenAI GPT-4 (2023)** - temperature scaling reduces model ECE from 0.12 to 0.04 in a single post-hoc step
- **DeepMind AlphaCode** - the model estimates confidence in each generated solution and selects top-k for submission
- **Tesla Autopilot** - explicitly models the boundaries of its capabilities and transfers control when epistemic uncertainty exceeds threshold
- **Anthropic Constitutional AI** - Claude applies self-critique against a list of principles before the final answer
- **Meta Llama-3** - built-in calibration via RLHF reduced overconfident hallucination rate by 34%
From 'Know Thyself' to Machine Metacognition
"Know thyself" was inscribed at the entrance to the Temple of Apollo at Delphi. In 1979, psychologist John Flavell introduced the term metacognition - knowledge about one's own knowledge. In 2022, Wei et al. showed that Chain-of-Thought is a form of machine metacognition: a language model that makes its thinking explicit gains the ability to check it. From philosophy to engineering in 2400 years.
Levels of Self-Modeling
**GPT-4 generates confident answers 96% of the time - even when it is wrong.** Stanford HAI, 2023: language models without a self-assessment mechanism hallucinate approximately 20% of answers to factual questions, giving no signal of uncertainty. This is not a data or architecture problem - it is the absence of a self-model. An agent with a self-model knows what it can do, what it knows, and where it tends to fail.
**Self-model** - an agent's internal representation of itself: capabilities, limitations, current knowledge state, and typical error patterns. Not introspection for its own sake - it is a quality control mechanism for decisions.
| Level | Name | Capability | Example |
|---|---|---|---|
| 0 | Reactive | Stimulus -> response only | Thermostat, simple chatbot |
| 1 | Stateful | Remembers interaction history | Agent with conversation memory |
| 2 | Self-model | Explicitly represents own capabilities and limitations | LLM with uncertainty estimation |
| 3 | Meta-cognitive | Models and optimizes its own thinking process | Chain-of-Thought + self-critique |
Architecture of a Self-Aware Agent
Self-modeling is introspection for its own sake - a philosophical concept with no practical value
Self-model is an engineering quality control mechanism: the agent knows when to refuse an answer or request help
Without a self-model, an agent cannot distinguish confident knowledge from blind spots. Result: equally confident correct and hallucinated answers - exactly what is observed in GPT without retrieval augmentation.
A level-1 agent (stateful) differs from a level-2 agent (self-model) in that:
Confidence Calibration
**OpenAI internal research 2022: GPT-3 said "I am confident" in 91% of cases but was correct only 71% of the time.** This is overconfidence - a systematic self-assessment error. A perfectly calibrated agent that says "I am 70% confident" should be correct exactly 70% of the time in such cases.
Calibration curve: poor vs good
Poor (overconfident): says "90% confident" -> correct 60% of the time. Says "70% confident" -> correct 55% of the time. Curve lies below the diagonal. Good (calibrated): says "90% confident" -> correct 89-91% of the time. Says "50% confident" -> correct 48-52% of the time. Curve matches the diagonal. ECE (Expected Calibration Error) = weighted average of deviations from the diagonal. ECE < 0.05 is considered good calibration.
| Calibration Method | How It Works | When to Use |
|---|---|---|
| Temperature Scaling | Divides logits by T before softmax | Post-hoc, one parameter - simple |
| Platt Scaling | Logistic regression on top of outputs | Binary classification |
| Isotonic Regression | Monotonic nonlinear transformation | Enough data for fitting |
| MC Dropout | Inference with dropout active, N forward passes | Epistemic uncertainty estimation |
A model's high confidence indicates high quality
Quality is determined by calibration: correspondence between stated confidence and actual accuracy
A model trained on imbalanced data or with aggressive RLHF tuning often becomes overconfident. Temperature scaling fixes ECE from 0.15 to 0.03 in 30 minutes without retraining.
A model says "80% confident" on 100 different questions. With good calibration, how many answers should be correct?
Two Types of Uncertainty
**In medical AI diagnostics, the difference between two types of uncertainty is literally life and death.** If the model is uncertain because the patient has a rare disease (epistemic - more data can help) - more tests are needed. If uncertain because the biological process is stochastic (aleatoric - irreducible) - a probabilistic decision must be made. Mixing these two types is a critical error.
| Type | Name | Source | Reducible? | Example |
|---|---|---|---|---|
| Aleatoric | World randomness | Data stochasticity | No | Coin flip, quantum effects |
| Epistemic | Model ignorance | Insufficient training data | Yes - more data | Rare disease, new domain |
**Practical rule:** collecting more data only makes sense with high epistemic uncertainty. With high aleatoric - enough data exists already, the task is fundamentally stochastic. Confusing these types = wasting resources.
A model forecasts next-day stock price. Which type of uncertainty dominates and why?
Introspective Loop and Self-Simulation
**Chain-of-Thought (Wei et al., NeurIPS 2022) improved GPT-3 accuracy on math problems from 18% to 57% - simply by adding externalized reasoning.** This is a form of introspection: by making thinking explicit, the model gains the ability to check and correct it. The introspective loop is an architectural pattern implementing reflection as a systemic mechanism.
**Self-Refine (Madaan et al., NeurIPS 2023):** iterative self-critique without additional training. The model generates an answer, then critiques it, then improves it. On coding tasks this produced a 13.5 percentage point improvement over baseline GPT-4.
| Technique | Core Idea | Improvement | Application |
|---|---|---|---|
| Chain-of-Thought | Explicit step-by-step reasoning | +39 pp math (GPT-3) | Logic, math |
| Self-Refine | Generate -> critique -> improve | +13.5 pp code (GPT-4) | Code, essays, problem solving |
| Constitutional AI | RLHF with self-critique on principles | -57% harmful outputs | Safety, alignment |
| Reflexion | Verbal reinforcement via reflection | +20% HotpotQA | Multi-step tasks |
Agent self-reflection is just re-asking the same question
The introspective loop is a structured process with explicit self-critique, rollback on errors, and self-model updates from outcomes
Simple repetition without a checking mechanism produces a similar answer with similar errors. Self-Refine works precisely because it includes critique with specific questions about reasoning quality.
Chain-of-Thought improved accuracy on math tasks from 18% to 57%. What is the primary mechanism of this improvement?
Вопросы для размышления
- An agent knows its epistemic uncertainty on topic X is high. What three actions should it take instead of giving a confident answer - and how should each be implemented technically?