Automata and Cognition

Mirror of the Mind: Agent Self-Models

Цели урока

Understand the levels of self-modeling from reactive to meta-cognitive
Implement confidence calibration and compute ECE
Distinguish aleatoric and epistemic uncertainty and choose the right estimation method
Apply the introspective loop (Chain-of-Thought, Self-Refine) in agents

Предварительные знания

MDP and decision-making under uncertainty
Meta-learning: learning how to learn
Basic neural network concepts and softmax

GPT-4 hallucinates 20% of factual answers with equal confidence. Chain-of-Thought fixes this by 39 percentage points without retraining. The mechanism is the self-model.

**OpenAI GPT-4 (2023)** - temperature scaling reduces model ECE from 0.12 to 0.04 in a single post-hoc step
**DeepMind AlphaCode** - the model estimates confidence in each generated solution and selects top-k for submission
**Tesla Autopilot** - explicitly models the boundaries of its capabilities and transfers control when epistemic uncertainty exceeds threshold
**Anthropic Constitutional AI** - Claude applies self-critique against a list of principles before the final answer
**Meta Llama-3** - built-in calibration via RLHF reduced overconfident hallucination rate by 34%

From 'Know Thyself' to Machine Metacognition

"Know thyself" was inscribed at the entrance to the Temple of Apollo at Delphi. In 1979, psychologist John Flavell introduced the term metacognition - knowledge about one's own knowledge. In 2022, Wei et al. showed that Chain-of-Thought is a form of machine metacognition: a language model that makes its thinking explicit gains the ability to check it. From philosophy to engineering in 2400 years.

Levels of Self-Modeling

**GPT-4 generates confident answers 96% of the time - even when it is wrong.** Stanford HAI, 2023: language models without a self-assessment mechanism hallucinate approximately 20% of answers to factual questions, giving no signal of uncertainty. This is not a data or architecture problem - it is the absence of a self-model. An agent with a self-model knows what it can do, what it knows, and where it tends to fail.

**Self-model** - an agent's internal representation of itself: capabilities, limitations, current knowledge state, and typical error patterns. Not introspection for its own sake - it is a quality control mechanism for decisions.

Level	Name	Capability	Example
0	Reactive	Stimulus -> response only	Thermostat, simple chatbot
1	Stateful	Remembers interaction history	Agent with conversation memory
2	Self-model	Explicitly represents own capabilities and limitations	LLM with uncertainty estimation
3	Meta-cognitive	Models and optimizes its own thinking process	Chain-of-Thought + self-critique

Architecture of a Self-Aware Agent

Self-modeling is introspection for its own sake - a philosophical concept with no practical value

Self-model is an engineering quality control mechanism: the agent knows when to refuse an answer or request help

Without a self-model, an agent cannot distinguish confident knowledge from blind spots. Result: equally confident correct and hallucinated answers - exactly what is observed in GPT without retrieval augmentation.

A level-1 agent (stateful) differs from a level-2 agent (self-model) in that:

Confidence Calibration

**OpenAI internal research 2022: GPT-3 said "I am confident" in 91% of cases but was correct only 71% of the time.** This is overconfidence - a systematic self-assessment error. A perfectly calibrated agent that says "I am 70% confident" should be correct exactly 70% of the time in such cases.

Calibration curve: poor vs good

Poor (overconfident): says "90% confident" -> correct 60% of the time. Says "70% confident" -> correct 55% of the time. Curve lies below the diagonal. Good (calibrated): says "90% confident" -> correct 89-91% of the time. Says "50% confident" -> correct 48-52% of the time. Curve matches the diagonal. ECE (Expected Calibration Error) = weighted average of deviations from the diagonal. ECE < 0.05 is considered good calibration.

Calibration Method	How It Works	When to Use
Temperature Scaling	Divides logits by T before softmax	Post-hoc, one parameter - simple
Platt Scaling	Logistic regression on top of outputs	Binary classification
Isotonic Regression	Monotonic nonlinear transformation	Enough data for fitting
MC Dropout	Inference with dropout active, N forward passes	Epistemic uncertainty estimation

A model's high confidence indicates high quality

Quality is determined by calibration: correspondence between stated confidence and actual accuracy

A model trained on imbalanced data or with aggressive RLHF tuning often becomes overconfident. Temperature scaling fixes ECE from 0.15 to 0.03 in 30 minutes without retraining.

A model says "80% confident" on 100 different questions. With good calibration, how many answers should be correct?

Two Types of Uncertainty

**In medical AI diagnostics, the difference between two types of uncertainty is literally life and death.** If the model is uncertain because the patient has a rare disease (epistemic - more data can help) - more tests are needed. If uncertain because the biological process is stochastic (aleatoric - irreducible) - a probabilistic decision must be made. Mixing these two types is a critical error.

Type	Name	Source	Reducible?	Example
Aleatoric	World randomness	Data stochasticity	No	Coin flip, quantum effects
Epistemic	Model ignorance	Insufficient training data	Yes - more data	Rare disease, new domain

**Practical rule:** collecting more data only makes sense with high epistemic uncertainty. With high aleatoric - enough data exists already, the task is fundamentally stochastic. Confusing these types = wasting resources.

A model forecasts next-day stock price. Which type of uncertainty dominates and why?

Introspective Loop and Self-Simulation

**Chain-of-Thought (Wei et al., NeurIPS 2022) improved GPT-3 accuracy on math problems from 18% to 57% - simply by adding externalized reasoning.** This is a form of introspection: by making thinking explicit, the model gains the ability to check and correct it. The introspective loop is an architectural pattern implementing reflection as a systemic mechanism.

**Self-Refine (Madaan et al., NeurIPS 2023):** iterative self-critique without additional training. The model generates an answer, then critiques it, then improves it. On coding tasks this produced a 13.5 percentage point improvement over baseline GPT-4.

Technique	Core Idea	Improvement	Application
Chain-of-Thought	Explicit step-by-step reasoning	+39 pp math (GPT-3)	Logic, math
Self-Refine	Generate -> critique -> improve	+13.5 pp code (GPT-4)	Code, essays, problem solving
Constitutional AI	RLHF with self-critique on principles	-57% harmful outputs	Safety, alignment
Reflexion	Verbal reinforcement via reflection	+20% HotpotQA	Multi-step tasks

Agent self-reflection is just re-asking the same question

The introspective loop is a structured process with explicit self-critique, rollback on errors, and self-model updates from outcomes

Simple repetition without a checking mechanism produces a similar answer with similar errors. Self-Refine works precisely because it includes critique with specific questions about reasoning quality.

Chain-of-Thought improved accuracy on math tasks from 18% to 57%. What is the primary mechanism of this improvement?

Вопросы для размышления

An agent knows its epistemic uncertainty on topic X is high. What three actions should it take instead of giving a confident answer - and how should each be implemented technically?

Связанные уроки

ml-01-intro