Automata and Cognition

Mirror of the Mind: Agent Self-Models

Цели урока

  • Understand the levels of self-modeling from reactive to meta-cognitive
  • Implement confidence calibration and compute ECE
  • Distinguish aleatoric and epistemic uncertainty and choose the right estimation method
  • Apply the introspective loop (Chain-of-Thought, Self-Refine) in agents

Предварительные знания

  • MDP and decision-making under uncertainty
  • Meta-learning: learning how to learn
  • Basic neural network concepts and softmax

GPT-4 hallucinates 20% of factual answers with equal confidence. Chain-of-Thought fixes this by 39 percentage points without retraining. The mechanism is the self-model.

  • **OpenAI GPT-4 (2023)** - temperature scaling reduces model ECE from 0.12 to 0.04 in a single post-hoc step
  • **DeepMind AlphaCode** - the model estimates confidence in each generated solution and selects top-k for submission
  • **Tesla Autopilot** - explicitly models the boundaries of its capabilities and transfers control when epistemic uncertainty exceeds threshold
  • **Anthropic Constitutional AI** - Claude applies self-critique against a list of principles before the final answer
  • **Meta Llama-3** - built-in calibration via RLHF reduced overconfident hallucination rate by 34%

From 'Know Thyself' to Machine Metacognition

"Know thyself" was inscribed at the entrance to the Temple of Apollo at Delphi. In 1979, psychologist John Flavell introduced the term metacognition - knowledge about one's own knowledge. In 2022, Wei et al. showed that Chain-of-Thought is a form of machine metacognition: a language model that makes its thinking explicit gains the ability to check it. From philosophy to engineering in 2400 years.

Levels of Self-Modeling

**GPT-4 generates confident answers 96% of the time - even when it is wrong.** Stanford HAI, 2023: language models without a self-assessment mechanism hallucinate approximately 20% of answers to factual questions, giving no signal of uncertainty. This is not a data or architecture problem - it is the absence of a self-model. An agent with a self-model knows what it can do, what it knows, and where it tends to fail.

**Self-model** - an agent's internal representation of itself: capabilities, limitations, current knowledge state, and typical error patterns. Not introspection for its own sake - it is a quality control mechanism for decisions.

LevelNameCapabilityExample
0ReactiveStimulus -> response onlyThermostat, simple chatbot
1StatefulRemembers interaction historyAgent with conversation memory
2Self-modelExplicitly represents own capabilities and limitationsLLM with uncertainty estimation
3Meta-cognitiveModels and optimizes its own thinking processChain-of-Thought + self-critique

Architecture of a Self-Aware Agent

Self-modeling is introspection for its own sake - a philosophical concept with no practical value

Self-model is an engineering quality control mechanism: the agent knows when to refuse an answer or request help

Without a self-model, an agent cannot distinguish confident knowledge from blind spots. Result: equally confident correct and hallucinated answers - exactly what is observed in GPT without retrieval augmentation.

A level-1 agent (stateful) differs from a level-2 agent (self-model) in that:

Confidence Calibration

**OpenAI internal research 2022: GPT-3 said "I am confident" in 91% of cases but was correct only 71% of the time.** This is overconfidence - a systematic self-assessment error. A perfectly calibrated agent that says "I am 70% confident" should be correct exactly 70% of the time in such cases.

Calibration curve: poor vs good

Poor (overconfident): says "90% confident" -> correct 60% of the time. Says "70% confident" -> correct 55% of the time. Curve lies below the diagonal. Good (calibrated): says "90% confident" -> correct 89-91% of the time. Says "50% confident" -> correct 48-52% of the time. Curve matches the diagonal. ECE (Expected Calibration Error) = weighted average of deviations from the diagonal. ECE < 0.05 is considered good calibration.

Calibration MethodHow It WorksWhen to Use
Temperature ScalingDivides logits by T before softmaxPost-hoc, one parameter - simple
Platt ScalingLogistic regression on top of outputsBinary classification
Isotonic RegressionMonotonic nonlinear transformationEnough data for fitting
MC DropoutInference with dropout active, N forward passesEpistemic uncertainty estimation

A model's high confidence indicates high quality

Quality is determined by calibration: correspondence between stated confidence and actual accuracy

A model trained on imbalanced data or with aggressive RLHF tuning often becomes overconfident. Temperature scaling fixes ECE from 0.15 to 0.03 in 30 minutes without retraining.

A model says "80% confident" on 100 different questions. With good calibration, how many answers should be correct?

Two Types of Uncertainty

**In medical AI diagnostics, the difference between two types of uncertainty is literally life and death.** If the model is uncertain because the patient has a rare disease (epistemic - more data can help) - more tests are needed. If uncertain because the biological process is stochastic (aleatoric - irreducible) - a probabilistic decision must be made. Mixing these two types is a critical error.

TypeNameSourceReducible?Example
AleatoricWorld randomnessData stochasticityNoCoin flip, quantum effects
EpistemicModel ignoranceInsufficient training dataYes - more dataRare disease, new domain

**Practical rule:** collecting more data only makes sense with high epistemic uncertainty. With high aleatoric - enough data exists already, the task is fundamentally stochastic. Confusing these types = wasting resources.

A model forecasts next-day stock price. Which type of uncertainty dominates and why?

Introspective Loop and Self-Simulation

**Chain-of-Thought (Wei et al., NeurIPS 2022) improved GPT-3 accuracy on math problems from 18% to 57% - simply by adding externalized reasoning.** This is a form of introspection: by making thinking explicit, the model gains the ability to check and correct it. The introspective loop is an architectural pattern implementing reflection as a systemic mechanism.

**Self-Refine (Madaan et al., NeurIPS 2023):** iterative self-critique without additional training. The model generates an answer, then critiques it, then improves it. On coding tasks this produced a 13.5 percentage point improvement over baseline GPT-4.

TechniqueCore IdeaImprovementApplication
Chain-of-ThoughtExplicit step-by-step reasoning+39 pp math (GPT-3)Logic, math
Self-RefineGenerate -> critique -> improve+13.5 pp code (GPT-4)Code, essays, problem solving
Constitutional AIRLHF with self-critique on principles-57% harmful outputsSafety, alignment
ReflexionVerbal reinforcement via reflection+20% HotpotQAMulti-step tasks

Agent self-reflection is just re-asking the same question

The introspective loop is a structured process with explicit self-critique, rollback on errors, and self-model updates from outcomes

Simple repetition without a checking mechanism produces a similar answer with similar errors. Self-Refine works precisely because it includes critique with specific questions about reasoning quality.

Chain-of-Thought improved accuracy on math tasks from 18% to 57%. What is the primary mechanism of this improvement?

Вопросы для размышления

  • An agent knows its epistemic uncertainty on topic X is high. What three actions should it take instead of giving a confident answer - and how should each be implemented technically?

Связанные уроки

  • ml-01-intro
Mirror of the Mind: Agent Self-Models

0

1

Sign In