Automata and Cognition
Society of Minds
Цели урока
- Understand Theory of Mind levels from reactive (0) to Common Knowledge
- Know how Inverse RL infers agent goals from observed behavior
- Understand Nash and Stackelberg equilibria and their practical applications
- See how signaling games model the emergence of language
- Know CTDE as the solution to non-stationarity in multi-agent learning
Предварительные знания
- Self-Models and Introspection (aut-09-self-models)
- MDP and Decision Making (aut-04-mdp)
- Bayesian Inference and HMM (aut-03-hmm)
A chess player thinks: "He thinks I'll move the queen, but he doesn't know that I know he thinks this". In 2016 Libratus implemented exactly this - and won USD 1.7 million from professionals.
- **AlphaStar (2019)**: Grandmaster in StarCraft II using CTDE - dozens of units coordinate through a single trained policy
- **Libratus (2016)**: poker bot uses Level 2 ToM - models the opponent's model of itself, continuously rebuilds strategy
- **Autonomous vehicles**: Waymo systems predict pedestrian and driver intentions through Inverse RL on historical trajectories
- **YouTube recommendations**: IRL on 2 billion users - behavior decoded into reward functions for personalization
- **OpenAI Multi-Agent Particles (2016)**: agents developed their own coordination language without external definition - pure signaling games
From the Turing Test to Theory of Mind
The Turing Test (1950) was testing Theory of Mind: can a machine simulate the beliefs and intentions of a human? The term Theory of Mind was introduced by Premack and Woodruff in 1978 while studying chimpanzees. The key question: does a chimpanzee understand that a human has goals different from its own? The classic test - the Sally-Anne task (1985): a child under 4 does not understand that another person can have a false belief. After age 4, they do. This is the critical milestone of ToM development in humans.
Theory of Mind: levels of recursion
**In 2016, the poker bot Libratus defeated four professional players in Heads-Up No-Limit Texas Hold'em, winning USD 1.7 million in chips. Libratus didn't just count cards - it modeled what opponents thought about its strategy, and systematically exploited their models.** Theory of Mind - the ability to understand that other agents have their own beliefs, desires, and intentions - is the foundation of social intelligence.
**Theory of Mind (ToM)** - the ability to attribute mental states to other agents: beliefs, desires, intentions. The term was introduced by Premack and Woodruff in 1978 while studying chimpanzees. The question: does a chimpanzee understand that a human has goals different from its own?
| Level | Description | AI Example |
|---|---|---|
| 0 - Reactive | Other agents = environment objects | Simple bot: sees enemy - shoots |
| 1 - Others' beliefs | "He thinks X" | Poker bot: he thinks I'm bluffing |
| 2 - Model of me | "He thinks that I think Y" | Libratus: opponent thinks I'm aggressive |
| 3+ | "He thinks that I think that he..." | Negotiations, diplomacy |
| Common Knowledge | Everyone knows that everyone knows... | Traffic lights, money, language |
ToM levels: from reactive to recursive
**Level 2 is bluffing in poker.** A player makes a large bet not because of good cards, but to make the opponent think the cards are strong. This is managing someone else's model of you - the key operation of social intelligence. **Common Knowledge** is the limit of this recursion: "Everyone knows that everyone knows that everyone knows". This is why traffic lights, money, and language work.
The Blue-Eyed Islanders puzzle
On an island, 100 people have blue eyes. No one speaks about eye color. Rule: if you learn your own eye color - leave at midnight. A tourist says: "I see a person with blue eyes". Everyone already knew this! But Common Knowledge changed: now everyone knows that everyone knows there is a blue-eyed person. After 100 nights, everyone leaves.
Theory of Mind is just empathy or reading emotions
ToM is the formal modeling of beliefs, desires, and intentions of other agents
Empathy is an affective response. ToM is a cognitive operation: building a model of another agent's mental state and using it to predict behavior. This is why ToM can be formalized mathematically and implemented in AI.
A poker player bluffs - makes a large bet with bad cards. What level of Theory of Mind is involved?
Modeling other agents: Inverse RL and game theory
**How do you build a model of another agent?** Observed actions are a projection of a hidden reward function. **Inverse Reinforcement Learning (IRL)** reverses the problem: it infers goals from behavior. This is how YouTube's recommendation system works - 2 billion users whose actions are continuously decoded into preferences.
**Inverse RL**: observe behavioral trajectories of an agent → infer the reward function being maximized. Assumption: the agent is approximately optimal with respect to its hidden goal. Applications: imitation learning, human preference modeling, autonomous driving.
Nash and Stackelberg: game theory formalism
**Nash Equilibrium** - a set of strategies where no agent can improve their outcome by unilateral deviation. The classic Prisoner's Dilemma shows the paradox: individually rational behavior leads to a collectively suboptimal outcome. Both defect (1,1) is Nash equilibrium, even though both cooperating (3,3) is better for everyone.
| Game type | Structure | Nash equilibrium |
|---|---|---|
| Prisoner's Dilemma | 2 players, cooperate or defect | Both defect - suboptimal |
| Coordination game | Payoff only if choices match | Multiple equilibria - selection problem |
| Zero-sum game | One's gain = other's loss | Minimax - unique equilibrium |
| Stackelberg | Leader moves first, follower responds | Leader has commitment advantage |
Inverse RL observes that an agent always takes the route through the park, even when it's longer. What is the correct conclusion?
Communication: from signals to pragmatics
**Language emerged evolutionarily as a coordination mechanism.** In 2016, OpenAI ran an experiment: agents in an environment had to coordinate actions. Without any instructions, they developed their own "language" - a signal system that both agents interpret the same way. Signaling games formalize this process.
**Signaling game (Lewis 1969)**: Sender knows the state of the world, Receiver must act. Sender sends a signal. Reward is shared - both benefit from correct interpretation. Through repeated interaction, a convention emerges: shared meaning without external definition.
Pragmatics: speaker simulates the listener
People don't speak literally: "Can you pass the salt?" is a request, not a question about capabilities. **Rational Speech Act (RSA)** models this mathematically: the speaker chooses an utterance not for its truth, but for how the listener will interpret it. This requires ToM level 1.
Schelling Points: coordination without communication
Thomas Schelling (Nobel Prize 2005) showed: if two people are asked to meet in New York with no specified location, most choose Grand Central Station at noon. No one agreed on this. It is a Schelling Point - a salient focal point that agents choose through mutual modeling: "What would he choose, knowing I'm choosing the same thing?"
Agents must agree on a language in advance
Language as a coordination mechanism emerges evolutionarily through repeated interaction
Lewis (1969) formally showed: signaling games with reinforcement learning converge to stable conventions without external definition of meaning. OpenAI Multi-Agent Particles (2016) reproduced this empirically. Language is not a contract - it is the Nash equilibrium of a signaling game.
Why does pragmatic communication require Theory of Mind?
Multi-agent learning: CTDE
**AlphaStar (DeepMind, 2019) reached Grandmaster in StarCraft II, playing against humans in real time with multiple units simultaneously.** This is a multi-agent problem: dozens of units act in parallel, each seeing only its own surroundings. The naive approach - running independent Q-learning - breaks immediately.
**The Independent Learners problem**: each agent learns as if the environment is static. But the environment changes because other agents are also learning. Each sees a "moving target" - non-stationarity makes Q-learning non-convergent in the general case.
| Algorithm | Approach | Use case |
|---|---|---|
| Independent Q-learning | Each agent learns separately | Simple tasks, non-stationarity issues |
| MADDPG | CTDE with deterministic policy | Continuous actions, mixed cooperative |
| QMIX | CTDE with monotonic Q-function mixing | Cooperative, reward decomposition |
| MAPPO | CTDE with proximal policy optimization | Complex cooperative tasks, AlphaStar-level |
Why is Independent Q-learning unstable in multi-agent environments?
Вопросы для размышления
- When is it beneficial for an agent to deliberately limit the depth of its Theory of Mind recursion - for example, to act like a Level 0 agent? How does this relate to Nash equilibrium in repeated games?