Statistical Learning Theory

Meta-Learning Theory

MAML learns to learn quickly - adapting to a new task in a few gradient steps. When is this theoretically justified? PAC-Bayes for meta-learning: the bound depends on the number of tasks T and the complexity of the meta-parameter space.

**MAML and few-shot recognition:** Omniglot: 1623 characters, 20 examples each. MAML on 5-way 1-shot reaches 98.7%; the meta-generalization theorem explains why T=600 tasks is sufficient
**Personalization in medicine:** Meta-learning across many patients for fast adaptation to a new one: the bilevel bound formalizes how many patients are needed for meta-generalization guarantees
**Few-shot NLP with LLMs:** Fine-tuning large language models is a form of meta-learning: meta-parameters (pretrained weights) adapt to new tasks in a few steps
**Robustness to task distribution:** Task stability guarantees the meta-algorithm is not overfit to training tasks and generalizes to new ones from the same distribution

Предварительные знания

Information-theoretic learning theory
PAC-Bayes bounds
Algorithmic stability

Information-Theoretic Learning Theory

MAML and its theoretical guarantees

The MAML algorithm (Finn, Abbeel, Levine, 2017) achieves 98.7% accuracy on Omniglot after 100,000 few-shot training episodes, adapting to a new class with just 1-5 examples. The meta-generalization theorem: after training on T tasks, generalization to a new task is bounded via sqrt(VC_meta / T). With T = 600 tasks on Omniglot the bound decays as 1/24.

What problem does MAML (Model-Agnostic Meta-Learning) solve?

Finn-Abbeel-Levine (2017): MAML minimises E_T[L_T(theta - eta * grad L_T(theta))] - the average loss after k SGD steps on task T. theta* is a "launchpad" in parameter space. Khodak-Balcan-Talwalkar (2019) showed MAML is equivalent to online learning with a proximity regulariser.

Few-shot learning bounds

What is the sample complexity for few-shot meta-learning?

Baxter (2000), Tripuraneni-Jordan-Jin (2020): meta-learning with T previous tasks and n examples per task gives few-shot error of order sqrt(C(F)/n) + sqrt(D(repr) / T*n), where C(F) is the functional layer complexity and D(repr) is the representation complexity. Large T amortises representation learning.

Transfer learning and domain adaptation

Meta-generalization as an experienced teacher moving to a new school - A teacher (meta-algorithm) taught T classes (training tasks). Experience is the meta-parameters theta. Moving to a new school (new task) with a few new students (few-shot), the teacher adapts quickly. The theorem: more schools T and a simpler pedagogical strategy space mean better adaptation. Meta-learning is not magic: its theoretical guarantees follow from standard PAC arguments applied to the two-level task structure.

What is the Ben-David domain adaptation bound?

Ben-David et al. (2010): R_T(h) <= R_S(h) + d_{H Δ H}(D_S, D_T) + lambda, where lambda is the joint error of the best h* on S and T. Domain adaptation requires both small joint error and matched distributions over H-images. DANN (Ganin et al. 2016) explicitly minimises d_{H Δ H} via adversarial training.

Connections

Meta-learning theory combines PAC-Bayes, algorithmic stability, and information-theoretic methods in a two-level structure.

PAC-Bayes — Related topic
Algorithmic stability — Related topic
Transfer learning — Related topic
Information-theoretic bounds — Related topic

Итоги

Meta-learning: given T tasks, find meta-parameters theta for fast adaptation to new tasks
Bilevel PAC bound: meta-level O(sqrt(comp(Theta)/T)) plus task-level O(sqrt(VC(H)/m))
Task stability beta_task: replacing one task changes theta by at most beta - guarantees meta-generalization
MAML: implicit regularization via bounded inner-loop steps improves task stability
PAC-Bayes for meta-learning (Amit et al. 2018): KL(Q_meta || P_meta) governs the rate of meta-generalization

What governs meta-generalization in meta-learning theory?

Bilevel PAC bound: meta-generalization decays as O(sqrt(comp(Theta)/T)) across tasks and O(sqrt(VC/m)) across examples.