Statistical Learning Theory
Meta-Learning Theory
MAML learns to learn quickly - adapting to a new task in a few gradient steps. When is this theoretically justified? PAC-Bayes for meta-learning: the bound depends on the number of tasks T and the complexity of the meta-parameter space.
- **MAML and few-shot recognition:** Omniglot: 1623 characters, 20 examples each. MAML on 5-way 1-shot reaches 98.7%; the meta-generalization theorem explains why T=600 tasks is sufficient
- **Personalization in medicine:** Meta-learning across many patients for fast adaptation to a new one: the bilevel bound formalizes how many patients are needed for meta-generalization guarantees
- **Few-shot NLP with LLMs:** Fine-tuning large language models is a form of meta-learning: meta-parameters (pretrained weights) adapt to new tasks in a few steps
- **Robustness to task distribution:** Task stability guarantees the meta-algorithm is not overfit to training tasks and generalizes to new ones from the same distribution
Предварительные знания
- Information-theoretic learning theory
- PAC-Bayes bounds
- Algorithmic stability
MAML and its theoretical guarantees
The MAML algorithm (Finn, Abbeel, Levine, 2017) achieves 98.7% accuracy on Omniglot after 100,000 few-shot training episodes, adapting to a new class with just 1-5 examples. The meta-generalization theorem: after training on T tasks, generalization to a new task is bounded via sqrt(VC_meta / T). With T = 600 tasks on Omniglot the bound decays as 1/24.
What problem does MAML (Model-Agnostic Meta-Learning) solve?
Finn-Abbeel-Levine (2017): MAML minimises E_T[L_T(theta - eta * grad L_T(theta))] - the average loss after k SGD steps on task T. theta* is a "launchpad" in parameter space. Khodak-Balcan-Talwalkar (2019) showed MAML is equivalent to online learning with a proximity regulariser.
Few-shot learning bounds
What is the sample complexity for few-shot meta-learning?
Baxter (2000), Tripuraneni-Jordan-Jin (2020): meta-learning with T previous tasks and n examples per task gives few-shot error of order sqrt(C(F)/n) + sqrt(D(repr) / T*n), where C(F) is the functional layer complexity and D(repr) is the representation complexity. Large T amortises representation learning.
Transfer learning and domain adaptation
Meta-generalization as an experienced teacher moving to a new school - A teacher (meta-algorithm) taught T classes (training tasks). Experience is the meta-parameters theta. Moving to a new school (new task) with a few new students (few-shot), the teacher adapts quickly. The theorem: more schools T and a simpler pedagogical strategy space mean better adaptation. Meta-learning is not magic: its theoretical guarantees follow from standard PAC arguments applied to the two-level task structure.
What is the Ben-David domain adaptation bound?
Ben-David et al. (2010): R_T(h) <= R_S(h) + d_{H Δ H}(D_S, D_T) + lambda, where lambda is the joint error of the best h* on S and T. Domain adaptation requires both small joint error and matched distributions over H-images. DANN (Ganin et al. 2016) explicitly minimises d_{H Δ H} via adversarial training.
Connections
Meta-learning theory combines PAC-Bayes, algorithmic stability, and information-theoretic methods in a two-level structure.
- PAC-Bayes — Related topic
- Algorithmic stability — Related topic
- Transfer learning — Related topic
- Information-theoretic bounds — Related topic
Итоги
- Meta-learning: given T tasks, find meta-parameters theta for fast adaptation to new tasks
- Bilevel PAC bound: meta-level O(sqrt(comp(Theta)/T)) plus task-level O(sqrt(VC(H)/m))
- Task stability beta_task: replacing one task changes theta by at most beta - guarantees meta-generalization
- MAML: implicit regularization via bounded inner-loop steps improves task stability
- PAC-Bayes for meta-learning (Amit et al. 2018): KL(Q_meta || P_meta) governs the rate of meta-generalization
What governs meta-generalization in meta-learning theory?
Bilevel PAC bound: meta-generalization decays as O(sqrt(comp(Theta)/T)) across tasks and O(sqrt(VC/m)) across examples.