Probability Theory
Advanced Probabilistic Graphical Models
Цели урока
- Understand HMM structure and the forward-backward, Viterbi, and Baum-Welch algorithms
- Master CRFs as a discriminative generalization of HMMs with arbitrary features
- Analyze mean-field variational inference and its link to ELBO
- Connect variational message passing with belief propagation on graphs
Предварительные знания
- Bayes' theorem and posterior distributions
- KL divergence and information theory
- Markov chains and dynamic programming
- Gaussian processes and de Finetti's theorem
How are probabilistic models with billions of hidden variables trained, and why do the same tools power spaCy NER and Facebook Graph Inference?
- **HMM in speech:** Apple Siri and Google Voice before the neural era processed 95% of mobile speech with HMM and Baum-Welch
- **CRF in NLP:** spaCy and Stanford CoreNLP use BiLSTM-CRF for NER and POS tagging; the SOTA until 2018
- **Profile HMMs:** HMMER applies HMM to protein family search; Glimmer/GENSCAN use Viterbi for gene prediction
- **Variational message passing:** Facebook Graph Inference and YouTube recommendations scale inference to billions of nodes via VMP
Hidden Markov Models (HMM)
In 1966 Leonard Baum and Ted Petrie introduced HMMs for speech recognition. By 2010 these models powered 95% of mobile speech recognition in Apple Siri and Google Voice. The HMM structure is simple: a hidden state z_t evolves as a Markov chain and each generates an observation x_t. The forward-backward and Viterbi algorithms are standard tools in NLP, bioinformatics, and finance.
HMMs remain relevant in bioinformatics: profile HMMs are used in HMMER for protein family search, and Viterbi in gene prediction (Glimmer, GENSCAN). In finance HMMs model market regimes (bull/bear) as hidden states.
What is the complexity of the forward-backward algorithm for an HMM with N states and T observations?
Conditional Random Fields (CRF)
In 2001 Lafferty, McCallum, and Pereira proposed CRFs as a discriminative generalization of HMMs. Their key advantage: CRFs model p(z|x) directly, bypassing p(x). This allows using arbitrary features of observations without modeling their distribution. Before 2018 the BiLSTM-CRF was state-of-the-art for named entity recognition, and this approach is still used in spaCy for tagging.
BiLSTM-CRF (Huang et al. 2015) is the architecture for sequence tagging: BiLSTM extracts contextual representations, the CRF models transition constraints between labels. Before the transformer era this was the standard in NER and POS tagging.
What is the principal difference between a CRF and an HMM?
Variational Message Passing
For complex PGMs exact inference is typically infeasible (NP-hard for loopy graphs). Variational message passing generalizes belief propagation to mean-field approximations: each node exchanges 'messages' with its neighbors, updating a local approximation of the posterior. This scales to billions of nodes in Facebook Graph Inference and YouTube recommendations.
Variational methods unite probability and deep learning
ELBO, message passing, and graphical models underlie modern Bayesian machine learning.
- Information theory — ELBO = E_q[log p(x,z) - log q(z)] is the optimization analog of mutual information; KL divergence is central
- Large deviations — ELBO = -F (variational free energy); the minimum free energy principle in stat mech is analogous to maximizing ELBO
- Exchangeability and Bayesian inference — Variational inference is a numerical method for computing the posterior p(z|x), whose existence is guaranteed by de Finetti's theorem
- Infinite-dimensional probability — Diffusion models are measures on image spaces; mathematically these are measures on Banach spaces
Итоги
- **HMM:** hidden state z_t as a Markov chain, observation x_t from z_t; forward-backward O(T*N^2), Viterbi for MAP
- **Baum-Welch:** EM for HMM; E-step via forward-backward, M-step by matrix re-estimation
- **CRF:** discriminative model p(z|x); arbitrary features of x; training via max-likelihood (convex objective)
- **BiLSTM-CRF:** SOTA for NER/POS tagging before the transformer era; used in spaCy and Stanford CoreNLP
- **Mean-field VI:** q = prod q_i; updates q_i* propto exp(E_{q_{-i}}[log p])
- **ELBO:** L(q) = log p(x) - KL(q||p(z|x)); maximizing ELBO equals minimizing KL
- **Loopy BP:** exact on trees, approximate on cyclic graphs; the basis of scalable inference
Why does mean-field variational inference reduce to iterative updates of marginals?