Statistical Learning Theory
Online Learning and Bandit Algorithms
UCB algorithm in Amazon A/B testing on 200M users achieves regret O(√(KT·log T)) vs O(T) for random selection. At T=200M tests, UCB wastes ~√200M ≈ 14000× less than random. LinUCB in Yahoo! News increased CTR by 12.5%. Google Ads uses contextual bandits for real-time bidding across 5B+ auctions per day. This is not academia , it is several billion dollars of annual revenue.
- **Amazon Explore:** multi-armed bandits for A/B testing UI changes across 300M users. No need to wait for statistical significance; optimal allocation in real time.
- **Google Ads Smart Bidding:** contextual LinUCB (with deep net features) for bids in RTB auctions. Handles 5M+ requests/sec.
- **Netflix thumbnails:** Thompson Sampling to choose the thumbnail for each user. 20%+ lift in browse click-through rate.
Online Convex Optimization: regret bounds and Follow-the-Leader
**UCB algorithm in Amazon A/B testing on 200M users achieves regret O(√(KT·log T)) vs O(T) for random selection.** Online Convex Optimization (OCO) is a game between an algorithm and an adversary: at each step t the algorithm picks w_t, receives loss function f_t(w_t), and the goal is to minimize cumulative regret. This generalizes stochastic SGD to the adversarial setting.
**OCO applications:** AdaGrad, Adam, RMSProp are FTRL with different regularizers. Online SVM, online logistic regression. Hedge (Multiplicative Weights) is optimal for the expert problem. All online advertising algorithms (Google, Meta) are OCO.
What regret does Online Gradient Descent achieve for convex losses over T steps?
OGD with lr = D/(G√T) achieves R_T ≤ DG√T. This is O(√T), which is the optimal minimax regret for online convex optimization with convex loss functions. For strongly convex losses it improves to O(log T).
Multi-Armed Bandits: UCB1 and regret theory
**Multi-Armed Bandit (MAB)**: sequential decision-making with partial feedback. K arms, at each step choose one arm and receive a stochastic reward. Goal: balance exploration and exploitation. UCB1 is theoretically optimal with proven regret O(√(KT log T)).
**Variants:** Thompson Sampling (Bayesian UCB) samples from the posterior, achieves O(√(KT log T)) regret, and performs better in practice. EXP3 handles adversarial bandits with O(√(KT log K)). Gradient Bandits are REINFORCE in the bandit setting.
What is the expected regret of UCB1 over T steps with K arms?
UCB1 has instance-dependent regret O(∑ log(T)/Δᵢ) and worst-case O(√(KT log T)). The Auer et al. lower bound is Ω(√(KT)), so UCB1 is optimal up to a log T factor.
Contextual Bandits: LinUCB for recommendations
**Contextual bandits** extend MAB: at each step a context x_t is observed (user/query features), the algorithm picks action a_t and receives reward r_t. LinUCB (Li et al., 2010) assumes a linear reward model and builds a confidence ellipsoid in feature space. Used in Yahoo! News, Google Search, and Netflix.
**Production deployment:** Yahoo! News (2010) used LinUCB for news feed personalization, achieving +12.5% CTR. Google Ads uses contextual bandits (with deep net features) for real-time bidding. Netflix uses Thompson Sampling to pick thumbnails per user, achieving 20%+ lift in CTR.
How does contextual bandits differ from standard multi-armed bandit?
In contextual bandits, at each step t a context x_t is observed (e.g. user features), and the algorithm picks action a_t using x_t to receive reward r_t. This enables personalization: different arms for different users. Standard MAB has no context.
Key ideas
- **OCO regret:** R_T = cumulative loss minus best fixed solution. OGD achieves O(√T) for convex losses, which is optimal. FTRL with the right regularizer yields AdaGrad.
- **UCB1:** select arm with max(hat_mu + sqrt(2 ln t / N_k)). E[R_T] ≤ ∑ 8 ln(T)/Δᵢ + const. Worst-case O(√(KT log T)).
- **Lai-Robbins lower bound:** impossible to beat Ω(∑ ln(T)/Δᵢ). UCB1 is optimal up to a constant.
- **LinUCB:** reward model r = θ_a^T x. UCB via confidence ellipsoid sqrt(x^T A^{-1} x). Regret O(d√(T log³(KT))).
- **Thompson Sampling:** Bayesian alternative to UCB. Better empirically, same theoretical bounds.
Related topics
Online learning and bandits bridge to reinforcement learning:
- Online regret (lesson 12) — Introduction to online learning with Hedge and Weighted Majority
- Reinforcement Learning — Bandits are RL without states or transitions
Связанные уроки
- lt-12-online-regret — Lesson 12 introduced online learning; here we prove full bounds and cover bandits
- lt-18-vc-sample-complexity — PAC bounds and regret bounds are two regimes of generalization theory
- ml-48-rl-intro — Bandits are the simplest RL without states