Convex Optimization
Bandit Optimization
Thompson proposed his algorithm back in 1933, but its optimality was proved only in 2012 by Chapelle and Li. For 80 years a simple Bayesian algorithm awaited rigorous analysis -- and turned out to be optimal by the Lai-Robbins bound.
- A/B testing: bandits allow adaptively routing traffic to the better version
- Clinical trials: bandit allocation of patients minimizes harm
- Ad auctions: LinUCB selects ads accounting for user context
Предварительные знания
UCB and Thompson Sampling
Lai and Robbins (1985) proved the lower bound on multi-armed bandit regret: R_T >= sum_{i: mu_i < mu*} (mu* - mu_i) * log(T) / KL(mu_i, mu*). UCB1 (Auer 2002) achieves this bound: select arm i* = argmax mu_hat_i + sqrt(2 log t / n_i). Thompson Sampling (1933, rediscovered Chapelle 2011) achieves the same regret Bayesianly.
What does the Lai-Robbins lower bound state for the multi-armed bandit?
Contextual Bandits and LinUCB
LinUCB (Li et al. 2010): at each round observe context x_t in R^d, reward r_t = theta^T x_{t,a} + epsilon. Estimate theta via Ridge regression, UCB = theta_hat^T x + alpha*sqrt(x^T A^{-1} x). Regret O(d*sqrt(T)) instead of O(sqrt(KT)). Foundation of Netflix and Yahoo! News recommendation systems.
What is the dimensional dependence of LinUCB regret?
Key Ideas
- UCB1: i* = argmax [mu_hat_i + sqrt(2 log t / n_i)], R_T = O(log T)
- Lai-Robbins lower bound: R_T >= sum Delta_i log T / KL(mu_i, mu*)
- Thompson Sampling: theta_i ~ Beta(alpha_i, beta_i), select argmax theta_i
- LinUCB: UCB = theta_hat^T x + alpha*sqrt(x^T A^{-1} x), R_T = O(d*sqrt(T))
- Contextual bandits: feature structure replaces arm enumeration
Further Directions
These ideas open paths to deeper mathematics.
- co-29-bayesian-opt — extends
Вопросы для размышления
- Give a concrete example.
- How does this connect to other areas of mathematics?