Convex Optimization

Bandit Optimization

Thompson proposed his algorithm back in 1933, but its optimality was proved only in 2012 by Chapelle and Li. For 80 years a simple Bayesian algorithm awaited rigorous analysis -- and turned out to be optimal by the Lai-Robbins bound.

A/B testing: bandits allow adaptively routing traffic to the better version
Clinical trials: bandit allocation of patients minimizes harm
Ad auctions: LinUCB selects ads accounting for user context

Предварительные знания

Previous lesson

UCB and Thompson Sampling

Lai and Robbins (1985) proved the lower bound on multi-armed bandit regret: R_T >= sum_{i: mu_i < mu*} (mu* - mu_i) * log(T) / KL(mu_i, mu*). UCB1 (Auer 2002) achieves this bound: select arm i* = argmax mu_hat_i + sqrt(2 log t / n_i). Thompson Sampling (1933, rediscovered Chapelle 2011) achieves the same regret Bayesianly.

What does the Lai-Robbins lower bound state for the multi-armed bandit?

Contextual Bandits and LinUCB

LinUCB (Li et al. 2010): at each round observe context x_t in R^d, reward r_t = theta^T x_{t,a} + epsilon. Estimate theta via Ridge regression, UCB = theta_hat^T x + alpha*sqrt(x^T A^{-1} x). Regret O(d*sqrt(T)) instead of O(sqrt(KT)). Foundation of Netflix and Yahoo! News recommendation systems.

What is the dimensional dependence of LinUCB regret?

Key Ideas

UCB1: i* = argmax [mu_hat_i + sqrt(2 log t / n_i)], R_T = O(log T)
Lai-Robbins lower bound: R_T >= sum Delta_i log T / KL(mu_i, mu*)
Thompson Sampling: theta_i ~ Beta(alpha_i, beta_i), select argmax theta_i
LinUCB: UCB = theta_hat^T x + alpha*sqrt(x^T A^{-1} x), R_T = O(d*sqrt(T))
Contextual bandits: feature structure replaces arm enumeration

Further Directions

These ideas open paths to deeper mathematics.

co-29-bayesian-opt — extends

Вопросы для размышления

Give a concrete example.
How does this connect to other areas of mathematics?