Convex Optimization
Online Convex Optimization: learning on a stream
In 2003 Martin Zinkevich proved that simple projected gradient descent achieves $O(\sqrt{T})$ regret against any adversarial opponent. Six years later Yahoo Research deployed contextual bandits to production news ranking. Today Vowpal Wabbit trains on billions of examples online with no batch passes, and at Spotify each track skip updates the recommender within milliseconds. Online Convex Optimization is not a theoretical abstraction - it is the engineering language of streaming ML.
- **Vowpal Wabbit (Microsoft Research):** trained online from inception, single-pass on terabyte ad-click datasets, model updates in microseconds per example
- **Spotify recommendations:** contextual bandits on skip signals and audio features, online updates of track and user embeddings after each listening session
- **Yahoo News article ranking (Li et al. 2010):** LinUCB as the first production contextual bandit deployment, A/B test showed a 12.5 percent CTR lift over batch models
Online setting: data arrives one at a time
The online setting differs radically from batch. At round $t$ the algorithm picks a point $x_t \in K$ BEFORE seeing the loss function $f_t$. After the choice, $f_t$ is revealed, the penalty $f_t(x_t)$ is paid, and round $t+1$ begins. There is no chance to recompute past moves: decisions are made in real time.
Why online matters: streaming data (Twitter trends update each second), non-stationarity (today's click distribution differs from yesterday's), big data (one pass is cheaper than storing and retraining).
Vowpal Wabbit from Microsoft Research was built online from day one: the model updates per example without storing the dataset. In Netflix recommender systems, user embeddings are recomputed after each watch event - a textbook online update with billions of rounds per day.
What separates the online setting from batch?
Regret bounds: what online learning success means
Regret is the central metric of online learning: $R_T = \sum_{t=1}^T f_t(x_t) - \min_{x^* \in K} \sum_{t=1}^T f_t(x^*)$. The comparison is against the best fixed point $x^*$ chosen in hindsight with full knowledge of all $f_t$. The algorithm aims to keep $R_T$ small.
Online Gradient Descent (Zinkevich 2003): $x_{t+1} = \Pi_K(x_t - \eta_t \nabla f_t(x_t))$. With $\eta_t = D / (G \sqrt{t})$ it achieves $R_T = O(\sqrt{T})$ regret for convex Lipschitz losses, and $O(\log T)$ for strongly convex ones.
No-regret property: $R_T / T \to 0$. The average loss of the algorithm converges to the loss of the best fixed solution. The lower bound $\Omega(\sqrt{T})$ from Abernethy-Bartlett-Rakhlin (2008) shows that beating $\sqrt{T}$ for general convex losses is impossible without extra structure.
Connection to SGD: Stochastic Gradient Descent is a special case of OGD where $f_t$ is sampled from a distribution. Regret analysis explains SGD generalization: a regret bound translates into an excess-risk bound through online-to-batch conversion (Cesa-Bianchi-Conconi-Gentile 2004).
What does the no-regret property mean?
Mirror descent: a universal no-regret algorithm
Mirror Descent (Nemirovski-Yudin 1983) generalizes OGD: instead of Euclidean projection it uses a Bregman divergence $D_\phi(x, y) = \phi(x) - \phi(y) - \langle \nabla \phi(y), x - y \rangle$ generated by a mirror map $\phi$. Update: $x_{t+1} = \arg\min_{x \in K} [\langle \nabla f_t(x_t), x \rangle + \frac{1}{\eta} D_\phi(x, x_t)]$.
The choice of $\phi$ is matched to the geometry of $K$. On the probability simplex $\Delta^n$ with $\phi(x) = \sum x_i \log x_i$ (negative entropy), the Bregman divergence becomes KL, and mirror descent reduces to multiplicative weights / Hedge / EXP3.
Mirror descent regret bound: $R_T \le \sqrt{2 R^2 G_*^2 T / \sigma}$, where $R$ is the radius of $K$ in $\phi$-geometry, $G_*$ is the dual norm of the gradient, $\sigma$ is the strong convexity of $\phi$. For the simplex with entropy this gives $O(\sqrt{T \log n})$ versus $O(\sqrt{Tn})$ for Euclidean OGD - a factor of $\sqrt{n / \log n}$.
AdaGrad (Duchi-Hazan-Singer 2011) is mirror descent with a data-dependent $\phi_t$ accumulating squared gradients. Adam is a heuristic mirror-descent variant with moving averages. In information geometry, mirror descent with $\phi$-divergence on an exponential family coincides with natural gradient descent.
What does mirror descent improve over OGD on the probability simplex?
Bandit feedback: only loss values, no gradients
The bandit setting is online with partial feedback: only the scalar $f_t(x_t)$ is observed, with no gradient and no values at other points. This is realistic for production: an ad placement reveals click or no-click on the chosen variant, but never what would have happened with the alternatives.
Multi-armed bandits: $K$ discrete arms, one is pulled per step. UCB1 (Auer-Cesa-Bianchi-Fischer 2002) achieves $O(\sqrt{KT \log T})$ regret via upper confidence bounds. Thompson Sampling (Thompson 1933, rediscovered in the 2010s) uses Bayesian posteriors and often beats UCB in practice.
Contextual bandits: at step $t$ a context $c_t$ is given, an action is chosen, a reward is observed. LinUCB (Li-Chu-Langford-Schapire 2010) assumes a linear reward model and applies UCB to its confidence intervals. EXP3 (Auer-Cesa-Bianchi-Freund-Schapire 2002) is mirror descent on the arm simplex with importance-weighted estimators for adversarial bandits.
Yahoo News deployed LinUCB for article ranking starting in 2009, the first production rollout of contextual bandits in news. Spotify uses contextual bandits for recommendations: skip as the loss signal, listening history and audio features as context.
Bandit gradient estimation for continuous $K$: SPSA (Spall 1992) and one-point estimators build an unbiased gradient estimate from one or two loss queries. The price is variance growth; convergence rates drop from $\sqrt{T}$ to $T^{2/3}$ or $T^{3/4}$ depending on the structure.
Bandits are about roulette and casinos, not relevant for ML
Bandits formalize any decision-making with limited feedback. Production uses include A/B testing with adaptive allocation, contextual recommendations (Yahoo News, Spotify), adaptive clinical trials (FDA approved adaptive design in 2019), and hyperparameter tuning (Hyperband is a bandit algorithm).
The historical name from slot machine levers ("one-armed bandits") frames the wrong context. The bandit formalism describes any exploration vs exploitation problem: try a new recommender or stay with the proven one? It is an engineering question, not a casino game.
What separates bandit feedback from full-information online?
Key ideas
- **Online setting:** decision $x_t$ before $f_t$ is revealed, sequential data without a second pass. Regret $R_T = \sum f_t(x_t) - \min_{x^*} \sum f_t(x^*)$ measures quality against best-in-hindsight
- **OGD reaches $O(\sqrt{T})$ regret** for convex Lipschitz losses through a simple projection $x_{t+1} = \Pi_K(x_t - \eta_t \nabla f_t(x_t))$. The lower bound $\Omega(\sqrt{T})$ rules out improvements without extra structure
- **Mirror descent with entropy** on the simplex gives $O(\sqrt{T \log n})$ regret rather than $O(\sqrt{Tn})$ for OGD. AdaGrad and Adam are data-dependent mirror-descent variants used in production
- **Bandit feedback:** only $f_t(x_t)$ is observed, no gradient. UCB and Thompson Sampling for MAB, LinUCB and EXP3 for contextual and adversarial settings. Yahoo News, Spotify, adaptive trials are bandits in production
Related topics
Online Convex Optimization links streaming ML, learning theory, and information geometry:
- Online Learning Theory: regret minimization — Regret analysis as a universal framework for online algorithms; lower bounds (Abernethy-Bartlett-Rakhlin) define the limits of the achievable
- Mirror Descent as Natural Gradient — Mirror descent with $\phi$-divergence on an exponential family coincides with natural gradient in information geometry - the same algorithm from two viewpoints
- Gradient methods (cvx-04) — OGD is the online sibling of gradient descent; the regret bound explains SGD generalization through online-to-batch conversion
- Bayesian inference (Thompson Sampling) — Thompson Sampling picks an action by posterior probability of optimality - a fully Bayesian approach to exploration vs exploitation
Вопросы для размышления
- OGD gives $O(\sqrt{T})$ regret against any adversarial opponent. What changes when losses $f_t$ are sampled stochastically from a distribution rather than picked adversarially? Which stronger bounds become possible?
- Multiplicative weights yields $O(\sqrt{T \log n})$ regret on the simplex with $n$ experts. How does this algorithm relate to softmax classification in neural networks, and why is cross-entropy a natural loss to optimize?
- Spotify uses skip events as a loss signal for a contextual bandit. Which problems does such a loss definition introduce (selection bias, delayed feedback, non-stationarity), and how can an online algorithm address them?
Связанные уроки
- lt-12-online-regret — Online learning theory in context of regret minimization
- ig-09-mirror-descent — Mirror descent in IG is the same algorithm
- cvx-04 — Need gradient methods foundation
- cvx-09 — Convex sets and projections
- prob-04-bayes — Thompson sampling is Bayesian
- prob-17