Convex Optimization
Online Convex Optimization
Hazan in 2007 proved that every online convex optimization algorithm can be represented as FTRL with some regularizer. This unified all methods -- from OGD to multiplicative weights -- into a single framework.
- Recommender systems: OGD updates the factor matrix after each user interaction
- Trading algorithms: MWU rebalances portfolio assets, minimizing regret vs best single asset
- Neural network training: Adam/AdaGrad are online optimizers with adaptive step size
Предварительные знания
OGD and O(sqrt(T)) Regret
Zinkevich in 2003 proved that Online Gradient Descent with step eta = D/(G*sqrt(T)) achieves regret R_T <= DG*sqrt(T), where D is the diameter of the feasible set and G is the subgradient norm. As T grows, average regret R_T/T -> 0 -- the algorithm is asymptotically optimal in hindsight.
What is the regret bound for Online Gradient Descent with optimal step size?
Exponential Weights and FTRL
Follow-The-Regularized-Leader (FTRL): w_{t+1} = argmin_w sum_{s<t} f_s(w) + R(w). With R(w) = (1/2eta)||w||^2 this is OGD; with R(w) = (1/eta) sum w_i log w_i this is the multiplicative weights algorithm. Hazan et al. (2007): FTRL with adaptive regularization achieves O(sqrt(T)) even for non-Lipschitz functions.
Which regularizer R(w) turns FTRL into Online Gradient Descent?
Key Ideas
- OGD: w_{t+1} = Pi_K(w_t - eta*grad f_t(w_t)), R_T <= DG*sqrt(T)
- Lower bound: R_T = Omega(sqrt(T)) -- OGD is optimal
- FTRL: w_{t+1} = argmin_w [sum f_s(w) + R(w)/eta]
- AdaGrad: diagonal preconditioner G_t = sum g_s g_s^T
- MWU: R_T <= 2*sqrt(T*ln K) with K experts
Further Directions
These ideas open paths to deeper mathematics.
- co-28-bandit-opt — extends
Вопросы для размышления
- Give a concrete example.
- How does this connect to other areas of mathematics?