Convex Optimization

Online Convex Optimization

Hazan in 2007 proved that every online convex optimization algorithm can be represented as FTRL with some regularizer. This unified all methods -- from OGD to multiplicative weights -- into a single framework.

Recommender systems: OGD updates the factor matrix after each user interaction
Trading algorithms: MWU rebalances portfolio assets, minimizing regret vs best single asset
Neural network training: Adam/AdaGrad are online optimizers with adaptive step size

Предварительные знания

Previous lesson

OGD and O(sqrt(T)) Regret

Zinkevich in 2003 proved that Online Gradient Descent with step eta = D/(G*sqrt(T)) achieves regret R_T <= DG*sqrt(T), where D is the diameter of the feasible set and G is the subgradient norm. As T grows, average regret R_T/T -> 0 -- the algorithm is asymptotically optimal in hindsight.

What is the regret bound for Online Gradient Descent with optimal step size?

Exponential Weights and FTRL

Follow-The-Regularized-Leader (FTRL): w_{t+1} = argmin_w sum_{s<t} f_s(w) + R(w). With R(w) = (1/2eta)||w||^2 this is OGD; with R(w) = (1/eta) sum w_i log w_i this is the multiplicative weights algorithm. Hazan et al. (2007): FTRL with adaptive regularization achieves O(sqrt(T)) even for non-Lipschitz functions.

Which regularizer R(w) turns FTRL into Online Gradient Descent?

Key Ideas

OGD: w_{t+1} = Pi_K(w_t - eta*grad f_t(w_t)), R_T <= DG*sqrt(T)
Lower bound: R_T = Omega(sqrt(T)) -- OGD is optimal
FTRL: w_{t+1} = argmin_w [sum f_s(w) + R(w)/eta]
AdaGrad: diagonal preconditioner G_t = sum g_s g_s^T
MWU: R_T <= 2*sqrt(T*ln K) with K experts

Further Directions

These ideas open paths to deeper mathematics.

co-28-bandit-opt — extends

Вопросы для размышления

Give a concrete example.
How does this connect to other areas of mathematics?