Convex Optimization

Nonsmooth Optimization and Proximal Methods

The L1-norm and total variation appear in hundreds of signal and image processing problems, yet they are not differentiable at zero. How can functions with no gradient at the most important points be optimized?

MRI reconstruction: FISTA (Beck and Teboulle, 2009) solves TV-minimization for Siemens MRI scanners in seconds - reducing scan time by 50% at the same quality
LASSO in genomics: L1-regularized regression selects 50-100 significant genes from 500,000 candidates; proximal gradient converges in 200-500 iterations
Neural network pruning: L1 regularization with proximal soft-thresholding zeroes small weights - structural pruning without additional algorithms
Audio processing: total variation regularization for impulse noise removal; ISTA solves the problem 10x faster than the subgradient method

Предварительные знания

Convex functions
Gradient descent
Duality theory

Subgradient Methods

Many optimization problems contain non-differentiable functions: L1 regularization in Lasso, hinge loss in SVM, total variation in image processing. Plain gradient descent does not apply directly - the gradient does not exist at kink points. Subgradient methods extend GD by using any element of the subdifferential, a generalized gradient for convex functions.

The subgradient method converges at rate O(1/sqrt(k)) even for smooth convex functions (where GD gives O(1/k) or faster). This is the price of generality: dropping differentiability cannot beat this rate. For smooth problems GD is preferred; for nonsmooth ones, proximal or accelerated methods are better.

Subgradients carry no local information about the function value: a small subgradient norm does not imply closeness to the optimum (unlike a gradient). This makes the stopping criterion nontrivial - usually a fraction of the initial residual or a maximum iteration count is used.

Why does the subgradient method not guarantee monotone decrease of f?

Proximal Algorithms

Proximal algorithms (Moreau 1965, Rockafellar 1976) offer an elegant alternative: instead of fighting non-differentiability, compute the proximal operator - an analogue of a gradient step for the nonsmooth part. For most practical regularizers (L1, L2, indicator of a convex set, nuclear norm) the prox-operator has a closed form. This combines the efficiency of GD on the smooth part with exact handling of the nonsmooth part.

Prox of common functions: for the L2 norm - linear shrinkage; for the indicator of a convex set - projection; for the nuclear norm of a matrix - soft thresholding of singular values; for total variation - the Rudin-Osher-Fatemi problem solved via GraphCut or a dual approach.

What makes the prox of the L1 norm especially useful for sparse regression?

Bundle Methods: subgradient accumulation

The classical subgradient method discards information - one subgradient is used per iteration and the rest are forgotten. Bundle methods (Lemarechal 1975, Kiwiel 1985) keep a set of previous subgradients and build a piecewise-linear model of the function. This improves practical convergence, particularly on functions with highly nonsmooth structure such as piecewise differentiable problems with jumps. They are heavily used in Lagrangian relaxation for integer programming.

Bundle methods converge at rate O(1/k) for convex problems under moderate parameters. In practice they are far faster than plain subgradient on problems with many kinks. Modern implementations (BNDLE in Mosek, BMOR) cap bundle size (k_max around 30 to 50) to control memory.

Bundle Trust-Region (Hiriart-Urruty 1993) replaces quadratic regularization with a trust-region constraint. It adapts t_k based on model quality: t_k grows on good progress (more trust in the model) and shrinks on poor progress. Empirically it is better on nonsmooth problems with strong curvature.

Bundle compression: when bundle size hits the cap, aggregation replaces several subgradients by a single convex combination. Information is lost but the representation stays compact. Effective compression rules remain an open research area.

Why do bundle methods keep several subgradients instead of a single one?

Key ideas

Subdifferential df(x) is the convex set of subgradients; optimality condition 0 in df(x*) generalizes grad f(x*) = 0
Proximal operator prox_{lambda g}(v) generalizes projection; for L1 it equals soft-thresholding S_lambda(v)
ISTA: gradient step + prox step, rate O(1/k); FISTA with Nesterov momentum achieves O(1/k^2) at the same iteration cost
Primal-dual splitting (Chambolle-Pock) solves min f(x)+g(Kx) via two proximal steps per iteration at O(1/k^2)
Many proximal operators have closed forms: nuclear norm, group LASSO, simplex projection

Connections to Other Areas

Nonsmooth optimization runs through signal processing, machine learning, and control theory.

Proximal Methods for Optimization — Proximal operators are the standard tool for non-smooth objectives
Proximal Methods: Operators for Non-Smooth Optimization — Introductory chapter on proximal mapping and FISTA/ADMM
Stochastic Optimization and Adaptive Methods — Prox-SGD combines adaptive step sizes with non-smooth regularizers

Вопросы для размышления

Why does FISTA with Nesterov momentum achieve O(1/k^2) when each iteration costs the same as ISTA?
How does prox_{lambda g}(v) reduce to projection onto a convex set C when g = indicator_C?
Why is soft-thresholding the proximal operator of the convex L1-norm and not of hard-thresholding (a non-convex penalty)?