Optimization
Convex Optimization Fundamentals
Why does logistic regression always converge reliably while neural networks need careful tuning and sometimes fail? The answer is convexity. When the optimization problem is convex, any algorithm is guaranteed to find the global optimum. Recognizing when a problem is convex - and when it isn't - is one of the most practical skills in ML engineering.
- **Portfolio optimization:** Markowitz mean-variance portfolio is a classic convex QP; CVXPY solves it in milliseconds with provably optimal weights
- **SVM training:** the SVM dual is a convex QP; no multi-start needed, training always converges to the unique global optimum
- **Lasso/Ridge regression:** convex loss + convex regularizer gives a unique, reproducible result regardless of initialization
Предварительные знания
Convex Sets
A set C is **convex** if the line segment between any two of its points lies entirely inside it: for all x, y ∈ C and θ ∈ [0,1], the point θx + (1-θ)y ∈ C. Geometrically-no dents or holes.
**Key property:** the intersection of any number of convex sets is convex. This explains why the LP feasible region (intersection of half-spaces) is always convex, and why a linear function achieves its minimum at a vertex.
Convex sets appear throughout ML: the probability simplex (probability distributions), the cone of positive semidefinite matrices (covariance matrices), the L2 ball (weight decay constraint in regularization).
Which of the following sets is NOT convex?
Convex Functions
A function f: Rn -> R is **convex** if its epigraph (the set of points on and above its graph) is a convex set. Equivalently: for all x, y and θ ∈ [0,1]: f(θx + (1-θ)y) ≤ θf(x) + (1-θ)f(y). The chord between any two points lies on or above the graph.
| Function | Type | ML application |
|---|---|---|
| MSE: ||Ax-b||^2 | Strictly convex | Linear regression |
| Log loss: -sum(yi*log(pi)) | Convex (for linear model) | Logistic regression |
| Hinge loss: max(0, 1-yi*xi) | Convex (piecewise linear) | SVM |
| Cross-entropy + neural net | Non-convex | Deep learning (many local minima) |
| L1 regularization: ||w||_1 | Convex (non-differentiable at 0) | Lasso (sparse solutions) |
**Convexity-preserving rules:** 1. non-negative sum of convex functions is convex 2. pointwise max of convex functions is convex 3. affine substitution f(Ax+b) preserves convexity. These rules enable building complex convex objectives from simple pieces.
Is the logistic regression loss L(w) = sum(log(1+exp(-yi*w^T*xi))) convex in w?
Local = Global Minimum
The fundamental theorem of convex optimization: **any local minimum of a convex function over a convex set is a global minimum**. This transforms optimization: there is no risk of getting stuck in a bad local minimum - any point where the gradient vanishes is already a solution.
**First-order optimality condition:** for differentiable convex f, x* is the global minimum if and only if nabla f(x*) = 0. For constrained problems (min f(x) s.t. x in C): nabla f(x*)^T * (y - x*) >= 0 for all y in C. The gradient points outward from the feasible set.
This is why convex optimization is a 'solved problem': gradient descent, Newton's method, interior point methods-all converge to the global optimum for convex objectives. Neural networks lack this guarantee, which is why initialization, learning rate schedules, and batch normalization matter so much.
Gradient descent stops at x* with nabla f(x*) = 0. f is convex. What conclusion follows?
Second-Order Conditions and CVXPY
The Hessian nabla^2 f(x) is the tool for checking convexity. f is convex if and only if nabla^2 f(x) is PSD for all x in its domain. Strict convexity requires nabla^2 f(x) to be positive definite. For nonlinear functions this is verified analytically or via eigenvalues.
| Problem type | Hessian condition | ML examples |
|---|---|---|
| LP | Hessian = 0 (linear) | Linear SVM, LP relaxation |
| QP (convex) | Hessian PSD, quadratic | LinReg (MSE), L2-SVM |
| SOCP | Second-order cone constraints | Robust regression, portfolio optimization |
| SDP | Matrix semidefinite constraints | Matrix completion, SDP relaxation of ILP |
**CVXPY limitation:** the problem must be convex and expressed using DCP rules. Attempting to encode a non-convex problem (e.g., multiplying two variables together) causes CVXPY to raise an error. This is a feature - it prevents accidentally formulating a non-convex problem and wondering why the solver misbehaves.
Neural networks are trained using convex optimization
Neural network training is non-convex due to matrix product weights. SGD works in practice but has no global optimality guarantee.
The loss surface of a neural network is non-convex in the weights (W1, W2): f(W1,W2) = ||W2*W1*x - y||^2. Convexity is a property of the objective function, not the algorithm. SGD finds 'good' local minima thanks to noise, batch normalization, and proper initialization-but there is no mathematical guarantee of finding the global optimum.
The Hessian of f(x) = x^T Q x has eigenvalues [-1, 3, 5]. Is f convex?
Key Ideas
- **Convex set:** any segment between two points stays inside; intersection of convex sets is convex
- **Convex function:** chord lies above the graph; nabla^2 f >= 0 (PSD Hessian) is the second-order criterion
- **Local = global:** for convex f on convex C, every local minimum is global; nabla f(x*) = 0 iff global optimum
- **CVXPY + DCP:** automatic convexity verification and solver selection; standard for convex optimization in Python
Related Topics
Convexity is the foundation for all exact solution methods:
- Interior Point Methods — Efficiently solves convex problems (QP, SOCP) with polynomial-time guarantees
- KKT Conditions — Optimality conditions for constrained convex problems generalize nabla f = 0
- Stochastic Optimization — SGD converges to the global optimum only for convex functions; theory breaks down otherwise
Вопросы для размышления
- Why does convexity of the loss function matter so much for ML reliability? What goes wrong during training when the loss is non-convex?
- Lasso and Ridge use different regularizers (L1 and L2). Are both convex? Why does Lasso produce sparse solutions while Ridge does not?
- CVXPY automatically checks whether a problem is convex. Why is this a useful feature rather than just a convenience?