Convex Optimization
Convex Optimization Problem
SVM is not a "classification algorithm." SVM is a quadratic program with linear constraints: min 0.5||w||^2 subject to yi(w*xi+b) >= 1. Once that framing lands - kernel trick, dual problem, and support vectors all become corollaries of the Kuhn-Tucker theorem. The entire scikit-learn SVC under the hood solves exactly this convex problem via LIBSVM (sequential minimal optimization). Not an algorithm - a problem. The distinction is fundamental.
- **SVM as QP**: scikit-learn SVC calls LIBSVM, which solves a quadratic program via SMO. The kernel trick works because the dual problem depends only on dot products xi*xj - not on xi directly
- **LASSO and L1**: min ||Xw-y||^2 + lambda*||w||_1 is a convex problem with a non-smooth constraint. Sparsity is not magic - it is a consequence of corner points of the L1 ball: the optimum lands where coordinates are zero
- **Markowitz portfolio**: min w^T*Sigma*w subject to E[r^T*w] >= R, sum(wi)=1 - classic QP. Every portfolio app (Betterment, Wealthfront) solves exactly this under the hood
Предварительные знания
Standard Form
SVM is not a "classification algorithm." SVM is a quadratic program with linear constraints: min 0.5||w||^2 subject to yi(w*xi+b) >= 1. Once that framing lands - the kernel trick, dual problem, and support vectors all become corollaries of a single theorem. Before that, it feels like magic.
A **convex optimization problem** in standard form: minimize a convex function f0(x) subject to convex inequalities gi(x) <= 0 and affine equalities hj(x) = 0. Three types of objects, three requirements - nothing redundant.
**Equalities must be affine (linear).** A nonlinear equality h(x) = 0 defines a surface that is generally not convex. Affine hj(x) = aj^T*x - bj is a hyperplane - and only a hyperplane satisfies the requirement. Everything else breaks the guarantees.
| Problem type | Objective f0 | Constraints gi | ML example |
|---|---|---|---|
| LP (linear) | c^T*x | Ax <= b (linear) | Portfolio with budget |
| QP (quadratic) | x^T*Px + q^T*x | Ax <= b | SVM, Ridge regression |
| SOCP | ||Ax+b||_2 + c^T*x | Lorentz cone | Robust regression |
| SDP | tr(CX) | X >= 0 (PSD) | SDP relaxation of MAX-CUT |
| General convex | Convex | Convex | LASSO (non-smooth) |
**Nesting hierarchy:** LP < QP < QCQP < SOCP < SDP < General convex. Each class contains all previous ones. LASSO is QP plus L1, which can be reduced to LP via auxiliary variables. That is why coordinate descent on LASSO converges with guarantees.
Problem: min x^2 + y^2 subject to x + y = 1, x^2 + y^2 <= 2. Is this a convex problem?
Feasible Set and Slater's Condition
Markowitz portfolio optimization is the textbook example: min variance(w) subject to E[return] >= R, sum(wi) = 1, wi >= 0. Before searching for the optimal portfolio, the first question is: does any portfolio satisfying all constraints simultaneously even exist? That is the question about the **feasible set**.
The **feasible set** is the intersection of all constraints: D = {x | gi(x) <= 0, hj(x) = 0}. In a convex problem D is always convex: sublevel sets of convex gi are convex, hyperplanes hj = 0 are convex, and the intersection of convex sets is convex. This does not need to be re-proved each time - it follows directly from the definitions.
| Problem status | Feasible set | What happens |
|---|---|---|
| Infeasible | Empty (intersection is empty) | Problem has no meaning - not a single feasible point exists |
| Feasible, bounded | Compact convex set | Optimum exists and is attained |
| Feasible, unbounded | Extends to infinity | Optimum is minus infinity (unbounded below) |
| Single point | {x*} | x* is the unique feasible and optimal solution |
**Slater's condition** is the key to duality: if there exists a strictly feasible point (all inequalities strict: gi(x) < 0), strong duality holds (d* = p*). For Markowitz this means: if at least one portfolio strictly satisfies all constraints - dual variables (shadow prices) exactly characterize the optimum.
Problem: min x subject to x >= 5, x <= 3. What is the status?
Local = Global Minimum
Here is why convex optimization is the "easy" class: **any local minimum is a global minimum**. No traps. No deceptive valleys. Gradient descent on Ridge regression or LASSO cannot get stuck - launched from any starting point, it converges to the same answer.
The proof is short. Suppose x* is a local minimum but not global. Then there exists y with f(y) < f(x*). By convexity of f: f(t*x* + (1-t)*y) <= t*f(x*) + (1-t)*f(y) < f(x*). As t -> 1 the point t*x* + (1-t)*y is arbitrarily close to x* with a smaller value of f - a contradiction with local optimality. End of proof.
**For non-convex problems this does not hold.** Neural networks with two or more layers are non-convex. SGD in PyTorch starts from a random point and moves downhill, but which "bottom" it finds depends on initialization. That is why He initialization, learning rate schedules, and weight decay matter: they help SGD land in a good (if not globally optimal) region.
| Property | Convex problem | Non-convex problem |
|---|---|---|
| Local minimum | = global (guaranteed) | One of many |
| Gradient descent | Converges to global optimum | Can get stuck in local or saddle |
| Complexity | Polynomial (interior-point: O(n^3)) | Often NP-hard |
| Optimality certificate | Via KKT and duality | Usually absent |
**Unconstrained optimality condition**: nabla f(x*) = 0 is both necessary and sufficient. That is why Ridge regression has an analytic solution: w* = (X^T*X + lambda*I)^{-1} * X^T*y - take gradient of MSE+L2, set to zero, and the result is immediately the global minimum. With constraints, the KKT conditions replace this.
Gradient descent is launched from 100 different starting points on a convex function. What happens?
First Look at Duality
Why does the kernel trick in SVM work at all? Why does replacing xi with phi(xi) and computing only dot products K(xi, xj) = phi(xi)^T * phi(xj) change everything? The answer is in the dual problem. Dual SVM depends not on the vectors phi(x) themselves, but only on their dot products. This is a direct consequence of the Lagrangian structure.
Every optimization problem has a "mirror" - the **dual problem**. The primal minimizes f0(x). The dual maximizes a lower bound via the **Lagrangian**: L(x, lambda, nu) = f0(x) + sum(lambda_i * gi(x)) + sum(nu_j * hj(x)). Multipliers lambda >= 0 penalize violations of inequalities.
**Weak duality** (always, for any problem): d* <= p*. The dual optimum is a guaranteed lower bound on the primal. **Strong duality** (convex problem + Slater's condition): d* = p*. Zero gap - and then KKT conditions are both necessary and sufficient for optimality.
**KKT conditions** for a convex problem with strong duality - four items: 1) stationarity: nabla f0(x*) + sum(lambda_i * nabla gi(x*)) + sum(nu_j * nabla hj(x*)) = 0; 2) primal feasibility: gi(x*) <= 0, hj(x*) = 0; 3) dual feasibility: lambda_i >= 0; 4) complementary slackness: lambda_i * gi(x*) = 0 for all i. Item four is the origin of "support vectors" in SVM: only points on the margin boundary have lambda_i > 0.
| Property | Weak duality | Strong duality |
|---|---|---|
| Always holds? | Yes, for any problem | No - convexity + Slater required |
| Relation | d* <= p* | d* = p* |
| Practical value | Lower bound without solving primal | KKT as optimality certificate |
| In SVM | Dual >= lower bound on margin | Dual SVM = Primal SVM (kernel trick) |
Any optimization problem can be reformulated as a convex one
NP-hard problems (TSP, SAT, Integer Programming) are inherently non-convex. If they could be reduced to convex problems - P = NP.
Convex optimization runs in polynomial time (interior-point: O(n^3.5) for LP). If an NP-hard problem could be reformulated as convex it would run in polynomial time, implying P = NP. In practice, convex relaxations are used: the SDP relaxation of MAX-CUT (Goemans-Williamson) gives a 0.878-approximation guarantee.
Weak duality: d* <= p*. What does this mean practically when solving a problem?
Key Ideas
- **Standard form**: min f0(x) subject to gi(x) <= 0, hj(x) = 0 - f0 and gi convex, hj affine. SVM is QP in this form
- **Feasible set** of a convex problem is always convex. Slater's condition (a strictly feasible point) guarantees strong duality
- **Main theorem**: local minimum = global minimum. Gradient descent on Ridge/LASSO/logistic regression cannot get stuck - the result does not depend on the starting point
- **Duality**: d* <= p* always (weak). With convexity + Slater: d* = p* (strong). Dual SVM + kernel trick is a direct consequence
Related Topics
The convex optimization problem unifies functions, sets, and algorithms:
- Convex Functions — Objective and constraints must be convex - covered in the previous lesson
- Convex Sets — The feasible region is a convex set built from sublevel sets and hyperplanes
- KKT and Duality — Full treatment of optimality conditions and their application to SVM
Вопросы для размышления
- Complementary slackness: lambda_i * gi(x*) = 0. What does this mean for SVM - why do only support vectors (points on the margin boundary) have nonzero lambda_i?
- Neural network training is non-convex. Why does gradient descent with SGD/Adam still work in practice? What replaces the guarantees of convexity?
- How does the convex SDP relaxation of MAX-CUT (Goemans-Williamson 1995) achieve a 0.878-approximation for an NP-hard problem? What exactly is being relaxed?
Связанные уроки
- cvx-02 — Convex functions are the objective and constraints of the problem
- cvx-04 — KKT conditions are derived directly from standard form
- ml-13-svm — SVM is a quadratic program with linear inequality constraints
- calc-19-gradient — Lagrangian stationarity condition uses the gradient
- ml-28-optimizers — Adam and SGD solve convex and non-convex problems in practice
- la-09-systems