Convex Optimization

KKT Conditions

SVM trains on millions of examples. But the resulting decision boundary depends on only a few - the support vectors. Why? KKT complementary slackness: $\alpha_i(y_i(w^Tx_i+b)-1) = 0$. Either $\alpha_i = 0$ (not a support vector) or the point sits exactly on the margin boundary. Not a heuristic. Not an approximation. Pure first-order optimality structure. Karush, Kuhn and Tucker wrote four pages in 1951, not knowing that inside every SVM, every interior point solver, and every constrained RL agent, their four conditions would be checked on every iteration.

**SVM sparse solutions:** complementary slackness forces most $\alpha_i = 0$ - only support vectors define the classifier
**CVXPY / scipy.optimize:** interior point methods use KKT as stopping criterion and for dual certificate construction
**Constrained RLHF:** KKT conditions for safety constraints in PPO with KL-penalty - the math of constrained policy optimization
**LASSO regression:** L1 regularization is equivalent to a norm constraint; KKT produces sparse coefficients

Предварительные знания

Lagrange Duality

The Four KKT Conditions

**1951. Karush, Kuhn, Tucker. Four pages.** A paper on necessary optimality conditions for constrained problems. The authors had no idea that forty years later these conditions would appear inside every SVM training routine, every constrained RL agent, every scipy.optimize call. KKT conditions are the language in which an optimal point communicates with its constraints.

Constrained optimization in standard form: $$\min_{x} f(x) \quad \text{s.t.} \quad g_i(x) \leq 0,\; i=1,\ldots,m \quad h_j(x) = 0,\; j=1,\ldots,p$$ Lagrange multipliers $\lambda_i \geq 0$ (for inequalities) and $\nu_j$ (for equalities). Lagrangian: $\mathcal{L}(x, \lambda, \nu) = f(x) + \sum_i \lambda_i g_i(x) + \sum_j \nu_j h_j(x)$.

**KKT conditions - necessary optimality conditions** (under regularity; sufficient under convexity): 1. **Stationarity**: $\nabla_x \mathcal{L} = 0$, i.e. $\nabla f(x^*) + \sum_i \lambda_i^* \nabla g_i(x^*) + \sum_j \nu_j^* \nabla h_j(x^*) = 0$ 2. **Primal feasibility**: $g_i(x^*) \leq 0$ and $h_j(x^*) = 0$ 3. **Dual feasibility**: $\lambda_i^* \geq 0$ 4. **Complementary slackness**: $\lambda_i^* \cdot g_i(x^*) = 0$ for all $i$ Violate any one - the point is not optimal. Satisfy all four (under convexity) - guaranteed global optimum.

**Complementary slackness** is the most non-obvious: $\lambda_i \cdot g_i(x) = 0$ means either $\lambda_i = 0$ (constraint inactive - could have been omitted) or $g_i(x) = 0$ (constraint active - the point sits on its boundary). Both cannot be nonzero simultaneously.

KKT complementary slackness: $\lambda_i^* \cdot g_i(x^*) = 0$. If $\lambda_i^* = 3 > 0$, what can be said about $g_i(x^*)$?

Geometric Intuition

The stationarity condition $\nabla f(x^*) + \sum_i \lambda_i \nabla g_i(x^*) = 0$ reads: **the objective gradient is a non-negative linear combination of the active constraint gradients**. There is no feasible descent direction left: every way forward runs into a boundary. The optimum is a trap built from active constraints.

Geometrically: at $x^*$, the gradient $\nabla f$ points into the normal cone of the active constraints. Each active constraint pulls in its normal direction with weight $\lambda_i \geq 0$. The combined pull balances the objective gradient - that is stationarity.

**Normal cone** $N_C(x^*)$ - the set of directions pointing strictly outward from the feasible set $C$: $N_C(x^*) = \{d \mid d^T(x - x^*) \leq 0 \; \forall x \in C\}$ **KKT stationarity** = $-\nabla f(x^*) \in N_C(x^*)$: The optimum is achieved when the steepest descent direction $-\nabla f$ points outside the feasible set. **Special case - unconstrained minimum:** no active constraints, $N_C = \{0\}$, $\nabla f(x^*) = 0$. The familiar first-order condition.

**Constraint qualification (regularity):** KKT are necessary only under some constraint qualification. The simplest is LICQ (Linear Independence CQ): gradients of active constraints are linearly independent. For convex problems, Slater's condition suffices: there exists a strictly feasible interior point. CVXPY checks this automatically before applying an interior point solver.

At $x^*$, two constraints are active: $g_1(x^*) = 0$ and $g_2(x^*) = 0$. What does this imply for $\lambda_1^*$ and $\lambda_2^*$?

SVM and Support Vectors via KKT

SVM trains on millions of examples. But the learned decision boundary depends on only a handful - those that lie exactly on the margin boundary (support vectors). Remove the rest - the hyperplane does not move. This is not a heuristic. It is a direct consequence of KKT complementary slackness.

**SVM as an optimization problem:** $$\min_{w,b} \frac{1}{2}\|w\|^2 \quad \text{s.t.} \quad y_i(w^T x_i + b) \geq 1,\; \forall i$$ Dual form (via Lagrangian): $$\max_{\alpha} \sum_i \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j x_i^T x_j \quad \text{s.t.} \quad \alpha_i \geq 0,\; \sum_i \alpha_i y_i = 0$$ **KKT complementary slackness:** $\alpha_i^* \cdot [y_i(w^T x_i + b) - 1] = 0$ Either $\alpha_i^* = 0$ (ordinary point, does not affect $w$), or $y_i(w^T x_i + b) = 1$ (support vector on the margin). **Consequence**: $w^* = \sum_i \alpha_i^* y_i x_i$ - only support vectors contribute.

**Sparsity from complementary slackness** - a general principle: in problems with inequality constraints, most $\alpha_i = 0$ (inactive constraints). Only active constraints produce nonzero multipliers. LASSO regression (L1 regularization) works by the same mechanism: KKT conditions force most coefficients to zero when the $\ell_1$ constraint is active only on a small subset.

In a trained SVM, point $x_i$ is not a support vector. What is $\alpha_i^*$?

Algorithms: KKT in Practice

KKT conditions are not just a theoretical result. Three major algorithm families in convex optimization use KKT operationally. Inside scipy.optimize, CVXPY and PyTorch constrained optimizers, they run under the hood every time a constrained problem is solved.

**Three algorithmic families via KKT:** **1. Interior point / barrier method:** Replaces inequality constraints with barrier $-\mu \sum_i \log(-g_i(x))$. As $\mu \to 0$, the central path converges to the KKT point. Used in scipy, CVXPY, SCS, Gurobi. **2. Active set method:** Tracks which constraints are active. Solves an equality-constrained QP (KKT without CS) on the current active set. Adds/removes constraints when violations occur. Efficient for medium-scale QP. **3. Projected gradient descent:** $x_{k+1} = \Pi_C(x_k - \eta \nabla f(x_k))$ where $\Pi_C$ is projection onto the feasible set. Each projection is a QP solved implicitly via KKT. ProxSGD, Frank-Wolfe, Mirror descent all rely on this.

KKT conditions are always sufficient for optimality

KKT conditions are necessary under regularity; sufficient only for convex problems

For nonconvex problems (neural networks, general NLPs), KKT describe only local stationary points - which may be minima, maxima, or saddle points. SGD finds 'good enough' solutions without verifying KKT. CVXPY guarantees global optimality precisely because it verifies convexity before applying an interior point solver, making KKT sufficient.

CVXPY solves a convex QP and returns a solution. Which KKT conditions are guaranteed to hold?

Key Ideas

**Four KKT conditions:** stationarity, primal feasibility, dual feasibility ($\lambda_i \geq 0$), complementary slackness ($\lambda_i g_i = 0$)
**Complementary slackness:** either the constraint is inactive ($g_i < 0$, $\lambda_i = 0$) or active ($g_i = 0$, $\lambda_i \geq 0$) - explains SVM sparsity
**Under convexity:** KKT are necessary and sufficient for global optimality - the foundation of CVXPY, scipy, SCS
**Practice:** interior point (barrier), active set, projected gradient - all algorithms operate through KKT

What's next

KKT are the optimality conditions. Algorithms use them to find the optimum:

Interior Point Methods — Barrier method and central path - KKT conditions driving interior point algorithms
Support Vector Machines — Support vectors as a consequence of KKT complementary slackness

Вопросы для размышления

LASSO (L1 regularization) produces sparse solutions. How does this connect to KKT complementary slackness for the constraint $\|w\|_1 \leq t$?
Nash equilibrium in zero-sum games is characterized by conditions analogous to KKT. Which condition corresponds to complementary slackness?
scipy.optimize.minimize with method 'SLSQP' (Sequential Quadratic Programming) applies KKT at each iteration. What happens if the starting point strongly violates primal feasibility?

Связанные уроки

cvx-04 — Lagrange duality is the necessary foundation for KKT
cvx-06 — Interior point methods check KKT conditions on every iteration
ml-13-svm — SVM and support vectors follow directly from KKT complementary slackness
ml-09-gradient-descent — Projected gradient descent implicitly solves a KKT subproblem at each step
calc-19-gradient — KKT stationarity requires understanding gradients and the chain rule