Convex Optimization
Lagrange Duality
Behind kernel SVM, ADMM in distributed training, constrained generation in LLMs - everywhere there is Lagrange duality. Primal problem: minimize a function subject to constraints. Dual problem: maximize lower bounds. Under strong duality - it is the same problem viewed from two directions. Without dual form, SVM cannot use the kernel trick and is confined to linear space.
- **SVM and kernel trick:** dual form replaces $x_i^\top x_j$ with $K(x_i, x_j)$ - the algorithm instantly works in infinite-dimensional spaces (RBF kernel = infinite-degree polynomial)
- **ADMM for federated learning:** Alternating Direction Method of Multipliers splits a global problem into local ones via dual decomposition; agents exchange only dual variables, not raw data
- **Constrained generation in LLMs:** output length and safety constraints are implemented via Lagrangian relaxation - violations are penalized and gradually disappear from the optimal solution
- **Boyd & Vandenberghe:** their textbook 'Convex Optimization' is used in Stanford CS229, CMU 10-725, and every serious ML course. The duality chapter is mandatory reading
- **Interior point methods:** solvers (CPLEX, Gurobi, MOSEK) check convergence via duality gap - when $p^* - d^* < \varepsilon$, the optimum is found. This is a certifiable quality guarantee
The Lagrangian
SVM could work entirely in the original feature space. But dual form allows replacing $x^\top y$ with $K(x,y)$ - and suddenly SVM operates in infinite-dimensional space. Duality is not a mathematical trick. It is a change of perspective on an optimization problem, one that unlocks methods impossible in the primal formulation.
Lagrange's idea (1788): instead of enforcing constraints rigidly, penalize violations - add them to the objective with multipliers. If a candidate $x$ violates the constraint ($g_i(x) > 0$), the penalty $\lambda_i g_i(x)$ grows. The optimal solution either satisfies every constraint or pays a strictly positive penalty.
**Primal problem:** $\min_x f_0(x)$ $\text{s.t.} \; f_i(x) \leq 0, \; i = 1, \ldots, m$ $\quad\quad h_j(x) = 0, \; j = 1, \ldots, p$ **Lagrangian:** $L(x, \lambda, \nu) = f_0(x) + \sum_{i=1}^m \lambda_i f_i(x) + \sum_{j=1}^p \nu_j h_j(x)$ where $\lambda_i \geq 0$ are dual variables (Lagrange multipliers) for inequality constraints, $\nu_j \in \mathbb{R}$ for equality constraints. $L$ is a weighted sum of the objective and constraint violations.
Why does the Lagrangian require $\lambda_i \geq 0$ for inequality constraints $f_i(x) \leq 0$?
Dual Function and Lower Bounds
The dual function is the minimum of the Lagrangian over the primal variables $x$ for fixed $\lambda, \nu$. The key property: for any feasible $\lambda \geq 0, \nu$, the dual function provides a lower bound on the primal optimum. This holds always, even without convexity.
**Dual function:** $g(\lambda, \nu) = \inf_x L(x, \lambda, \nu) = \inf_x \left[ f_0(x) + \sum_i \lambda_i f_i(x) + \sum_j \nu_j h_j(x) \right]$ **Lower bound:** for any $\lambda \geq 0$ and any $\nu$: $g(\lambda, \nu) \leq p^*$ where $p^* = \min\{f_0(x) \mid f_i(x) \leq 0, h_j(x) = 0\}$ is the primal optimum. **Dual problem:** $\max_{\lambda \geq 0, \nu} g(\lambda, \nu) = d^*$ $g$ is always concave in $\lambda, \nu$ (as an infimum of linear functions), so the dual problem is always convex.
A remarkable fact: even when the primal is nonconvex, the dual is always convex. This is why duality is sometimes used to obtain lower bounds for hard nonconvex problems. In constrained neural network training (safety, output length), the dual function provides a certificate of how far the current solution is from optimum.
The dual function $g(\lambda, \nu)$ is always concave. Why?
Weak and Strong Duality
1951. Harold Kuhn and Albert Tucker derived first-order conditions for constrained problems. KKT conditions would appear in every optimization textbook for the next 70 years - and in every solver from scipy to CPLEX. The bridge to them runs through strong duality.
**Weak duality** (always holds): $d^* \leq p^*$ The dual optimum never exceeds the primal optimum. The difference $p^* - d^*$ is called the duality gap. **Strong duality** ($d^* = p^*$, duality gap = 0): Holds under **Slater's condition:** there exists a strictly feasible point, i.e., $\exists \tilde{x}$ such that $f_i(\tilde{x}) < 0$ for all $i$. **Consequence:** for convex primal problems satisfying Slater's condition, strong duality holds automatically. The dual problem has the same optimum as the primal.
Zero duality gap is an optimality certificate. In distributed training (ADMM for federated learning) agents optimize local dual problems and exchange dual variables. Convergence to zero duality gap guarantees global optimum is reached - without sharing raw data.
Strong duality means primal and dual problems have the same solution point $x^*$.
Strong duality means equality of optimal values $p^* = d^*$, but optimal points $x^*$ (primal) and $\lambda^*$ (dual) are different objects in different spaces.
Primal variables $x$ are model parameters (weights, geometric coordinates). Dual variables $\lambda$ are 'prices' of constraints - a measure of how actively each constraint influences the optimum.
Primal optimum $p^* = 5$, dual optimum $d^* = 3$. What can be concluded?
KKT Conditions and SVM Dual
KKT conditions are necessary and sufficient optimality conditions for convex problems with Slater's condition. Four conditions, each with its own meaning. Their direct consequence is SVM dual form, which enables the kernel trick.
**KKT conditions** (necessary under Slater, necessary and sufficient under convexity): 1. **Stationarity:** $\nabla_x L(x^*, \lambda^*, \nu^*) = 0$ (gradient of Lagrangian w.r.t. $x$ vanishes) 2. **Primal feasibility:** $f_i(x^*) \leq 0$, $h_j(x^*) = 0$ (original constraints are satisfied) 3. **Dual feasibility:** $\lambda_i^* \geq 0$ (multipliers for inequalities are non-negative) 4. **Complementary slackness:** $\lambda_i^* f_i(x^*) = 0$ (either constraint is active $f_i = 0$, or $\lambda_i = 0$) Complementary slackness is the key: 'inactive' constraints do not affect the optimum ($\lambda_i = 0$ whenever $f_i(x^*) < 0$).
SVM dual is the canonical example. Primal SVM: minimize $\frac{1}{2}\|w\|^2$ subject to $y_i(w^\top x_i + b) \geq 1$. Substituting KKT stationarity and minimizing over $w, b$ yields the dual: $\max_{\alpha} \sum_i \alpha_i - \frac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j x_i^\top x_j$ $\text{s.t.} \; \alpha_i \geq 0, \; \sum_i \alpha_i y_i = 0$ Only dot products $x_i^\top x_j$ appear in the dual. Replace them with $K(x_i, x_j) = \phi(x_i)^\top \phi(x_j)$ - and SVM operates in any (infinite-dimensional) feature space without ever computing $\phi(x)$ explicitly.
In SVM dual, why do only dot products $x_i^\top x_j$ appear rather than the vectors $x_i$ themselves?
Key Ideas
- **Lagrangian** $L(x,\lambda,\nu)$ = objective + weighted constraints; $\lambda \geq 0$ for inequalities
- **Dual function** $g(\lambda,\nu) = \inf_x L$ - always concave, always a lower bound on $p^*$
- **Weak duality:** $d^* \leq p^*$ - always true without additional conditions
- **Strong duality:** $d^* = p^*$ - holds for convex primal with Slater's condition (strictly feasible point)
- **KKT:** stationarity + primal/dual feasibility + complementary slackness - necessary and sufficient conditions under convexity
- **SVM dual:** kernel trick is possible precisely because the dual depends on $x_i$ only through dot products
What Comes Next
Duality is theory. KKT is the tool. Solvers are practice:
- KKT Conditions — Complete optimality system: four conditions solving constrained problems
- Interior Point Methods — Interior point solvers use duality gap as the stopping criterion
Вопросы для размышления
- SVM in primal space minimizes $\|w\|^2$. In dual - maximizes lower bounds through $\alpha_i$. Why is this the same problem - and how are $w^*$ and $\alpha^*$ related?
- Zero duality gap guarantees the found solution is globally optimal. How is this used in federated learning to verify convergence without accessing centralized data?
- If the primal is nonconvex, the dual is still convex. Why? And what is the dual useful for in nonconvex optimization?
Связанные уроки
- cvx-03 — Convex functions and optimality conditions
- cvx-05 — KKT conditions follow from strong duality
- opt-02 — General vs convex optimization - different guarantees
- ml-09-gradient-descent — SVM dual is trained via projected gradient on alpha
- fa-01 — Hilbert spaces - mathematical home for the kernel trick
- la-25-quadratic-forms