Optimization
Constrained Optimization
Training a Tesla autopilot that won't run red lights, or optimizing a trading portfolio with a strict risk limit - these are constrained optimization problems. KKT conditions are the language that describes all of them-from SVM training to policy gradient in safe RL.
- **SVM:** SVM training is a KKT problem; support vectors are points with lambda_i > 0 (active constraints); the dual formulation enables the kernel trick
- **Safe RL:** constrained policy gradient (CPO, TRPO, PPO) uses KKT for constraint satisfaction-the foundation of safe autonomous systems
- **Federated Learning:** ADMM decomposes the learning problem into client subproblems; used in FL with privacy constraints at Apple and Google
Предварительные знания
Lagrangian and Multipliers
Constrained problem: min f(x) s.t. hi(x) = 0, gj(x) <= 0. The **Lagrangian** merges objective and constraints into one function: L(x, lambda, mu) = f(x) + sum(lambda_i * hi(x)) + sum(mu_j * gj(x)). Multipliers lambda_i (equalities) and mu_j >= 0 (inequalities) are the 'prices' of constraint violation.
**Economic interpretation:** lambda_i is the 'shadow price' of constraint hi = 0. Relaxing the constraint by epsilon changes the optimum by approximately lambda_i * epsilon. This is identical to dual variables in LP-sensitivity of the optimum to constraint right-hand sides.
In min f(x) s.t. g(x)=0, the Lagrange multiplier lambda = 5. What happens to the optimum if we relax the constraint to g(x) <= 0.1?
KKT Conditions
**KKT conditions** (Karush-Kuhn-Tucker, 1951) are the necessary optimality conditions for a constrained problem. For convex problems they are also sufficient. Any optimum x* must satisfy four groups of conditions.
| KKT condition | Math | Meaning |
|---|---|---|
| Stationarity | nabla_x L(x*,lam*,mu*) = 0 | Optimal in x with fixed multipliers |
| Primal feasibility | g_i(x*) <= 0, h_j(x*) = 0 | Solution satisfies all constraints |
| Dual feasibility | lam_i >= 0 | Inequality multipliers are non-negative |
| Complementary slackness | lam_i * g_i(x*) = 0 | Only tight constraints have nonzero price |
**Complementary slackness** is the most intuitive KKT condition: either a constraint is active (gi = 0) or its price is zero (lambda_i = 0). Both conditions cannot hold simultaneously-unless degenerate. This explains why in SVM only support vectors have nonzero alpha_i.
At the optimum of min f(x) s.t. g(x) <= 0, we have g(x*) = -3 (constraint is inactive). What does this imply about lambda*?
Penalty and Augmented Lagrangian
To solve a constrained problem using unconstrained methods, the **penalty method** adds a quadratic penalty: min f(x) + (rho/2)||h(x)||^2. As rho goes to infinity, the solution approaches the constrained optimum. Problem: large rho causes ill-conditioning. **Augmented Lagrangian** combines Lagrangian and penalty: La = f + lambda^T*h + (rho/2)||h||^2-a finite rho suffices.
**Augmented Lagrangian vs plain penalty:** plain penalty needs rho -> infinity for exact constraint satisfaction (ill-conditioning). Augmented Lagrangian updates lambda-a finite rho achieves exact satisfaction. This is the Hestenes-Powell method of multipliers (1969), the foundation of ADMM.
Why is Augmented Lagrangian better than plain penalty method for exact constraint satisfaction?
ADMM: Distributed Optimization
**ADMM** (Alternating Direction Method of Multipliers) extends Augmented Lagrangian to problems of the form min f(x) + g(z) s.t. Ax + Bz = c. It splits the problem into two subproblems, solved alternately: an x-update and a z-update. Natural fit for distributed ML.
**Safe RL via constrained optimization:** the constrained policy optimization problem (max reward s.t. E[violations] <= delta) is solved through CPO (Constrained Policy Optimization) or ADMM-based approaches. KKT conditions give the optimality criterion for policy gradients with constraints.
KKT conditions are purely theoretical and not used in practice
KKT conditions underlie all industrial constrained solvers: interior point methods solve perturbed KKT systems, SVM is built on dual KKT, ADMM updates dual variables per KKT.
Primal-dual IPM at each iteration solves: nabla_x L = 0, h(x)=0, g(x)<=0, lambda>=0, lambda*g=0-with a barrier smoothing. This is the KKT system. SVM dual: the KKT condition w=sum(alpha_i*yi*xi) is used directly for prediction. KKT is not just theory-it's the algorithmic foundation of modern constrained optimization.
Why is ADMM popular for distributed ML problems?
Key Ideas
- **Lagrangian** L(x,lambda,mu) merges objective and constraints; multipliers are shadow prices of constraints
- **KKT:** stationarity + primal feasibility + dual feasibility + complementary slackness-necessary optimality conditions (sufficient for convex problems)
- **Augmented Lagrangian** solves constrained problems with unconstrained methods; dual update achieves exact constraint satisfaction with finite rho
- **ADMM** decomposes composite problems into subproblems; standard for distributed ML and federated learning
Related Topics
KKT and Lagrangian are the language of all constrained optimization:
- Convex Optimization — For convex problems KKT is necessary and sufficient; strong duality guarantees primal-dual gap = 0
- Interior Point Methods — Primal-dual IPM solves perturbed KKT conditions; the barrier smooths complementary slackness
- Bayesian Optimization — Constrained BO uses Augmented Lagrangian for constrained HPO
Вопросы для размышления
- Complementary slackness says: either the constraint is active or its price is zero. Why are both conditions impossible simultaneously? How does this connect to SVM support vectors?
- Gradient descent can't handle constraints directly. Why does Augmented Lagrangian transform a constrained problem into a sequence of unconstrained ones, and how does this help?
- How does ADMM apply to federated learning? What corresponds to the x-update, z-update, and dual update in that context?