Optimization

Constrained Optimization

Training a Tesla autopilot that won't run red lights, or optimizing a trading portfolio with a strict risk limit - these are constrained optimization problems. KKT conditions are the language that describes all of them-from SVM training to policy gradient in safe RL.

  • **SVM:** SVM training is a KKT problem; support vectors are points with lambda_i > 0 (active constraints); the dual formulation enables the kernel trick
  • **Safe RL:** constrained policy gradient (CPO, TRPO, PPO) uses KKT for constraint satisfaction-the foundation of safe autonomous systems
  • **Federated Learning:** ADMM decomposes the learning problem into client subproblems; used in FL with privacy constraints at Apple and Google

Предварительные знания

  • Nonlinear Optimization

Lagrangian and Multipliers

Constrained problem: min f(x) s.t. hi(x) = 0, gj(x) <= 0. The **Lagrangian** merges objective and constraints into one function: L(x, lambda, mu) = f(x) + sum(lambda_i * hi(x)) + sum(mu_j * gj(x)). Multipliers lambda_i (equalities) and mu_j >= 0 (inequalities) are the 'prices' of constraint violation.

**Economic interpretation:** lambda_i is the 'shadow price' of constraint hi = 0. Relaxing the constraint by epsilon changes the optimum by approximately lambda_i * epsilon. This is identical to dual variables in LP-sensitivity of the optimum to constraint right-hand sides.

In min f(x) s.t. g(x)=0, the Lagrange multiplier lambda = 5. What happens to the optimum if we relax the constraint to g(x) <= 0.1?

KKT Conditions

**KKT conditions** (Karush-Kuhn-Tucker, 1951) are the necessary optimality conditions for a constrained problem. For convex problems they are also sufficient. Any optimum x* must satisfy four groups of conditions.

KKT conditionMathMeaning
Stationaritynabla_x L(x*,lam*,mu*) = 0Optimal in x with fixed multipliers
Primal feasibilityg_i(x*) <= 0, h_j(x*) = 0Solution satisfies all constraints
Dual feasibilitylam_i >= 0Inequality multipliers are non-negative
Complementary slacknesslam_i * g_i(x*) = 0Only tight constraints have nonzero price

**Complementary slackness** is the most intuitive KKT condition: either a constraint is active (gi = 0) or its price is zero (lambda_i = 0). Both conditions cannot hold simultaneously-unless degenerate. This explains why in SVM only support vectors have nonzero alpha_i.

At the optimum of min f(x) s.t. g(x) <= 0, we have g(x*) = -3 (constraint is inactive). What does this imply about lambda*?

Penalty and Augmented Lagrangian

To solve a constrained problem using unconstrained methods, the **penalty method** adds a quadratic penalty: min f(x) + (rho/2)||h(x)||^2. As rho goes to infinity, the solution approaches the constrained optimum. Problem: large rho causes ill-conditioning. **Augmented Lagrangian** combines Lagrangian and penalty: La = f + lambda^T*h + (rho/2)||h||^2-a finite rho suffices.

**Augmented Lagrangian vs plain penalty:** plain penalty needs rho -> infinity for exact constraint satisfaction (ill-conditioning). Augmented Lagrangian updates lambda-a finite rho achieves exact satisfaction. This is the Hestenes-Powell method of multipliers (1969), the foundation of ADMM.

Why is Augmented Lagrangian better than plain penalty method for exact constraint satisfaction?

ADMM: Distributed Optimization

**ADMM** (Alternating Direction Method of Multipliers) extends Augmented Lagrangian to problems of the form min f(x) + g(z) s.t. Ax + Bz = c. It splits the problem into two subproblems, solved alternately: an x-update and a z-update. Natural fit for distributed ML.

**Safe RL via constrained optimization:** the constrained policy optimization problem (max reward s.t. E[violations] <= delta) is solved through CPO (Constrained Policy Optimization) or ADMM-based approaches. KKT conditions give the optimality criterion for policy gradients with constraints.

KKT conditions are purely theoretical and not used in practice

KKT conditions underlie all industrial constrained solvers: interior point methods solve perturbed KKT systems, SVM is built on dual KKT, ADMM updates dual variables per KKT.

Primal-dual IPM at each iteration solves: nabla_x L = 0, h(x)=0, g(x)<=0, lambda>=0, lambda*g=0-with a barrier smoothing. This is the KKT system. SVM dual: the KKT condition w=sum(alpha_i*yi*xi) is used directly for prediction. KKT is not just theory-it's the algorithmic foundation of modern constrained optimization.

Why is ADMM popular for distributed ML problems?

Key Ideas

  • **Lagrangian** L(x,lambda,mu) merges objective and constraints; multipliers are shadow prices of constraints
  • **KKT:** stationarity + primal feasibility + dual feasibility + complementary slackness-necessary optimality conditions (sufficient for convex problems)
  • **Augmented Lagrangian** solves constrained problems with unconstrained methods; dual update achieves exact constraint satisfaction with finite rho
  • **ADMM** decomposes composite problems into subproblems; standard for distributed ML and federated learning

Related Topics

KKT and Lagrangian are the language of all constrained optimization:

  • Convex Optimization — For convex problems KKT is necessary and sufficient; strong duality guarantees primal-dual gap = 0
  • Interior Point Methods — Primal-dual IPM solves perturbed KKT conditions; the barrier smooths complementary slackness
  • Bayesian Optimization — Constrained BO uses Augmented Lagrangian for constrained HPO

Вопросы для размышления

  • Complementary slackness says: either the constraint is active or its price is zero. Why are both conditions impossible simultaneously? How does this connect to SVM support vectors?
  • Gradient descent can't handle constraints directly. Why does Augmented Lagrangian transform a constrained problem into a sequence of unconstrained ones, and how does this help?
  • How does ADMM apply to federated learning? What corresponds to the x-update, z-update, and dual update in that context?

Связанные уроки

  • calc-01-sequences
Constrained Optimization

0

1

Sign In