Convex Optimization

Interior Point Method

1984. Bell Labs. Narendra Karmarkar publishes an LP algorithm with complexity $O(n^{3.5})$. Simplex could run exponentially. Interior point - polynomially. Karmarkar's paper rewrote the theory, but Simplex is still faster in practice. Why? Because worst-case complexity and typical-instance complexity are different things. Interior point wins where Simplex drowns: large SDP, dense QP, ill-conditioned problems. SVM, portfolio optimization, compressed sensing, network flow - all solved via IPM. CVXPY, MOSEK, Gurobi, HiGHS - the same algorithm under the hood: log-barrier + Newton steps + $t \to \infty$.

**SVM dual QP (libsvm, sklearn):** the classifier trains via IPM on a quadratic program - support vectors are KKT complementary slackness made concrete
**CVXPY / CVX:** frontend for hundreds of ADMM and IPM calls per day - portfolio opt, resource allocation, MPC in robotics
**Portfolio optimization (Black-Litterman):** daily rebalancing via QP, ECOS takes 10-20 Newton steps in milliseconds
**Compressed sensing (basis pursuit):** L1 minimization via LP with $2n$ constraints, Gurobi solves in $O(\sqrt{n})$ IPM iterations

Предварительные знания

Log-barrier and the central path

**1984. Bell Labs. Narendra Karmarkar publishes an LP algorithm with complexity $O(n^{3.5})$.** The Simplex method could run in exponential time on adversarial instances - that was known since 1972 (the Klee-Minty cube). Karmarkar proved that a polynomial-time method exists. But here is the paradox: Simplex is still faster on 99% of real-world instances. Interior point wins only on ill-conditioned and very large LP. Why does the method with the better theory lose in practice? It comes down to how it is built.

Standard constrained problem: $\min_x f_0(x)$ subject to $f_i(x) \leq 0$, $i = 1, \ldots, m$. The constraints create a barrier at the boundary of the feasible set. The barrier method makes this barrier **explicit** inside the objective, then removes it in the limit.

**Log-barrier and the barrier problem:** $$\phi_t(x) = t \cdot f_0(x) - \sum_{i=1}^{m} \log(-f_i(x))$$ Here $t > 0$ is a scaling parameter, and $-\log(-f_i(x)) \to +\infty$ as $f_i(x) \to 0^-$ - the function blows up at the boundary. When $t$ is large, the first term dominates and the problem approaches the original. **Central path** $x^*(t)$ - the optimal solution of the barrier problem as a function of $t$: $$x^*(t) = \arg\min_{f_i(x) < 0} \phi_t(x)$$ As $t \to \infty$: $x^*(t) \to x^*$ - the central path converges to the optimum of the original problem.

CVXPY builds this barrier automatically. When one writes `prob.solve(solver='ECOS')`, the ECOS solver follows the central path, increasing $t$ at each outer iteration. scikit-learn's SVM with `kernel='linear'` uses liblinear, which runs interior point inside the dual LP. Gurobi and CPLEX - the leaders in commercial LP/QP solvers - default to IPM for problems with dense matrices.

**Suboptimality bound:** on the central path at a given $t$: $$f_0(x^*(t)) - f_0(x^*) \leq \frac{m}{t}$$ For accuracy $\varepsilon$: it suffices to have $t \geq m / \varepsilon$. More constraints $m$ means a larger $t$ is needed - and a harder inner subproblem.

An LP has $m = 100$ constraints. What value of $t$ guarantees $f_0(x^*(t)) - f_0(x^*) \leq 0.01$?

Newton step inside the barrier subproblem

The barrier function $\phi_t$ is strictly convex and twice differentiable everywhere inside the feasible set. An ideal candidate for Newton's method. GD on such a function would converge at rate $(1 - 1/\kappa)^k$ - and $\kappa$ can be enormous. Newton on the same geometry converges **quadratically** near the optimum: the error squares at every iteration.

**Newton step for $\phi_t$:** $$\Delta x_{nt} = -[\nabla^2 \phi_t(x)]^{-1} \nabla \phi_t(x)$$ For the barrier problem: $$\nabla \phi_t(x) = t \nabla f_0(x) + \sum_{i=1}^{m} \frac{1}{-f_i(x)} \nabla f_i(x)$$ $$\nabla^2 \phi_t(x) = t \nabla^2 f_0(x) + \sum_{i=1}^{m} \frac{1}{f_i(x)^2} \nabla f_i(x) \nabla f_i(x)^T - \sum_{i=1}^{m} \frac{1}{-f_i(x)} \nabla^2 f_i(x)$$ **Newton decrement** - a measure of proximity to the optimum in the Hessian metric: $$\lambda(x)^2 = \nabla \phi_t(x)^T [\nabla^2 \phi_t(x)]^{-1} \nabla \phi_t(x)$$ Stopping criterion: $\lambda(x)^2 / 2 \leq \varepsilon$.

Portfolio optimization is a classic interior point application. Markowitz (1952): $\min_w w^T \Sigma w$ subject to $\mathbf{1}^T w = 1$, $w \geq 0$ (and possibly a return target $\mu^T w \geq r$). The Black-Litterman model adds a prior on returns - the problem becomes a QP. CVXPY solves it in milliseconds via ECOS, which takes 10-30 Newton steps inside the barrier. Hedge funds rebalance portfolios daily - 250 QP instances per year, each via IPM.

**Barrier method algorithm (outline):** ``` input: feasible starting point x, t_0 > 0, mu > 1, epsilon repeat: 1. Centering: solve min phi_t(x) with Newton's method (until lambda^2/2 <= epsilon_inner) 2. Update: t := mu * t 3. Stop when: m/t <= epsilon ``` **Choosing $\mu$**: large $\mu$ means fewer outer iterations but harder centering. Typical value: $\mu = 10$-$20$. Each centering step takes $O(1)$ Newton iterations when the problem is self-concordant.

Why does the barrier method need backtracking line search to stay feasible, rather than a fixed step $\eta = 1/L$?

Self-concordance and polynomial complexity

Why does the log-barrier work so well? For a general convex function, Newton's method has no iteration count guarantee for $\varepsilon$-accuracy: the function can have arbitrarily bad curvature. The log-barrier possesses a special property - **self-concordance**. Nesterov and Nemirovskii introduced this class in 1994 and proved polynomial IPM complexity precisely through it.

**Self-concordance (Nesterov-Nemirovskii 1994):** A function $f: \mathbb{R} \to \mathbb{R}$ is self-concordant if: $$|f'''(x)| \leq 2 [f''(x)]^{3/2}$$ For the multivariate case (along any direction $v$): third-order derivatives are bounded by the curvature. **Consequence:** in the $\lambda$-region (where $\lambda(x) < 1$), Newton converges quadratically with a **guaranteed iteration count** - without knowing global $L$, $\mu$. **The log-barrier is self-concordant:** $-\log(x)$ for $x > 0$ satisfies the condition up to a constant. A sum of self-concordant functions is self-concordant. Hence $\phi_t(x)$ is self-concordant. **IPM complexity via self-concordance:** $$\text{Newton iterations per centering} = O(\sqrt{m} \log(m / \varepsilon))$$ Total complexity: $O(\sqrt{m})$ centerings times $O(1)$ Newton steps = $O(\sqrt{m} \log(1/\varepsilon))$ iterations.

Compressed sensing - minimizing $\|x\|_1$ subject to $Ax = b$ (basis pursuit). Via LP reformulation: introduce $t_i \geq 0$ with $-t_i \leq x_i \leq t_i$ and minimize $\sum t_i$. The resulting LP with $m = 2n$ constraints is directly suited for interior point. Gurobi solves basis pursuit in $O(\sqrt{n})$ IPM iterations. This outperforms iterative thresholding (ISTA/FISTA) when $A$ is ill-conditioned.

**Self-concordant barriers for conic programs:** - **LP** ($x \geq 0$): $-\sum_i \log x_i$ - standard barrier, $m$-self-concordant - **SOCP** ($\|Ax + b\| \leq c^T x + d$): $-\log(t^2 - \|u\|^2)$ - second-order cone barrier - **SDP** ($X \succeq 0$): $-\log \det X$ - matrix log-barrier All three are self-concordant. That is why CVXPY/CVX solves LP, SOCP, and SDP with the same engine (MOSEK, ECOS, SCS) - different problem classes, one algorithmic scheme.

Self-concordance guarantees Newton convergence in $O(\sqrt{m} \log(1/\varepsilon))$ iterations. What assumption does this theory remove compared to standard Newton analysis?

Primal-dual IPM and practice: CVXPY, Gurobi, SVM

The barrier method is a **primal** method: at each step a feasible primal point $x^*(t)$ is found, and dual variables are recovered after the fact. **Primal-dual IPM** tracks primal and dual variables simultaneously, solving the KKT system directly. This is the method implemented in Gurobi, CPLEX, MOSEK - commercial solvers that handle LP with millions of variables.

**Primal-dual IPM - KKT iteration:** KKT conditions for the problem with $f_i(x) \leq 0$: $$\nabla f_0(x^*) + \sum_i \lambda_i^* \nabla f_i(x^*) = 0$$ $$\lambda_i^* f_i(x^*) = 0 \quad (\text{complementary slackness})$$ $$f_i(x^*) \leq 0, \quad \lambda_i^* \geq 0$$ Predictor-corrector step: introduce **surrogate complementarity** $\lambda_i |f_i(x)| = 1/t$ (instead of zero). As $t \to \infty$ - convergence to KKT. Each step solves a Newton-like system: $$\begin{pmatrix} \nabla^2 f_0 + \sum \lambda_i \nabla^2 f_i & \nabla f_i^T \\ \Lambda F & \text{diag}(f_i) \end{pmatrix} \begin{pmatrix} \Delta x \\ \Delta \lambda \end{pmatrix} = \text{rhs}$$ **Convergence:** $O(\sqrt{m})$ iterations, each solving a linear system in $O((n+m)^3)$ or $O(n^2 m)$ via sparse Cholesky.

SVM dual is a quadratic program. Training SVM on $n$ examples: $\max_\alpha \mathbf{1}^T \alpha - \frac{1}{2} \alpha^T Q \alpha$ subject to $y^T \alpha = 0$, $0 \leq \alpha_i \leq C$. The matrix $Q_{ij} = y_i y_j K(x_i, x_j)$ is kernel-based. libsvm uses SMO (sequential minimal optimization) - a specialized IPM for SVM. sklearn's `SVC` calls libsvm under the hood. CVXPY with `solver='OSQP'` solves the same QP via ADMM; with `solver='ECOS'` - via primal-dual IPM.

Network flow - LP on a graph: $\min c^T f$ subject to $Bf = d$ (flow balance), $0 \leq f \leq u$ (capacity). The matrix $B$ is the incidence matrix - sparse. Specialized IPM solvers exploit sparse Cholesky factorization of $B^T B$: complexity drops from $O(n^3)$ to $O(n^{1.5})$ for planar graphs. HiGHS (open-source LP solver, 2022) implements exactly this variant and outperforms GLPK by 10-100x on large instances.

Interior point is always faster than Simplex - it has a polynomial guarantee, after all.

On most real-world LP instances Simplex is faster. IPM wins on dense matrices, large QP/SDP, and ill-conditioned problems.

Simplex moves between vertices of the polytope - each step costs $O(mn)$, and in practice takes $O(m)$ steps. IPM solves an $(n+m) \times (n+m)$ linear system per step - $O((n+m)^3)$ without sparsity exploitation. At $n, m \sim 10^3$, IPM is more expensive. At $n \sim 10^5$ with a dense matrix (SDP, portfolio), Simplex is not applicable.

Primal-dual IPM updates both $x$ and $\lambda$ simultaneously. What advantage does this have over a pure primal barrier method?

Key Ideas

**Log-barrier** $\phi_t(x) = t f_0(x) - \sum \log(-f_i(x))$ replaces constraints with a penalty; central path $x^*(t) \to x^*$ as $t \to \infty$
**Accuracy guarantee:** $f_0(x^*(t)) - f_0(x^*) \leq m/t$ - precision is determined by the constraint count and parameter $t$
**Newton step inside the barrier** converges quadratically; Newton decrement $\lambda(x)^2/2$ is an adaptive stopping criterion
**Self-concordance (Nesterov-Nemirovskii 1994):** $|f'''| \leq 2(f'')^{3/2}$ guarantees $O(\sqrt{m} \log(1/\varepsilon))$ iterations without global constants $L$, $\mu$
**Primal-dual IPM** updates $x$ and $\lambda$ jointly via the KKT system - the engine inside Gurobi, MOSEK, CPLEX

What's next

Interior point is the pinnacle of first- and second-order methods for convex problems. Next come decomposition and distributed methods:

Dual Decomposition and ADMM — Dual decomposition splits large IPM problems into parallel subproblems
KKT Conditions — KKT is the algebraic core of IPM: the primal-dual method solves a perturbed KKT system
Proximal Methods — ISTA/FISTA - alternative to IPM for structured non-smooth problems (LASSO, group sparsity)

Вопросы для размышления

CVXPY solves SDP via IPM in $O(n^{3.5})$ operations. For training a neural network with $n = 10^6$ parameters this is infeasible - why? What methods are used instead of IPM at that scale?
Self-concordance of the log-barrier allows using the Newton decrement $\lambda(x)^2/2$ as a stopping criterion without knowing $L$ and $\mu$. How does this relate to GD needing an explicit $L$ for step size selection?
Simplex outperforms IPM on typical LP but IPM wins on QP and SDP. What fundamentally changes in the problem structure when moving from LP to QP that makes Simplex inefficient?

Связанные уроки

cvx-05 — KKT conditions are the language through which IPM locates the optimum
cvx-06 — IPM replaces GD with Newton inside the barrier subproblem
cvx-07 — Proximal methods are a neighboring algorithm class with different structure
cvx-09 — Dual decomposition and ADMM build on the KKT structure of IPM
ml-13-svm — SVM dual QP is solved by an interior-point method
calc-19-gradient — Gradient and Hessian are the primary objects of the Newton step
la-06-gauss