Convex Optimization
Semidefinite Programming (SDP)
1994. Lovász proves the Shannon capacity of the Petersen graph is $\sqrt{5}$ - via a single semidefinite program. 1995. Goemans and Williamson: MAX-CUT approximated within 0.878 of optimal in polynomial time - via SDP plus random rounding. Both results had resisted 20 years of purely combinatorial approaches. The shift: allow the optimization variable to be a matrix, constrain it to the PSD cone, and suddenly NP-hard integer programs reveal polynomial-time approximations with provable guarantees.
- **MAX-CUT / graph partitioning (VLSI, clustering):** Goemans-Williamson is the best known polynomial-time approximation for MAX-CUT - used in practice for circuit partitioning and image segmentation
- **Control design (LQR, H-infinity):** stability certificates via Lyapunov LMI; MOSEK solves LQR for $n=500$ state systems in under a second
- **Polynomial optimization (SOS):** verifying non-negativity of polynomials for safety certificates in robotics and formal verification
- **Multiple kernel learning:** combining PSD kernels with SDP constraints, solved by interior point in $O(n^{3.5})$ for $n \leq 1000$ samples
Предварительные знания
Linear Matrix Inequality and SDP standard form
**1994. László Lovász proves that the Shannon capacity of a graph - related to the maximum independent set problem - can be computed exactly by solving a single semidefinite program.** The maximum independent set problem is NP-hard in general. For this specific graph parameter, a matrix inequality closed the gap. One year later, Goemans and Williamson showed that MAX-CUT - another NP-hard combinatorial problem - achieves a 0.878 approximation ratio via SDP. That ratio has not been improved in 30 years. The tool that enabled both results was not a clever combinatorial argument but a change of variable: replace a scalar or vector with a **matrix**.
LP has the form $\min c^\top x$ subject to $Ax \leq b$, where the variable $x \in \mathbb{R}^n$ is a vector. SDP lifts the variable to a symmetric matrix $X \in \mathbb{S}^n$ and replaces the componentwise inequality $x \geq 0$ with the matrix inequality $X \succeq 0$ - positive semidefiniteness. Every LP is an SDP (take $X = \text{diag}(x)$), but SDP strictly contains LP.
**SDP standard (primal) form:** $$\min_{X \in \mathbb{S}^n} \text{Tr}(CX) \quad \text{s.t.} \quad \text{Tr}(A_i X) = b_i,\ i=1,\ldots,m, \quad X \succeq 0$$ - $C, A_i \in \mathbb{S}^n$ are symmetric parameter matrices - $\text{Tr}(CX) = \sum_{i,j} C_{ij} X_{ij}$ is the inner product on $\mathbb{S}^n$, linear in $X$ - $X \succeq 0$ means all eigenvalues of $X$ are non-negative (the PSD cone) **Linear Matrix Inequality (LMI) form:** $$\min_x c^\top x \quad \text{s.t.} \quad F(x) = F_0 + x_1 F_1 + \cdots + x_n F_n \succeq 0$$ where $F_0, F_i \in \mathbb{S}^k$ are fixed matrices and $x \in \mathbb{R}^n$ is the variable. LMI form and the matrix form are equivalent - each converts to the other by duality.
SDP appears across engineering. In control theory, stability of $\dot{z} = Az$ requires a Lyapunov matrix $P \succ 0$ satisfying $AP + PA^\top \prec 0$ - a pair of LMIs in $P$. In signal processing, filter design with frequency-response constraints reduces to SDP. In quantum information, a quantum state satisfies $\rho \succeq 0$, $\text{Tr}(\rho) = 1$ - a single LMI with one affine constraint. CVXPY handles all these with the same interface.
LP has the constraint $x \geq 0$ (componentwise). SDP replaces it with $X \succeq 0$ (PSD). What makes the PSD constraint a strictly larger class than componentwise non-negativity?
SDP duality and complementary slackness
LP duality transforms $\min c^\top x$, $Ax = b$, $x \geq 0$ into $\max b^\top y$, $A^\top y \leq c$. SDP duality follows the same structure - the scalar cone $x \geq 0$ becomes the PSD cone $X \succeq 0$ in both the primal and the dual.
**SDP primal-dual pair:** $$\text{Primal: } \min_X \text{Tr}(CX) \quad \text{s.t.} \quad \text{Tr}(A_i X) = b_i,\ X \succeq 0$$ $$\text{Dual: } \max_y b^\top y \quad \text{s.t.} \quad S = C - \sum_{i=1}^m y_i A_i \succeq 0$$ **Weak duality** always holds: $\text{Tr}(CX) \geq b^\top y$ for any feasible $(X, y)$. **Strong duality** holds under Slater's condition: if there exists $X \succ 0$ (strictly positive definite) feasible for the primal, then the duality gap is zero at optimality. **Complementary slackness at optimum:** $$S^* X^* = 0, \quad S^* = C - \sum_i y_i^* A_i \succeq 0, \quad X^* \succeq 0$$ This is the matrix analog of $\lambda_i^* f_i(x^*) = 0$ in KKT: the dual slack $S^*$ and primal variable $X^*$ are both PSD and their product is zero.
**Why SDP duality is strictly harder than LP duality.** In LP, strong duality holds whenever either the primal or dual is feasible (and the other is bounded). In SDP, a duality gap can exist even when both primal and dual are feasible - a situation that never arises in LP. A concrete pathological example: $\min X_{11}$ subject to $X_{12} = 1$, $X \succeq 0$. The infimum is 0 but is never attained; the dual is infeasible. CVXPY will report `infeasible` or `unbounded` in such cases.
Interior point methods for SDP use the matrix log-barrier $-\log \det X$ - the natural extension of $-\sum_i \log x_i$ from LP. The central path for SDP satisfies $tC - \sum_i y_i A_i = X^{-1}$, a matrix equation. MOSEK solves SDP to $\varepsilon$-accuracy in $O(\sqrt{n} \log(1/\varepsilon))$ iterations, each requiring a Cholesky factorization of an $n \times n$ matrix - $O(n^3)$ per step, $O(n^{3.5})$ total.
SDP strong duality always holds under the same conditions as LP strong duality (primal and dual feasibility).
SDP can have a duality gap even when both primal and dual are feasible. Slater's condition - existence of a strictly feasible $X \succ 0$ - is required for strong duality.
LP's strong duality relies on the polyhedral structure of the feasible set (finite vertices, extreme rays). The PSD cone is not polyhedral - it has curved faces where pathological configurations arise. Slater's condition rules these out by requiring a strictly interior point.
At the SDP optimum, complementary slackness requires $S^* X^* = 0$ where both $S^*, X^* \succeq 0$. What does this imply about the column spaces of $S^*$ and $X^*$?
SDP relaxation: MAX-CUT and Goemans-Williamson
MAX-CUT: given an undirected weighted graph $G = (V, E, w)$, partition $V$ into two sets $S$ and $\bar{S}$ to maximize $\sum_{(i,j) \in E} w_{ij} \cdot \mathbf{1}[i \in S, j \notin S]$. Assign $y_i \in \{-1, +1\}$ to each vertex; edge $(i,j)$ is cut iff $y_i y_j = -1$. The cut weight is $\frac{1}{4} \sum_{ij} w_{ij}(1 - y_i y_j)$. Maximizing over $y_i \in \{-1,+1\}$ is NP-hard.
**Goemans-Williamson SDP relaxation (1995):** **Step 1.** Replace $y_i y_j \in \{-1, +1\}$ with a matrix entry $X_{ij}$, requiring $X \succeq 0$ and $X_{ii} = 1$: $$\max_{X \succeq 0} \frac{1}{4} \sum_{(i,j) \in E} w_{ij}(1 - X_{ij}) \quad \text{s.t.} \quad X_{ii} = 1\ \forall i$$ **Step 2.** Solve the SDP to get $X^*$. Factor: $X^* = V^\top V$ where $v_i = V e_i$ are unit vectors. **Step 3. Random rounding:** sample $r \sim \text{Uniform}(S^{n-1})$. Assign $y_i = \text{sign}(v_i^\top r)$. **Key geometric fact:** for unit vectors $v_i, v_j$ with $\langle v_i, v_j \rangle = X_{ij}^*$: $$\Pr[\text{sign}(v_i^\top r) \neq \text{sign}(v_j^\top r)] = \frac{\arccos(X_{ij}^*)}{\pi}$$ **Goemans-Williamson theorem:** $$\mathbb{E}[\text{cut}] \geq \alpha_{GW} \cdot \text{OPT}_{IP}, \quad \alpha_{GW} = \min_{\theta \in [0,\pi]} \frac{2}{\pi} \cdot \frac{\theta}{1 - \cos\theta} \approx 0.878$$ The minimum of the ratio $\arccos(t) / (\pi(1-t)/2)$ over $t \in [-1, 1]$ is achieved at $t \approx -0.689$. If Khot's Unique Games Conjecture holds, no polynomial-time algorithm can beat 0.878.
In Goemans-Williamson, the probability that edge $(i,j)$ is cut equals $\arccos(X_{ij}^*)/\pi$. Why does the constraint $X_{ii} = 1$ make this formula valid?
SDP in ML and control: eigenvalues, SOS, LQR
Semidefinite programming is the natural language for problems involving eigenvalue constraints, polynomial non-negativity, and stability certificates. Three domains where SDP replaced previously intractable or conservative methods: kernel learning, polynomial optimization, and control design.
**Eigenvalue optimization via LMI:** $\min_x \lambda_{\max}(A(x))$ where $A(x) = A_0 + x_1 A_1 + \cdots + x_n A_n$ is an affine matrix family. Equivalent SDP: $$\min_{x, t} t \quad \text{s.t.} \quad t I - A(x) \succeq 0$$ **Sum-of-Squares (SOS) and polynomial non-negativity:** Deciding $f(x) \geq 0$ for all $x$ where $f$ is degree-$2d$ in $n$ variables is NP-hard. The SOS relaxation: $$f(x) = z(x)^\top Q \, z(x), \quad Q \succeq 0$$ where $z(x)$ is the monomial vector up to degree $d$. This is a linear constraint on $Q$ plus PSD - a feasibility SDP. If an SOS decomposition exists, $f(x) \geq 0$ everywhere (sufficient, not necessary). **LQR control via Lyapunov SDP:** For $\dot{z} = Az + Bu$, the closed-loop stability condition $AP + PA^\top \prec 0$ with $P \succ 0$ is a pair of LMIs. The Schur complement converts the Riccati inequality into a linear SDP in $P$. MOSEK solves this for systems with $n \leq 1000$ states in under a second. **DSOS/SDSOS (Ahmadi-Majumdar):** replace $Q \succeq 0$ with diagonally dominant (DSOS) or scaled diagonally dominant (SDSOS) matrices. These are LP/SOCP feasibility problems - 10-100x faster, slightly more conservative.
PSD kernel learning in ML: the constraint that $k(x, x')$ is a valid kernel is exactly $K \succeq 0$ on any training set. Multiple kernel learning (MKL) combines base kernels $k = \sum_m \theta_m k_m$ with $\theta \geq 0$, $\mathbf{1}^\top \theta = 1$, and adds a trace constraint $\text{Tr}(K) \leq 1$ for normalization. The joint optimization over $\theta$ and the kernel SVM is an SDP. Interior point solves this in $O(n^{3.5})$ - practical for $n \leq 1000$ samples.
The connection to interior point (lesson 08): SDP is solved by interior point via the matrix log-barrier $-\log \det X$. Nesterov and Nemirovskii proved self-concordance of $-\log \det X$ in 1994 alongside the scalar case. The central path satisfies $X^{-1} = t(C - \sum_i y_i A_i)$. The Newton step requires solving a system whose coefficient matrix involves Kronecker products of $X$ - this is the $O(n^{3.5})$ bottleneck that limits SDP to $n \lesssim 2000$ in practice. For larger scale: ADMM-based solvers (SCS, CDCS) sacrifice per-iteration precision for scalability.
SDP solves $\min_x \lambda_{\max}(A(x))$ as a single LMI: $\min t$ s.t. $tI - A(x) \succeq 0$. Why can the same trick not directly minimize $\lambda_{\max}^2(A(x))$?
Key Ideas
- **SDP standard form:** $\min \text{Tr}(CX)$ s.t. $\text{Tr}(A_i X) = b_i$, $X \succeq 0$ - the variable is a PSD matrix; LP is the special case $X = \text{diag}(x)$
- **LMI:** $F_0 + \sum_i x_i F_i \succeq 0$ is the constraint form used in control and filter design; equivalent to SDP by duality
- **Duality gap:** unlike LP, SDP can have a gap even when both primal and dual are feasible; Slater's condition ($\exists X \succ 0$ strictly feasible) closes the gap
- **Goemans-Williamson:** SDP relaxation of MAX-CUT + random rounding achieves $\mathbb{E}[\text{cut}] \geq 0.878 \cdot \text{OPT}$ - the best known polynomial-time ratio, likely optimal under UGC
- **Applications:** eigenvalue minimization (LMI), polynomial SOS certificates, PSD kernel learning, LQR control - all reduce to SDP, solved by interior point in $O(n^{3.5})$
What's next
SDP marks the boundary of tractable conic programming. Beyond it lie non-convex problems, distributed methods, and stochastic optimization:
- Interior Point Method — The matrix log-barrier -log det X is the engine behind every SDP solver; self-concordance gives the $O(n^{3.5})$ complexity
- KKT Conditions — SDP duality and complementary slackness S*X*=0 are the matrix-valued extension of the KKT framework
- Proximal Methods — For large-scale SDP where interior point is too expensive, ADMM-based solvers (SCS, CDCS) apply proximal splitting to the PSD cone projection
Вопросы для размышления
- Goemans-Williamson achieves 0.878 for MAX-CUT via SDP relaxation. The SDP upper-bounds OPT and random rounding lower-bounds it. What property of the random rounding scheme makes the 0.878 ratio independent of the specific graph structure?
- SDP interior point costs $O(n^{3.5})$ per solve. For a control system with $n = 5000$ states, this is infeasible. What structural properties of the Lyapunov LMI (sparsity, block structure) could be exploited to reduce cost, and what solver class would handle this?
- SOS relaxation certifies $f(x) \geq 0$ by finding $Q \succeq 0$ with $f = z^\top Q z$. If SOS fails (no such $Q$ exists), does that mean $f$ can be negative somewhere? Describe a polynomial that is globally non-negative but not SOS.
Связанные уроки
- cvx-05 — Strong duality and complementary slackness in SDP directly extend the KKT theory
- cvx-08 — SDP is solved by interior point via the matrix log-barrier -log det X
- cvx-01 — The PSD cone is the central convex set of SDP
- ml-13-svm — Kernel learning with PSD constraints is an SDP instance
- cvx-03
- la-25-quadratic-forms