Linear Algebra
Quadratic Forms: The Language of Optimization and Curvature
Near a minimum, the loss surface of a neural network is a quadratic form defined by the Hessian. Whether Adam converges faster than SGD, why BatchNorm is needed, why GANs are unstable - all of this is determined by the spectrum of that quadratic form.
- Optimization: loss near a minimum L(w) ≈ ½(w-w*)ᵀH(w-w*), H = Hessian
- SVM: maximizing the margin is a quadratic program with linear constraints
- Statistics: Mahalanobis distance is a quadratic form with the covariance matrix
- Physics: kinetic energy of rotation is a quadratic form with the inertia tensor
- Economics: second-order utility functions and stability analysis of equilibria
Definition and classification
**Every time gradient descent converges to a minimum, it lands at a point where the Hessian is positive definite.** Every time Adam stalls at a saddle point, the Hessian there is indefinite - some eigenvalues positive, some negative. Every time a GMM or Isolation Forest computes a Mahalanobis distance, a quadratic form is inside. All of this is xᵀAx, a single scalar that tells the optimizer everything about the curvature of the loss landscape around a point.
Zoom into any loss function near its minimum and the landscape becomes a **quadratic form** - its matrix is the Hessian. Positive definite Hessian → bowl → SGD converges. Indefinite Hessian → saddle → gradient descent stalls. Loss landscapes, sharp vs flat minima, the reason Adam escapes saddles while SGD gets stuck: all of it is quadratic-form theory in disguise.
What a quadratic form is
A quadratic form is a function of vector x, written through a symmetric matrix A:
Q(x) = xᵀ A x (a scalar, A is symmetric) FOR n = 2 (expanded): x = (x₁, x₂), A = [a b] [b c] Q(x) = [x₁ x₂] · [a b] · [x₁] = ax₁² + 2bx₁x₂ + cx₂² [b c] [x₂] FOR n = 3: Q(x) = a₁₁x₁² + a₂₂x₂² + a₃₃x₃² + 2a₁₂x₁x₂ + 2a₁₃x₁x₃ + 2a₂₃x₂x₃ KEY PROPERTY: Q(x) is always a homogeneous polynomial of degree 2. Q(tx) = t² · Q(x) for any scalar t
Intuitively: a quadratic form is the generalization of ax² to multiple dimensions. The question is **which way this paraboloid opens** - upward (minimum), downward (maximum), or in both directions at once (saddle).
Classification: five types of forms
The sign of Q(x) for all nonzero x determines the type of form - the geometry of the surface, and the behavior of any optimizer:
| Type | Condition | Eigenvalues of A | Geometry | In optimization |
|---|---|---|---|---|
| Positive definite | Q(x) > 0 for x != 0 | All lambda_i > 0 | Paraboloid (bowl up) | Local minimum |
| Negative definite | Q(x) < 0 for x != 0 | All lambda_i < 0 | Inverted paraboloid | Local maximum |
| Positive semidefinite | Q(x) >= 0 | All lambda_i >= 0, some zero | Cylindrical paraboloid | Boundary (no strict min) |
| Negative semidefinite | Q(x) <= 0 | All lambda_i <= 0, some zero | Inverted cylinder | Boundary |
| Indefinite | Q(x) changes sign | Both lambda > 0 and lambda < 0 | Saddle surface | Saddle point |
**Mnemonic via eigenvalues**: the sign of Q(x) is completely determined by the signs of the eigenvalues of A. To classify a form, find the eigenvalues. That is exactly what `np.linalg.eigvalsh` does when analyzing a Hessian.
What does positive definiteness of Q(x) = xᵀAx mean?
Positive definiteness means Q(x) > 0 whenever x ≠ 0. Equivalently: all eigenvalues of A are positive. Central in optimisation: a positive-definite Hessian guarantees a critical point is a local minimum.
Diagonalisation and level-set geometry
Geometry: level curves and diagonalization
The level sets Q(x) = c are quadrics: ellipsoids, hyperboloids, paraboloids. The eigenvectors of A define the **principal axes** of these quadrics; the eigenvalues define the **curvature** along each axis.
THEOREM: any Q(x) = xᵀAx can be reduced to canonical form by the substitution x = Py, where P holds the eigenvectors of A: Q = xᵀAx = (Py)ᵀ A (Py) = yᵀ (PᵀAP) y = yᵀLy where L = diag(lambda1, ..., lambdan) - diagonal matrix of eigenvalues CANONICAL FORM: Q = lambda1*y1² + lambda2*y2² + ... + lambdan*yn² MEANING: in the coordinate system of the eigenvectors - no cross terms yi*yj - each lambda_i is the curvature along the i-th axis EXAMPLE: A = [3 1] lambda1 = 4, lambda2 = 2 [1 3] Q(x) = 3x1² + 2x1x2 + 3x2² -> in diagonal coordinates: Q(y) = 4y1² + 2y2² Ellipse: semi-axes 1/sqrt(4) = 0.5 and 1/sqrt(2) ~= 0.707
What do the level sets Q(x) = c of a positive-definite quadratic form look like?
In the orthonormal eigenbasis of A the form becomes Q = sum lambda_i y_i^2 = c, an ellipsoid with semi-axes sqrt(c/lambda_i). The principal axes are the eigenvectors of A. This is the picture of loss landscapes near a minimum.
Sylvester's criterion and ML applications
Sylvester's criterion: checking positive definiteness efficiently
Computing all eigenvalues of a large matrix is expensive. **Sylvester's criterion** (leading minors) checks positive definiteness without finding eigenvalues:
MATRIX A is positive definite if and only if all leading principal minors are positive: D1 = a11 > 0 D2 = det[a11 a12] > 0 [a21 a22] D3 = det(first 3 rows and columns) > 0 ... Dn = det(A) > 0 EXAMPLE: A = [2 1] [1 3] D1 = 2 > 0 check D2 = 2*3 - 1*1 = 5 > 0 check -> A is positive definite Verification via eigenvalues: lambda = (5 +/- sqrt(5))/2 ~= 3.618 and 1.382 (both > 0) check
ML application #1: Hessian and the curvature of the loss
The loss function L(theta) of a neural network near a point theta_0 admits a Taylor expansion. The quadratic term is a quadratic form of the Hessian:
EXPANSION NEAR theta_0: L(theta) ~= L(theta_0) + grad_L(theta_0)^T * Dtheta + (1/2) * Dtheta^T * H(theta_0) * Dtheta + ... where H = d²L/dtheta_i dtheta_j - Hessian matrix Dtheta = theta - theta_0 - deviation from the point MINIMUM CRITERION (at a critical point where grad_L = 0): H is positive definite -> local minimum H is negative definite -> local maximum H is indefinite -> saddle point FOR QUADRATIC LOSS (MSE): L(theta) = ||X*theta - y||² = theta^T (X^T X) theta - 2y^T X theta + const H = 2 X^T X X^T X is always PSD -> function is convex -> minimum is unique!
**Why Adam works better than SGD**: SGD uses the same step size in every direction. Adam adapts the step per direction, approximating **Newton's method** (step proportional to H⁻¹ * grad_L). On an ill-conditioned Hessian (elongated ellipse) this makes a fundamental difference.
ML application #2: Newton's method and K-FAC
**Newton's method** updates parameters using curvature: theta -> theta - H⁻¹ * grad_L. On quadratic functions it converges in one step. For a network with millions of parameters, H⁻¹ cannot be computed directly. **K-FAC** (Kronecker-Factored Approximate Curvature) approximates the Hessian through layer structure and converges several times faster than Adam.
OPTIMIZATION TASK: min L(theta) -> in the quadratic case: min theta^T H theta / 2 - b^T theta NEWTON'S METHOD (one step on a quadratic function): theta* = theta_0 - H^-1 * grad_L(theta_0) For MSE regression L = ||X*theta - y||²: H = 2 X^T X grad_L = 2 X^T (X*theta - y) theta* = theta_0 - (2 X^T X)^-1 * 2 X^T (X*theta_0 - y) = (X^T X)^-1 X^T y <-- the normal equation! Newton's method on MSE = exact answer in one step. This works only because MSE is a quadratic form: the Hessian is constant, not dependent on theta.
ML application #3: Mahalanobis distance and anomaly detection
Euclidean distance ignores feature correlations. If "height" and "weight" are correlated, the point (2m, 100kg) is normal while (2m, 30kg) is an anomaly - yet their Euclidean distances to the center can look similar. **Mahalanobis distance** is a quadratic form with the inverse covariance matrix:
What to take from this lesson
- **Q(x) = xᵀAx** - a scalar describing curvature around a point; all 5 types are determined by the signs of the eigenvalues of A
- **Positive definite** (all lambda > 0) = paraboloid, minimum, optimizer convergence
- **Sylvester's criterion**: all leading minors > 0 if and only if A is positive definite
- **Hessian** = quadratic form describing the curvature of the loss; its definiteness decides the fate of a critical point
- **MSE is convex** because its Hessian 2X^T X is always PSD; linear regression has a unique global minimum
- **Mahalanobis distance** = quadratic form with Sigma^-1 - accounts for feature correlations in anomaly detection
- **Covariance matrix** is always PSD by construction: Sigma = E[(x-mu)(x-mu)^T] >= 0
Connections
Quadratic forms are the bridge between linear algebra and optimization
- Eigenvectors and diagonalization — Diagonalizing a quadratic form: eigenvectors are the principal axes, eigenvalues are the curvature
- Inverse matrix — Mahalanobis distance requires Sigma^-1; a PSD matrix is invertible if and only if it is positive definite
- SVD — SVD decomposes a matrix through orthogonal bases - a generalization of quadratic form diagonalization
- Gram-Schmidt process — The orthonormal basis of eigenvectors is exactly the diagonalizing transformation for a quadratic form