Linear Algebra

Quadratic Forms: The Language of Optimization and Curvature

Near a minimum, the loss surface of a neural network is a quadratic form defined by the Hessian. Whether Adam converges faster than SGD, why BatchNorm is needed, why GANs are unstable - all of this is determined by the spectrum of that quadratic form.

Optimization: loss near a minimum L(w) ≈ ½(w-w*)ᵀH(w-w*), H = Hessian
SVM: maximizing the margin is a quadratic program with linear constraints
Statistics: Mahalanobis distance is a quadratic form with the covariance matrix
Physics: kinetic energy of rotation is a quadratic form with the inertia tensor
Economics: second-order utility functions and stability analysis of equilibria

Definition and classification

**Every time gradient descent converges to a minimum, it lands at a point where the Hessian is positive definite.** Every time Adam stalls at a saddle point, the Hessian there is indefinite - some eigenvalues positive, some negative. Every time a GMM or Isolation Forest computes a Mahalanobis distance, a quadratic form is inside. All of this is xᵀAx, a single scalar that tells the optimizer everything about the curvature of the loss landscape around a point.

Zoom into any loss function near its minimum and the landscape becomes a **quadratic form** - its matrix is the Hessian. Positive definite Hessian → bowl → SGD converges. Indefinite Hessian → saddle → gradient descent stalls. Loss landscapes, sharp vs flat minima, the reason Adam escapes saddles while SGD gets stuck: all of it is quadratic-form theory in disguise.

What a quadratic form is

A quadratic form is a function of vector x, written through a symmetric matrix A:

Q(x) = xᵀ A x (a scalar, A is symmetric) FOR n = 2 (expanded): x = (x₁, x₂), A = [a b] [b c] Q(x) = [x₁ x₂] · [a b] · [x₁] = ax₁² + 2bx₁x₂ + cx₂² [b c] [x₂] FOR n = 3: Q(x) = a₁₁x₁² + a₂₂x₂² + a₃₃x₃² + 2a₁₂x₁x₂ + 2a₁₃x₁x₃ + 2a₂₃x₂x₃ KEY PROPERTY: Q(x) is always a homogeneous polynomial of degree 2. Q(tx) = t² · Q(x) for any scalar t

Intuitively: a quadratic form is the generalization of ax² to multiple dimensions. The question is **which way this paraboloid opens** - upward (minimum), downward (maximum), or in both directions at once (saddle).

Classification: five types of forms

The sign of Q(x) for all nonzero x determines the type of form - the geometry of the surface, and the behavior of any optimizer:

Type	Condition	Eigenvalues of A	Geometry	In optimization
Positive definite	Q(x) > 0 for x != 0	All lambda_i > 0	Paraboloid (bowl up)	Local minimum
Negative definite	Q(x) < 0 for x != 0	All lambda_i < 0	Inverted paraboloid	Local maximum
Positive semidefinite	Q(x) >= 0	All lambda_i >= 0, some zero	Cylindrical paraboloid	Boundary (no strict min)
Negative semidefinite	Q(x) <= 0	All lambda_i <= 0, some zero	Inverted cylinder	Boundary
Indefinite	Q(x) changes sign	Both lambda > 0 and lambda < 0	Saddle surface	Saddle point

**Mnemonic via eigenvalues**: the sign of Q(x) is completely determined by the signs of the eigenvalues of A. To classify a form, find the eigenvalues. That is exactly what `np.linalg.eigvalsh` does when analyzing a Hessian.

What does positive definiteness of Q(x) = xᵀAx mean?

Positive definiteness means Q(x) > 0 whenever x ≠ 0. Equivalently: all eigenvalues of A are positive. Central in optimisation: a positive-definite Hessian guarantees a critical point is a local minimum.

Diagonalisation and level-set geometry

Geometry: level curves and diagonalization

The level sets Q(x) = c are quadrics: ellipsoids, hyperboloids, paraboloids. The eigenvectors of A define the **principal axes** of these quadrics; the eigenvalues define the **curvature** along each axis.

THEOREM: any Q(x) = xᵀAx can be reduced to canonical form by the substitution x = Py, where P holds the eigenvectors of A: Q = xᵀAx = (Py)ᵀ A (Py) = yᵀ (PᵀAP) y = yᵀLy where L = diag(lambda1, ..., lambdan) - diagonal matrix of eigenvalues CANONICAL FORM: Q = lambda1*y1² + lambda2*y2² + ... + lambdan*yn² MEANING: in the coordinate system of the eigenvectors - no cross terms yi*yj - each lambda_i is the curvature along the i-th axis EXAMPLE: A = [3 1] lambda1 = 4, lambda2 = 2 [1 3] Q(x) = 3x1² + 2x1x2 + 3x2² -> in diagonal coordinates: Q(y) = 4y1² + 2y2² Ellipse: semi-axes 1/sqrt(4) = 0.5 and 1/sqrt(2) ~= 0.707

What do the level sets Q(x) = c of a positive-definite quadratic form look like?

In the orthonormal eigenbasis of A the form becomes Q = sum lambda_i y_i^2 = c, an ellipsoid with semi-axes sqrt(c/lambda_i). The principal axes are the eigenvectors of A. This is the picture of loss landscapes near a minimum.

Sylvester's criterion and ML applications

Sylvester's criterion: checking positive definiteness efficiently

Computing all eigenvalues of a large matrix is expensive. **Sylvester's criterion** (leading minors) checks positive definiteness without finding eigenvalues:

MATRIX A is positive definite if and only if all leading principal minors are positive: D1 = a11 > 0 D2 = det[a11 a12] > 0 [a21 a22] D3 = det(first 3 rows and columns) > 0 ... Dn = det(A) > 0 EXAMPLE: A = [2 1] [1 3] D1 = 2 > 0 check D2 = 2*3 - 1*1 = 5 > 0 check -> A is positive definite Verification via eigenvalues: lambda = (5 +/- sqrt(5))/2 ~= 3.618 and 1.382 (both > 0) check

ML application #1: Hessian and the curvature of the loss

The loss function L(theta) of a neural network near a point theta_0 admits a Taylor expansion. The quadratic term is a quadratic form of the Hessian:

EXPANSION NEAR theta_0: L(theta) ~= L(theta_0) + grad_L(theta_0)^T * Dtheta + (1/2) * Dtheta^T * H(theta_0) * Dtheta + ... where H = d²L/dtheta_i dtheta_j - Hessian matrix Dtheta = theta - theta_0 - deviation from the point MINIMUM CRITERION (at a critical point where grad_L = 0): H is positive definite -> local minimum H is negative definite -> local maximum H is indefinite -> saddle point FOR QUADRATIC LOSS (MSE): L(theta) = ||X*theta - y||² = theta^T (X^T X) theta - 2y^T X theta + const H = 2 X^T X X^T X is always PSD -> function is convex -> minimum is unique!

**Why Adam works better than SGD**: SGD uses the same step size in every direction. Adam adapts the step per direction, approximating **Newton's method** (step proportional to H⁻¹ * grad_L). On an ill-conditioned Hessian (elongated ellipse) this makes a fundamental difference.

ML application #2: Newton's method and K-FAC

**Newton's method** updates parameters using curvature: theta -> theta - H⁻¹ * grad_L. On quadratic functions it converges in one step. For a network with millions of parameters, H⁻¹ cannot be computed directly. **K-FAC** (Kronecker-Factored Approximate Curvature) approximates the Hessian through layer structure and converges several times faster than Adam.

OPTIMIZATION TASK: min L(theta) -> in the quadratic case: min theta^T H theta / 2 - b^T theta NEWTON'S METHOD (one step on a quadratic function): theta* = theta_0 - H^-1 * grad_L(theta_0) For MSE regression L = ||X*theta - y||²: H = 2 X^T X grad_L = 2 X^T (X*theta - y) theta* = theta_0 - (2 X^T X)^-1 * 2 X^T (X*theta_0 - y) = (X^T X)^-1 X^T y <-- the normal equation! Newton's method on MSE = exact answer in one step. This works only because MSE is a quadratic form: the Hessian is constant, not dependent on theta.

ML application #3: Mahalanobis distance and anomaly detection

Euclidean distance ignores feature correlations. If "height" and "weight" are correlated, the point (2m, 100kg) is normal while (2m, 30kg) is an anomaly - yet their Euclidean distances to the center can look similar. **Mahalanobis distance** is a quadratic form with the inverse covariance matrix:

What to take from this lesson

**Q(x) = xᵀAx** - a scalar describing curvature around a point; all 5 types are determined by the signs of the eigenvalues of A
**Positive definite** (all lambda > 0) = paraboloid, minimum, optimizer convergence
**Sylvester's criterion**: all leading minors > 0 if and only if A is positive definite
**Hessian** = quadratic form describing the curvature of the loss; its definiteness decides the fate of a critical point
**MSE is convex** because its Hessian 2X^T X is always PSD; linear regression has a unique global minimum
**Mahalanobis distance** = quadratic form with Sigma^-1 - accounts for feature correlations in anomaly detection
**Covariance matrix** is always PSD by construction: Sigma = E[(x-mu)(x-mu)^T] >= 0

Connections

Quadratic forms are the bridge between linear algebra and optimization

Eigenvectors and diagonalization — Diagonalizing a quadratic form: eigenvectors are the principal axes, eigenvalues are the curvature
Inverse matrix — Mahalanobis distance requires Sigma^-1; a PSD matrix is invertible if and only if it is positive definite
SVD — SVD decomposes a matrix through orthogonal bases - a generalization of quadratic form diagonalization
Gram-Schmidt process — The orthonormal basis of eigenvectors is exactly the diagonalizing transformation for a quadratic form

Связанные уроки

stats-21