Algebra

Quadratic Forms

Why does gradient descent converge slowly on elongated loss landscapes? Why is Adam faster than SGD? Because Adam implicitly estimates the Hessian. Quadratic forms are the language for describing local curvature of functions-the foundation of optimization in ML.

**Hessian in optimization:** Adam divides learning rate by sqrt of accumulated squared gradients - a diagonal Hessian approximation; condition number κ(H) determines convergence rate
**Covariance matrices:** GMM, LDA, QDA all use (x−μ)ᵀΣ⁻¹(x−μ); Mahalanobis distance = normalized quadratic form
**Second-order conditions:** local min ↔ H PD; saddle points (H indefinite) are the typical situation in DNN loss landscapes

Предварительные знания

Inner Products and Orthogonality

Quadratic Forms

A quadratic form is a function Q: ℝⁿ → ℝ of the form Q(x) = xᵀAx = Σᵢⱼ aᵢⱼxᵢxⱼ, where A is a symmetric matrix (WLOG, since xᵀAx = xᵀ((A+Aᵀ)/2)x). For n=2: Q(x₁,x₂) = ax₁² + 2bx₁x₂ + cx₂² with matrix [[a,b],[b,c]].

**Geometric interpretation:** level sets Q(x) = c are ellipsoids for PD matrices. The axes of the ellipsoid are the eigenvectors of A, and the semi-axes are proportional to 1/√λᵢ. This is the geometry of covariance matrices in statistics.

The matrix of the quadratic form Q(x) = 3x₁² + 4x₁x₂ + x₂² is:

Definiteness and Sylvester's Criterion

Matrix A is positive definite (PD) if Q(x) > 0 for all x ≠ 0. Equivalent conditions: 1. all eigenvalues λᵢ > 0 2. Sylvester's criterion: all leading principal minors (top-left submatrix determinants) are positive. Types: PD (all λ > 0), PSD (λ ≥ 0), ND (λ < 0), NSD, indefinite.

Type	Eigenvalue condition	Geometry	Optimization
PD	All λ > 0	Ellipsoid	Strict local min
PSD	All λ ≥ 0	Cylinder/flat	Weak local min
ND	All λ < 0	Inverted ellipsoid	Strict local max
Indefinite	Mixed signs	Saddle surface	Saddle point

The matrix [[2, 1], [1, 3]] is:

Hessian and Optimality Conditions

The Hessian H(x) is the matrix of second derivatives: H[i][j] = ∂²f/∂xᵢ∂xⱼ. Second-order Taylor expansion: f(x+δ) ≈ f(x) + ∇f(x)ᵀδ + ½δᵀH(x)δ. Second-order conditions: if ∇f(x*) = 0 and H(x*) is PD → strict local minimum; H PD everywhere → strictly convex.

In deep neural networks, most critical points are **saddle points** (indefinite Hessian), not local minima. Stochastic gradient descent naturally escapes saddle points via noise. This makes DNN optimization fundamentally different from convex optimization.

At a critical point x* (∇f = 0), the Hessian has eigenvalues λ₁ = 3, λ₂ = −1. Then x* is:

Covariance Matrices and Mahalanobis Distance

The covariance matrix Σ = E[(x−μ)(x−μ)ᵀ] is always PSD (and PD for non-degenerate distributions). Mahalanobis distance: d²(x,μ) = (x−μ)ᵀΣ⁻¹(x−μ)-a quadratic form with matrix Σ⁻¹. It accounts for correlations between features: a distance of 1 means one standard deviation in the natural geometry of the distribution.

**Mahalanobis distance** generalizes Euclidean distance to account for the covariance structure of the data. Used in GDA (Gaussian Discriminant Analysis), anomaly detection (chi-squared test), and the loss of some variational autoencoders.

A covariance matrix Σ is always:

Key Ideas

**Q(x) = xᵀAx:** quadratic form with symmetric A; level sets are ellipsoids for PD matrices
**Definiteness:** PD ↔ all λ > 0 ↔ Sylvester's criterion (leading minors > 0); PSD for covariance matrices
**Hessian:** second-order expansion f(x+δ) ≈ f(x)+∇fᵀδ+½δᵀHδ; PD Hessian → local min; indefinite → saddle
**Mahalanobis:** d²(x,μ) = (x−μ)ᵀΣ⁻¹(x−μ); accounts for feature correlations

Вопросы для размышления

Adam divides learning rate by the sqrt of accumulated squared gradients. How does this relate to the diagonal approximation of the Hessian and preconditioned gradient descent?
In DNNs, most critical points are saddle points, not local minima. Why is this good news for optimization rather than bad?
Mahalanobis distance uses Σ⁻¹. What happens when Σ is singular (rank-deficient)? How do practical algorithms handle this (e.g., using pseudoinverse or regularization)?

Связанные уроки

la-01-vectors-intro

Quadratic Forms

The matrix of the quadratic form Q(x) = 3x₁² + 4x₁x₂ + x₂² is:

Definiteness and Sylvester's Criterion

Type

Eigenvalue condition

Geometry

Optimization

All λ > 0

Ellipsoid

Strict local min

PSD

All λ ≥ 0

Cylinder/flat

Weak local min

All λ < 0

Inverted ellipsoid

Strict local max

Indefinite

Mixed signs

Saddle surface

Saddle point

The matrix [[2, 1], [1, 3]] is:

Hessian and Optimality Conditions

At a critical point x* (∇f = 0), the Hessian has eigenvalues λ₁ = 3, λ₂ = −1. Then x* is:

Covariance Matrices and Mahalanobis Distance

A covariance matrix Σ is always:

Key Ideas

**Q(x) = xᵀAx:** quadratic form with symmetric A; level sets are ellipsoids for PD matrices

**Definiteness:** PD ↔ all λ > 0 ↔ Sylvester's criterion (leading minors > 0); PSD for covariance matrices

**Hessian:** second-order expansion f(x+δ) ≈ f(x)+∇fᵀδ+½δᵀHδ; PD Hessian → local min; indefinite → saddle

**Mahalanobis:** d²(x,μ) = (x−μ)ᵀΣ⁻¹(x−μ); accounts for feature correlations

Quadratic Forms

Предварительные знания

Quadratic Forms

Definiteness and Sylvester's Criterion

Hessian and Optimality Conditions

Covariance Matrices and Mahalanobis Distance

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки

Quadratic Forms

Предварительные знания

Quadratic Forms

Definiteness and Sylvester's Criterion

Hessian and Optimality Conditions

Covariance Matrices and Mahalanobis Distance

Key Ideas

Related Topics

Вопросы для размышления

Связанные уроки