Algebra

Quadratic Forms

Why does gradient descent converge slowly on elongated loss landscapes? Why is Adam faster than SGD? Because Adam implicitly estimates the Hessian. Quadratic forms are the language for describing local curvature of functions-the foundation of optimization in ML.

  • **Hessian in optimization:** Adam divides learning rate by sqrt of accumulated squared gradients - a diagonal Hessian approximation; condition number κ(H) determines convergence rate
  • **Covariance matrices:** GMM, LDA, QDA all use (x−μ)ᵀΣ⁻¹(x−μ); Mahalanobis distance = normalized quadratic form
  • **Second-order conditions:** local min ↔ H PD; saddle points (H indefinite) are the typical situation in DNN loss landscapes

Предварительные знания

  • Inner Products and Orthogonality

Quadratic Forms

A quadratic form is a function Q: ℝⁿ → ℝ of the form Q(x) = xᵀAx = Σᵢⱼ aᵢⱼxᵢxⱼ, where A is a symmetric matrix (WLOG, since xᵀAx = xᵀ((A+Aᵀ)/2)x). For n=2: Q(x₁,x₂) = ax₁² + 2bx₁x₂ + cx₂² with matrix [[a,b],[b,c]].

**Geometric interpretation:** level sets Q(x) = c are ellipsoids for PD matrices. The axes of the ellipsoid are the eigenvectors of A, and the semi-axes are proportional to 1/√λᵢ. This is the geometry of covariance matrices in statistics.

The matrix of the quadratic form Q(x) = 3x₁² + 4x₁x₂ + x₂² is:

Definiteness and Sylvester's Criterion

Matrix A is positive definite (PD) if Q(x) > 0 for all x ≠ 0. Equivalent conditions: 1. all eigenvalues λᵢ > 0 2. Sylvester's criterion: all leading principal minors (top-left submatrix determinants) are positive. Types: PD (all λ > 0), PSD (λ ≥ 0), ND (λ < 0), NSD, indefinite.

TypeEigenvalue conditionGeometryOptimization
PDAll λ > 0EllipsoidStrict local min
PSDAll λ ≥ 0Cylinder/flatWeak local min
NDAll λ < 0Inverted ellipsoidStrict local max
IndefiniteMixed signsSaddle surfaceSaddle point

The matrix [[2, 1], [1, 3]] is:

Hessian and Optimality Conditions

The Hessian H(x) is the matrix of second derivatives: H[i][j] = ∂²f/∂xᵢ∂xⱼ. Second-order Taylor expansion: f(x+δ) ≈ f(x) + ∇f(x)ᵀδ + ½δᵀH(x)δ. Second-order conditions: if ∇f(x*) = 0 and H(x*) is PD → strict local minimum; H PD everywhere → strictly convex.

In deep neural networks, most critical points are **saddle points** (indefinite Hessian), not local minima. Stochastic gradient descent naturally escapes saddle points via noise. This makes DNN optimization fundamentally different from convex optimization.

At a critical point x* (∇f = 0), the Hessian has eigenvalues λ₁ = 3, λ₂ = −1. Then x* is:

Covariance Matrices and Mahalanobis Distance

The covariance matrix Σ = E[(x−μ)(x−μ)ᵀ] is always PSD (and PD for non-degenerate distributions). Mahalanobis distance: d²(x,μ) = (x−μ)ᵀΣ⁻¹(x−μ)-a quadratic form with matrix Σ⁻¹. It accounts for correlations between features: a distance of 1 means one standard deviation in the natural geometry of the distribution.

**Mahalanobis distance** generalizes Euclidean distance to account for the covariance structure of the data. Used in GDA (Gaussian Discriminant Analysis), anomaly detection (chi-squared test), and the loss of some variational autoencoders.

A covariance matrix Σ is always:

Key Ideas

  • **Q(x) = xᵀAx:** quadratic form with symmetric A; level sets are ellipsoids for PD matrices
  • **Definiteness:** PD ↔ all λ > 0 ↔ Sylvester's criterion (leading minors > 0); PSD for covariance matrices
  • **Hessian:** second-order expansion f(x+δ) ≈ f(x)+∇fᵀδ+½δᵀHδ; PD Hessian → local min; indefinite → saddle
  • **Mahalanobis:** d²(x,μ) = (x−μ)ᵀΣ⁻¹(x−μ); accounts for feature correlations

Related Topics

Quadratic forms are the second-order language of optimization:

  • Eigenvalues and Eigenvectors — Eigenvalue signs determine the type of critical point (min/max/saddle)
  • SVD — SVD diagonalizes the matrix; singular values = sqrt(eigenvalues of AᵀA, a PSD quadratic form)
  • Inner Products — ⟨u,Av⟩ is a bilinear form; ⟨u,Au⟩ is the quadratic form-both share the same matrix A

Вопросы для размышления

  • Adam divides learning rate by the sqrt of accumulated squared gradients. How does this relate to the diagonal approximation of the Hessian and preconditioned gradient descent?
  • In DNNs, most critical points are saddle points, not local minima. Why is this good news for optimization rather than bad?
  • Mahalanobis distance uses Σ⁻¹. What happens when Σ is singular (rank-deficient)? How do practical algorithms handle this (e.g., using pseudoinverse or regularization)?

Связанные уроки

  • la-01-vectors-intro
Quadratic Forms

0

1

Sign In