Algebra
Quadratic Forms
Why does gradient descent converge slowly on elongated loss landscapes? Why is Adam faster than SGD? Because Adam implicitly estimates the Hessian. Quadratic forms are the language for describing local curvature of functions-the foundation of optimization in ML.
- **Hessian in optimization:** Adam divides learning rate by sqrt of accumulated squared gradients - a diagonal Hessian approximation; condition number κ(H) determines convergence rate
- **Covariance matrices:** GMM, LDA, QDA all use (x−μ)ᵀΣ⁻¹(x−μ); Mahalanobis distance = normalized quadratic form
- **Second-order conditions:** local min ↔ H PD; saddle points (H indefinite) are the typical situation in DNN loss landscapes
Предварительные знания
Quadratic Forms
A quadratic form is a function Q: ℝⁿ → ℝ of the form Q(x) = xᵀAx = Σᵢⱼ aᵢⱼxᵢxⱼ, where A is a symmetric matrix (WLOG, since xᵀAx = xᵀ((A+Aᵀ)/2)x). For n=2: Q(x₁,x₂) = ax₁² + 2bx₁x₂ + cx₂² with matrix [[a,b],[b,c]].
**Geometric interpretation:** level sets Q(x) = c are ellipsoids for PD matrices. The axes of the ellipsoid are the eigenvectors of A, and the semi-axes are proportional to 1/√λᵢ. This is the geometry of covariance matrices in statistics.
The matrix of the quadratic form Q(x) = 3x₁² + 4x₁x₂ + x₂² is:
Definiteness and Sylvester's Criterion
Matrix A is positive definite (PD) if Q(x) > 0 for all x ≠ 0. Equivalent conditions: 1. all eigenvalues λᵢ > 0 2. Sylvester's criterion: all leading principal minors (top-left submatrix determinants) are positive. Types: PD (all λ > 0), PSD (λ ≥ 0), ND (λ < 0), NSD, indefinite.
| Type | Eigenvalue condition | Geometry | Optimization |
|---|---|---|---|
| PD | All λ > 0 | Ellipsoid | Strict local min |
| PSD | All λ ≥ 0 | Cylinder/flat | Weak local min |
| ND | All λ < 0 | Inverted ellipsoid | Strict local max |
| Indefinite | Mixed signs | Saddle surface | Saddle point |
The matrix [[2, 1], [1, 3]] is:
Hessian and Optimality Conditions
The Hessian H(x) is the matrix of second derivatives: H[i][j] = ∂²f/∂xᵢ∂xⱼ. Second-order Taylor expansion: f(x+δ) ≈ f(x) + ∇f(x)ᵀδ + ½δᵀH(x)δ. Second-order conditions: if ∇f(x*) = 0 and H(x*) is PD → strict local minimum; H PD everywhere → strictly convex.
In deep neural networks, most critical points are **saddle points** (indefinite Hessian), not local minima. Stochastic gradient descent naturally escapes saddle points via noise. This makes DNN optimization fundamentally different from convex optimization.
At a critical point x* (∇f = 0), the Hessian has eigenvalues λ₁ = 3, λ₂ = −1. Then x* is:
Covariance Matrices and Mahalanobis Distance
The covariance matrix Σ = E[(x−μ)(x−μ)ᵀ] is always PSD (and PD for non-degenerate distributions). Mahalanobis distance: d²(x,μ) = (x−μ)ᵀΣ⁻¹(x−μ)-a quadratic form with matrix Σ⁻¹. It accounts for correlations between features: a distance of 1 means one standard deviation in the natural geometry of the distribution.
**Mahalanobis distance** generalizes Euclidean distance to account for the covariance structure of the data. Used in GDA (Gaussian Discriminant Analysis), anomaly detection (chi-squared test), and the loss of some variational autoencoders.
A covariance matrix Σ is always:
Key Ideas
- **Q(x) = xᵀAx:** quadratic form with symmetric A; level sets are ellipsoids for PD matrices
- **Definiteness:** PD ↔ all λ > 0 ↔ Sylvester's criterion (leading minors > 0); PSD for covariance matrices
- **Hessian:** second-order expansion f(x+δ) ≈ f(x)+∇fᵀδ+½δᵀHδ; PD Hessian → local min; indefinite → saddle
- **Mahalanobis:** d²(x,μ) = (x−μ)ᵀΣ⁻¹(x−μ); accounts for feature correlations
Related Topics
Quadratic forms are the second-order language of optimization:
- Eigenvalues and Eigenvectors — Eigenvalue signs determine the type of critical point (min/max/saddle)
- SVD — SVD diagonalizes the matrix; singular values = sqrt(eigenvalues of AᵀA, a PSD quadratic form)
- Inner Products — ⟨u,Av⟩ is a bilinear form; ⟨u,Au⟩ is the quadratic form-both share the same matrix A
Вопросы для размышления
- Adam divides learning rate by the sqrt of accumulated squared gradients. How does this relate to the diagonal approximation of the Hessian and preconditioned gradient descent?
- In DNNs, most critical points are saddle points, not local minima. Why is this good news for optimization rather than bad?
- Mahalanobis distance uses Σ⁻¹. What happens when Σ is singular (rank-deficient)? How do practical algorithms handle this (e.g., using pseudoinverse or regularization)?