Differential Geometry

Smooth Manifolds

Robot orientations live on SO(3). Neural network covariance matrices live on the SPD manifold Sym⁺(n). Word embeddings for hierarchical data live on hyperbolic space H^n. Standard gradient descent on these spaces is geometrically wrong. Riemannian gradient descent fixes this-and it all starts with the notion of a smooth manifold.

**Robotics:** configuration space of a robot arm is SO(3)^n. Riemannian planning avoids gimbal lock and singularities
**Geometric ML:** geomstats and geoopt implement Riemannian Adam for SPD matrices, SO(n), and hyperspheres with correct metric geometry
**Hyperbolic embeddings:** trees and ontologies embed into H^n with exponentially smaller distortion than Euclidean space

Предварительные знания

Geodesics

Manifolds: Atlas and Charts

An **n-dimensional smooth manifold** M is a topological space covered by **charts** (Uα, φα: Uα → ℝⁿ), where each Uα is open and φα is a homeomorphism. The collection is an **atlas**. **Transition maps** φα ∘ φβ⁻¹ must be smooth (C∞), endowing M with a differentiable structure.

Intuition: the Earth is a manifold, covered by the maps in an atlas. Each chart locally 'flattens' the surface onto a plane. No single chart can cover the whole sphere; at least two are needed. Transition maps specify how to change coordinates between overlapping charts.

**Examples of smooth manifolds used in ML:** Sⁿ (hypersphere for normalized embeddings), SO(n) (rotation group), GL(n,R) (invertible matrices), Sym⁺(n) (SPD matrices for covariances), hyperbolic space H^n (hierarchical data). All appear in modern geometric deep learning.

Why can't a single chart (one homeomorphism φ: M → Rⁿ) cover the entire sphere Sⁿ?

Tangent Space and Tangent Bundle

The **tangent space** TₓM is an n-dimensional vector space, formally consisting of derivations: linear maps D: C∞(M) → R satisfying the Leibniz rule D(fg) = D(f)g(x) + f(x)D(g). In local coordinates: TₓM ≅ span{∂/∂x¹, ..., ∂/∂xⁿ}.

The **tangent bundle** TM = ⊔ₓ TₓM is the disjoint union of all tangent spaces-itself a smooth 2n-dimensional manifold. A vector field is a smooth section X: M → TM assigning a tangent vector X(p) ∈ TₚM to each point p.

Manifold M	dim	TₓM
Sⁿ ⊂ ℝⁿ⁺¹	n	Vectors v with v ⊥ x
SO(n)	n(n−1)/2	Skew-symmetric matrices A = −Aᵀ
GL(n,ℝ)	n²	All n×n matrices
Sym⁺(n) (SPD)	n(n+1)/2	Symmetric matrices (open cone)

The tangent space T_I SO(3) at the identity matrix consists of:

Vector Fields and Applications

A **vector field** X on M is a smooth map X: M → TM with X(p) ∈ TₚM. Integral curves of X are curves γ with γ'(t) = X(γ(t))-differential equations on the manifold.

**Hairy ball theorem:** on even-dimensional spheres S^{2n}, no nowhere-vanishing tangent vector field exists. On S²: a hairy ball cannot be combed flat without a cowlick. On S¹ and S³, nonzero fields do exist.

The **geoopt** library (PyTorch) provides Riemannian optimizers-SGD, Adam, Adagrad-for Sⁿ, SO(n), Sym⁺(n), Stiefel manifold. Each manifold provides exp, log, and parallel transport. Used in Poincaré embeddings, geometric deep learning.

Why must we project the Euclidean gradient onto the tangent space when optimizing on Sⁿ⁻¹?

Key Ideas

An **n-manifold** is covered by charts (Uα, φα: Uα → ℝⁿ) with smooth transition maps φα ∘ φβ⁻¹-the differentiable structure
**Tangent space** TₓM: n-dimensional space of derivations, locally spanned by {∂/∂xⁱ}. For Sⁿ: vectors perpendicular to x; for SO(n): skew-symmetric matrices
**Tangent bundle** TM = ⊔ₓ TₓM is itself a 2n-manifold. Vector fields are smooth sections of TM
**Riemannian optimization:** project gradient onto TₓM, then step via exp map. Implemented in geoopt / geomstats

Вопросы для размышления

The hairy ball theorem says S² has no nonzero tangent vector field. What does this imply about a continuous wind pattern on Earth-must there always be a calm spot?
The Lie group SO(3) has Lie algebra so(3) spanned by skew-symmetric matrices. How does exp: so(3) → SO(3) relate to Rodrigues' rotation formula (rotation about an axis)?
SPD matrices are used for neural network covariance representations. Why does a naive Euclidean gradient step on covariance matrices risk breaking positive definiteness, and how does Riemannian gradient descent on Sym⁺(n) avoid this?