Optimization
Neural Network Loss Landscapes
You train a neural network-loss drops, accuracy rises. But why does one model generalize while another overfits, even though both reach zero train loss? Why do ResNets train more stably than vanilla networks? The answer is hidden in the **shape of the loss surface**-geometry that's nearly impossible to see in 100M dimensions.
- **Architectural design**: skip connections in ResNet, normalization in transformers-all improve landscape geometry, making it smoother and easier to optimize
- **Model merging**: knowledge of mode connectivity enables 'merging' several trained models into one without quality loss (Model Soup, SLERP for LLMs)
- **SAM in production**: Google uses SAM for training Vision Transformers-explicit sharpness minimization gives +1-2% on ImageNet without architecture changes
Предварительные знания
Geometry of the Loss Landscape
The loss function of a neural network is a surface in a space of millions of parameters. Intuitions from 3D ('mountain', 'valley', 'ravine') transfer poorly, but geometric concepts remain: **local minima**, **maxima**, **saddle points**, **plateaus** and **sharp ravines**.
**Dauphin et al. (2014)** showed: in high dimensions, 'bad' local minima (far from the global optimum) are virtually non-existent. Most critical points with high loss values are saddle points, not local minima. This is good news: SGD and its variants escape saddle points randomly (mini-batch noise adds random perturbation). Bad local minima where training will get stuck are virtually absent-unlike the views from the 1990s.
However, **good** (deep) local minima can have different shapes: sharp and flat. The shape of the minimum is critical for the **generalization ability** of the model-this discovery became central to modern understanding of deep learning.
Why are local minima rare in deep neural networks with millions of parameters?
Saddle Points and Escaping Them
A **saddle point** is a critical point where the gradient is zero but which is neither a minimum nor a maximum. The function decreases in some directions and increases in others. These-not local minima-are what optimizers actually struggle with.
**Deterministic gradient descent** can get stuck at a saddle for a long time: gradient is small, step is small, exit is slow (depends on the size of the eigenvalue). **Mini-batch SGD**: the gradient approximation noise adds a random perturbation. If it has a nonzero projection onto the 'negative' Hessian direction-the model starts 'rolling' downward. **Theorem (Jin et al., 2017)**: Perturbed SGD converges away from strict saddle points in O(1/ε⁴) steps in polynomial time.
Why does mini-batch SGD escape saddle points faster than full (batch) gradient descent?
Sharp vs Flat Minima and Generalization
When an optimizer finds a minimum, the **shape** matters: a sharp minimum-a narrow 'spike', a flat minimum-a wide 'valley'. This difference is critical for the **generalization ability** of the model.
**SAM** (Foret et al., 2021)-an optimizer that explicitly minimizes sharpness. Instead of minimizing f(θ) it minimizes: `max_{||ε||≤ρ} f(θ + ε)`-worst-case value in a ρ-ball around θ Two steps per iteration: 1. Step 1: θ̃ = θ + ρ · ∇f(θ)/||∇f(θ)|| (shift to 'worst point' in the neighborhood) 2. Step 2: θ = θ - α · ∇f(θ̃) (gradient step at the worst point) Result: SAM+SGD consistently outperforms SGD on ImageNet (+1-2%) and is especially good with small datasets.
Why does a flat loss minimum typically generalize better than a sharp one?
Visualizing Loss Landscapes
How do researchers study the shape of the landscape in millions of parameter dimensions? **Li et al. (2018)** proposed a visualization method through projection onto two directions with **filter normalization**, enabling correct comparison of networks of different scales.
A surprising discovery by **Garipov et al. (2018)** and **Draxler et al. (2018)**: two different neural network minima are connected by a **zero-loss curve** in parameter space (loss barrier ≈ 0). **Mode Connectivity** method: find a curve c(t) between θ₁ and θ₂ such that L(c(t)) ≈ const (low) for all t ∈ [0,1]. This means: different 'solutions' found by a neural network with different random seeds are not that different-they are connected in parameter space without large barriers.
Why is filter normalization applied in the loss landscape visualization method (Li et al.)?
Key Ideas
- **Geometry**: in high dimensions local minima are rare-saddle points dominate; SGD noise helps escape them
- **Sharpness**: sharpness = λ_max(∇²f), sharp minima → poor generalization, flat minima → good generalization
- **SAM**: minimizes max_{||ε||≤ρ} f(θ+ε)-explicit search for flat minima, +1-2% generalization
- **Visualization**: filter normalization from Li et al. enables fair comparison of landscape shapes across architectures
Related Topics
The geometry of the loss landscape explains why specific methods work:
- Adaptive Optimization Methods — Adam and SGD navigate the landscape differently - understanding geometry explains their different generalization behavior
- Optimization for LLM Training — Warmup, gradient clipping, learning rate schedule - tools for navigating the complex loss landscape of LLMs
- Optimization in ML — Theoretical foundations: convexity, critical points, KKT conditions
Вопросы для размышления
- If the loss function were convex, how would neural network training change? What would we lose?
- Why do batch normalization and layer normalization improve training - can this be explained through landscape geometry?
- Mode connectivity shows that different minima are 'connected'. What does this mean for ensembling and model merging?