Optimization
Neural Network Loss Landscapes
You train a neural network-loss drops, accuracy rises. But why does one model generalize while another overfits, even though both reach zero train loss? Why do ResNets train more stably than vanilla networks? The answer is hidden in the **shape of the loss surface**-geometry that's nearly impossible to see in 100M dimensions.
- **Architectural design**: skip connections in ResNet, normalization in transformers-all improve landscape geometry, making it smoother and easier to optimize
- **Model merging**: knowledge of mode connectivity enables 'merging' several trained models into one without quality loss (Model Soup, SLERP for LLMs)
- **SAM in production**: Google uses SAM for training Vision Transformers-explicit sharpness minimization gives +1-2% on ImageNet without architecture changes
Предварительные знания
Geometry of the Loss Landscape
The loss function of a neural network is a surface in a space of millions of parameters. Intuitions from 3D ('mountain', 'valley', 'ravine') transfer poorly, but geometric concepts remain: **local minima**, **maxima**, **saddle points**, **plateaus** and **sharp ravines**.
**Dauphin et al. (2014)** showed: in high dimensions, 'bad' local minima (far from the global optimum) are virtually non-existent. Most critical points with high loss values are saddle points, not local minima. This is good news: SGD and its variants escape saddle points randomly (mini-batch noise adds random perturbation). Bad local minima where training will get stuck are virtually absent-unlike the views from the 1990s.