Deep Learning

Regularization: Dropout, BatchNorm

Цели урока

  • Apply Dropout and understand the train/eval mode distinction
  • Explain how BatchNorm stabilizes deep network training
  • Choose between BatchNorm and LayerNorm for different architectures
  • Build augmentation pipelines for CV and NLP tasks

A model at 95% on train, 60% on test - classic overfitting. Add Dropout: 92% / 85%. Add BatchNorm: 94% / 88%. Add augmentation: 93% / 91%. The difference between memorization and learning is three techniques from this lesson.

  • **ResNet, EfficientNet:** BatchNorm after every conv layer - the foundation of training stability
  • **GPT, BERT:** LayerNorm in every transformer block without exception
  • **ImageNet SOTA:** RandAugment + MixUp + CutMix - the standard augmentation stack
  • **LLM fine-tuning:** Dropout 0.1 + weight decay 0.01 - protecting pretrained weights

Dropout and Batch Normalization

In 2014 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov published Dropout, which randomly drops units during training to prevent co-adaptation - a cheap regularizer that became ubiquitous. A year later, in 2015, Sergey Ioffe and Christian Szegedy at Google introduced Batch Normalization, which normalizes layer activations across the mini-batch. BatchNorm let practitioners use much higher learning rates and train far deeper networks, and it remains a building block of modern convolutional architectures.

Предварительные знания

  • Training loop and the train/validation split
  • Overfitting and the bias-variance tradeoff
  • Mini-batch training and gradient updates
  • Backpropagation: How Neural Networks Learn
  • Optimization: SGD -> AdamW

Dropout: Random Neuron Deactivation

A model has memorized the training data but fails to generalize. One sign: neurons have specialized - neuron A fires only for 'cat', neuron B exists only to complement A. Remove A and B becomes useless. **Dropout** randomly deactivates neurons on each training step, forcing every neuron to learn independently.

**Inverted Dropout** (PyTorch implementation): during training, surviving activations are divided by (1-p); during inference nothing changes. This eliminates the need to rescale weights when switching to eval mode.

**Dropout as an ensemble:** mathematically, dropout at inference is equivalent to averaging over 2^N sub-networks (N = number of neurons). Each mask produces a unique sub-network - free ensemble.

model.eval() disables Dropout. What happens to the weights at that switch in PyTorch?

Batch Normalization

A 50-layer network: after each layer, the activation distribution shifts and stretches (internal covariate shift). By layer 50, the signal has either exploded or vanished. **Batch Normalization** normalizes each layer's output during training, making deep network optimization stable.

**BatchNorm has a regularization effect** - each sample is normalized relative to its batch, which introduces noise (different batches produce different statistics). This noise acts as a weak regularizer. With BatchNorm, Dropout strength is often reduced or removed.

BatchNorm in eval() mode uses different statistics than in train(). Which ones?

Layer Normalization

A language model processes sentences of varying length. BatchNorm normalizes across the batch - in NLP the batch contains sequences of different lengths with padding, making batch statistics unreliable. **Layer Normalization** normalizes each sample independently - across all features of a single sample, not across the batch.

Why do transformers use LayerNorm instead of BatchNorm?

Data Augmentation

A model trained on 10,000 cat photos memorizes them. All photos show cats on sofas, facing right. The model then fails on real-world photos. The problem is not dataset size but diversity. **Data Augmentation** generates new variations from existing data, expanding the effective dataset size.

**Augmentation rules:** 1) apply only during training, never at eval. 2) augmentation should be label-preserving (flipping a cat horizontally is still a cat; vertical flip often breaks semantic meaning). 3) MixUp and CutMix add +0.5-1% accuracy at near-zero cost - worth including.

Data augmentation is applied only during training, not at inference. Why?

Regularization in Deep Learning

  • Dropout: randomly deactivates neurons during training - forces the network to work without co-adaptation
  • BatchNorm: normalizes across the batch - stabilizes gradients in deep networks, adds regularization noise
  • LayerNorm: normalizes across features per sample - works for NLP and variable-length sequences
  • Data Augmentation: random transforms during training - expands the effective dataset size
  • MixUp/CutMix: blend samples and labels - +0.5-1% accuracy at near-zero computational cost

Related Topics

Regularization works together with the right optimizer and architecture choices.

  • Optimization: SGD -> AdamW — AdamW + weight decay is another form of regularization
  • Convolutional Networks (CNN) — BatchNorm is a standard component of CNN architectures
  • Transformers — LayerNorm + Dropout are embedded in every transformer block

Вопросы для размышления

  • When is it appropriate to remove Dropout if BatchNorm is already present?
  • What is the practical difference between Pre-LayerNorm and Post-LayerNorm in transformers?
  • How can the right augmentation intensity be found without manual search?

Связанные уроки

  • dl-09 — Weight decay in AdamW is itself regularization
  • dl-04 — BatchNorm is a standard component of CNN architectures
  • ml-08-regularization — L1 and L2 penalties from classical ML carry over
  • ml-21-bagging-boosting — Dropout acts like ensembling many thinned networks
  • prob-11-normal — BatchNorm rescales activations toward a normal distribution
  • stat-01-sampling
Regularization: Dropout, BatchNorm

0

1

Sign In