Deep Learning

Regularization: Dropout, BatchNorm

Цели урока

Apply Dropout and understand the train/eval mode distinction
Explain how BatchNorm stabilizes deep network training
Choose between BatchNorm and LayerNorm for different architectures
Build augmentation pipelines for CV and NLP tasks

A model at 95% on train, 60% on test - classic overfitting. Add Dropout: 92% / 85%. Add BatchNorm: 94% / 88%. Add augmentation: 93% / 91%. The difference between memorization and learning is three techniques from this lesson.

**ResNet, EfficientNet:** BatchNorm after every conv layer - the foundation of training stability
**GPT, BERT:** LayerNorm in every transformer block without exception
**ImageNet SOTA:** RandAugment + MixUp + CutMix - the standard augmentation stack
**LLM fine-tuning:** Dropout 0.1 + weight decay 0.01 - protecting pretrained weights

Dropout and Batch Normalization

In 2014 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov published Dropout, which randomly drops units during training to prevent co-adaptation - a cheap regularizer that became ubiquitous. A year later, in 2015, Sergey Ioffe and Christian Szegedy at Google introduced Batch Normalization, which normalizes layer activations across the mini-batch. BatchNorm let practitioners use much higher learning rates and train far deeper networks, and it remains a building block of modern convolutional architectures.

Предварительные знания

Training loop and the train/validation split
Overfitting and the bias-variance tradeoff
Mini-batch training and gradient updates

Dropout: Random Neuron Deactivation

A model has memorized the training data but fails to generalize. One sign: neurons have specialized - neuron A fires only for 'cat', neuron B exists only to complement A. Remove A and B becomes useless. **Dropout** randomly deactivates neurons on each training step, forcing every neuron to learn independently.

**Inverted Dropout** (PyTorch implementation): during training, surviving activations are divided by (1-p); during inference nothing changes. This eliminates the need to rescale weights when switching to eval mode.

**Dropout as an ensemble:** mathematically, dropout at inference is equivalent to averaging over 2^N sub-networks (N = number of neurons). Each mask produces a unique sub-network - free ensemble.

model.eval() disables Dropout. What happens to the weights at that switch in PyTorch?

Batch Normalization

A 50-layer network: after each layer, the activation distribution shifts and stretches (internal covariate shift). By layer 50, the signal has either exploded or vanished. **Batch Normalization** normalizes each layer's output during training, making deep network optimization stable.

**BatchNorm has a regularization effect** - each sample is normalized relative to its batch, which introduces noise (different batches produce different statistics). This noise acts as a weak regularizer. With BatchNorm, Dropout strength is often reduced or removed.

BatchNorm in eval() mode uses different statistics than in train(). Which ones?

Layer Normalization

A language model processes sentences of varying length. BatchNorm normalizes across the batch - in NLP the batch contains sequences of different lengths with padding, making batch statistics unreliable. **Layer Normalization** normalizes each sample independently - across all features of a single sample, not across the batch.

Why do transformers use LayerNorm instead of BatchNorm?

Data Augmentation

A model trained on 10,000 cat photos memorizes them. All photos show cats on sofas, facing right. The model then fails on real-world photos. The problem is not dataset size but diversity. **Data Augmentation** generates new variations from existing data, expanding the effective dataset size.

**Augmentation rules:** 1) apply only during training, never at eval. 2) augmentation should be label-preserving (flipping a cat horizontally is still a cat; vertical flip often breaks semantic meaning). 3) MixUp and CutMix add +0.5-1% accuracy at near-zero cost - worth including.

Data augmentation is applied only during training, not at inference. Why?

Regularization in Deep Learning

Dropout: randomly deactivates neurons during training - forces the network to work without co-adaptation
BatchNorm: normalizes across the batch - stabilizes gradients in deep networks, adds regularization noise
LayerNorm: normalizes across features per sample - works for NLP and variable-length sequences
Data Augmentation: random transforms during training - expands the effective dataset size
MixUp/CutMix: blend samples and labels - +0.5-1% accuracy at near-zero computational cost

Вопросы для размышления

When is it appropriate to remove Dropout if BatchNorm is already present?
What is the practical difference between Pre-LayerNorm and Post-LayerNorm in transformers?
How can the right augmentation intensity be found without manual search?

Связанные уроки

dl-09 — Weight decay in AdamW is itself regularization
dl-04 — BatchNorm is a standard component of CNN architectures
ml-08-regularization — L1 and L2 penalties from classical ML carry over
ml-21-bagging-boosting — Dropout acts like ensembling many thinned networks
prob-11-normal — BatchNorm rescales activations toward a normal distribution
stat-01-sampling

Deep Learning

Regularization: Dropout, BatchNorm

Цели урока

Apply Dropout and understand the train/eval mode distinction
Explain how BatchNorm stabilizes deep network training
Choose between BatchNorm and LayerNorm for different architectures
Build augmentation pipelines for CV and NLP tasks

**ResNet, EfficientNet:** BatchNorm after every conv layer - the foundation of training stability
**GPT, BERT:** LayerNorm in every transformer block without exception
**ImageNet SOTA:** RandAugment + MixUp + CutMix - the standard augmentation stack
**LLM fine-tuning:** Dropout 0.1 + weight decay 0.01 - protecting pretrained weights

Dropout and Batch Normalization

Предварительные знания

Training loop and the train/validation split
Overfitting and the bias-variance tradeoff
Mini-batch training and gradient updates

Dropout: Random Neuron Deactivation

**Dropout as an ensemble:** mathematically, dropout at inference is equivalent to averaging over 2^N sub-networks (N = number of neurons). Each mask produces a unique sub-network - free ensemble.

model.eval() disables Dropout. What happens to the weights at that switch in PyTorch?

Batch Normalization

BatchNorm in eval() mode uses different statistics than in train(). Which ones?

Layer Normalization

Why do transformers use LayerNorm instead of BatchNorm?

Data Augmentation

Data augmentation is applied only during training, not at inference. Why?

Regularization in Deep Learning

Dropout: randomly deactivates neurons during training - forces the network to work without co-adaptation
BatchNorm: normalizes across the batch - stabilizes gradients in deep networks, adds regularization noise
LayerNorm: normalizes across features per sample - works for NLP and variable-length sequences
Data Augmentation: random transforms during training - expands the effective dataset size
MixUp/CutMix: blend samples and labels - +0.5-1% accuracy at near-zero computational cost

Вопросы для размышления

When is it appropriate to remove Dropout if BatchNorm is already present?
What is the practical difference between Pre-LayerNorm and Post-LayerNorm in transformers?
How can the right augmentation intensity be found without manual search?

Связанные уроки

dl-09 — Weight decay in AdamW is itself regularization
dl-04 — BatchNorm is a standard component of CNN architectures
ml-08-regularization — L1 and L2 penalties from classical ML carry over
ml-21-bagging-boosting — Dropout acts like ensembling many thinned networks
prob-11-normal — BatchNorm rescales activations toward a normal distribution
stat-01-sampling

Regularization: Dropout, BatchNorm

Цели урока

Dropout and Batch Normalization

Предварительные знания

Dropout: Random Neuron Deactivation

Batch Normalization

Layer Normalization

Data Augmentation

Regularization in Deep Learning

Related Topics

Вопросы для размышления

Связанные уроки

Regularization: Dropout, BatchNorm

Цели урока

Dropout and Batch Normalization

Предварительные знания

Dropout: Random Neuron Deactivation

Batch Normalization

Layer Normalization

Data Augmentation

Regularization in Deep Learning

Related Topics

Вопросы для размышления

Связанные уроки