Computer Vision

CNN Architectures: from LeNet to ResNet

Предварительные знания

Convolution, filter kernels, and pooling from cv-03
Backpropagation and gradient descent as the network's training mechanism
An image as a tensor (H, W, C) and how channels work

Features: SIFT, SURF, ORB

2012. AlexNet beat the ImageNet field by 10.8 points - the moment computer vision flipped from hand-crafted SIFT/HOG features to learned ones. Each architecture in the next four years answered a specific bottleneck: AlexNet proved scale, VGG proved depth, ResNet broke the degradation barrier and pushed past 150 layers. Today every production CV system descends directly from this lineage.

**Bank check readers (1998-2000s)** ran LeNet-5 in production at AT&T and the US Postal Service - the first CNN deployment at industrial scale, processing about 10% of US checks
**ImageNet 2012** AlexNet on two GTX 580s with 3GB VRAM each set off the deep learning era - top-5 error 15.3% vs 26.2% for the runner-up
**Microsoft's ResNet-152 (2015)** was the first model to beat the human top-5 error of 5.1% on ImageNet, using residual connections to train 152 layers stably
**MobileNet** powers on-device CV (Google Photos, AR Core, Snapchat) - the design lineage from LeNet's parameter sharing through ResNet's residual blocks made millimeter-thin mobile inference possible

Historical context

In 1998, Yann LeCun and colleagues at AT&T Bell Labs published Gradient-Based Learning Applied to Document Recognition, introducing LeNet-5. The paper demonstrated that a 5-layer convolutional network trained end-to-end with backpropagation could read handwritten zip codes with error rates below 1%. Banks deployed LeNet-5 in check reading machines by the late 1990s - it was processing roughly 10% of US checks by 2000. When AlexNet won ImageNet in 2012 by a 10-point margin, it vindicated LeCun's 1998 ideas scaled to GPU hardware. LeCun received the Turing Award in 2018, jointly with Yoshua Bengio and Geoffrey Hinton.

LeNet-5: the first successful CNN

**LeNet-5** (1998) established the CNN blueprint: alternating convolution and pooling layers extract hierarchical features, followed by fully connected layers for classification. The architectural innovation was **weight sharing**: a single filter slides across all spatial positions, dramatically reducing parameters compared to a fully connected layer while enforcing translation invariance.

**Why Tanh (not ReLU)**: LeNet-5 used Tanh because ReLU had not yet been established as the default activation. Tanh saturates at +/-1 and suffers from vanishing gradients in deep networks. AlexNet's switch to ReLU in 2012 was one of its key contributions - gradients flow freely through ReLU for positive activations, enabling training of much deeper networks.

What is the key advantage of convolutional layers in LeNet over fully connected layers for image processing?

AlexNet: deep learning wins ImageNet

**AlexNet** (2012) won the ImageNet LSVRC competition with a 15.3% top-5 error rate, 10.8 points ahead of second place. Three innovations enabled this: ReLU activations (preventing vanishing gradients), Dropout regularization (preventing overfitting), and GPU training (enabling the 5-layer 60M parameter network to train in a week on two GTX 580s with 3GB VRAM each).

**Local Response Normalization (LRN)**: AlexNet introduced LRN between conv layers, normalizing activations across nearby feature maps (lateral inhibition, inspired by neuroscience). VGG later showed LRN did not help and removed it. Batch Normalization (2015) superseded all ad-hoc normalization schemes with a theoretically grounded alternative that also serves as regularization.

Why did AlexNet use ReLU instead of Tanh, and what was the practical benefit?

VGG: depth through 3x3 filters

**VGG** (2014) made a systematic discovery: network depth is the critical factor for performance. VGG used exclusively 3x3 convolutions, stacked 2-3 per block before pooling. The insight: two 3x3 convolutions have the same receptive field as one 5x5, but fewer parameters and two ReLU non-linearities instead of one. This 'depth over width' principle guided architecture design for years.

**VGG parameter count**: VGG-16 has 138M parameters, of which 102M (74%) are in the three fully connected layers at the end. This is both its strength (high capacity) and weakness (memory, computation). Modern architectures replace the FC layers with Global Average Pooling (reducing parameters by ~100x) - a change pioneered by GoogLeNet and adopted universally from ResNet onward.

Why is a stack of two 3x3 convolutions preferred over one 5x5 convolution (same receptive field)?

ResNet: residual connections and 150+ layers

**ResNet** (2015) solved the **degradation problem**: simply adding more layers to a network made it worse, even on training data (not just validation). This was not a vanishing gradient problem - it was that deep networks were harder to optimize. ResNet's solution: **residual connections** (skip connections) that let the network learn F(x) = H(x) - x (the residual) rather than H(x) directly. If the layer is not useful, F(x) = 0 and information flows unchanged through the shortcut.

**Pre-activation ResNet (v2)**: the original ResNet applies BatchNorm and ReLU after the addition. He et al. (2016) proposed 'pre-activation' - BN and ReLU before conv layers - showing this allows gradients to flow directly through the shortcut path as pure identity, improving training of 1000+ layer networks. Most modern architectures use pre-activation residual blocks.

What problem do residual connections solve in ResNet, and why is it not simply a vanishing gradient issue?

Key Takeaways

**LeNet-5 (1998)** introduced weight sharing and the conv-pool-conv-pool-FC blueprint, deployed in industrial check readers but limited by Tanh saturation
**AlexNet (2012)** scaled CNNs to GPUs with ReLU + Dropout + LRN, dropping ImageNet top-5 error from 26% to 15.3% and starting the deep learning era
**VGG (2014)** showed that depth dominates width: stacks of 3x3 convolutions match larger receptive fields with fewer parameters and more non-linearities
**ResNet (2015)** solved the degradation problem with skip connections: F(x) = H(x) - x makes identity the default, so adding layers helps instead of hurting - 152 layers, error 3.57%, beats human accuracy

Вопросы для размышления

ResNet's identity shortcut means F(x) = H(x) - x - the network learns the residual rather than the full transformation. Why is learning the residual easier for the optimizer than learning H(x) directly?
VGG-16 has 138M parameters with 102M sitting in the final fully connected layers. Modern architectures replaced those FC layers with Global Average Pooling. What does the network give up by removing them, and why is the trade-off worth it?
AlexNet won ImageNet partly because of GPU training. If only CPU compute had been available in 2012, what experiments would researchers have run differently, and how might the field's timeline have changed?

Связанные уроки

cv-03 — Convolution and feature fundamentals precede architecture design
cv-05 — Modern architectures and ViT extend the CNN story with attention
dl-04 — Convolutional networks are studied deeply in the DL track
ml-29-cnn — Same CNN principles in the classical ML curriculum
ml-38-image-classification — LeNet to ResNet applied directly to image classification
dl-01