Machine Learning
Image Classification
In 2012, a computer first beat handcrafted features at image recognition. In 2015, a computer surpassed human accuracy. By 2020, the same task that required a PhD team costs $0.001 per image via API. This revolution happened through three ideas: deeper networks, smarter scaling, and creative data augmentation.
- **Medical diagnostics** - CNN models based on EfficientNet analyze X-rays, MRI scans, and histological images, reaching the accuracy of experienced radiologists in screening for lung cancer, diabetic retinopathy, and melanoma. In 2020, the FDA approved the first AI systems for autonomous diagnosis
- **Autonomous driving** - Tesla, Waymo, and others use CNNs to classify road signs, pedestrians, and vehicles in real time. ResNet-like backbones process frames from 8 cameras simultaneously, making decisions in milliseconds
- **Manufacturing quality control** - CNNs with transfer learning detect defects on assembly lines (cracks, scratches, shape deviations) with 99.5%+ accuracy, replacing human visual inspection and operating 24/7 without fatigue
Предварительные знания
From ImageNet to residual networks
The modern era of computer vision began with data. In 2009 Fei-Fei Li and her collaborators released ImageNet, a labeled set of millions of images across a thousand categories, and turned it into an annual competition. For three years the winners used hand-crafted features. Then in 2012 Alex Krizhevsky's AlexNet, a deep convolutional network trained on GPUs, cut the error rate almost in half and ended the debate about whether deep learning worked. VGG in 2014 showed that stacking small filters into very deep networks kept improving accuracy, but training such depth grew unstable. In 2015 Kaiming He and his team at Microsoft introduced ResNet and its residual connections, shortcuts that let gradients flow straight through, making networks of a hundred or more layers trainable and beating human-level accuracy on the ImageNet benchmark.
The Evolution of CNN Architectures
The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) was an annual competition to classify 1.2 million images into 1000 categories. From 2010 to 2017, it became **the primary driver of progress** in computer vision. Each year, the winner introduced an architectural innovation that then became the industry standard. In 2012, a breakthrough occurred: the CNN architecture **AlexNet** obliterated all hand-crafted methods, dropping top-5 error from 26% to 16%. This was the moment deep learning stopped being an academic toy and became an industrial tool.
**AlexNet (2012)** was the first deep CNN trained on GPU. Alex Krizhevsky used two GTX 580 cards (3 GB each), splitting the network across them. Key innovations: **ReLU** activation instead of sigmoid (6x faster training), **Dropout** for regularization (randomly disabling 50% of neurons), and **data augmentation** (flips, crops, color jitter). 8 layers, 60 million parameters, and results that demolished all classical computer vision methods.
**VGG (2014)** proved a simple but powerful idea: **depth matters more than complexity**. Instead of large filters (11x11 in AlexNet), VGG used only 3x3 filters but stacked them: two 3x3 layers have an effective receptive field of 5x5, three give 7x7. Fewer parameters, but more nonlinearities (ReLU). VGG-16 and VGG-19 became standard feature extractors still in use today. **GoogLeNet/Inception (2014)** took a different approach: instead of one filter size, each Inception module applies 1x1, 3x3, 5x5 convolutions and max pooling in parallel, then concatenates the results. This lets the network choose its own scale of analysis. 1x1 convolutions before the 3x3 and 5x5 compress channels to reduce computation. Result: 22 layers but only **7 million parameters** - 20x fewer than VGG.
Between 2012 and 2015 it became clear: **deeper networks give better results**. AlexNet (8 layers) lost to VGG (19), which lost to GoogLeNet (22). The recipe seemed simple - add more layers. But attempts to build networks with 50+ layers hit a problem: accuracy didn't just plateau - it **dropped**. A 56-layer network performed worse than a 20-layer one. This wasn't overfitting (training error also increased). It was the **degradation problem**, which the next architecture - ResNet - would solve.
GoogLeNet/Inception achieved better accuracy than VGG while having 20x fewer parameters. How?
ResNet: Skip Connections
By 2015, it was clear that network depth was critical for accuracy. But attempts to train networks deeper than 20 layers ran into the **degradation problem**: both train and test error increased compared to shallower networks. This was a paradox: a 56-layer network **contains** all solutions of a 20-layer network (the extra 36 layers could simply learn the identity mapping f(x) = x). But in practice, the optimizer couldn't find that solution - gradient flow faded over dozens of layers, and the deep network got stuck in a poor local minimum.
Kaiming He's solution (He et al., 2015) was elegant: instead of forcing a layer to learn the full mapping H(x), let it learn the **residual** F(x) = H(x) - x. Then the block output is: **H(x) = F(x) + x**. This is implemented via a **skip connection** (shortcut) - the input x is simply added to the output of the convolutional layers. If the optimal mapping is close to the identity, it's easier for the network to learn F(x) = 0 (push weights toward zero) than to learn H(x) = x (exact copying through several layers).
**Why skip connections solve the degradation problem:** 1. **Gradient highway** - gradients can flow directly through the skip connection, bypassing the convolutional layers. Even in a 152-layer network, gradients reach the early layers without vanishing. 2. **Easy identity learning** - if a layer isn't needed, it's easier for the network to zero out its weights (F(x) = 0, so H(x) = x) than to learn exact copying through convolutions. 3. **Ensemble effect** - ResNet can be viewed as an ensemble of networks of varying depth. Each residual block can be bypassed via the skip connection, creating 2^n possible paths through n blocks. 4. **Bottleneck block** - for deep ResNets (50+), a 1x1-3x3-1x1 configuration is used: 1x1 compresses channels, 3x3 performs the convolution, 1x1 expands back. This is 3x cheaper than two 3x3 convolutions.
Why does ResNet-56 outperform a plain 56-layer CNN without skip connections, even though both networks have the same capacity?
EfficientNet: Compound Scaling
After ResNet, a question arose: how do you **scale** a CNN for maximum accuracy? There are three scaling dimensions: **width** (number of channels per layer), **depth** (number of layers), and **resolution** (input image size). Traditionally, engineers increased one dimension at a time: ResNet went deeper (50, 101, 152 layers), WideResNet expanded channels, and for medical tasks you simply fed in high-resolution images. But scaling a single dimension saturates quickly - adding layers beyond 100 gives minimal improvement while doubling computation.
Mingxing Tan and Quoc Le (Google Brain, 2019) proposed **compound scaling** - simultaneously scaling all three dimensions at a fixed ratio. The intuition: if you increase input resolution, the network needs more layers (depth) to process the larger receptive field, and more channels (width) to capture finer patterns in the more detailed image. Scaling one dimension without the others is like upgrading a camera but looking at the photos through a foggy window.
**Compound scaling coefficient:** EfficientNet uses a single coefficient phi to scale all three dimensions: - depth: d = alpha^phi - width: w = beta^phi - resolution: r = gamma^phi Where alpha, beta, gamma are found via NAS (Neural Architecture Search) such that alpha * beta^2 * gamma^2 ~ 2 (doubling FLOPS when phi increases by 1). For EfficientNet: alpha = 1.2, beta = 1.1, gamma = 1.15 **Scaling from B0 to B7:** - B0: phi=0, 224x224, 5.3M parameters, 77.1% top-1 - B3: phi=3, 300x300, 12M parameters, 81.6% top-1 - B7: phi=7, 600x600, 66M parameters, 84.3% top-1 Each step of phi doubles FLOPS and adds ~1% accuracy.
The base EfficientNet-B0 architecture was discovered through **NAS (Neural Architecture Search)** - automated search for the optimal architecture. NAS evaluates thousands of configurations (which blocks, how many layers, what kernel size), trains each one, and selects the best by accuracy-per-computation ratio. EfficientNet-B0 is built on **MBConv** blocks (Mobile Inverted Bottleneck Convolution) with squeeze-and-excitation: expand channels via 1x1, depthwise 3x3 or 5x5 convolution, SE channel attention, then project back via 1x1. This is the same inverted residual block from MobileNet V2, but with optimal proportions.
After EfficientNet (2019), architectures emerged that rethought the entire approach to image classification. **ConvNeXt (2022)** is a pure CNN inspired by Vision Transformer design: patchify stem (4x4 conv with stride 4), LayerNorm instead of BatchNorm, GELU instead of ReLU, depthwise 7x7 convolutions. ConvNeXt matches Vision Transformer accuracy without any attention mechanism. **Vision Transformer (ViT, 2020)** abandoned convolutions entirely: an image is split into 16x16 patches, each patch projected into an embedding, and processed by a standard Transformer. ViT outperforms CNNs on large datasets (300M+ images), but on ImageNet (1.2M), CNNs remain competitive.
EfficientNet-B0 achieves higher accuracy than ResNet-50 with 5x fewer parameters. What core principle underlies this?
Data Augmentation
Network architecture is only half the story in image classification. The other half is **data**. ImageNet contains 1.2 million images, but for training ResNet-152 with 60 million parameters this is insufficient - the model tends to overfit. **Data augmentation** addresses this by creating variations of training images through random transforms. But it's not just "making the dataset bigger" - each transformation trains the model on a specific **invariance**: horizontal flip teaches that a cat looking left is the same cat looking right; random crop teaches that a partial view of an object is sufficient for recognition.
Advanced augmentation methods go beyond geometric transforms. **CutMix** cuts a rectangle from one image and pastes it into another, mixing labels proportionally by area: if 30% of pixels from a cat are pasted into a dog image, the label becomes 0.3*cat + 0.7*dog. **MixUp** blends two images (and their labels) with a random weight alpha: x_new = alpha * x1 + (1-alpha) * x2. Both methods train the model to avoid being "too confident" - this is a form of **label smoothing through data**.
**Automatic augmentation strategies:** **AutoAugment (Google, 2018)** - reinforcement learning searches for the optimal transformation sequence for a specific dataset. For ImageNet it found non-intuitive combinations: Posterize + Rotate, Equalize + Shear. **RandAugment (2020)** - a simplified alternative: randomly choose N transforms from a list and apply each at a uniform strength M. Just 2 hyperparameters instead of 30+ in AutoAugment, with comparable or better accuracy. **TrivialAugment (2021)** - even simpler: one random transform at a random strength. Zero hyperparameters, performs on par with RandAugment. Trend: from complex learned strategies toward simple random ones - regularization through randomness turned out to be sufficient.
Data augmentation is not just a trick for inflating data. It's **regularization**, standing alongside Dropout and weight decay. Dropout randomly disables neurons, forcing the network not to rely on individual features. Data augmentation randomly transforms inputs, forcing the network not to rely on specific pixels, positions, or lighting. Experiments show: ResNet-50 with strong augmentation (RandAugment + CutMix + MixUp) reaches 80.4% top-1 on ImageNet - nearly matching EfficientNet-B3, which has twice as many parameters. The right augmentation can be more valuable than a more powerful architecture.
Data augmentation is just a trick to increase dataset size, only needed when you have very little data
Data augmentation is a form of regularization that teaches the model invariances (symmetry, scale, color). It improves generalization even on large datasets, standing alongside Dropout and weight decay
ResNet-50 with RandAugment + CutMix + MixUp reaches 80.4% on ImageNet - nearly as much as EfficientNet-B3 with twice the parameters. If augmentation were simply about inflating data, it wouldn't help on ImageNet with 1.2M images. In practice, augmentation works even on datasets with tens of millions of examples, because it encodes world knowledge (an object doesn't change when flipped), not just copies.
Key Takeaways
- **Architecture evolution:** from AlexNet (8 layers, 2012) to ResNet (152 layers, 2015), top-5 error on ImageNet dropped from 16.4% to 3.6% - each generation brought a key innovation: ReLU, small filters, Inception modules, skip connections
- **Skip connections:** residual learning F(x) + x solved the degradation problem - gradients flow through the shortcut directly to early layers, and it's easier for the network to learn a zero residual than an identity mapping through convolutions
- **Compound scaling:** EfficientNet scales width, depth, and resolution simultaneously with a single coefficient phi, matching ResNet-50 accuracy with 5x fewer parameters - balanced scaling is more efficient than one-sided scaling
- **Data augmentation as regularization:** not simply data inflation, but training on invariances - the same three ideas from the lesson intro (depth, scaling, augmentation) that in 8 years turned a PhD team task into a $0.001-per-image API
Related Topics
Image classification is the foundation of computer vision, connecting convolutional networks to detection, segmentation, and transfer learning:
- Convolutional Neural Networks (CNN) — The basic building block of all image classification architectures - convolutional layers, pooling, feature maps. AlexNet, VGG, ResNet, and EfficientNet are all built from the same primitives, combined in different ways
- Transfer Learning — ImageNet-pretrained architectures (ResNet, EfficientNet) are used as feature extractors for new tasks. Fine-tuning the final layers allows high accuracy on small datasets in minutes rather than hours of training from scratch
- Object Detection — Classification architectures (ResNet, EfficientNet) serve as backbones in object detectors (Faster R-CNN, YOLO, EfficientDet). The backbone extracts features; the detection head locates and classifies objects in the image
- Image Segmentation — The encoder in segmentation architectures (U-Net, DeepLab) is a classification backbone. Classification models learn to extract semantic features that segmentation uses for per-pixel labeling
Вопросы для размышления
- ResNet solved the degradation problem through skip connections, enabling networks of 100+ layers. But why did the problem exist in the first place - after all, a deep network theoretically contains all solutions of a shallow one? What does this tell us about the loss landscape?
- EfficientNet-B0 achieves better accuracy than ResNet-50 with 5x fewer parameters. Does this mean architecture matters more than parameter count? When are more parameters still necessary?
- Data augmentation teaches invariances (flip, rotation, color). But some tasks require sensitivity to those very properties - for example, distinguishing the letters b and d (mirror images). How do you resolve the tension between invariance and discrimination?
Связанные уроки
- ml-29-cnn — Classification is built on CNN backbones
- ml-41-transfer-learning — Fine-tuning beats training from scratch
- ml-39-object-detection — Classification backbones feed detectors
- ml-05-evaluation — Accuracy and confusion matrix evaluate models
- stat-05-hypothesis — Decision threshold mirrors hypothesis testing
- la-06-transformations