Machine Learning

Convolutional Neural Networks (CNN)

In 2012, a neural network crushed classical algorithms at image recognition - by a 10 percentage-point margin that shocked the entire community. And the whole network was built on three simple ideas: a sliding filter (convolution), picking the max from a window (pooling), and stacking simple patterns into complex ones (hierarchy). The architecture was so simple a student could implement it over a weekend. Today those same three ideas power systems that drive cars, diagnose X-rays, and generate images.

**Self-driving cars (Tesla, Waymo)** - CNNs recognize pedestrians, signs, lane markings, and other vehicles in real time from camera feeds, processing dozens of frames per second
**Medical diagnostics** - CNNs analyze X-rays, MRI, and CT scans at the accuracy level of experienced radiologists, helping detect cancer, pneumonia, and fractures at early stages
**Face recognition (Face ID)** - CNNs extract unique facial features for user authentication, working even with changing lighting, angles, and partial occlusion

Предварительные знания

Optimizers: SGD, Adam, RMSProp

From the Neocognitron to the ImageNet breakthrough

The convolutional idea was born in 1980, when Kunihiko Fukushima built the Neocognitron, a layered network inspired by the visual cortex that learned to recognize shapes regardless of their position. In 1989 Yann LeCun added backpropagation and trained a network to read handwritten digits; by 1998 his LeNet-5 was reading checks and ZIP codes for AT&T, the first CNN to earn its keep in production. CNNs then stayed a niche tool for over a decade until 2012, when Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton entered AlexNet in the ImageNet competition. It cut the error rate by roughly ten points over the best classical methods, and computer vision switched to deep learning almost overnight.

The convolution operation

Suppose you're searching for a specific pattern in a large image - say, a horizontal line. Instead of comparing every pixel with every other pixel, you take a small **window (kernel/filter)** of size 3x3 and slide it across the entire image. At each position you multiply the pixel values by the filter weights and sum them up - the result is a single number that shows how much the current region resembles the target pattern. This is the **convolution** operation. Each filter learns to recognize one specific pattern: a horizontal line, a vertical edge, a diagonal, or a color gradient.

Why is convolution better than a fully connected layer for images? Three reasons. **Locality:** each neuron looks only at a small area (3x3, 5x5), not the whole image. This makes sense - edges and textures are local patterns. **Parameter sharing:** the same filter is applied across the entire image. If a filter learns to find a vertical line in the top-left corner, it will find it in the bottom-right as well. **Translational invariance:** an object is recognized regardless of where it is in the image.

**Stride and Padding:** **Stride** - the step with which the kernel moves across the image: - Stride 1: kernel moves 1 pixel - output is almost the same size - Stride 2: kernel moves 2 pixels - output is 2x smaller **Padding** - adding zero pixels around the border: - 'valid' (no padding): output is smaller than input, edge pixels are less represented - 'same' (with padding): output is the same size as input Output size formula: output_size = (input_size - kernel_size + 2 * padding) / stride + 1 Example: input 32x32, kernel 3x3, stride 1, padding 1: (32 - 3 + 2*1) / 1 + 1 = 32 - size is preserved!

**Parameter count:** this is the power of CNNs. A fully connected layer for a 224x224x3 image (150,528 inputs) with 1,000 neurons = 150 million parameters. A convolutional layer with 64 filters of 3x3 on 3 channels = 64 * (3*3*3 + 1) = **1,792 parameters**. A difference of 100,000x! Fewer parameters = less data needed for training, less overfitting, faster computation.

An image 32x32x3 passes through a Conv2D layer with 16 filters of size 5x5, stride=1, padding='same'. How many trainable parameters are in this layer?

Pooling and dimensionality reduction

After convolution we get a feature map where each value shows how strongly a certain pattern is present in a given area. But the feature map is still large - if the input is 224x224, with padding='same' the output is also 224x224. **Pooling** reduces the spatial size while retaining the most important information. The most popular variant is **Max Pooling**: we take a window (usually 2x2) and keep only the maximum of the four values. The intuition is simple: if a filter found an edge in one of four neighboring pixels - what matters is *that* it found it, not *exactly where* down to the pixel.

**Average Pooling** is an alternative to Max Pooling that takes the mean value in the window instead of the maximum. Average Pooling preserves the overall intensity of a region, while Max Pooling captures the most prominent feature. In practice, Max Pooling almost always performs better in intermediate layers, and **Global Average Pooling** (mean over the entire feature map) is often used in the final layer before classification - it replaces the fully connected layer and drastically reduces the number of parameters.

**Why pooling is needed - three reasons:** 1. **Dimensionality reduction:** a 224x224 feature map becomes 112x112 after 2x2 pooling. That's 4x less data for the next layer - faster training and inference. 2. **Reduced computation:** fewer pixels = fewer multiply operations in the next convolutional layer. 3. **Minor translational invariance:** if a pattern shifts by 1 pixel, max pooling still returns the same value. The model becomes slightly less sensitive to the exact position of an object.

**Modern trend: strided convolutions instead of pooling.** In newer architectures (ResNet, EfficientNet), a convolution with stride=2 is used instead of a separate pooling layer. This allows the network to learn by itself how best to reduce the dimensionality, rather than applying a hardcoded "take the maximum" rule. Empirically the quality difference is minimal, but strided convolutions give the model more flexibility.

A feature map of size 64x64 passes through Max Pooling 2x2 with stride 2. What is the output size and how many trainable parameters are added?

Feature maps and the feature hierarchy

Each filter in a convolutional layer creates one **feature map** - a two-dimensional activation map showing where a certain pattern is present in the image. But one filter only sees one pattern. That's why each layer uses many filters: typically 32, 64, 128, or 256. Each filter creates its own feature map, and together they form a **multi-channel output**. For example, a layer with 64 filters transforms a 224x224x3 image into a 224x224x64 tensor - 64 different "views" of the image.

The most remarkable thing about CNNs is the **feature hierarchy**. Early layers (closer to the input) learn to detect simple patterns: edges, lines, color gradients. Middle layers combine simple patterns into more complex ones: textures, corners, simple shapes. Deep layers (closer to the output) recognize whole objects: eyes, wheels, animal faces. The network builds this hierarchy entirely on its own - it is never told "look for edges"; it discovers that this is useful for the final task.

**Receptive field - what each neuron "sees":** A neuron in the first convolutional layer (kernel 3x3) "sees" a 3x3 pixel area of the original image. A neuron in the second layer (also 3x3) "sees" a 3x3 area on the first layer, which corresponds to a 5x5 area on the original image. The receptive field grows with each layer: - Layer 1 (3x3): receptive field = 3x3 - Layer 2 (3x3): receptive field = 5x5 - Layer 3 (3x3): receptive field = 7x7 - After pooling 2x2: receptive field doubles A deep neuron "sees" a large portion of the image - that's why it can recognize large objects. This explains why CNNs need depth: many layers = large receptive field = ability to see whole objects.

**Filter visualization** is a practical tool for understanding CNNs. If you extract the weights of the first convolutional layer from a trained network (e.g., AlexNet), you will see that the filters have genuinely learned detectors for edges of different orientations, color gradients, and simple textures. This happens *without any manual programming* - the network discovered that these patterns are useful for object recognition. Deeper in the network the visualizations become less interpretable, but methods like Grad-CAM let you see which areas of the image the network "looks at" when making a decision.

A CNN has 3 convolutional layers. The first detects edges, the second textures, the third object parts. Who "told" the first layer to look for edges?

CNN architectures: from LeNet to ResNet

The history of CNNs is a history of increasing depth and clever solutions to the problems that depth creates. The first CNNs appeared in the 1990s, but the real revolution came in 2012 when a deep network beat classical methods for the first time at the largest image recognition competition. Here are the key milestones.

**LeNet-5 (1998)** - the first successful CNN, created by Yann LeCun. Just 5 layers, used to recognize handwritten digits on postal envelopes. Simple architecture: convolution -> pooling -> convolution -> pooling -> fully connected. It worked, but only on small grayscale images (28x28). For 14 years CNNs remained a niche topic - lacking data and computational power.

**AlexNet (2012)** - a turning point in the history of deep learning. At the ImageNet competition (1000 categories, 1.2 million images), AlexNet achieved an error rate of 16.4%, while the best classical method was 26.2%. The 10-point gap was shocking. Key innovations: **ReLU** instead of sigmoid (faster training), **Dropout** (fighting overfitting), training on **GPU** (10x faster than CPU), **data augmentation** (flips, crops, color jitter). From this moment, deep learning became the dominant approach in computer vision.

**VGG (2014)** showed the power of simplicity: instead of a variety of filter sizes (5x5, 7x7, 11x11 as in AlexNet), VGG uses **only 3x3 convolutions**. Why does it work? Two 3x3 layers have a receptive field of 5x5 but only 2*(3*3) = 18 parameters instead of 5*5 = 25 for a single 5x5 layer. Three 3x3 layers cover 7x7 with 27 parameters instead of 49. More layers = more nonlinearities (ReLU after each) with fewer parameters. VGG-16 with 138M parameters achieved a 7.3% error rate on ImageNet.

**ResNet and skip connections - the key breakthrough:** As network depth increases, a problem arises: beyond a certain depth, quality *drops* even on training data. This is not overfitting - it's an optimization problem: gradients vanish (vanishing gradient) or explode. **Skip connection (shortcut)** solves this directly: instead of: output = F(x) - layer must learn everything ResNet: output = F(x) + x - layer only learns the *difference* (residual) If the optimal transformation is close to the identity (which is often the case in deep networks), the layer only needs to learn F(x) ~ 0 - much easier than learning F(x) ~ x. Result: networks of depth 152 layers became successfully trainable. ResNet-152 achieved a 3.57% error rate on ImageNet - better than humans (5.1%).

You need to design a CNN from scratch for every new task, carefully choosing the architecture and training it on your own data

Transfer learning from pretrained models (ResNet, EfficientNet) works better in 90% of cases - just replace the final layer and fine-tune on your data

Models pretrained on ImageNet (1.2M images, 1000 classes) have already learned to extract universal features: edges, textures, shapes. These features are useful for almost any visual task. Training from scratch requires a huge dataset and computational resources, while fine-tuning a pretrained model gives better results even with just 1000 images.

Summary

**Convolution:** a small filter (3x3, 5x5) slides over the image detecting local patterns - parameter sharing drastically reduces the parameter count (thousands vs millions in a fully connected layer)
**Pooling:** reduces the spatial size of the feature map (usually by 2x), cutting computation and adding minor translational invariance - in modern networks often replaced by strided convolution
**Feature hierarchy:** early layers learn to detect edges, middle layers textures and shapes, deep layers whole objects; this hierarchy emerges automatically through backpropagation, without any manual programming
**Architectures:** from LeNet (5 layers) to ResNet (152 layers) - skip connections solved the vanishing gradient problem and enabled very deep networks
**Transfer learning as the default practice:** that same simple 2012 idea - convolution, pooling, hierarchy - lives on today in pretrained models that can be fine-tuned for any task in minutes, without designing a CNN from scratch

Вопросы для размышления

Why do CNNs with their local filters and parameter sharing work so well for images, but don't suit tabular data (e.g., a table of customer features)? What inductive bias is built into the CNN architecture?
Skip connections in ResNet enabled training networks of 152 layers. Can you keep increasing depth indefinitely and always get better results? Where is the limit, and what determines it?
If CNN builds a feature hierarchy (edges -> textures -> objects) on its own, why does transfer learning work - don't different tasks require different features?

Связанные уроки

ml-28-optimizers — CNNs are trained with these optimizers
ml-38-image-classification — CNNs power image classification pipelines
ml-41-transfer-learning — Pretrained CNNs enable transfer learning
ml-31-transformers — Both learn hierarchical features, different inductive bias
la-07-matrix-multiply — Convolution reduces to matrix multiplication (im2col)
aie-25-multimodal