Computer Vision

Semantic Segmentation

An autonomous vehicle must understand within 50ms: this is road, this is sidewalk, pedestrian #1 is walking right, pedestrian #2 is standing still. Bounding boxes are too coarse - a per-pixel map is needed. That is exactly what semantic and panoptic segmentation provide.

**Autonomous driving:** Waymo, Tesla FSD - full scene segmented in under 50ms
**Medical imaging:** U-Net won the ISBI 2012 EM neuron segmentation challenge; now the standard for CT/MRI
**Satellite imagery:** building, road, and vegetation segmentation for mapping and urban planning

FCN: The First Fully Convolutional Network

In 2015, Jonathan Long, Evan Shelhamer, and Trevor Darrell at UC Berkeley showed that an image classifier could be turned into a per-pixel segmenter by replacing the final FC layers with convolutions. Their Fully Convolutional Network accepted images of any size and was the first model trained end-to-end for segmentation. The same year, Olaf Ronneberger, Philipp Fischer, and Thomas Brox introduced U-Net for biomedical images, and Liang-Chieh Chen's group began the DeepLab line with atrous convolutions. These three 2015 ideas set the direction of segmentation for a decade.

Предварительные знания

Convolutional layers and pooling
Backbone networks and feature maps
The IoU metric

FCN: replacing FC layers with convolutions

An image classifier (AlexNet, VGG) outputs one vector for the entire image. Segmentation requires a label per pixel. Long et al. (2015) proposed a simple fix: replace the final FC layers with 1x1 convolutions - the network then accepts any image size and outputs a class map.

**Transposed convolution** (deconvolution) is learned upsampling: it inserts zeros between input values (stride), then applies a standard convolution. This restores spatial resolution in a trainable way.

**mIoU (mean Intersection over Union):** for each class, IoU = |intersection| / |union| of predicted and ground-truth masks. The metric is averaged across all classes.

Why does FCN replace FC layers with 1x1 convolutions?

U-Net: skip connections for sharp boundaries

FCN loses fine detail: after 32x downsampling, a 2-pixel tumor boundary vanishes. **U-Net** (Ronneberger et al., 2015) addresses this with a symmetric architecture: an encoder compresses, a decoder restores, and skip connections pass exact spatial information directly across.

Skip connections in U-Net use concatenation (not addition as in ResNet). The decoder at each level receives: upsampled features (deep semantics) + encoder features at the same level (exact boundary locations). Together this allows segmenting structures as thin as 1-2 pixels.

**Medical origin:** segmentation of cells in electron microscopy images (the original U-Net task)
**Small data:** U-Net works well with 30-50 annotated examples through elastic deformation augmentation
**Legacy:** U-Net became the standard in medical imaging and is the backbone architecture in Stable Diffusion

How do skip connections in U-Net differ from residual connections in ResNet?

DeepLab: dilated convolutions and ASPP

Maxpooling reduces resolution and loses detail. DeepLab (Chen et al., 2015-2018) offers an alternative: **atrous (dilated) convolution** - a convolution with holes that expands the receptive field without reducing resolution.

**DeepLabV3+ (2018)** adds a U-Net-style decoder: ASPP encoder + lightweight decoder with skip connections. This combines the benefits of dilated convolutions (large context without resolution loss) and U-Net (sharp boundaries via skip connections).

Version	Key idea	VOC 2012 mIoU
DeepLabV1 (2015)	Dilated conv + CRF post-processing	71.6%
DeepLabV2 (2016)	ASPP (multi-scale)	79.7%
DeepLabV3 (2017)	Improved ASPP + BN	85.7%
DeepLabV3+ (2018)	Encoder-decoder + Xception backbone	89.0%

**CRF (Conditional Random Field):** a post-processing step in early DeepLab versions that refined boundaries using pairwise pixel energies. Dropped in DeepLabV3+ - the decoder with skip connections achieves comparable boundary quality.

What advantage do dilated (atrous) convolutions offer over maxpooling in segmentation?

Panoptic Segmentation: things and stuff unified

Semantic segmentation labels each pixel with a class but does not distinguish instances: all cars are just 'car'. Instance segmentation (Mask R-CNN) distinguishes instances but ignores amorphous regions (sky, road). **Panoptic segmentation** unifies both tasks.

**Panoptic FPN** (Kirillov et al., 2019): Mask R-CNN for the instance branch + a semantic head over FPN for stuff classes. Both share the FPN backbone - a single forward pass yields both instances and stuff.

**Autonomous driving:** Tesla FSD uses panoptic segmentation - pedestrian #1 and pedestrian #2 must be tracked separately, but the road has no instance ID
**Robotics:** manipulators need to know both what (class) and which specific (instance) object to grasp
**Medical:** distinguish individual cell instances (instance) from tissue background (semantic)

Panoptic segmentation differs from semantic segmentation in that it:

Evolution of semantic segmentation

**FCN (2015):** FC → 1x1 conv + transposed conv upsampling; multi-scale via skip connections
**U-Net (2015):** symmetric encoder-decoder + concatenation skip connections for sharp boundaries
**DeepLabV3+ (2018):** dilated conv + ASPP for large context without resolution loss, 89% on VOC
**Panoptic (2019):** things as instances + stuff as semantic classes; metric PQ = SQ x RQ

Вопросы для размышления

Why are skip connections more critical for segmentation than for image classification?
Which type of segmentation (semantic, instance, panoptic) is needed for an automated parking system, and why?
How do dilated convolutions at different rates in ASPP compensate for not using multi-scale image pyramids at inference time?

Связанные уроки

cv-07 — Detection backbones and FPN are reused for dense prediction
cv-09 — Instance segmentation adds per-object masks on top
dl-04 — Encoder-decoder CNNs underlie U-Net and FCN
ml-40-segmentation — Same pixel-labeling task in the classical ML curriculum
alg-12-bfs — Connected-component labeling of masks is graph traversal
ml-01

Computer Vision

Semantic Segmentation

**Autonomous driving:** Waymo, Tesla FSD - full scene segmented in under 50ms
**Medical imaging:** U-Net won the ISBI 2012 EM neuron segmentation challenge; now the standard for CT/MRI
**Satellite imagery:** building, road, and vegetation segmentation for mapping and urban planning

FCN: The First Fully Convolutional Network

Предварительные знания

Convolutional layers and pooling
Backbone networks and feature maps
The IoU metric

FCN: replacing FC layers with convolutions

**mIoU (mean Intersection over Union):** for each class, IoU = |intersection| / |union| of predicted and ground-truth masks. The metric is averaged across all classes.

Why does FCN replace FC layers with 1x1 convolutions?

U-Net: skip connections for sharp boundaries

**Medical origin:** segmentation of cells in electron microscopy images (the original U-Net task)
**Small data:** U-Net works well with 30-50 annotated examples through elastic deformation augmentation
**Legacy:** U-Net became the standard in medical imaging and is the backbone architecture in Stable Diffusion

How do skip connections in U-Net differ from residual connections in ResNet?

DeepLab: dilated convolutions and ASPP

Version	Key idea	VOC 2012 mIoU
DeepLabV1 (2015)	Dilated conv + CRF post-processing	71.6%
DeepLabV2 (2016)	ASPP (multi-scale)	79.7%
DeepLabV3 (2017)	Improved ASPP + BN	85.7%
DeepLabV3+ (2018)	Encoder-decoder + Xception backbone	89.0%

What advantage do dilated (atrous) convolutions offer over maxpooling in segmentation?

Panoptic Segmentation: things and stuff unified

**Autonomous driving:** Tesla FSD uses panoptic segmentation - pedestrian #1 and pedestrian #2 must be tracked separately, but the road has no instance ID
**Robotics:** manipulators need to know both what (class) and which specific (instance) object to grasp
**Medical:** distinguish individual cell instances (instance) from tissue background (semantic)

Panoptic segmentation differs from semantic segmentation in that it:

Evolution of semantic segmentation

**FCN (2015):** FC → 1x1 conv + transposed conv upsampling; multi-scale via skip connections
**U-Net (2015):** symmetric encoder-decoder + concatenation skip connections for sharp boundaries
**DeepLabV3+ (2018):** dilated conv + ASPP for large context without resolution loss, 89% on VOC
**Panoptic (2019):** things as instances + stuff as semantic classes; metric PQ = SQ x RQ

Вопросы для размышления

Why are skip connections more critical for segmentation than for image classification?
Which type of segmentation (semantic, instance, panoptic) is needed for an automated parking system, and why?
How do dilated convolutions at different rates in ASPP compensate for not using multi-scale image pyramids at inference time?

Связанные уроки

cv-07 — Detection backbones and FPN are reused for dense prediction
cv-09 — Instance segmentation adds per-object masks on top
dl-04 — Encoder-decoder CNNs underlie U-Net and FCN
ml-40-segmentation — Same pixel-labeling task in the classical ML curriculum
alg-12-bfs — Connected-component labeling of masks is graph traversal
ml-01

Semantic Segmentation

FCN: The First Fully Convolutional Network

Предварительные знания

FCN: replacing FC layers with convolutions

U-Net: skip connections for sharp boundaries

DeepLab: dilated convolutions and ASPP

Panoptic Segmentation: things and stuff unified

Evolution of semantic segmentation

Related topics

Вопросы для размышления

Связанные уроки

Semantic Segmentation

FCN: The First Fully Convolutional Network

Предварительные знания

FCN: replacing FC layers with convolutions

U-Net: skip connections for sharp boundaries

DeepLab: dilated convolutions and ASPP

Panoptic Segmentation: things and stuff unified

Evolution of semantic segmentation

Related topics

Вопросы для размышления

Связанные уроки