Computer Vision
Semantic Segmentation
An autonomous vehicle must understand within 50ms: this is road, this is sidewalk, pedestrian #1 is walking right, pedestrian #2 is standing still. Bounding boxes are too coarse - a per-pixel map is needed. That is exactly what semantic and panoptic segmentation provide.
- **Autonomous driving:** Waymo, Tesla FSD - full scene segmented in under 50ms
- **Medical imaging:** U-Net won the ISBI 2012 EM neuron segmentation challenge; now the standard for CT/MRI
- **Satellite imagery:** building, road, and vegetation segmentation for mapping and urban planning
FCN: The First Fully Convolutional Network
In 2015, Jonathan Long, Evan Shelhamer, and Trevor Darrell at UC Berkeley showed that an image classifier could be turned into a per-pixel segmenter by replacing the final FC layers with convolutions. Their Fully Convolutional Network accepted images of any size and was the first model trained end-to-end for segmentation. The same year, Olaf Ronneberger, Philipp Fischer, and Thomas Brox introduced U-Net for biomedical images, and Liang-Chieh Chen's group began the DeepLab line with atrous convolutions. These three 2015 ideas set the direction of segmentation for a decade.
Предварительные знания
- Convolutional layers and pooling
- Backbone networks and feature maps
- The IoU metric
FCN: replacing FC layers with convolutions
An image classifier (AlexNet, VGG) outputs one vector for the entire image. Segmentation requires a label per pixel. Long et al. (2015) proposed a simple fix: replace the final FC layers with 1x1 convolutions - the network then accepts any image size and outputs a class map.
**Transposed convolution** (deconvolution) is learned upsampling: it inserts zeros between input values (stride), then applies a standard convolution. This restores spatial resolution in a trainable way.
**mIoU (mean Intersection over Union):** for each class, IoU = |intersection| / |union| of predicted and ground-truth masks. The metric is averaged across all classes.
Why does FCN replace FC layers with 1x1 convolutions?
U-Net: skip connections for sharp boundaries
FCN loses fine detail: after 32x downsampling, a 2-pixel tumor boundary vanishes. **U-Net** (Ronneberger et al., 2015) addresses this with a symmetric architecture: an encoder compresses, a decoder restores, and skip connections pass exact spatial information directly across.
Skip connections in U-Net use concatenation (not addition as in ResNet). The decoder at each level receives: upsampled features (deep semantics) + encoder features at the same level (exact boundary locations). Together this allows segmenting structures as thin as 1-2 pixels.
- **Medical origin:** segmentation of cells in electron microscopy images (the original U-Net task)
- **Small data:** U-Net works well with 30-50 annotated examples through elastic deformation augmentation
- **Legacy:** U-Net became the standard in medical imaging and is the backbone architecture in Stable Diffusion
How do skip connections in U-Net differ from residual connections in ResNet?
DeepLab: dilated convolutions and ASPP
Maxpooling reduces resolution and loses detail. DeepLab (Chen et al., 2015-2018) offers an alternative: **atrous (dilated) convolution** - a convolution with holes that expands the receptive field without reducing resolution.
**DeepLabV3+ (2018)** adds a U-Net-style decoder: ASPP encoder + lightweight decoder with skip connections. This combines the benefits of dilated convolutions (large context without resolution loss) and U-Net (sharp boundaries via skip connections).
| Version | Key idea | VOC 2012 mIoU |
|---|---|---|
| DeepLabV1 (2015) | Dilated conv + CRF post-processing | 71.6% |
| DeepLabV2 (2016) | ASPP (multi-scale) | 79.7% |
| DeepLabV3 (2017) | Improved ASPP + BN | 85.7% |
| DeepLabV3+ (2018) | Encoder-decoder + Xception backbone | 89.0% |
**CRF (Conditional Random Field):** a post-processing step in early DeepLab versions that refined boundaries using pairwise pixel energies. Dropped in DeepLabV3+ - the decoder with skip connections achieves comparable boundary quality.
What advantage do dilated (atrous) convolutions offer over maxpooling in segmentation?
Panoptic Segmentation: things and stuff unified
Semantic segmentation labels each pixel with a class but does not distinguish instances: all cars are just 'car'. Instance segmentation (Mask R-CNN) distinguishes instances but ignores amorphous regions (sky, road). **Panoptic segmentation** unifies both tasks.
**Panoptic FPN** (Kirillov et al., 2019): Mask R-CNN for the instance branch + a semantic head over FPN for stuff classes. Both share the FPN backbone - a single forward pass yields both instances and stuff.
- **Autonomous driving:** Tesla FSD uses panoptic segmentation - pedestrian #1 and pedestrian #2 must be tracked separately, but the road has no instance ID
- **Robotics:** manipulators need to know both what (class) and which specific (instance) object to grasp
- **Medical:** distinguish individual cell instances (instance) from tissue background (semantic)
Panoptic segmentation differs from semantic segmentation in that it:
Evolution of semantic segmentation
- **FCN (2015):** FC → 1x1 conv + transposed conv upsampling; multi-scale via skip connections
- **U-Net (2015):** symmetric encoder-decoder + concatenation skip connections for sharp boundaries
- **DeepLabV3+ (2018):** dilated conv + ASPP for large context without resolution loss, 89% on VOC
- **Panoptic (2019):** things as instances + stuff as semantic classes; metric PQ = SQ x RQ
Related topics
Segmentation builds on detection and shares backbone networks.
- Two-Stage Detectors: R-CNN Family — Mask R-CNN adds a segmentation head to Faster R-CNN
- Feature Pyramid Networks — FPN is the shared backbone for Panoptic FPN and DeepLabV3+
Вопросы для размышления
- Why are skip connections more critical for segmentation than for image classification?
- Which type of segmentation (semantic, instance, panoptic) is needed for an automated parking system, and why?
- How do dilated convolutions at different rates in ASPP compensate for not using multi-scale image pyramids at inference time?
Связанные уроки
- cv-07 — Detection backbones and FPN are reused for dense prediction
- cv-09 — Instance segmentation adds per-object masks on top
- dl-04 — Encoder-decoder CNNs underlie U-Net and FCN
- ml-40-segmentation — Same pixel-labeling task in the classical ML curriculum
- alg-12-bfs — Connected-component labeling of masks is graph traversal
- ml-01