Computer Vision

Semantic Segmentation

An autonomous vehicle must understand within 50ms: this is road, this is sidewalk, pedestrian #1 is walking right, pedestrian #2 is standing still. Bounding boxes are too coarse - a per-pixel map is needed. That is exactly what semantic and panoptic segmentation provide.

  • **Autonomous driving:** Waymo, Tesla FSD - full scene segmented in under 50ms
  • **Medical imaging:** U-Net won the ISBI 2012 EM neuron segmentation challenge; now the standard for CT/MRI
  • **Satellite imagery:** building, road, and vegetation segmentation for mapping and urban planning

FCN: The First Fully Convolutional Network

In 2015, Jonathan Long, Evan Shelhamer, and Trevor Darrell at UC Berkeley showed that an image classifier could be turned into a per-pixel segmenter by replacing the final FC layers with convolutions. Their Fully Convolutional Network accepted images of any size and was the first model trained end-to-end for segmentation. The same year, Olaf Ronneberger, Philipp Fischer, and Thomas Brox introduced U-Net for biomedical images, and Liang-Chieh Chen's group began the DeepLab line with atrous convolutions. These three 2015 ideas set the direction of segmentation for a decade.

Предварительные знания

  • Convolutional layers and pooling
  • Backbone networks and feature maps
  • The IoU metric
  • Image Classification: CNNs
  • Two-Stage Detectors: the R-CNN Family

FCN: replacing FC layers with convolutions

An image classifier (AlexNet, VGG) outputs one vector for the entire image. Segmentation requires a label per pixel. Long et al. (2015) proposed a simple fix: replace the final FC layers with 1x1 convolutions - the network then accepts any image size and outputs a class map.

**Transposed convolution** (deconvolution) is learned upsampling: it inserts zeros between input values (stride), then applies a standard convolution. This restores spatial resolution in a trainable way.

**mIoU (mean Intersection over Union):** for each class, IoU = |intersection| / |union| of predicted and ground-truth masks. The metric is averaged across all classes.

Why does FCN replace FC layers with 1x1 convolutions?

U-Net: skip connections for sharp boundaries

FCN loses fine detail: after 32x downsampling, a 2-pixel tumor boundary vanishes. **U-Net** (Ronneberger et al., 2015) addresses this with a symmetric architecture: an encoder compresses, a decoder restores, and skip connections pass exact spatial information directly across.

Skip connections in U-Net use concatenation (not addition as in ResNet). The decoder at each level receives: upsampled features (deep semantics) + encoder features at the same level (exact boundary locations). Together this allows segmenting structures as thin as 1-2 pixels.

  • **Medical origin:** segmentation of cells in electron microscopy images (the original U-Net task)
  • **Small data:** U-Net works well with 30-50 annotated examples through elastic deformation augmentation
  • **Legacy:** U-Net became the standard in medical imaging and is the backbone architecture in Stable Diffusion

How do skip connections in U-Net differ from residual connections in ResNet?

DeepLab: dilated convolutions and ASPP

Maxpooling reduces resolution and loses detail. DeepLab (Chen et al., 2015-2018) offers an alternative: **atrous (dilated) convolution** - a convolution with holes that expands the receptive field without reducing resolution.

**DeepLabV3+ (2018)** adds a U-Net-style decoder: ASPP encoder + lightweight decoder with skip connections. This combines the benefits of dilated convolutions (large context without resolution loss) and U-Net (sharp boundaries via skip connections).

VersionKey ideaVOC 2012 mIoU
DeepLabV1 (2015)Dilated conv + CRF post-processing71.6%
DeepLabV2 (2016)ASPP (multi-scale)79.7%
DeepLabV3 (2017)Improved ASPP + BN85.7%
DeepLabV3+ (2018)Encoder-decoder + Xception backbone89.0%

**CRF (Conditional Random Field):** a post-processing step in early DeepLab versions that refined boundaries using pairwise pixel energies. Dropped in DeepLabV3+ - the decoder with skip connections achieves comparable boundary quality.

What advantage do dilated (atrous) convolutions offer over maxpooling in segmentation?

Panoptic Segmentation: things and stuff unified

Semantic segmentation labels each pixel with a class but does not distinguish instances: all cars are just 'car'. Instance segmentation (Mask R-CNN) distinguishes instances but ignores amorphous regions (sky, road). **Panoptic segmentation** unifies both tasks.

**Panoptic FPN** (Kirillov et al., 2019): Mask R-CNN for the instance branch + a semantic head over FPN for stuff classes. Both share the FPN backbone - a single forward pass yields both instances and stuff.

  • **Autonomous driving:** Tesla FSD uses panoptic segmentation - pedestrian #1 and pedestrian #2 must be tracked separately, but the road has no instance ID
  • **Robotics:** manipulators need to know both what (class) and which specific (instance) object to grasp
  • **Medical:** distinguish individual cell instances (instance) from tissue background (semantic)

Panoptic segmentation differs from semantic segmentation in that it:

Evolution of semantic segmentation

  • **FCN (2015):** FC → 1x1 conv + transposed conv upsampling; multi-scale via skip connections
  • **U-Net (2015):** symmetric encoder-decoder + concatenation skip connections for sharp boundaries
  • **DeepLabV3+ (2018):** dilated conv + ASPP for large context without resolution loss, 89% on VOC
  • **Panoptic (2019):** things as instances + stuff as semantic classes; metric PQ = SQ x RQ

Related topics

Segmentation builds on detection and shares backbone networks.

  • Two-Stage Detectors: R-CNN Family — Mask R-CNN adds a segmentation head to Faster R-CNN
  • Feature Pyramid Networks — FPN is the shared backbone for Panoptic FPN and DeepLabV3+

Вопросы для размышления

  • Why are skip connections more critical for segmentation than for image classification?
  • Which type of segmentation (semantic, instance, panoptic) is needed for an automated parking system, and why?
  • How do dilated convolutions at different rates in ASPP compensate for not using multi-scale image pyramids at inference time?

Связанные уроки

  • cv-07 — Detection backbones and FPN are reused for dense prediction
  • cv-09 — Instance segmentation adds per-object masks on top
  • dl-04 — Encoder-decoder CNNs underlie U-Net and FCN
  • ml-40-segmentation — Same pixel-labeling task in the classical ML curriculum
  • alg-12-bfs — Connected-component labeling of masks is graph traversal
  • ml-01
Semantic Segmentation

0

1

Sign In