Computer Vision
Two-Stage Detectors: the R-CNN Family
Предварительные знания
- Single-shot detectors YOLO and SSD as a point of comparison (cv-06)
- Anchor boxes, IoU, and Non-Maximum Suppression
- A CNN classifier as a feature-extraction backbone
The R-CNN Family
In 2014, Ross Girshick and colleagues at UC Berkeley published R-CNN (Regions with CNN features). It applied a CNN to about 2000 region proposals per image, lifting detection accuracy on PASCAL VOC far beyond prior methods, but it was painfully slow because each region went through the network separately. Girshick fixed the speed in 2015 with Fast R-CNN, which ran the CNN once over the whole image and pooled features per region with RoI Pooling. Later in 2015, Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun completed the line with Faster R-CNN, replacing the slow external region proposals with a learned Region Proposal Network sharing features with the detector. That made two-stage detection nearly real time and set the template for accurate detection still used today.
Tesla Autopilot must recognize a pedestrian within 50 milliseconds at 120 km/h - the car travels 1.5 meters in that window. In 2013 the best detector needed 47 seconds per image. The R-CNN family cut that by a factor of 1000 in four years without sacrificing accuracy.
- **Autonomous driving:** Faster R-CNN with FPN underpins many production detectors in ADAS systems
- **Medical imaging:** FPN enables detection of micronodules in CT scans (objects < 6mm)
- **Surveillance:** real-time person detection in airport and stadium video feeds
R-CNN (2013): 2000 separate CNN passes
2013. ImageNet CNNs can classify objects but detection requires locating them too. Girshick et al. proposed a straightforward pipeline: generate region candidates first, then classify each with its own CNN pass.
- **Selective Search** generates ~2000 region proposals via texture/color-based segmentation
- Each proposal is cropped and resized to 227x227 pixels
- An AlexNet-style CNN extracts a 4096-dim feature vector per proposal
- An SVM classifier predicts the class; a separate regressor refines the bounding box
**Core bottleneck:** the CNN runs 2000 times independently per image. Most proposals overlap - the same pixels are recomputed repeatedly. This is the fundamental inefficiency R-CNN did not address.
R-CNN on VOC 2012: **53.3% mAP** vs 33.7% for the best DPM method. The quality leap was enormous - but 47 seconds per frame made it useful only for offline processing.
Why does R-CNN run so slowly (47 sec/frame)?
Fast R-CNN (2015): one pass + RoI Pooling
Girshick (2015) inverted the logic: run the CNN once over the entire image, then project proposals onto the already-computed feature map. This shared computation is the central idea of Fast R-CNN.
- The full image passes through VGG16 once → feature map H×W×C
- Selective Search still provides ~2000 proposals, but they are projected onto the shared feature map
- **RoI Pooling** normalizes each proposal to a fixed 7×7 grid via adaptive max-pooling
- A single FC head predicts class (softmax) and bbox offsets - no SVM
**RoI Pooling:** a proposal at coordinates (x1,y1,x2,y2) is projected to the feature map, divided into a 7×7 grid, and max-pooled within each cell. Output is always 7×7 regardless of proposal size.
Fast R-CNN trains 9x faster and infers 213x faster than R-CNN (excluding Selective Search). Selective Search itself became the new bottleneck - a handcrafted algorithm taking ~2 sec that cannot be trained jointly with the CNN.
What is the key innovation of Fast R-CNN compared to R-CNN?
Faster R-CNN (2015): Region Proposal Network
Ren et al. (2015) eliminated the remaining bottleneck: Selective Search was replaced by a **Region Proposal Network (RPN)** - a small convolutional network that predicts proposals directly from the shared feature map. The entire pipeline became a single differentiable network.
The RPN slides a 3×3 window over the feature map and, at each position, predicts objectness scores and bbox offsets for 9 anchors. The top-300 proposals by objectness score are passed to the detection head.
| Method | Proposal speed | VOC 2007 mAP | FPS |
|---|---|---|---|
| R-CNN | Selective Search (~2s) | 66.0% | 0.02 |
| Fast R-CNN | Selective Search (~2s) | 70.0% | 0.5 |
| Faster R-CNN | RPN (10ms) | 73.2% | 5-17 |
The RPN and Fast R-CNN head share backbone weights, trained jointly. The shared backbone means proposals and detection are computed from identical features - a double benefit.
Why does the RPN use anchors of multiple scales and aspect ratios?
Feature Pyramid Network (2017): multi-scale features
Deep CNNs produce a feature hierarchy: early layers have high resolution with fine details; late layers have low resolution with rich semantics. Small objects vanish in late layers; large objects are poorly captured in early ones. Lin et al. (2017) proposed a pyramid that exploits both simultaneously.
A lateral connection is a 1×1 conv that normalizes C_i to 256 channels, then adds it to the upsampled P_{i+1}. A final 3×3 conv smooths upsampling artifacts.
| Method | COCO mAP | Small AP (< 32px) | Large AP (> 96px) |
|---|---|---|---|
| Faster R-CNN (single scale) | 36.2% | 19.4% | 47.4% |
| Faster R-CNN + FPN | 42.1% | 26.5% | 53.4% |
The biggest gain from FPN is on **small objects**: +7.1% AP for objects under 32 pixels. Critical for autonomous driving (distant pedestrians) and medical imaging (lung micronodules).
What is the purpose of lateral connections in FPN?
R-CNN family evolution
- **R-CNN (2013):** 2000 independent CNN passes → 47s/frame, 53% mAP on VOC
- **Fast R-CNN (2015):** shared backbone + RoI Pooling → 9x faster, 70% mAP
- **Faster R-CNN (2015):** RPN replaces Selective Search → 5-17 FPS, 73% mAP
- **FPN (2017):** multi-scale feature pyramid → +7% AP on small objects
Related topics
Two-stage detectors established the architectural patterns for all of object detection.
- One-Stage Detectors: YOLO, SSD — Alternative approach: no RPN, faster but less accurate
- Semantic Segmentation — Next step: per-pixel labeling instead of bounding boxes
Вопросы для размышления
- Why does the two-stage approach outperform one-stage detectors in accuracy, and for which tasks does this gap matter most?
- How would the Faster R-CNN architecture need to change to handle objects that differ by a factor of 100 in size?
- What trade-offs arise when choosing the number of RPN proposals: 300 vs 1000 vs 2000?
Связанные уроки
- cv-06 — Two-stage detectors trade YOLO speed for region-proposal accuracy
- cv-08 — Region features and ROI Align feed semantic segmentation heads
- cv-09 — Mask R-CNN adds a mask branch onto Faster R-CNN
- dl-04 — Convolutional backbones supply the shared feature maps
- ml-39-object-detection — Same detection task framed inside the classical ML curriculum
- alg-20-greedy — Non-max suppression is greedy selection of best boxes
- ml-01