Computer Vision
Object Tracking: SORT, DeepSORT, ByteTrack
Amazon Go stores operate without cashiers - cameras track every shopper and every product simultaneously. 100+ people, 1000+ items, 30 FPS. One ID switch means a wrong receipt. This is not a detection problem: it is a tracking problem. SORT appeared in 2016, ByteTrack in 2022 - and the gap between them in crowded scenes is 15% accuracy.
- Amazon Go: MOT with 300+ cameras for checkout-free shopping without a single cashier
- Tesla Autopilot: SORT-style tracking across 8 cameras simultaneously at 36 ms latency per frame
- Waymo: pedestrian, cyclist, vehicle tracking from 29 LiDAR + camera fusion on 40 000 hours/day
- Sports analytics: ByteTrack for soccer players - movement statistics and heat maps in real time
Предварительные знания
- Object detection: YOLO/DETR, bounding boxes, confidence scores, IoU
- Hungarian algorithm: optimal assignment over a cost matrix in O(n^3)
- Kalman filter: predicting and correcting state from noisy measurements
- Cosine similarity and feature embeddings (for re-ID in DeepSORT)
SORT, DeepSORT, ByteTrack: the evolution of tracking-by-detection
In 2016 Alex Bewley and co-authors published SORT (Simple Online and Realtime Tracking) at ICIP. The work proved that a plain combination of a Kalman filter and the Hungarian algorithm over IoU achieves near-SOTA tracking at hundreds of frames per second, with no neural network inside the tracker itself. Its weak spot was occlusion: after a long occlusion SORT swapped identities. In 2017 Nicolai Wojke, Alex Bewley, and Dietrich Paulus released DeepSORT, adding a deep appearance embedding (re-ID) and cascaded matching, which sharply cut ID switches. In 2021 Yifu Zhang and co-authors proposed ByteTrack (published at ECCV 2022): the key idea is to keep low-confidence detections rather than discard them, then associate them with lost tracks in a second stage. This recovered objects partially hidden in a crowd and pushed ByteTrack to the top of the MOT17, MOT20, and DanceTrack benchmarks with minimal complexity.
SORT: Tracking in 20 Lines of Code
2016. Tesla Autopilot processed 8 cameras simultaneously at 36 ms per frame. The detector found objects. The question: which bbox from frame N corresponds to the same bbox in frame N-1? That is the tracking problem. SORT (Simple Online and Realtime Tracking) solves it in 260 microseconds per frame - 100x faster than the detector.
**SORT = Kalman Filter + Hungarian Algorithm.** Kalman predicts where each tracked object will be in the next frame based on its current velocity. The Hungarian Algorithm assigns predicted positions to detections from the current frame using an IoU (Intersection over Union) cost matrix. Tracks with no matches for N frames are deleted. New detections with no matches become new tracks.
**Kalman State Vector.** SORT represents each track as the state vector [x, y, s, r, dx, dy, ds], where x,y is the bbox center, s is area, r is aspect ratio, and dx,dy,ds are velocities. Kalman predicts [x',y',s',r'] for the next frame. This works well for uniform motion and breaks down for sharp turns.
What does the Hungarian Algorithm do in the SORT pipeline?
DeepSORT: Re-ID Embeddings Against ID Switches
SORT loses tracks during occlusion: object A hides object B for 30 frames, then A moves away - SORT assigns two new IDs instead of restoring the originals. DeepSORT adds an appearance embedding: a compact vector (128 dimensions) describing the visual identity of the object. Even after long occlusions, a re-ID match restores the correct track.
**Cascaded matching.** DeepSORT uses two-stage matching: first it associates active tracks (time_since_update == 0) with detections using appearance + IoU, then it tries to associate remaining lost tracks using IoU only. The appearance model is less reliable for tracks that have not been updated recently.
What specific SORT problem does DeepSORT solve with appearance embeddings?
ByteTrack: Use Every Detection, Even Low-Confidence Ones
SORT and DeepSORT discard detections with confidence below 0.5. In crowded scenes (stadium, intersection) an object buried in a crowd may score only 0.3 confidence - and the tracker loses it. ByteTrack (2022) flips the logic: first match high-confidence detections, then try to associate lost tracks with low-confidence detections.
**ByteTrack and the HOTA metric.** MOTA (traditional) penalizes ID switches harshly. HOTA (Higher Order Tracking Accuracy) balances detection accuracy and association accuracy. ByteTrack on DanceTrack (dancer tracking with heavy occlusion): HOTA 47.7 vs 45.7 for DeepSORT+ - without any appearance model. The reason: low-confidence detections contain real objects that DeepSORT completely ignores.
Why does ByteTrack use low-confidence detections in the second stage but not the first?
Multi-Object Tracking: Production Decisions
Waymo processes 40 000 hours of sensor data from robotaxis every day. Tracking pedestrians, cyclists, and cars simultaneously at 10 FPS LiDAR + 30 FPS camera. At this scale, architectural decisions matter more than the choice of algorithm.
**Transformer-based tracking.** TrackFormer (2021) and MOTR merge detection and tracking into a single transformer: a track query is a learnable vector that follows an object across frames through cross-attention. No Kalman, no Hungarian - attention replaces both. On MOT17: 74.1 MOTA for MOTR vs 76.5 for ByteTrack, but MOTR scales more easily to new cameras without hyperparameter tuning.
Better detector = better tracker
Detector accuracy matters, but the tracker adds ID consistency. ByteTrack with YOLOv8 at confidence=0.5 outperforms DeepSORT with the same detector by 15% HOTA through low-confidence association - same detector, different tracking algorithm.
MOTA penalizes ID switches as heavily as missed detections. In occluded scenes ID switches dominate the error - making the association algorithm more critical than the detection threshold.
Why do production MOT systems often run a separate tracker per object class?
Key ideas
- SORT: Kalman predicts position, Hungarian assigns predictions to detections by IoU - 260 microseconds per frame
- DeepSORT: 128-dim re-ID embedding restores the correct ID after long occlusions
- ByteTrack: two-stage match - high-confidence detections first, then low-confidence for lost tracks
- Multi-class: separate tracker per class - a pedestrian cannot become a car
- HOTA metric: balances detection accuracy and association accuracy better than MOTA
Related topics
Tracking builds on detection and leads to higher-level video understanding.
- Object Detection: YOLO, SSD — YOLO is the standard bbox detection source for the MOT pipeline
- Video Understanding — Tracking identifies objects; Video Understanding builds actions on top of tracks
- Self-Supervised Vision — DeepSORT re-ID embeddings improve through contrastive self-supervised pretraining
Вопросы для размышления
- SORT uses only IoU for matching. In what scenarios does this fundamentally break even with a perfect detector?
- ByteTrack uses no appearance features. How does this affect the accuracy-speed tradeoff compared to DeepSORT?
- MOTR and TrackFormer merge detection and tracking into one transformer. What practical problems remain unsolved?
Связанные уроки
- cv-14 — Video Understanding provides per-frame detection that tracking builds on
- cv-06 — YOLO detector is the standard bbox source for SORT and ByteTrack
- cv-16 — DeepSORT re-ID embeddings improve through self-supervised learning
- dsp-06 — Kalman filter in SORT is a classic DSP algorithm for smoothing noisy measurements
- prob-04-bayes