Computer Vision
Features: SIFT, SURF, ORB
November 2010. Microsoft launches Kinect - the first mass-market depth sensor. 8 million units sold in 60 days. Inside: ORB features running in real-time at 30 frames per second, no GPU, detecting a human skeleton in 200 milliseconds. A classical algorithm from 2011 made motion capture accessible to everyone.
- **Google Photos / Apple Photos** - automatic grouping of photos showing the same places and objects through feature matching
- **Visual SLAM (robots, AR)** - ORB features let a robot build a map of a room while simultaneously determining its own position, at 30+ fps without a GPU
- **Panoramas** - every time a panorama is shot on a phone, the algorithm finds features on overlapping frames, computes a homography, and stitches them into a single image
David Lowe and the Birth of SIFT
In 1999 David Lowe at the University of British Columbia published a draft of SIFT; the final version with the ratio test landed in 2004. The algorithm cracked a problem considered unsolvable: how to find the same point in photographs taken from different distances and angles? SIFT became the most cited work in computer vision for a decade. In 2020 the patent expired - the algorithm went free. Today, SIFT derivatives run in every smartphone when shooting panoramas.
Предварительные знания
What Is a Feature? Harris and FAST
A **feature** is a point in an image that gets detected again reliably. A smooth wall is a poor feature - every patch looks the same. A straight line is better - position pins down perpendicular to it. A **corner** is the ideal feature - position pins down precisely in both directions.
**Harris Corner Detector** (1988) formalizes the intuition. For each pixel it computes a 2x2 autocorrelation matrix M from Sobel gradients. The eigenvalues of M tell the story: both small - flat region, one large - edge, both large - **corner**.
**FAST (Features from Accelerated Segment Test)** - the alternative to Harris, built for speed. Skips the gradient matrices entirely - looks at 16 pixels on a circle around the candidate: if 12 or more are brighter or darker than the center by a given threshold, it's a corner. FAST runs 10-20x faster than Harris, making it the go-to for real-time systems.
The catch with Harris and FAST: they detect corners **only at the current scale**. Shoot a building from far away - a window corner occupies 3x3 pixels. Up close - 30x30 pixels. Harris and FAST at different scales pick out different points. The fix is **scale-space**: analyzing the image at multiple scale levels (a Gaussian pyramid). Exactly what SIFT and SURF do.
In practice: Harris for educational tasks and camera calibration (checkerboard). FAST for real-time tracking (drones, AR). SIFT/ORB for matching between images (panoramas, object search).
Which region of an image makes the best feature (keypoint)?
Descriptors: SIFT, SURF, ORB
A feature detector finds **where** the interesting points are. But to match two images, one question must be answered: "Is this point in image A the same as that point in image B?" Each point gets a **descriptor** - a numerical vector describing the local neighborhood. Similar neighborhoods - similar descriptors.
**SIFT (Scale-Invariant Feature Transform)**, 2004, David Lowe - the first complete algorithm with invariance to scale and rotation. For each keypoint, a histogram of oriented gradients (HOG) is computed over a 16x16 neighborhood split into 4x4 cells with 8 orientation bins. Result: a vector of **128 numbers** (float32).
**SURF (Speeded-Up Robust Features)**, 2006 - an accelerated SIFT. Uses Haar wavelets and integral images in place of gradient histograms. Descriptor: **64 numbers** (half as compact). 3-5x faster than SIFT at comparable quality.
**SIFT and SURF were patented.** The SIFT patent expired in 2020 - free to use now. SURF is still under patent in some jurisdictions. In OpenCV: SIFT lives in the main module (`cv2.SIFT_create()`); SURF only in `opencv-contrib-python`.
**ORB (Oriented FAST and Rotated BRIEF)**, 2011, OpenCV Labs - the free alternative. Combines FAST detection with BRIEF descriptors and adds orientation support. Descriptor: a **binary vector of 256 bits** (32 bytes). Matching uses Hamming distance (counting differing bits) - an order of magnitude faster than Euclidean distance.
| Algorithm | Descriptor | Speed | Quality | License |
|---|---|---|---|---|
| SIFT | 128 float (512 B) | Slow | Excellent | Free (since 2020) |
| SURF | 64 float (256 B) | Medium | Excellent | Patented (in some jurisdictions) |
| ORB | 256 bit (32 B) | Fast | Good | Free (BSD) |
| AKAZE | Variable binary | Medium | Very good | Free |
An AR application for a smartphone is being built (real-time, 30 fps). Which descriptor should be chosen?
Matching: BFMatcher, FLANN, Ratio Test
Two sets of descriptors in hand (from two images). The **matching** task: for each descriptor in set A, find the nearest one in set B. "Nearest" means Euclidean distance (for float descriptors like SIFT/SURF) or Hamming distance (for binary descriptors like ORB).
**BFMatcher (Brute-Force)** - compares every descriptor in A against every descriptor in B. Complexity O(N x M). Reliable but slow once keypoints hit the thousands. **FLANN (Fast Library for Approximate Nearest Neighbors)** - uses KD-trees or LSH for approximate nearest-neighbor search. 5-10x faster than BF, with the trade-off that it may miss the best match.
The catch: many matches are wrong. A descriptor can be equally similar to several points (repeating textures, noise). **Lowe's ratio test** (David Lowe, 2004) is the compact fix: for each descriptor, grab its two nearest neighbors. If the best match is **noticeably** closer than the second - good match. If both are roughly the same distance away - the match is unreliable; toss it.
The ratio test with threshold **0.75** kills ~90% of false matches while sacrificing only ~5% of correct ones. Tighten to 0.6 (fewer matches, almost all correct) or loosen to 0.8 (more matches, more errors). The right choice depends on the task: RANSAC wants more matches; visualization wants precision.
| Matcher | Complexity | Descriptors | When to use |
|---|---|---|---|
| BFMatcher (L2) | O(N x M) | SIFT, SURF (float) | Small number of keypoints (<500) |
| BFMatcher (Hamming) | O(N x M), but faster than L2 | ORB, BRIEF (binary) | Real-time with ORB |
| FLANN (KD-Tree) | O(N x log M) | SIFT, SURF (float) | Thousands of keypoints, not real-time |
| FLANN (LSH) | O(N x ~const) | ORB, BRIEF (binary) | Thousands of binary descriptors |
Lowe's ratio test with threshold 0.75 discards a match if:
Homography and RANSAC
After matching: point (x1, y1) in image A corresponds to point (x2, y2) in image B. A set of such pairs unlocks the **geometric transformation** between the images. If both images capture a planar scene (or were shot from the same position), that transformation is a **homography**.
A **homography** is a 3x3 matrix (H) that maps points from one image to another via a projective transformation. It encodes rotation, scale, and perspective distortion. Computing it requires at least **4 point pairs** (8 degrees of freedom; the 9th element gets fixed by normalization).
The catch: matches always contain **outliers** (false correspondences). Even after the ratio test, ~10-30% of matches can be wrong. A single outlier can wreck the homography. The fix: **RANSAC (RANdom SAmple Consensus)**.
**RANSAC** - an iterative algorithm: 1) Randomly pick 4 pairs. 2) Compute a homography from them. 3) Count how many other pairs agree (inliers). 4) Repeat 100-1000 times. 5) Take the homography with the most inliers. The genius: outliers have **zero influence** on the result as long as inliers make up at least ~30% of all pairs.
The flagship application of homography is **image stitching** (panoramas). The pipeline: 1) Find features on both images. 2) Match them. 3) Compute H via RANSAC. 4) Apply `warpPerspective` - transform one image into the other's coordinate system. 5) Blend.
OpenCV ships a ready-made `cv2.Stitcher`: `stitcher = cv2.Stitcher_create(); status, panorama = stitcher.stitch([img1, img2, img3])`. It handles feature detection, matching, homography, blending, and exposure compensation automatically. But understanding each step matters when `Stitcher` fails and debugging starts.
Deep learning has completely replaced classical features (SIFT, ORB) in computer vision
Classical features remain indispensable in certain tasks. ORB + RANSAC is the standard in Visual SLAM for robots and AR, because it runs in real-time without a GPU. SIFT is used in industrial inspection where reproducibility matters. For tasks with limited data (camera calibration, stereo vision), classical methods work out of the box, without training.
Deep learning dominates high-level recognition tasks (classification, object detection, segmentation). But for geometric tasks (matching, pose estimation, SLAM), classical features are often more reliable: deterministic, no GPU, no training data required. SuperPoint and SuperGlue achieve better matching quality but require a GPU and do not guarantee real-time on edge devices.
To compute a homography via RANSAC there are 100 matches, 40 of which are outliers. What is the minimum sample size per RANSAC iteration?
Key Takeaways
- **Feature** = a point that can be reliably detected again. Corners are the best features. Harris finds corners from gradient matrices; FAST uses a circle of pixels (faster)
- **Descriptor** - a numerical "fingerprint" of a keypoint's neighborhood. SIFT (128 floats, precise, slow) -> ORB (256 bits, fast, free). A million Eiffel Tower photos? For every corner in every photo a descriptor is computed, and similar descriptors find each other
- **Lowe's ratio test** discards ~90% of false matches, keeping the reliable ones. Threshold 0.75 is the gold standard
- **Homography + RANSAC** - a geometric transformation resistant to outliers is computed from the matches. Applications: panoramas, AR, video stabilization
Related Topics
Feature detection is the bridge between low-level processing and high-level image understanding:
- Filtering and Convolution — SIFT uses Gaussian blur for scale-space and Sobel for gradients. The Harris corner detector is a convolution with Sobel kernels followed by eigenvalue analysis
- Digital Images: Pixels and Color — Feature detection operates on the grayscale channel. Understanding shape (H, W), coordinates (y, x), and cv2.cvtColor conversions are necessary prerequisites
Вопросы для размышления
- The ratio test works because a "correct" match should be significantly closer than any other candidate. In what situations does this assumption break down? Hint: consider repeating patterns (windows of a building, a keyboard).
- RANSAC randomly picks 4 points on each iteration. If 50% of matches are outliers, what is the probability that all 4 random points are inliers? How many iterations are needed for 99% confidence?
- Homography assumes a planar scene (or camera rotation only). What will happen when stitching a street panorama with close objects (cars, trees) and distant ones (buildings)?
Связанные уроки
- cv-02 — SIFT uses Gaussian blur and Sobel gradients from convolution
- cv-01 — Feature detection works on grayscale; shape and cvtColor are prerequisites
- cv-04 — Object detection and tracking build on feature matching
- ml-01 — Descriptors as feature vectors for classifiers
- alg-01 — RANSAC is iterative; big-O intuition helps choose iteration count
- geo-01 — Homography is a projective transformation studied in projective geometry
- la-15-svd