Computer Vision

3D Reconstruction

In 2021, the Mars Perseverance Rover built a 3D map of Jezero Crater from a stereo camera pair, using the same principles as iPhone Face ID. The cost of a mistake on Mars: a stranded rover worth USD 2.7 billion. How do photographs become a 3D model reliable enough for autonomous navigation?

**Apple Face ID** uses infrared structured light (30,000 dots) for sub-millimeter 3D face reconstruction to authenticate with less than 1 in 1,000,000 false acceptance rate
**Tesla FSD** builds a 3D road model from 8 monocular cameras using cross-camera stereo matching and monocular depth in real time
**Matterport 3D** scans interiors into photorealistic 3D tours used in over 10 million real-estate listings

From Photogrammetry to SfM and SLAM

Recovering 3D from images predates computers: photogrammetry measured terrain from aerial photographs in the early 20th century. With computer vision, these ideas were formalized into structure from motion (SfM) and multi-view stereo, algorithms that recover both scene geometry and camera positions from a set of frames. In 2003, Andrew Davison presented MonoSLAM, the first system to build a map and track a single camera's pose in real time. In 2015, ORB-SLAM made the approach accurate and accessible, and today SLAM underpins AR, robotics, and autonomous navigation.

Предварительные знания

Camera model and image formation
Keypoints and feature matching
Linear algebra: matrices and projections

Stereo Vision

Close the left eye, then the right - nearby objects shift noticeably. That shift is **disparity**: the pixel-coordinate difference of the same scene point between two cameras. The closer the object, the larger the disparity. The brain uses this principle for depth perception; stereo vision replicates it by matching pixels across left and right images. Depth follows directly from geometry: depth = (f * B) / disparity, where f is focal length and B is the baseline (inter-camera distance).

Stereo pipeline: (1) Rectification - transform the image pair so epipolar lines are horizontal (reduces 2D search to 1D); (2) Matching - find corresponding pixels (algorithms: SGBM, Semi-Global Matching, MC-CNN, PSMNet); (3) Disparity to depth via camera geometry; (4) Post-processing - WLS filter for noise, occlusion handling. Intel RealSense and Microsoft Azure Kinect use IR stereo pairs for metric depth in real time.

Why do nearby objects produce larger disparity in stereo vision?

Monocular Depth Estimation

Stereo requires two calibrated cameras. Smartphones shoot with one camera - yet Apple Depth Effect works. **Monocular depth estimation** recovers depth from a single image by exploiting patterns learned from billions of photos: perspective, occlusion, texture gradients, known object sizes. MiDaS and Depth Anything v2 (Meta, 2024) do this in 50 ms on a GPU, outperforming stereo on scenes with textureless surfaces where block-matching fails.

Two categories: (1) Metric depth - real meters (requires stereo/LiDAR calibration or scale-supervised training); (2) Affine-invariant (relative) depth - correct proportions but not absolute values, like MiDaS/Depth Anything. Architectures: DPT (Vision Transformer + Dense Prediction Transformer), Depth Anything v2 (ViT-L, 335M params, 142M synthetic frames). Apple iPhone fuses LiDAR sparse depth + monocular CNN for ARKit metric depth.

What is the key difference between affine-invariant depth (MiDaS) and metric depth?

Point Clouds

A depth map is a matrix of numbers. A point cloud is those same data in 3D: each pixel becomes a point (X, Y, Z) in space with an optional color (R, G, B). Tesla's LiDAR generates 1.2 million points per second. SLAM algorithms in robots merge point clouds from consecutive frames into a global 3D map. PointNet (2017) proved that neural networks can classify 3D objects directly from point clouds without converting to voxels.

Point cloud operations: (1) Unprojection (depth map -> point cloud): unproject each pixel (u,v,d) via inverse camera matrix K^(-1); (2) ICP (Iterative Closest Point) - align two point clouds; (3) Voxel grid downsampling - subsample for faster processing; (4) Normal estimation - PCA over neighbors; (5) RANSAC - robust plane/primitive fitting. Open3D is the standard Python library. Formats: PLY, PCD, LAS.

What is the primary motivation for voxel grid downsampling when working with point clouds?

Mesh Reconstruction

A point cloud is a set of discrete points with no surface between them. **Mesh reconstruction** builds a continuous triangle surface from those points - renderable, animatable, 3D-printable. Poisson Surface Reconstruction (Kazhdan, 2006) solves for a function whose gradient best matches the point normals - producing a globally smooth, watertight result without seam artifacts. NeRF and Gaussian Splatting (2023) offer an alternative: implicit representations that skip the explicit mesh entirely.

Mesh reconstruction algorithms: (1) Poisson Surface Reconstruction - smooth closed surface, requires normals; (2) Ball Pivoting Algorithm (BPA) - rolls a sphere over the point cloud, good for dense clouds; (3) Marching Cubes - voxel-based, extracts isosurface from a scalar field; (4) Screened Poisson - improved Poisson that better respects input point positions; (5) DeepSDF / OccNet - neural implicit surfaces. Formats: PLY/OBJ; renderers: Three.js, Blender, Unity.

More points in the cloud always yield a better mesh

Mesh quality depends on uniform point distribution and accurate normals, not raw point count

A dense but noisy cloud produces a noisy mesh. Poisson Surface Reconstruction prerequisites are a statistically cleaned, evenly distributed cloud with consistently oriented normals - not maximum density.

Why does Poisson Surface Reconstruction require point normals in addition to coordinates?

Key Ideas

**Stereo** extracts depth through disparity (projection shift of a scene point across two cameras) - the formula depth = f*B/d is geometrically exact given calibrated camera parameters
**Monocular depth estimation** (MiDaS, Depth Anything v2) gives affine-invariant depth from a single frame via patterns learned on billions of images - no calibration required
**Point clouds** and **meshes** are two 3D surface representations: the former is discrete and fast to build, the latter is continuous and ready for rendering, physics simulation, or 3D printing

Вопросы для размышления

Monocular depth estimation relies on learned patterns from training data. How does this limit reliability in out-of-distribution scenes like underwater or outer space?
Poisson Surface Reconstruction produces a watertight closed surface even where no points exist. When is this an advantage, and when a problem?
LiDAR point clouds have uniform angular distribution but not uniform distance distribution (nearby objects are denser). How does this affect mesh quality?

Связанные уроки

cv-13 — NeRF is the neural alternative to classical 3D reconstruction
cv-03 — Feature matching like SIFT drives structure-from-motion
la-15-svd — SVD solves triangulation and camera-pose estimation
calc-19-gradient — Bundle adjustment is gradient-based reprojection minimization
rob-07 — SLAM in robotics reconstructs 3D maps from camera motion
la-06-transformations