Robotics

SLAM

2004. DARPA Grand Challenge. Stanley the Stanford VW drives 7 hours through the Mojave Desert without a driver. It builds a map of the terrain in real time, localizes within that map, and navigates around obstacles - all simultaneously. No pre-built map, no GPS indoors, no external infrastructure. This is SLAM. Twenty years later, Roomba vacuums draw floor plans of your apartment. iPhone AR apps place furniture that stays put when you look away. Boston Dynamics Spot explores earthquake damage. Mars rovers navigate alone on a planet with 24-minute signal delay. SLAM is the technology that gives robots spatial awareness.

**Waymo / autonomous vehicles**: LiDAR SLAM (based on LOAM/Cartographer) builds centimeter-accurate 3D maps used for localization; Waymo has mapped 65 cities with over 32 million km of real-world driving data
**iRobot Roomba j7+ (2021)**: uses visual SLAM with a downward-facing camera and learned features to map room layouts; enables targeted spot-cleaning and obstacle avoidance without LiDAR
**Meta Quest 3 / Apple Vision Pro**: inside-out tracking using visual-inertial SLAM (VIO) with 4-8 wide-angle cameras and IMU; achieves 6-DOF head tracking at < 1mm accuracy for VR/AR applications without external base stations

The SLAM Problem

A robot enters an unknown building. To navigate, it needs a map. To build a map, it needs to know where it is. This is the **SLAM chicken-and-egg problem**: Simultaneous Localization and Mapping. Without a map, the robot cannot know its position. Without knowing its position, it cannot build a consistent map. SLAM algorithms solve both problems jointly: the robot incrementally builds a map while estimating its position within that map, using sensor measurements (LiDAR, cameras, IMU, wheel encoders) to constrain both.

**SLAM state estimation:** - **State**: robot pose (x, y, θ) + map (landmark positions m_1, ..., m_n) - **Observations**: z_t = sensor measurements at time t (distances, angles) - **Motion model**: p(x_t | x_{t-1}, u_t) - how pose changes given control u - **Observation model**: p(z_t | x_t, m) - what sensor should see from pose x_t with map m SLAM estimates the joint posterior: `p(x_{0:t}, m | z_{0:t}, u_{0:t})` - robot trajectory AND map given all observations and controls.

**EKF-SLAM complexity**: O(n^2) per update step where n is the number of landmarks. For 1000 landmarks, each step requires O(10^6) operations. This limits EKF-SLAM to sparse maps. Graph-based SLAM (next concept) scales to millions of landmarks by solving the problem as a sparse least-squares optimization.

Why is SLAM called the 'chicken-and-egg' problem of robotics?

Graph-Based SLAM

EKF-SLAM stores a dense covariance matrix - all landmarks are correlated with all others. Graph-based SLAM takes a different view: represent the SLAM problem as a **pose graph** where nodes are robot poses at different times, and edges are constraints between poses (odometry edges from wheel encoders, loop closure edges when the robot revisits a place). SLAM becomes a sparse nonlinear least-squares problem: find the pose assignment that best satisfies all constraints. This is solved with Gauss-Newton or Levenberg-Marquardt.

**Pose graph structure:** - **Nodes**: x_i = (x, y, θ) at time i - **Odometry edges**: constraint from x_i to x_{i+1} measured by IMU/encoders - **Loop closure edges**: constraint when robot recognizes a previously visited place - **Objective**: `min_X Σ_ij ||x_j - f(x_i, z_ij)||^2_{Ω_ij}` (weighted by information matrix Ω) The key insight: the constraint matrix is **sparse** (each pose connected to only a few neighbors). Sparse Cholesky factorization solves the system efficiently even with millions of poses. Libraries: g2o, GTSAM, iSAM2.

Why is loop closure critical for graph-based SLAM accuracy over long trajectories?

Visual SLAM (V-SLAM)

LiDAR SLAM is accurate but expensive ($5,000-$80,000 per sensor). **Visual SLAM** (V-SLAM) uses cameras - a $20 sensor. The challenge: cameras provide 2D pixel images, not 3D point clouds. V-SLAM extracts **keypoints** (corners, edges) from images, tracks them across frames, and triangulates 3D positions. The resulting sparse landmark map enables pose estimation. ORB-SLAM3 (2021) achieves cm-level accuracy with a single $20 webcam. Used in: Meta Quest headsets, iPhone ARKit, Boston Dynamics Spot navigation.

**V-SLAM pipeline:** 1. **Feature extraction**: ORB, SIFT, BRISK - detect corners/blobs, compute descriptors 2. **Feature matching**: match descriptors across frames (brute-force or FLANN) 3. **Pose estimation**: PnP (Perspective-n-Point) from 2D-3D correspondences, RANSAC for outlier rejection 4. **Map update**: triangulate new landmarks from stereo or motion parallax 5. **Loop closure**: bag-of-words image similarity (DBoW2) for place recognition 6. **Bundle adjustment**: joint optimization of all poses and landmark positions

Monocular visual SLAM (single camera) has a fundamental limitation compared to stereo or RGB-D SLAM. What is it?

LiDAR SLAM

LiDAR (Light Detection And Ranging) fires laser pulses and measures return times, producing precise 3D point clouds. A Velodyne HDL-64E generates 2.2 million points per second with 2 cm range accuracy at 100 m distance. **LiDAR SLAM** matches consecutive point clouds (scan matching via ICP or NDT), extracts features (planes, edges, corners), and builds 3D maps with centimeter-level accuracy. Waymo, Cruise, and Argo AI all use LiDAR SLAM as the primary mapping and localization layer for autonomous vehicles.

**LOAM (LiDAR Odometry and Mapping)** - the dominant LiDAR SLAM algorithm: 1. Extract **edge features** (sharp lines) and **planar features** (flat surfaces) from each scan 2. **Scan-to-scan matching**: ICP on features to estimate motion between consecutive scans (10 Hz) 3. **Scan-to-map matching**: register current scan against accumulated map for drift correction (1 Hz) 4. **Map update**: fuse current scan into voxel map Successors: LeGO-LOAM (ground vehicles), LIO-SAM (with IMU preintegration), ALOHA (tightly-coupled IMU+LiDAR). Cartographer (Google) uses 3D pose graphs with LiDAR.

SLAM is a solved problem - modern systems like Waymo's have perfect localization and mapping

SLAM remains an active research area with significant challenges: dynamic environments (moving pedestrians, changing scenes), adverse weather (LiDAR in heavy rain/snow), long-term map maintenance, and large-scale multi-session SLAM; Waymo's system uses HD prior maps combined with online SLAM, not pure online SLAM

Most deployed autonomous vehicle systems use pre-built high-definition maps (Waymo's Map, HERE HD Live Map) combined with online localization against those maps - this is much easier than pure online SLAM because the map is known. Pure online SLAM in novel environments at city scale remains unsolved. Research frontiers: semantic SLAM (understanding scene structure), lifelong SLAM (map maintenance over months), and SLAM in dynamic environments where moving objects must be distinguished from static map elements.

Why does LOAM perform scan-to-scan matching at 10 Hz and scan-to-map matching at only 1 Hz, rather than using a single scan-to-map pipeline?

Key ideas

**SLAM fundamentals**: jointly estimate robot trajectory and environment map from sensor data; the chicken-and-egg dependency between localization and mapping is resolved by probabilistic joint estimation
**EKF-SLAM**: Kalman filter with landmark-augmented state; O(n^2) per update limits scalability; suitable for sparse maps (<100 landmarks)
**Graph-based SLAM**: pose graph with odometry and loop-closure edges; sparse nonlinear least-squares (g2o, GTSAM); loop closure corrects accumulated drift globally
**V-SLAM and LiDAR SLAM**: cameras provide dense appearance information ($20, scale ambiguous); LiDAR provides accurate 3D geometry ($5K-80K); LOAM decouples fast scan-to-scan (10 Hz) and slow scan-to-map (1 Hz) for real-time operation

Вопросы для размышления

EKF-SLAM has O(n^2) update complexity. For a warehouse robot mapping 10,000 visual landmarks, estimate the computation per step. What architectural choice (graph SLAM, sparse EKF, particle filter) would make this tractable in real time?
Loop closure relies on place recognition: detecting that the robot has returned to a previously visited location. What failure modes exist for appearance-based place recognition (DBoW2) - and how do LiDAR-based approaches (e.g., scan context descriptors) address them?
Waymo uses pre-built HD maps for localization rather than purely online SLAM. What advantages does this provide - and what limitations does it create for deploying autonomous vehicles in new cities or after map changes (road construction, parking lots, temporary closures)?

Связанные уроки

prob-04-bayes