AR/VR
Tracking: 6DoF
2019. Apple quietly buys up AR tracking companies: PrimeSense (depth sensors), SensoMotoric (eye tracking), Metaio (AR SDK). Tab: over USD 1 billion. One mission: pinpoint the user's head to the millimeter. Payoff - Vision Pro, 2023. 12 cameras, LiDAR, TrueDepth, 1000 Hz IMU. Tracking computes position a thousand times a second.
- **Autonomous vehicles:** Visual-Inertial SLAM from XR is used in self-driving cars and drones for GPS-denied navigation (tunnels, indoors)
- **Cinematography:** Studios like ILM use OptiTrack outside-in tracking to capture actor movements - this is how Gollum, Thanos, and Avatar characters are created
- **Robotics:** Boston Dynamics uses SLAM algorithms in Spot and Atlas robots for navigation in unfamiliar environments
From military simulators to Meta Quest
In 1968 Ivan Sutherland created the first VR headset - the Sword of Damocles - which was suspended from the ceiling due to its weight. Tracking was mechanical, 3DoF. In the 1990s NASA experimented with outside-in using ultrasonic sensors. The breakthrough came in 2016: Oculus CV1 with external cameras, Valve Lighthouse with IR lasers. The next leap - 2019, Quest 1: inside-out with no external cameras at all. By 2023 Vision Pro raised the bar to 12 cameras and eye tracking. Fifty-five years from a ceiling-hung pendulum to a standalone headset at USD 3500.
Предварительные знания
IMU: inertial sensors
2016. Oculus CV1 - the first mass-market VR headset. Turn the head, the virtual world swings within 2 milliseconds. The human vestibular system tolerates at most 20 ms of mismatch before nausea kicks in. Where does the speed come from? Inside every XR device sits an **IMU (Inertial Measurement Unit)** - a chip the size of a grain of rice, dispatching 1000 measurements per second.
**An IMU consists of two primary sensors:** an accelerometer (measures linear acceleration along 3 axes - forward/back, left/right, up/down) and a gyroscope (measures angular rotation velocity around 3 axes). Together they produce 6 data streams (6-axis IMU). Some IMUs also include a magnetometer (compass) - making it a 9-axis IMU.
To pull **position** out of an IMU, acceleration has to be integrated twice: acceleration -> velocity -> position. For rotation, one integration of angular velocity does the job. Mathematically simple. Practically, a disaster.
**Drift** is the IMU's fatal flaw. Double integration turns tiny sensor errors into quadratic position errors over time. Within a minute the virtual position can drift meters from reality. Rotational drift is far gentler (single integration), so 3DoF tracking on a pure IMU is workable - 6DoF is not.
Drift or no drift, IMUs are non-negotiable for two reasons. **Speed:** data every millisecond - minimal latency for head tracking. **Reliability:** they keep working in darkness, when cameras are blocked, during sharp fast movements. That is why every XR device uses an IMU as the front line and cameras for drift correction.
Why can't an IMU provide accurate 6DoF positional tracking over an extended period?
SLAM: Simultaneous Localization and Mapping
IMUs are fast but drift. Cameras are accurate but slow (30-60 Hz versus 1000 Hz). Combine the two and the weaknesses cancel. Enter **SLAM (Simultaneous Localization And Mapping)** - an algorithm that builds a map of the environment and pins down the device's position on it at the same time.
**The chicken-and-egg problem in SLAM:** to determine position, a map is needed. To build a map, position must be known. SLAM solves both tasks simultaneously, iteratively refining both the map and the position estimate.
**Visual-Inertial SLAM** is the modern XR standard. Cameras run at 30-60 Hz and serve up accurate but infrequent position fixes. The IMU runs at 1000 Hz and stitches the gaps between camera frames. Result: camera-grade accuracy at IMU-grade latency.
**Loop closure** is the SLAM superpower. When the device circles back to a place it has seen before and recognizes it, SLAM corrects the accumulated error along the entire trajectory. Like wandering through a forest, hitting a familiar tree, and realizing: "I've been here before - so now I know exactly where I am."
| SLAM component | Input | Output |
|---|---|---|
| Feature extraction | Camera frame | Set of recognizable points (ORB, SIFT) |
| Feature tracking | Pair of frames | How points shifted between frames |
| Pose estimation | Point shift + IMU | Device position and orientation |
| Mapping | 3D points | Environment map (point cloud) |
| Loop closure | Current vs. past features | Correction of accumulated error |
Beyond cameras, some devices add **depth sensors** (LiDAR, structured light, ToF cameras) for direct depth readings. They speed up 3D map construction and harden SLAM against low light and texture-poor scenes (a blank white wall is a nightmare for visual SLAM).
What task does loop closure solve in SLAM?
Inside-out tracking: cameras on the headset
SLAM relies on cameras and IMUs to pin down position. The question is: where do those cameras physically sit? The answer splits XR tracking into two camps. **Inside-out** - cameras live **on the device itself**, looking outward at the world. Put it on, walk in, done.
**"Inside-out" means:** observing from the inside (from the headset) outward (toward the world). The headset sees the room and determines its own position. No additional external equipment is needed.
| Device | Cameras | Extra sensors | Note |
|---|---|---|---|
| Meta Quest 2 | 4 | - | First mass-market standalone with inside-out |
| Meta Quest 3 | 4 | Depth sensor | Passthrough + MR |
| Apple Vision Pro | 12 | LiDAR + TrueDepth | Most advanced inside-out to date |
| HoloLens 2 | 4 | ToF depth | Enterprise AR |
| Windows Mixed Reality | 2 | - | Minimal inside-out (2 cameras) |
**Hand tracking** ships free with inside-out. The same cameras tracking the headset's position can also see the hands. Computer vision picks out hand shape, finger positions, and gestures. Meta Quest and Apple Vision Pro lean on hand tracking as the primary input method - controllers optional.
The main pain point of inside-out is **occlusion**. Hands tucked behind the back drop out of the cameras' view. A controller hidden by the body loses tracking. The system falls back to IMU-only prediction on the controller, but accuracy decays inside a second or two.
Second weakness: **dependence on visual information**. In pitch darkness or a room with blank white walls, SLAM finds no feature points and positional tracking collapses to 3DoF (IMU only). Meta Quest patches this with IR illumination - cameras see in the near-infrared band, invisible to the naked eye.
In which situation will inside-out tracking perform worst?
Outside-in tracking: the external perspective
IMUs are fast but drift. SLAM corrects drift with cameras. Inside-out parks the cameras on the headset. Flip it the other way: cameras placed **out in the room**, aimed at the headset. That is **outside-in tracking** - external sensors watching the device from afar.
**"Outside-in" means:** observing from the outside (from the room) inward (toward the headset/controllers). External cameras or sensors know their own position precisely (they are fixed) and track markers on the headset.
| System | Type | Accuracy | Zone | Use case |
|---|---|---|---|---|
| SteamVR Lighthouse 2.0 | IR lasers + photodiodes | < 1 mm | 10x10 m | PC VR (Valve Index, Vive) |
| OptiTrack | IR cameras + retroreflectors | < 0.1 mm | Configurable | Motion capture, research |
| Vicon | IR cameras + markers | < 0.1 mm | Up to 100+ sq.m | Film, biomechanics |
| PlayStation VR1 | Camera + light markers | ~5 mm | ~2x2 m | Console VR (legacy) |
Outside-in's headline win is **accuracy**. Lighthouse hits sub-millimeter; OptiTrack and Vicon go finer still. That precision is non-negotiable for film motion capture (actor suits with markers), scientific movement research, and professional VR training simulators.
The other win: **tracking through occlusion**. With five OptiTrack cameras spread around the room, a marker on the hand is in view of at least two cameras from nearly any pose - hand-behind-back included. Headset-mounted inside-out cameras simply cannot match that.
The industry has tilted hard toward inside-out: Meta Quest, Apple Vision Pro, and every standalone headset run on-device cameras. Outside-in survives in the niches where sub-millimeter accuracy is non-negotiable: professional motion capture, scientific labs, VR arcades with full-body tracking.
Inside-out tracking is inherently inferior to outside-in in accuracy and quality
For 95% of use cases (gaming, productivity, education, MR), inside-out tracking provides sufficient accuracy (2-5 mm) while being far more convenient: no external equipment, works in any room, supports unlimited range of motion
The stereotype formed in the era of early VR headsets (Rift CV1, Vive 2016) when inside-out hadn't yet matured. Quest 3 and Vision Pro deliver stable 6DoF without a single external camera. Outside-in is justified only where < 1 mm accuracy is required - motion capture, scientific experiments, surgical simulators
Key ideas
- **IMU** (accelerometer + gyroscope) delivers data 1000 times per second but suffers from drift - error accumulation during integration
- **SLAM** combines cameras (accuracy) and IMU (speed), simultaneously building an environment map and determining position on it
- **Inside-out** (cameras on the headset) is convenient, works in any room, and is the standard for consumer devices (Quest, Vision Pro)
- **Outside-in** (external sensors) offers sub-millimeter accuracy for motion capture and scientific tasks, but requires equipment installation
Related topics
Tracking is the bridge between sensors and visual experience in XR:
- Optics and displays — Tracking data determines what to render - ATW and foveated rendering depend on the accuracy of head and eye tracking
- Introduction to XR — The type of tracking (3DoF vs 6DoF) determines what XR experience is possible - from AR to full VR
Вопросы для размышления
- Why did Visual-Inertial SLAM become the standard rather than purely visual SLAM or purely inertial navigation? What weaknesses of each approach does the other compensate for?
- How would one design a tracking system for a 200 sq.m VR arcade with dozens of players? Inside-out, outside-in, or a hybrid?
- SLAM algorithms from XR are used in autonomous vehicles and robots. What additional requirements appear when the device on the move is a car on a road rather than a headset on a head?
Связанные уроки
- arvr-02 — Optics and displays: ATW and foveated rendering depend on tracking accuracy
- arvr-01 — Tracking type (3DoF vs 6DoF) determines what XR experience is possible
- rob-03 — SLAM in robots and XR - the same math, different applications
- rob-04 — Autonomous robot navigation and VR tracking both use Visual-Inertial SLAM
- emb-03 — IMU connects to MCU via I2C/SPI - the physical layer of tracking
- ml-01 — Hand tracking and gesture recognition apply computer vision on top of SLAM
- la-06-transformations