AR/VR
XR System Architecture: Rendering, Tracking, Input
Apple Vision Pro costs USD 3500. Its R1 chip processes incoming video in 12 milliseconds - twice as fast as human perception. Meta spent USD 20 billion on the Metaverse in three years. These are not expensive goggles - they are a new computing platform with an architecture unlike anything before.
- Quest 3: 120fps stereo rendering with ASW and foveated rendering on Snapdragon XR2 Gen 2
- Apple Vision Pro: R1 chip processes 12 camera streams and sensor fusion in 12ms
- Meta Presence Platform: scene understanding builds a 3D room map for AR anchors
- OpenXR 1.0: supported by Meta, Valve, Microsoft, Sony - one API for 95% of the XR market
XR Rendering Pipeline: Stereo, Reprojection, Foveated
Meta Quest 3 renders 120 frames per second for each eye separately. 2560x2832 pixels total. The system has 8.3ms per frame. That is less time than a single frame on a 144Hz monitor. And not a single frame can be dropped - or nausea follows.
The **XR rendering pipeline** differs from standard 3D rendering in three key ways: rendering two perspectives (stereo), compensating for head movement between frames (reprojection), and adapting quality to the gaze area (foveated rendering).
**Asynchronous SpaceWarp (ASW) / ATW** saves frames when FPS drops. If the application cannot render a frame within 8.3ms, the runtime reprojects the previous frame using the new head position. The user sees 120fps while the application renders at 60fps. Works only for head movement, not for moving scene objects.
**Multiview Rendering** is a Vulkan/OpenGL extension that renders both eye perspectives in a single draw call. Instead of two separate render passes, the GPU processes both cameras in parallel. CPU overhead savings: 40-50%. Supported on all modern XR GPUs starting from Mali-G78.
What is Asynchronous Spacewarp (ASW) in XR systems?
Tracking Systems: Inside-Out, Hand and Eye Tracking
**Inside-out tracking** - the headset tracks itself, without external cameras. Quest 3 uses four monochrome 1MP cameras plus an IMU (accelerometer + gyroscope). A SLAM algorithm builds a map of the room in real time and localizes the headset within it. No beacons, no setup - works anywhere.
**Hand Tracking** follows 26 joints of each hand without controllers. Quest uses the same 4 cameras and a neural network (MediaPipe Hands) to detect and predict joint positions. 30fps on the Snapdragon in Quest 2, 60fps on Quest 3. Limitation: performance degrades in poor lighting and during self-occlusion.
**Eye Tracking** follows the gaze point for foveated rendering and social presence. Apple Vision Pro uses infrared LEDs and cameras to determine gaze direction with 5-degree accuracy. Applications: (1) foveated rendering saves 40% GPU, (2) avatars make eye contact in MR conferences, (3) accessibility for people with motor impairments.
How does inside-out tracking differ from outside-in tracking (as used in PS Move)?
Input System: Controllers, Gestures and Voice
**XR input** is more complex than standard 3D input: the user exists in 3D space, their hands are 6DoF manipulators, and the interface can be placed on any surface in the world. OpenXR standardizes this through an Action abstraction - independent of platform and controller.
**Ray casting vs near-field interaction.** For distant objects - a ray from the controller (ray casting, Unity XR Ray Interactor). For nearby objects - physical touch simulation (near-field, Direct Interactor). Switching between modes happens automatically based on distance. Apple Vision Pro added a third mode - eye+pinch: gaze selects the object, pinch confirms.
Why is OpenXR important for XR application development?
Platform Stack: Meta, Apple, OpenXR
2024. Three major XR platforms with incompatible SDKs. Meta SDK - Snapdragon-specific optimizations, scene understanding, social API. Apple visionOS - spatial computing, SharePlay, EyeSight. OpenXR - the lowest common denominator, but portability. Platform choice equals constraint choice.
**visionOS** introduces a new concept: **spatial computing** instead of VR/AR. Application windows exist in physical space without blocking it. There is no concept of a 'session' - apps are always embedded in the real world. SwiftUI with 3D extensions (RealityKit) plus ARKit for anchor placement. Key constraint: no sideloading, strict App Store.
VR/AR development requires a completely different stack than regular 3D development
XR development builds on the same engines (Unity, Unreal) and 3D concepts, adding specific APIs for tracking, stereo rendering, and input
Unity XR Plugin Framework and Unreal XR abstract the XR-specific concerns. A 3D game developer can start XR development by learning only OpenXR Input, stereo camera setup, and performance budgets.
What development strategy is recommended for supporting multiple XR platforms?
Key ideas
- XR frame budget: 8.3ms at 120fps - stereo, foveated rendering, lens distortion correction
- ASW/ATW: reproject the previous frame to maintain smoothness during FPS drops
- Inside-out tracking: SLAM on 4 cameras + IMU, 6DoF without external beacons
- OpenXR: cross-platform standard - one codebase for Quest, PSVR2, SteamVR
- Eye tracking + foveated rendering = 40% GPU savings with no perceptible quality loss
Related topics
XR architecture connects GPU design, tracking systems, and platform APIs.
- GPU Architecture — Stereo and foveated rendering are GPU-intensive operations on XR platforms
- Tracking and SLAM — 6DoF tracking of head and hands is the foundation of any XR system
- XR Performance — Frame budget and rendering optimizations define system design
Вопросы для размышления
- Why can conventional temporal anti-aliasing (TAA) not be used in XR without modifications?
- How does scene understanding (building a 3D room map) differ from SLAM? For which XR applications is scene understanding specifically needed?
- What is the 'compositor' in the XR stack and why do applications not render directly to the display?
Связанные уроки
- arvr-14 — XR performance budgets are the foundation for system design
- arvr-04 — VR rendering defines requirements for the rendering pipeline
- arvr-03 — Tracking is the core of any XR system architecture
- arch-15-gpu-architecture — GPU architecture underlies the XR rendering pipeline
- arvr-16 — Understanding the system is necessary for XR interviews
- sd-01-intro