AR/VR

Spatial Computing: Vision Pro

In 2007, the iPhone removed buttons and made the screen the primary input. In 2024, Vision Pro removed the screen and made space the primary input. This is not iteration - it is a different computing paradigm where the OS exists in three dimensions.

  • **Mayo Clinic surgeons** use Vision Pro to visualize MRI scans in 3D directly above the patient during procedures
  • **Lufthansa** is testing Vision Pro for pilot training - a simulator without a physical cockpit
  • **Foveated rendering** in Vision Pro saves up to 60% GPU load - the same technique is moving into next-generation gaming GPUs

Spatial OS: operating system of space

On the first day after Vision Pro launched, users opened Safari and found the browser floating in front of them - hovering in the air beside the couch, movable, scalable, placeable next to the window. This is not a trick. **visionOS** is the first OS where the desktop is the three-dimensional space of a room.

Unlike iOS and macOS, where windows exist on a flat screen, visionOS builds a **Shared Space** - a unified volume where multiple apps coexist. Each app receives its own window or volumetric container, anchored to the real geometry of the room through ARKit. The system understands where the table is, where the walls are, and prevents windows from clipping through physical objects.

The architecture is three-layered: **SwiftUI** handles logic and 2D interface, **RealityKit** renders 3D objects and physics, **ARKit** tracks space and provides anchors. All three frameworks share a single scene graph, so coordinates need not be manually synchronized between them.

**Key difference from HoloLens:** Microsoft built a separate OS (Windows Holographic) on top of Windows. Apple made spatiality a core feature of the OS itself - visionOS is not an 'AR mode', it is the foundational paradigm of the system.

In visionOS, multiple apps run simultaneously in Shared Space. What anchors their windows to the physical environment?

Immersive Spaces: levels of immersion

Apple introduced the concept of a "Dial of Immersion." Vision Pro does not switch between the real world and VR with a hard cut. Instead, there is a continuous spectrum from fully open reality to complete VR.

In code, **ImmersiveSpace** is a separate scene in the app. Transitioning into it is done through the `openImmersiveSpace` Environment action. The system ensures only one ImmersiveSpace is active at any moment - two apps cannot simultaneously run full VR.

**Immersion style** is declared via `.immersionStyle(selection: $style, in: .mixed, .progressive, .full)`. The modifier `.upperLimbVisibility(.hidden)` hides the user's hands in full mode - by default they are always visible for safety.

Two apps both attempt to open an ImmersiveSpace in full mode simultaneously. What happens?

Passthrough: reality through a camera

Vision Pro has no transparent lenses - no optics pass light directly through to the eyes. Instead, cameras capture the real world, the image is processed, and the result is displayed on screens in front of the eyes. This is **video passthrough** (also called VST - Video See-Through). The latency from capture to display is approximately 12 ms - the number Apple cited at WWDC as the reason users experience no motion sickness.

For developers, passthrough is controlled through **ARKitSession** with `WorldTrackingProvider`. Apps have no direct access to raw camera frames for privacy reasons - ARKit exposes abstractions instead: room mesh, planes, anchors. Camera image access is only possible in specific scenarios through `CameraFrameProvider`, which requires an entitlement from Apple.

**EAC - Eye Accurate Correction:** passthrough automatically compensates for the parallax between the camera positions (outside the device) and the eye positions (inside). Without this correction, nearby objects would appear shifted, breaking the illusion.

Why can't visionOS developers access the raw passthrough camera frames directly?

Eye Tracking: gaze as the primary input

Vision Pro ships with no controllers. The primary input is gaze combined with a pinch gesture. The system knows where the user is looking with accuracy of a few degrees, and when fingers are pinched together, this is interpreted as a tap on the focused target. The interface paradigm shifts: no hand movement is required to hit a button.

Eye tracking is implemented through infrared LEDs and inward-facing cameras. During device setup, a calibration step occurs - the user looks at a sequence of dots, and the system builds a personal eye model. Gaze data never leaves the device - this is an explicit Apple policy, given that gaze direction constitutes biometric data.

**Foveated rendering:** the system knows where the eye is pointing and renders the central region (fovea) at maximum quality while rendering the periphery at lower resolution. This saves up to 60% GPU power with no perceptible quality loss.

Vision Pro is a VR headset with an AR mode

visionOS is a spatial operating system. By default, the user sees the real world through passthrough; full VR (full immersive mode) is an opt-in for specific scenarios

Apple deliberately positioned Vision Pro not as 'VR' but as a 'spatial computer'. Apps run by default in Shared Space (in the real world), and full immersion requires an explicit transition into an ImmersiveSpace

Why is eye tracking data never sent off-device and not directly accessible to apps?

Spatial Computing: Vision Pro

  • visionOS is the first OS with a three-dimensional desktop (Shared Space), where windows are anchored to real geometry via ARKit
  • ImmersiveSpace provides an immersion spectrum from mixed to full VR; only one space is active at a time
  • Passthrough is video see-through via cameras with ~12 ms latency; direct frame access is restricted for privacy
  • Eye tracking + pinch is the primary input without controllers; gaze data stays on-device
  • Foveated rendering maximizes quality in the foveal zone while reducing peripheral resolution

Related topics

Vision Pro builds on the AR/VR foundations covered in earlier lessons:

  • ARKit: tracking foundation — ARKit provides world tracking and anchors for visionOS
  • Mixed Reality and passthrough — VST passthrough concept in the MR context
  • Rendering in XR — Foveated rendering and GPU optimizations

Вопросы для размышления

  • Which use cases benefit from the absence of physical controllers, and which suffer from it?
  • If eye tracking is biometric and not exposed to apps - how does this constrain accessibility features?
  • Passthrough with 12 ms latency vs optical lenses with zero latency: in which tasks does the difference matter most?

Связанные уроки

  • sd-01-intro
Spatial Computing: Vision Pro

0

1

Sign In