AR/VR

Haptics and Multimodal Input

In 2024, surgeons at Brigham and Women's Hospital performed the first AR-navigated procedure through Apple Vision Pro. The surgeon's observation afterward: 'When the instrument touches tissue in AR and there's no feeling, the brain stops trusting the screen'. Haptics, spatial audio, gestures, and voice are not conveniences - they are the foundation of trust in a virtual world.

  • **Meta Quest 3** uses adaptive triggers with haptic feedback: squeezing a virtual object of different densities creates different resistance - a ball and a rock feel physically distinct
  • **Apple Vision Pro** personalizes the HRTF via ear canal scanning during initial setup, improving sound localization accuracy by 40% compared to generic profiles
  • **Valve Index** with finger tracking monitors individual finger curl angles - Half-Life: Alyx built a full object interaction system with zero buttons, relying entirely on hand gestures

Haptic Feedback

Vision and hearing account for roughly 80% of the brain's sensory input - but without touch, virtual reality remains a flat illusion. Haptic feedback closes that gap: vibrations, force resistance, and thermal effects convince the nervous system that an interaction is real. Meta Quest 3 uses Linear Resonant Actuators (LRA) in its controllers at up to 320 Hz - every surface type in a game can be literally felt through fingertips.

Three tiers of haptics: **vibrotactile** (LRA motors in controllers - cheap, universal), **force feedback** (exoskeletons like HaptX Gloves - 40+ N resistance per finger, USD 5,000+), **thermal** (Peltier elements for heat/cold simulation - research prototype territory). In XR development, 95% of use cases fall in the first tier.

Which type of haptic feedback is most common in modern XR controllers?

Spatial Audio

The brain uses an interaural time difference (ITD) of 690 microseconds and level difference (ILD) of up to 20 dB to pinpoint a sound source in 3D space. Spatial audio in XR reproduces these cues through HRTF - the Head-Related Transfer Function, a unique acoustic fingerprint of ear shape. Apple Vision Pro applies a personalized HRTF captured by the True Depth Camera during setup, making spatial audio startlingly realistic.

**HRTF** (Head-Related Transfer Function) is a set of filters modeling how sound diffracts around the head and ears from every direction. Standard libraries (MIT KEMAR) work acceptably for most users, but a personalized HRTF improves localization accuracy by 30-50%. **Ambisonics** is a recording format capturing a full spherical sound field; in VR it is decoded in real time based on current head orientation.

What is HRTF and why does spatial audio need it?

Gesture Recognition

Meta Quest 3 tracks 26 joints per hand at 60 Hz using stereo cameras and a neural network model - no controllers required. Apple Vision Pro goes further: 12 cameras, 6 microphones, and LiDAR build a full real-time 3D hand model. The paradox is that gesture recognition accuracy drops sharply under bright or backlit conditions - which is exactly why premium headsets use IR illumination, invisible to the eye but clear to sensors.

Two gesture categories: **static** (pinch, fist, open palm - recognized from a single pose snapshot) and **dynamic** (swipe, circle, push - require trajectory analysis over time). Dynamic gestures typically use RNN/LSTM or a sliding window over keypoint sequences. A critical problem is false positives: a hand accidentally forming a gesture shape during natural movement must not trigger unintended actions.

Why does hand tracking accuracy drop sharply under bright backlit conditions?

Voice Interfaces

Voice commands solve a fundamental XR input problem: there is no keyboard in VR. But integrating voice in XR is harder than on a smartphone - the headset creates an acoustic chamber, the user moves constantly, cooling fans add noise, and the real-world environment contributes random sounds. Meta uses beam-forming across a 4-microphone array to isolate voice from noise. OpenAI's Whisper runs directly on Quest 3 via MLX compilation - latency 200-400 ms, accuracy 95%+ for English.

Key distinction between **wake word detection** and **intent recognition**: wake words ("Hey Siri", "Ok Google") run continuously on a small local model; intent recognition requires semantic understanding and often involves a cloud round-trip. In XR, wake word is critical - pressing a button to activate voice defeats the purpose. End-to-end latency must stay under 300 ms, or the system feels broken.

Voice commands in XR are just a matter of embedding Siri or Google Assistant

XR requires specialized acoustic processing: microphone beam-forming, fan noise cancellation, button-free wake words, and end-to-end latency under 300 ms

Standard voice assistants are optimized for smartphones. VR headsets have different acoustics, different noise sources, and stricter latency requirements - direct integration produces poor UX.

What is the maximum acceptable voice input latency for an XR system to feel responsive?

Key Ideas

  • **Haptics** are not a special effect but a trust mechanism: without tactile feedback virtual objects remain illusions; LRA actuators in controllers deliver 90% of the effect at a fraction of the cost of force-feedback exoskeletons
  • **Spatial audio via HRTF** creates a three-dimensional sound field - personalized HRTF improves localization accuracy by 30-50%; ambisonics enables recording and playback of a full spherical soundscape
  • **Multimodal input** (gestures + voice + gaze) reduces cognitive load and eliminates the need for physical controllers - Apple Vision Pro demonstrates this approach at commercial scale

Related Topics

Multimodal input depends on accurate tracking and rendering infrastructure:

  • XR Tracking — Hand gestures depend on positional tracking quality - inside-out or external tracking determines the accuracy floor
  • XR Rendering — Haptic feedback must synchronize with rendered frames - latency above 20 ms breaks the illusion of touch

Вопросы для размышления

  • Haptic feedback for surgical simulators requires force accuracy in single-gram increments - how would you decide between an expensive force-feedback exoskeleton and cheap vibration motors, and at what budget threshold does the tradeoff shift?
  • HRTF personalization improves spatial audio but requires additional hardware during initial setup - how do you balance accuracy against the barrier-to-entry for an end user?
  • If voice, gestures, and gaze operate simultaneously, how do you resolve intent conflicts when a user looks at one object, reaches for another, and speaks a command about a third?

Связанные уроки

  • la-01-vectors-intro
Haptics and Multimodal Input

0

1

Sign In