Computer Graphics
Real-Time Ray Tracing: RTX
In 2018, NVIDIA released the RTX 2080 and for the first time showed Battlefield V with real reflections in puddles and mirrors. Until that moment, every game across 30 years of rendering used hacks: screen-space reflections, shadow maps, baked global illumination. Each hack broke under some conditions, and artists spent days masking artifacts. RTX turned the industry over: the first dedicated cores for ray tracing did what was previously considered a job for Pixar's offline renderers. Cyberpunk 2077 in RT Overdrive mode is a full real-time path tracer - a technology that just five years earlier was physically impossible.
- **Cyberpunk 2077 RT Overdrive:** full path tracer on RTX 4090 plus DLSS 3 frame generation - the current peak of real-time graphics
- **Unreal Engine 5 Lumen:** hybrid GI solver using signed distance fields plus RT cores on high-end GPUs
- **NVIDIA OmniGraph / Quake II RTX:** complete replacement of legacy rasterization pipelines with ray-traced primary visibility
RT cores: hardware-accelerated intersections
Until 2018, real-time ray tracing was considered fantasy: a single ray-triangle intersection on the CPU takes hundreds of cycles, yet billions of intersections per second are needed for a 1080p frame. NVIDIA Turing (RTX 2080, 2018) introduced RT cores - dedicated blocks that execute two narrow operations entirely in hardware: BVH (Bounding Volume Hierarchy) traversal and ray-triangle intersection tests. Each RT core processes one ray per cycle, while regular SM cores still run hit/miss shaders. The result is roughly a 10x speedup over CUDA emulation. AMD answered with RDNA2 Ray Accelerators (2020), and Intel Arc Alchemist shipped RTUs (2022).
API surface: DirectX 12 Raytracing (DXR, 2018), Vulkan Ray Tracing (KHR_ray_tracing, 2020), Metal MPSRayIntersector (2018). All define five shader stages: ray-generation (spawns a ray), intersection (custom geometry), any-hit (anti-aliasing/transparency), closest-hit (shading), miss (background). The programmer writes shaders; the GPU handles BVH traversal on RT cores. The NVIDIA RTX 4090 carries 128 third-generation RT cores delivering roughly 10 billion ray-triangle tests per second.
What exactly do NVIDIA RT cores accelerate compared with regular CUDA cores?
Denoising: 1 ray per pixel plus a neural network
RT cores accelerated BVH traversal, but the ray budget is still capped at 16 milliseconds per frame. At 1080p with 1 ray per pixel that is just 2 million rays, whereas an offline path tracer fires thousands of rays per pixel. The result is heavy noise: a naive 1 spp (sample per pixel) image looks like static. Denoising turns that noisy image into a clean one: spatial filters (bilateral, A-trous), temporal filters (TAA reprojection), neural denoisers (NVIDIA OptiX, Intel Open Image Denoise). NVIDIA Ray Reconstruction (DLSS 3.5, 2024) replaces the classic denoiser entirely with a model trained on offline renders.
The real RT pipeline architecture: G-buffer (normals, albedo, depth) + noisy radiance buffer -> denoiser -> tonemapping -> upscaler (DLSS/FSR/XeSS). SVGF (Spatiotemporal Variance-Guided Filter, Schied 2017) is the leading classical denoiser: it uses accumulated radiance from previous frames through motion vectors. NVIDIA NRC (Neural Radiance Cache) goes further - a small MLP network trains live during rendering, caching radiance at scene nodes. Quake II RTX, Cyberpunk 2077, and Alan Wake 2 all rely on a combination of denoising and upscaling to reach a playable frame rate.
Why is an RTX frame impossible without a denoiser?
Hybrid rendering: rasterization + RT
A full path tracer still needs an order of magnitude more rays than are available in real time. The answer is hybrid rendering: rasterization (fast) produces primary visibility and the G-buffer, while RT (accurate) handles only the effects where rasterization is physically impossible or faked. Reflections - RT (instead of screen-space reflections, which break at screen edges). Soft shadows - RT (instead of shadow maps with peter-panning). Global illumination - RT (instead of baked lightmaps). Frostbite, Unreal Engine 5 Lumen, and Cyberpunk RT Overdrive are all built around this pattern.
Deferred rendering remains the pipeline core: rasterization fills the G-buffer (depth, normal, albedo, roughness, metallic), then an RT pass fires one ray per pixel per effect (reflection, shadow, GI). Modern titles use ReSTIR (Reservoir-based SpatioTemporal Importance Resampling, Bitterli 2020) - it picks the best directions from a large light pool through importance resampling. The technique simulates thousands of light sources with a single ray per pixel. Unreal Lumen uses a simplified signed distance field for tracing on consumer hardware without RT cores, while RT cores accelerate the precise version on high-end GPUs.
Why do leading games use hybrid rendering rather than a full path tracer?
BLAS and TLAS: two-level acceleration structure
For RT cores to intersect rays with a million triangles in real time, the scene must be represented as a BVH (Bounding Volume Hierarchy) - a tree of bounding volumes. Modern APIs split this into two levels: BLAS (Bottom-Level Acceleration Structure) - a BVH over the triangles of a single geometry (one character mesh, one car mesh). TLAS (Top-Level Acceleration Structure) - a BVH over BLAS instances, each with its transform matrix. A thousand copies of one mesh need ONE BLAS and a thousand TLAS instances. This split is critical for dynamic scenes: BLAS is expensive (milliseconds), TLAS is cheap (microseconds); recomputing the TLAS per frame moves objects without rebuilding the BLAS.
For static geometry (walls, background), BLAS is built once during loading. For skinned meshes (animated characters), refit is used - a fast update of an existing BLAS without a full rebuild that loses tree quality but is 10x faster. A full rebuild is required only when topology changes drastically (such as destruction). NVIDIA RTXMU (Memory Utility) and AMD GPUOpen RAB (Ray Acceleration Building) provide optimized builders. TLAS rebuild budget: about 0.5 ms for 10K instances on an RTX 4080.
RTX delivers true path tracing in games, just like Blender Cycles offline renders
RTX in games is denoised 1 spp combined with hybrid rendering; a full path tracer is still 10-50x slower than real time even on an RTX 4090
An offline path tracer spends 4096-16384 spp per pixel for convergence. RTX in a game uses 1-4 spp and relies on a denoiser plus temporal accumulation plus DLSS upscaling. Cyberpunk RT Overdrive (full path tracer) on RTX 4090 without DLSS runs 8-12 fps at 1080p. With DLSS 3, 60+ fps, but that is already a combination of Monte Carlo, neural networks, and upscaling, not a pure path tracer
Why split the acceleration structure into BLAS and TLAS instead of one large BVH?
Key ideas
- **RT cores** hardware-accelerate two specialized operations: BVH traversal and ray-triangle intersection, freeing SM cores for shaders.
- **A denoiser is an architectural element**, not optional: 1 spp looks like static; SVGF + DLSS Ray Reconstruction turn noise into clean images.
- **Hybrid rendering** keeps rasterization for primary visibility and engages RT only where rasterization is physically faked (reflections, shadows, GI).
- **BLAS + TLAS** allow reuse of expensive geometry: a thousand mesh instances share one BLAS, while a lightweight TLAS is rebuilt every frame.
Related topics
Real-time ray tracing combines Monte Carlo, spatial structures, and GPU parallelism.
- Path Tracing and Monte Carlo — RTX is a Monte Carlo path tracer at 1 spp plus a denoiser; the theory from cg-12 drives the entire pipeline
- Geometric Algorithms at Scale — BLAS/TLAS is a BVH built on the GPU via Morton codes - the same structures used in spatial indexing
Вопросы для размышления
- RT cores delivered a 10x speedup, yet a denoiser is still required. Will the next generation's bottleneck be memory bandwidth, neural denoisers, or new sampling algorithms?
- Is hybrid rendering a temporary compromise or a long-term architecture? When will rasterization finally yield to a full path tracer?
- DLSS 3.5 Ray Reconstruction replaces the classical denoiser with a neural network. What are the risks and benefits of such a substitution for graphics predictability in games?
Связанные уроки
- cg-08 — GPU pipeline - foundation for understanding hardware ray tracing
- cg-14 — Deferred Rendering complements RTX in hybrid rendering
- arch-15-gpu-architecture — RT-cores as specialized ALUs: hardware acceleration of the same idea
- cgeom-06 — BVH for RTX is a hierarchy of bounding volumes
- arch-04-cpu