Computational Geometry
Geometric Algorithms at Scale
Uber ingests 28 million GPS points per minute from drivers across the planet. Google Maps stores a billion building polygons and wants to render only those inside the current viewport within 100 milliseconds. UK Health Service tracked a trillion Bluetooth contact pairs - geometric proximities - during the pandemic. All of these would hit a wall with naive linear scans; R-tree, spatial hashing, and GPU parallelism are what turn computational geometry from a textbook subject into production infrastructure. Without them, modern mapping, ride-sharing, and logistics would be technically impossible.
- **PostGIS / MongoDB / ElasticSearch:** R-tree (GiST / 2dsphere) indexes on top of billions of records with millisecond lookups
- **NVIDIA OptiX / Unreal Lumen:** GPU BVH rebuilt every frame for ray tracing in scenes with millions of triangles
- **Apache Sedona / Databricks Mosaic:** distributed spatial joins on petabyte data in real Geo BI pipelines
R-tree: spatial indexing for big data
A linear scan over a billion points or polygons is unacceptable. PostGIS, MongoDB GeoSpatial, ElasticSearch Geo, and SAP HANA Spatial all rely on the R-tree (Antonin Guttman, 1984). It is a balanced B-tree for rectangles: every internal node stores a minimum bounding rectangle (MBR) that covers all its descendants. A range query descends top-down, pruning subtrees whose MBR does not intersect the target region. Tree height is O(log_M N) for fan-out M and N leaves, giving logarithmic search regardless of dimensionality (2D, 3D, or 4D spatio-temporal data).
R-tree quality is determined by insertion and split strategies. Guttman originally proposed a quadratic split running in O(N^2); the R*-tree (Beckmann, 1990) uses heuristic minimization of overlap and dead space, improving query times by 30-40%. The Hilbert R-tree (Kamel-Faloutsos, 1994) orders rectangles along a Hilbert curve, granting spatial locality for bulk loading. PostGIS GiST is an R*-tree implementation on top of a generalized search tree. For static datasets, bulk loading via STR (Sort-Tile-Recursive) yields optimal packing.
Why does the R-tree remain the standard for spatial indexing instead of the simpler KD-tree?
Spatial hashing: a grid as a fast alternative to a tree
R-tree is optimal when object sizes vary and updates are rare. For dynamic systems - game physics, fluid simulation, chat apps with user geolocation - spatial hashing wins: partition space into a uniform grid and keep a list of objects per cell. A radius search becomes O((R/cell)^d) cells visited - constant for a fixed cell size. Updating a position is one remove plus one insert, O(1). Game engines Box2D and Bullet use spatial hashing for broad-phase collision detection. Uber uses geo-hashing (a variant of spatial hashing) to find nearby drivers.
Grid resolution is the key tuning knob. Too coarse a cell means checking thousands of candidates; too fine and a single object lands in dozens of cells. Rule of thumb: average cell size ~ average object size. A hierarchical spatial hash maintains two grids at different resolutions for objects of different sizes. Geohash (Niemeyer, 2008) encodes (lat, lon) as a string: neighboring cells share a common prefix, allowing spatial data to live in Redis ZSET, MongoDB, or Cassandra without specialized indexes.
When is spatial hashing preferable to an R-tree?
GPU parallelism: geometry as dataflow
A modern GPU runs thousands of threads executing the same shader on different data (SIMT). Geometric algorithms map naturally to this model: a vertex shader processes each vertex independently, a fragment shader each pixel, a compute shader an arbitrary task. Convex hull, Delaunay triangulation, and Voronoi diagrams all have efficient GPU implementations. CUDA libraries cugraph, cuSpatial, and RAPIDS execute spatial operations over billion-point datasets in seconds. NVIDIA OmniGraph and Unreal Niagara build particle rendering on top of GPU spatial queries.
Key GPU geometry patterns: parallel scan (prefix sum) for aggregation, parallel reduction for min/max/argmin, parallel sort for ordering by Morton codes, atomic operations for write conflicts. GPU grid and GPU hash extend spatial hashing to the GPU: keys via atomicAdd, buckets populated in parallel. The main bottlenecks are memory bandwidth and divergence: an if/else branch in a shader stalls the warp, so data-dependent branches are avoided whenever possible.
What makes a Morton code (space-filling curve) so useful on a GPU?
Parallel algorithms: divide-and-conquer on big data
When even a GPU is not enough, distributed systems take over: Spark, Dask, Ray, and Apache Sedona (formerly GeoSpark) process spatial data on hundreds of nodes. The key technique is spatial partitioning: split the plane into regions (R-tree, KD-tree, or quad-tree partitioning) so each worker processes its region independently. Boundary objects are then handled in a separate merge step. This is map-reduce with geometric locality. Convex hull on a billion points: parallel sort + Andrew's monotone chain on segments + merge, giving linear scalability up to hundreds of nodes.
Apache Sedona extends Spark SQL with spatial functions: ST_Contains, ST_Distance, and ST_Intersects operate on distributed dataframes. A spatial join (find every polygon pair that overlaps spatially) is a classic O(N*M) operation that reduces to O((N+M) log N) via R-tree partitioning plus nested-loop within each partition. Google BigQuery GIS, Amazon Redshift Spatial, and Databricks Mosaic are commercial implementations of the same model on petabyte data.
With big enough data, just parallelize a naive O(N^2) algorithm - a hundred nodes will handle it
Linear scalability on petabyte data requires structurally efficient algorithms with the right locality; parallelism does not convert O(N^2) into O(N log N)
A parallelized O(N^2) on P nodes is O(N^2 / P) - still quadratic. With N=10^9 that is 10^18/100 = 10^16 operations, years of compute. Real solutions need the right asymptotics first (R-tree partitioning, Morton sort, BVH), achieving O(N log N) sequentially, and only then parallelize across P nodes with high efficiency
Why is spatial partitioning critical for distributed spatial joins?
Key ideas
- **R-tree is the spatial indexing standard:** a balanced B-tree for rectangles, the foundation of PostGIS, MongoDB, and ES.
- **Spatial hashing** wins on dynamic data: O(1) updates versus O(log N) in an R-tree, ideal for physics and geolocation.
- **GPU geometry through Morton codes** gives coalesced memory access and parallel BVH construction in O(N) across thousands of threads.
- **Distributed spatial joins** require spatial partitioning, not hash partitioning, or shuffle becomes O(N*M).
Related topics
Scalable geometry unifies ideas from data structures, GPU computing, and distributed systems.
- Robotics and GIS — The same R-tree and spatial hashing power path planning and GIS queries
- Distributed systems: partitioning — Spatial partitioning is the geometric analog of consistent hashing
Вопросы для размышления
- R-tree is optimal for static data, spatial hashing for dynamic. Where is the boundary at which maintaining both structures simultaneously becomes worthwhile?
- GPU geometry saves orders of magnitude of time but requires rewriting code for shaders/CUDA. Under what conditions would a team choose a well-tuned CPU R-tree over a GPU BVH?
- Distributed spatial joins handle petabytes, yet tuning a spatial partitioner is non-trivial. Which practices help pick the grid resolution and sample rate for R-tree partitioning?
Связанные уроки
- cgeom-09 — Range trees and k-d trees - foundation for scalable geometric queries
- cgeom-11 — Delaunay triangulation - base structure for spatial indexes
- ds-16-graphs-intro — Geometric graphs (k-NN graph) are the standard tool for scaled geometry
- alg-14-dijkstra — A* is Dijkstra with a geometric heuristic
- par-01