System Design

Case Study: Uber

When a rider taps Request in San Francisco at 18:47, 14 driver phones light up within 230 milliseconds. The matching service ran a constrained optimization over 4,300 candidate drivers, refreshed surge multipliers across 8,000 H3 cells, and reserved a $24.50 authorization hold - all before the rider's screen finished the transition animation.

  • Uber engineering disclosed (2019 talk by Matt Ranney) that DISCO, the dispatcher, processes 50,000 matching decisions per second at peak.
  • H3 hexagonal grid (open-sourced by Uber in 2018) indexes the planet at 16 resolutions - resolution 9 hexagons have an edge of 174m, the working unit for rider-driver proximity.
  • Halloween 2014 in SF: surge pricing disabled by a bug for 90 minutes; completion rate dropped from approximately 100% to 80% (Diamond and Hall, NBER 2015).
  • Trip state machine handles 17M trips per day with optimistic locking on a versioned status field - no global trip lock, no central transaction coordinator.

Requirements and Scale

Цели урока

  • Define functional and non-functional requirements
  • Perform back-of-envelope estimation
  • Understand the unique challenges of a ride-sharing platform

Functional Requirements

Uber is a real-time platform connecting drivers and riders. Key features:

Uber at Scale (Real Numbers)

Back-of-Envelope Estimation

Key Technical Challenges

High-Level Architecture

Key Takeaways

  • Uber is a real-time, location-centric system
  • Key challenges: 1M+ location updates/sec, matching in <15 sec, accurate ETA, dynamic pricing
  • Core components: Location Service (Redis Geo), Matching Service, Trip Service, Pricing Service

What is the primary load source in a system like Uber?

Geospatial Indexing

Цели урока

  • Understand different approaches to geospatial indexing
  • Study Geohash, S2, and H3 systems
  • Master Redis Geospatial for location queries

The Problem: Spatial Proximity Queries

The core query in Uber is: "find all drivers within 3 km of point (lat, lng)". How do you answer this quickly with millions of drivers?

Geohash

Geohash encodes coordinates into a string. Nearby points share a common prefix, enabling prefix-based search.

H3 (Uber's Solution)

H3 is a hexagonal hierarchical spatial index developed by Uber. Hexagons solve the edge-effect problem and provide uniform distance.

Redis Geospatial Implementation

H3 + Redis for High Load

Comparing Approaches

Key Takeaways

  • Geospatial indexing transforms O(N) proximity search into O(1) cell lookup
  • Geohash is simple and natively supported in Redis
  • H3 is a hexagonal grid with uniform distance, designed by Uber specifically for ride-sharing
  • The choice depends on requirements: simplicity (Geohash) vs. optimality (H3)

Why does Uber use hexagonal cells (H3) instead of rectangular ones (Geohash)?

Matching Algorithm

Цели урока

  • Understand the driver-rider matching problem
  • Study scoring and ranking of drivers
  • Master batch matching for optimization

The Problem: Optimal Matching

The naive "closest driver" approach performs poorly under high load. Why?

Batch Matching Approach

Driver Scoring

Distance is not the only factor. Uber uses multi-factor scoring:

Matching Service Implementation

Race Conditions and Locking

Key Takeaways

  • Matching in Uber is not simply 'find the nearest driver'
  • Batch matching every 2-3 seconds enables globally optimal assignment
  • Scoring considers ETA, heading, rating, and acceptance rate
  • Optimistic locking prevents race conditions during concurrent matching

Why is batch matching better than sequential matching?

Dynamic Pricing (Surge)

Цели урока

  • Understand supply-demand economics in ride-sharing
  • Study the surge pricing algorithm
  • Master zone-based pricing calculation

Why Surge Pricing Exists

Surge pricing (dynamic pricing) is a mechanism for balancing supply and demand in real time. Without it, the system breaks down during demand spikes.

Zone-Based Surge Calculation

Surge Calculation Algorithm

Fare Calculation

Anti-Gaming Measures

Key Takeaways

  • Surge pricing balances supply and demand in real time
  • The city is divided into zones (H3 cells); the demand/supply ratio and surge multiplier are computed per zone
  • Smoothing prevents abrupt jumps
  • Anti-gaming measures protect against manipulation by both drivers and riders

Why is smoothing applied to the surge multiplier?

Real-Time Architecture

Цели урока

  • Design the trip state machine
  • Understand real-time communication patterns
  • Study fault tolerance for critical operations

Trip State Machine

Every trip moves through a defined set of states. The state machine ensures consistency and correct handling of edge cases.

Real-Time Communication

Trip Service Implementation

Fault Tolerance

Key Takeaways

  • Uber's real-time architecture includes: a Trip State Machine with validated transitions, dual-channel communication (HTTP/2 for updates, WebSocket for push), and optimistic locking for consistency
  • Fault tolerance is achieved through authorization holds, multi-layer fallbacks, and geographic partitioning

Surge pricing exists to extract maximum revenue when demand spikes - it is a profit lever that punishes riders during emergencies.

Surge is a control loop that rebalances driver supply against rider demand in a closed feedback cycle. When the H3-cell ratio of pending requests to idle drivers crosses a threshold, the multiplier rises - which both filters price-sensitive demand and attracts drivers from adjacent cells. The objective function is wait-time minimization, not revenue maximization.

Diamond and Hall (NBER 2015) analyzed 142M Uber trips and showed that disabling surge during the 2014 Halloween outage in San Francisco produced an 80% completion rate vs 100% with surge enabled. The control loop, not the price tag, is what kept the marketplace cleared. If surge were tuned for revenue, Uber would maximize rates in low-supply low-demand pockets - it does the opposite.

Why does Uber use an authorization hold instead of charging directly after the trip?

Where this connects

Uber is the case study where geospatial indexing, real-time pricing as a control loop, and consistency-vs-availability trade-offs all converge in a single workload. It builds directly on YouTube's CDN intuition and feeds back into consistent hashing and Redis-as-state-store patterns.

  • YouTube case study (real-time at scale) — builds-on
  • Consistent hashing for geo-sharding — applies
  • Redis as ephemeral location store — applies
  • Self-organization in markets — analogous-to

What stays in memory

  • H3 hexagonal geospatial indexing replaces lat/lng range queries with O(1) hex-id lookups; resolution 9 (174m edge) is the working unit for proximity.
  • Driver locations live in Redis with TTL 5s; the hot path never touches PostgreSQL, only the trip state does.
  • DISCO matching uses constrained optimization over batched candidates (ETA, rating, vehicle type, surge), not greedy nearest-driver - greedy starves long-waiting riders.
  • Surge pricing is a closed-loop controller: ratio of pending requests to idle drivers per H3 cell drives the multiplier, which throttles demand and attracts supply.
  • Trip state machine uses optimistic locking with a version column - no global lock, but invalid transitions raise ConcurrentModificationError on conflict.
  • Real-time architecture splits transport: HTTP/2 server push for state updates, WebSocket for bidirectional events, Kafka for the event log, Pinot for OLAP queries on trip events.

Вопросы для размышления

  • Surge pricing operates as a control loop, not a revenue lever. If Uber switched the objective from 'minimize wait time' to 'maximize revenue per request', what failure modes would emerge in the marketplace?
  • Geo-sharding via H3 works on Earth's surface, but flight booking and orbital satellite networks face the same nearest-neighbor problem on different geometries. What does H3 lose when projected onto a sphere, and where would a different indexing scheme win?
  • Optimistic locking on trip state was chosen over pessimistic locks. What happens at 50K matching decisions per second when conflicting transitions arrive within a single millisecond, and how does the retry policy avoid a thundering herd on hot drivers?

Связанные уроки

  • sd-15-youtube — YouTube scale patterns: geo-distributed data
  • sd-18-consistent-hashing — Consistent hashing for geo-data sharding
  • net-15-tcp-basics — WebSocket/long polling for real-time location updates
  • db-19-redis — Redis stores current driver positions (TTL = 5 sec)
  • alg-14-dijkstra
Case Study: Uber

0

1

Sign In