Real-Time Backend
WebRTC Scaling
Google Meet handles millions of concurrent calls worldwide. A single SFU in Virginia cannot do it. How is a global media infrastructure built?
- **Livekit Cloud** - 20+ regional PoPs, the client picks the nearest SFU by latency measurement, inter-region backbone over dedicated links, $0.001/participant-minute
- **Jitsi Octo** - cascading topology for webinar scale: UN General Assembly broadcasts use Jitsi with Octo to reach 100K+ listeners across multiple regional nodes
- **Cloudflare Calls** - a WebRTC SFU on Cloudflare Workers and the 300+ PoP anycast network, media enters the nearest edge without you managing your own infrastructure
Cascading SFU
A single SFU handles around 500-1000 participants under typical load (VP8, 720p). A large conference with 5000+ participants needs a **cascading** architecture: several SFUs are linked by server-to-server connections, each holds its own group of clients but forwards media between them.
The mechanics: SFU-A connects to SFU-B via WebRTC or an internal RTP tunnel. When a user on SFU-A speaks, their audio/video is forwarded to SFU-B, which distributes it to its clients. Links between SFUs stay minimal: only active streams, not every participant to everyone. Livekit calls this **inter-node forwarding**, Jitsi calls it **Octo** (Olga's Cascading Topology for Oceans).
Jitsi Octo supports conferences up to 100,000 participants (webinars). A real example: UN General Assembly broadcasts use Jitsi with Octo cascading, several SFU nodes in different regions, each holding a group of listeners.
In a cascading SFU with 1000 participants, how many media streams does SFU-A forward to SFU-B?
Global Distribution
A globally distributed WebRTC infrastructure places SFU nodes in multiple regions. The client connects to the nearest one. This reduces RTT: a participant in Tokyo does not round-trip to a data center in Virginia for every media packet. The SFU in Tokyo holds their connection, and an inter-region trunk delivers media to other regions.
Livekit Cloud runs a global network across 20+ PoPs (Points of Presence): the client connects to the nearest one over WebRTC, nodes are linked by dedicated backbone. On a transcontinental Moscow-NYC call: Moscow -> SFU-EU -> backbone -> SFU-US -> NYC. Media RTT: ~30ms (Moscow-EU) + ~60ms (EU-US) + ~20ms (US-NYC) = ~110ms. Without PoPs: ~180ms.
Cloudflare Calls runs on Workers and Cloudflare's global anycast network with 300+ PoPs. Media enters the nearest edge and then travels over Cloudflare's backbone. That lets them offer a WebRTC SFU without you having to manage media infrastructure.
Why deploy SFUs in multiple regions if the inter-region backbone still adds latency?
Geo Routing
**Geo routing** decides which SFU a client connects to. Options: DNS-based (GeoDNS returns the IP of the nearest SFU based on client IP), HTTP redirect (the signaling server measures RTT to several SFUs and routes to the best one), BGP anycast (one IP, network-level routing).
GeoDNS is the simplest but IP-to-region mapping is inaccurate (~15% error rate for mobile). RTT measurement is more accurate: the client sends HTTP HEAD to 3-4 regional endpoints and picks the one with the fastest response. That is how Twilio TURN server selection works. The client picks the nearest relay itself.
Daily.co publishes data: geo routing by latency measurement cuts median RTT by 35% compared to GeoDNS for mobile users. Especially critical in Asia, where IP-to-region mapping is particularly inaccurate for carrier NAT networks.
Why is latency-based routing more accurate than GeoDNS for picking an SFU?
WebRTC Infrastructure
A full WebRTC infrastructure has several pieces beyond the SFU: **TURN servers** for traversing strict NATs and corporate firewalls (up to 15% of connections need TURN), **STUN servers** to discover the public IP, and a **signaling server** to exchange SDP offer/answer and ICE candidates.
TURN traffic is the most expensive: the full media stream runs through the relay. At 15% penetration and an average 30-minute meeting at 720p (1.5 Mbps), the bill adds up. Twilio and Daily estimate TURN at 20-30% of total infrastructure cost. That is why TURN servers are deployed close to SFU nodes, to minimize extra hops.
- **STUN** - public IP/port discovery (free, UDP 3478)
- **TURN** - relay server for strict NAT (expensive, all traffic flows through it)
- **SFU** - media server, forwards RTP between participants
- **Signaling** - WebSocket/HTTP server for SDP and ICE exchange
- **Recording** - pipeline that captures the streams (optional)
Livekit Cloud charges $0.001 per participant-minute, which covers SFU + TURN + signaling + STUN. At 1M minutes/month that is $1000. Running it yourself at the same load: EC2 + Coturn + network = $600-800, but you also need a DevOps team.
Scaling WebRTC = adding more SFU nodes
Scaling WebRTC is a system of several components: cascading SFU + geo routing + regional TURN + signaling with sticky sessions
Adding SFUs without cascading leaves participants on different SFUs unable to see each other. Without geo routing, clients connect to a distant SFU. Without regional TURN, relay traffic takes unnecessary cross-continent hops.
TURN is 20-30% of infrastructure cost while only 15% of connections need it. Why can't we drop it?
Summary
- **Cascading SFU** links several nodes server-to-server, forwarding only active speakers - linear scaling without N^2 traffic growth
- **Global distribution** - regional SFUs close to clients reduce RTT: a Tokyo client sees 18ms to the nearest SFU instead of 170ms to Virginia
- **Geo routing** via latency measurement is 35% more accurate than GeoDNS for mobile; TURN servers are expensive (20-30% of cost) but necessary for 15% of connections
Related Topics
Scaling WebRTC builds on the fundamentals of distributed systems:
- Adaptive Bitrate — SVC cuts inter-SFU stream count by 3x, directly impacting cascading efficiency
- CDN and Edge — Cloudflare Calls is built on CDN edge nodes - WebRTC as a special case of edge computing
- Load Balancing — Geo routing is a specialized load balancer that factors in latency, not just CPU/memory
Вопросы для размышления
- With cascading SFUs across 3 nodes of 500 participants each and one active speaker, how many additional RTP streams does cascading create? How does that change with 5 active speakers?
- Livekit uses latency measurement for geo routing, Cloudflare uses anycast. In which scenarios is anycast less effective than an explicit latency test?
- TURN is expensive at 15% utilization. How do you do TURN capacity planning: how much capacity is needed at 10K concurrent participants with a known 15% penetration?