Real-Time Backend
SFU Architecture
Zoom supports 1000 participants in a single meeting. In a Mesh P2P topology each would send 999 video streams. That is 999 * 2 Mbps = 2 Gbps upload. Impossible. How does Zoom solve it?
- **Google Meet** uses an SFU for meetings with more than 2 participants. Each participant sends just one upstream to the SFU, which selectively delivers the needed streams to everyone. At 100 participants: 100 * 1 upstream = 100 Mbps on the server vs 100 * 99 = 9.9 Gbps in Mesh.
- **LiveKit** (open-source SFU, Apache 2.0) runs on Kubernetes and powers Daily.co, Mux, StreamElements. Written in Go, it handles 1000+ participants on a single c40.large server. GitHub: 8k+ stars in 2 years.
- **Twilio Video** runs an SFU network on AWS across 12 regions. Rooms are automatically routed to the region closest to the majority of participants. For a room with participants in EU and US, cascade between regions kicks in.
- **mediasoup** is used in production at Skype Web (Microsoft) and several European telehealth platforms. One Worker process handles up to 500 concurrent video producers on a modern CPU core. Architecturally it is the closest thing to 'SFU as a library'.
What is an SFU
An SFU (Selective Forwarding Unit) is a media server that receives streams from participants and selectively forwards them to the subscribers that need them. Unlike an MCU, an SFU does not decode or remix video. It passes through in its original encoded form. That makes an SFU cheap on CPU but requires the client to decode N streams.
An SFU is the industry standard for group video calls of more than 3 participants. Google Meet, Zoom, Twilio Video, Agora, Daily.co all use an SFU or variations of it. The key advantage: a participant's upload bandwidth does not grow with the number of participants (only 1 upstream). AWS Chime SDK, LiveKit, and mediasoup are open-source SFU frameworks.
In a 10-person SFU conference, one participant switches to 'audio only'. The SFU stops forwarding video to them. How does this affect the others?
mediasoup
mediasoup is a Node.js SFU library with a C++ core for handling RTP. It is not a standalone server. It is a building block from which you assemble your application. It supports simulcast, SVC, DataChannels, and inter-worker pipes for horizontal scaling.
mediasoup is used by Skype Web (Microsoft), Twitch (streaming experiments), and many EdTech platforms. One Worker process on a modern CPU core handles around 500 video producers. LiveKit (the open-source Zoom competitor) rewrote its SFU core in Go for version 1.0, but borrowed ideas from mediasoup for RTP routing.
mediasoup is running on an 8-core server. What is the correct way to use all the cores?
Janus Gateway
Janus is an open-source WebRTC gateway written in C by Meetecho. Unlike mediasoup (a library), Janus is a full server with REST/WebSocket/RabbitMQ APIs and a plugin system. The VideoRoom plugin implements SFU conferences, VideoCall handles 1:1 calls, and Streaming covers live broadcast.
Janus is used inside Jitsi Meet as an SFU (Jicofo), in several telehealth platforms, and in live-streaming services. Its advantage over mediasoup is being a ready-to-run server with no code to write. The downside is that it is less flexible for custom logic. Jitsi Meet (15M+ MAU) is migrating away from Janus to its own SFU (Jitsi Videobridge) written in Java for better scaling.
A startup is building a video chat for 50M potential users. What should they pick: mediasoup or Janus?
Scaling SFUs
A single SFU server is limited by CPU and bandwidth. For thousands of concurrent rooms you need clustering: stateful routing (a room is pinned to a specific SFU node), cascade SFUs (SFUs interconnected), or geographic distribution (CDN-style).
LiveKit (open-source, deploys on AWS/GCP) uses a distributed SFU: rooms can migrate between nodes, NATS.io handles inter-node communication. Zoom builds a geo-distributed SFU with cascade: participants from different regions meet via a cascaded stream between data centers. Agora's global SD-RTN network has 200+ PoPs with SFU nodes and automatic selection of the nearest one.
SFUs scale infinitely - just add more servers and everything works
An SFU requires stateful routing: a room is pinned to a specific node. Horizontal scaling needs consistent hashing, cascade topology, or room migration. None of it is trivial.
Participants' media streams must meet on the same SFU node (or on connected ones via cascade). A naive round-robin load balancer puts Alice on SFU-1 and Bob on SFU-2, and they cannot reach each other's streams. You need coordination via Redis/NATS plus sticky routing.
Google Meet supports 500 participants in a single meeting. Does the SFU have to decode video from all 500 to forward it?
Summary
- **SFU** forwards encoded RTP packets without decoding. A participant's upload is 1 stream regardless of how many peers are in the room. CPU load is O(streams), not O(pixels).
- **mediasoup** is an SFU library (C++ Workers + Node.js API). Maximum control over routing logic, simulcast, and SVC. Requires writing the SFU code.
- **Janus Gateway** is a ready-made SFU server with plugins (VideoRoom, Streaming). Quick start, less flexibility for custom logic.
- **Scaling** requires stateful routing (consistent hashing by room_id) and cascade topology for cross-datacenter rooms.
Related Topics
An SFU is the centerpiece of group video calls. Understanding it requires familiarity with adjacent layers:
- MCU vs SFU vs Mesh — An SFU is one of three topologies; comparing them clarifies the trade-offs and when to pick each
- Media Streams — A MediaStream from getUserMedia is added via addTrack() and arrives at the SFU as a Producer; the SFU returns Consumers for other participants
- STUN and TURN — Every ICE connection between client and SFU requires STUN for discovery and TURN if the client is behind a symmetric NAT; the SFU itself acts as the public endpoint
Вопросы для размышления
- mediasoup spawns a Worker per CPU core. With 50 participants in a room and 8 Workers, how do you distribute Producers and Consumers to minimize cross-worker pipe transport?
- An SFU knows the bitrate of each Producer through RTCP. How can that knowledge be used to adapt video quality for a participant with a poor network without disrupting the others?
- How does a cascade SFU solve the problem of participants in different regions, and what additional latency does it add compared to Mesh P2P between nearby participants?