Real-Time Backend
Design: Video Conferencing
In April 2020 Zoom served 300 million daily meeting participants, more than the entire population of the US. In three months the platform grew 20 times. How did the system not collapse under that load, and what do you have to design to repeat it?
- Zoom (2020): 300M daily meeting participants, an SFU architecture with global edge clusters and consistent hashing by roomId
- Google Meet: migrating from DTLS/SRTP to QUIC transport dropped latency by 15 to 20% on poor user connections
- AWS Chime SDK: managed SFU + Amazon Transcribe real-time transcription + recording pipeline through S3 and Lambda, everything as an API
Video conferencing architecture: SFU vs MCU
In April 2020 Zoom processed 300 million meeting participants in a single day. This is not just 'video calls', it is the problem of routing hundreds of thousands of media streams in real time with under 150 ms of latency. The architectural choice decides whether the system holds or collapses.
MCU vs SFU
- MCU (Multipoint Control Unit) — The server decodes every incoming stream, mixes them into one composed stream and sends each participant a single video. Server load: O(N) decodes + 1 mix. The client receives one stream, saving bandwidth. The problem: a CPU explosion past 100 participants.
- SFU (Selective Forwarding Unit) — The server only routes packets, it does not decode and does not mix. Each client receives N-1 streams and renders the layout itself. Client load grows with N, but server load is linear and cheap. Zoom, Google Meet, AWS Chime all moved to SFU.
Real systems use a hybrid: SFU for distribution + server-side simulcast. Each participant publishes three video versions (1080p/360p/180p). The SFU picks the right layer for each subscriber based on their bandwidth. This is called **Adaptive Bitrate (ABR) forwarding**.
WebRTC stack
Google Meet migrated from DTLS/SRTP over UDP to QUIC transport in 2020. QUIC has built-in stream multiplexing with no head-of-line blocking: an audio packet loss does not block video delivery. The result: latency dropped 15 to 20% on poor connections.
Geo-distribution
Zoom runs regional clusters (US, EU, APAC) with a dedicated backbone between them. The client connects to the nearest edge server via WebRTC, then media travels over the carrier's internal network. That gives predictable latency instead of the unpredictable public internet.
What is the main advantage of SFU over MCU with 50 participants on a call?
Room Management: meeting state
A video conference room is a distributed object with state: list of participants, each one's media tracks, permissions (mute/unmute, host controls), chat history. With 100,000 active rooms at the same time, the problem becomes syncing state across multiple edge servers.
Room data structure
Storage and sync
Room state lives in Redis with TTL. On a participant join: atomic HSET + SADD in Redis, then a pub/sub event to every edge server in the room. Each SFU keeps a local copy of room state in memory, a cache for fast forwarding decisions.
- **Redis Cluster**: primary store for room state, TTL = call duration + 1 hour
- **Pub/Sub**: events (join/leave/mute/unmute) are broadcast to every room SFU through Redis Pub/Sub or Kafka
- **Waiting Room**: a separate queue in Redis, the host sees the list and approves one by one
- **Active Speaker Detection**: SFU tracks the RMS audio packet level and switches the 'main speaker' every 500 ms
Signaling: WebSocket coordination
On reconnect (a connection drop) the participant gets a new WebRTC peer, but room state in Redis is preserved. The signaling server detects the reconnect by userId and restores the context. The participant does not lose their place in the waiting room queue and does not lose the co-host role.
Scaling under peak load
Zoom faced a 20x load increase in two weeks in March 2020. Stateless signaling scales horizontally easily: any instance can serve any request through Redis. The problem is SFU: each media server holds WebRTC peer connections in memory, and migrating a peer between servers without a reconnect is impossible.
Zoom's solution: consistent hashing by roomId on the SFU cluster. All participants of a room always land on the same set of SFU servers. On a server failure clients reconnect (reconnect under 5 seconds), room state is restored from Redis.
Why are signaling servers easier to scale than SFU media servers?
Recording Pipeline: meeting recording
Recording a video conference looks like 'just save the stream'. In practice: you have to mix N video tracks + audio, handle participant reconnects, apply a layout (grid/speaker view), transcode to MP4 and deliver to the user. AWS Chime SDK and Zoom expose recording pipeline as a separate service.
Two recording modes
- Cloud Recording — A special recorder instance connects to the SFU as a regular participant, receives every track over WebRTC, mixes them on the server (compositor), writes to S3. Transcoding runs in the background via FFmpeg workers. Result: an MP4 in the cloud, accessible by link.
- Local Recording — The client records the MediaStream through the MediaRecorder API right in the browser. Format: WebM (VP8+Opus). No server costs, but: no server-side layout, recording stops on a connection loss, the file is on the client's disk.
Cloud Recording Pipeline
Edge case handling
- **A participant leaves and returns**: the recorder marks a gap, the compositor inserts a black frame or a placeholder during the interval
- **Layout switch** (speaker view to grid): the compositor flips the FFmpeg filter graph on the fly via sendcmd
- **Screen sharing**: a separate video track with higher quality priority (1080p vs 720p for the camera)
- **Recorder instance crash**: a backup recorder connects with an offset, the final compositor merges both segments
Zoom stores recordings in S3 with a lifecycle policy: 30 days in hot storage (S3 Standard), then Glacier for archive. Transcoding fires through an SQS queue. At peak (end of the workday) workers auto-scale through AWS Auto Scaling by queue length.
Transcription and AI processing
After the MP4: the audio track is extracted and sent to Whisper / AWS Transcribe. The result is WebVTT subtitles + speaker time stamps (speaker diarization). This makes it possible to build AI meeting summaries. AWS Chime SDK offers Amazon Transcribe as a managed service with real-time transcription during the call.
A recorder instance in cloud recording connects to the SFU as:
Breakout Rooms: virtual subgroups
Breakout rooms split participants into subgroups with isolated audio/video. Zoom launched it in 2020 and it became the killer feature for online education. Behind the simple UX sits a non-trivial problem: switch 100 participants between rooms in seconds without losing state.
Breakout rooms architecture
Each breakout room is a fully isolated room with its own SFU context. When N breakout rooms are created, the system reserves N extra media-routing slots on the existing SFU cluster. Isolation is enforced at the WebRTC level: each participant gets a separate peer connection for the breakout room.
Participant switching process
- **Signal**: the host triggers the split, the server creates N room records in Redis
- **Assignment**: participants receive a BREAKOUT_ASSIGNED event over WebSocket with the target roomId
- **Disconnect from main**: the client closes the peer connection to the main SFU context
- **Reconnect**: the client creates a new WebRTC peer connection to the breakout SFU context (a new SDP offer/answer)
- **State preservation**: main room chat history is kept, the breakout room starts with an empty chat
- **Return**: on the 'end breakouts' signal, the reverse happens, everyone reconnects to main
Host broadcast
The host can send a message to every breakout room at once. Implementation: the signaling server receives a BROADCAST_TO_BREAKOUTS event and publishes it to the Redis Pub/Sub channels of every child room. Each SFU context delivers the message to its participants over WebSocket. Delivery latency: under 100 ms.
Timer and auto-return
Zoom supports a breakout session timer: the host sets a limit (for example, 10 minutes). 60 seconds before the end, a countdown notification reaches every participant. On expiry, an automatic RETURN_TO_MAIN signal fires. Implementation: a deferred task in Redis (EXPIRE key) or a scheduled job in the task queue.
Performance on mass breakout creation: creating 50 breakout rooms for 500 participants takes about 2 to 3 seconds. The bottleneck is not the SFU (contexts are created instantly), it is the fan-out of WebSocket notifications through the signaling layer. Fix: parallel broadcast with batching.
Breakout rooms are separate physical servers allocated per subgroup
Breakout rooms are isolated logical contexts (routing namespaces) on the same SFU cluster as the main room
Allocating a separate physical server for each breakout room of 3 to 4 people is economically pointless and creates a cold start problem (10 to 30 seconds to spin up). SFU contexts are created in memory instantly. Isolation lives at the routing table level, not in physical separation.
When a participant moves from the main room to a breakout room, what happens to the WebRTC peer connection?
Takeaways
- **SFU vs MCU**: SFU forwards RTP without decoding, so server load is linear. MCU mixes streams, saving client bandwidth at the cost of a CPU explosion at scale. Every major platform picked SFU.
- **Room state in Redis**: signaling is stateless, room state lives in Redis Pub/Sub, which lets signaling scale horizontally. SFU is stateful by peer connections and scales via consistent hashing.
- **Recording = WebRTC participant**: the cloud recorder connects as a headless WebRTC client to the SFU, receives tracks, the compositor mixes them through FFmpeg, the result lands in S3 -> transcoder queue -> MP4.
- **Breakout rooms = logical namespaces**: created in SFU cluster memory instantly, switching a participant = a new WebRTC handshake (1 to 3 sec). Host broadcast fans out through Redis Pub/Sub.
Related topics
Video conferencing brings together several core areas of distributed systems:
- WebRTC and P2P protocols — Transport stack: DTLS/SRTP encryption, ICE/STUN for NAT traversal, QUIC as an alternative
- Redis Pub/Sub and queues — Room state sync between SFU clusters, fan-out of events to participants, recording job queue
- Consistent Hashing — Routing participants of one room to a single set of SFU servers to minimize inter-server traffic
- CDN and S3 delivery — Storing and serving recording files: S3 hot storage -> Glacier archive -> CloudFront CDN for playback
Вопросы для размышления
- If you had to support 1,000 participants in a single room (webinar), how would the SFU architecture change? What would hit the bottleneck first?
- The recording pipeline stores raw chunks in S3 before transcoding. What trade-off does this create between storage cost and reliability on a compositor failure?
- Breakout rooms require a new WebRTC handshake on switch. How can this process be sped up so that the switch is less noticeable to the user?