Real-Time Backend

Adaptive Bitrate

Half the participants on a Zoom call are on mobile with unstable networks. How do you serve HD video to some without freezing others?

**Google Meet + VP9 SVC** - one stream with hierarchical layers instead of 3 simulcast streams, 30% browser CPU savings, 500-participant support in enterprise tier
**Twilio Video simulcast** - three parallel streams (150/600/2000 Kbps), the SFU picks the right one for each receiver, documented freeze-rate reduction from 8% to 0.9%
**Discord Go Live** - adaptive bitrate for game streaming: on network loss it switches from 720p60 to 360p30 in <200ms via the Transport-CC feedback loop

Bandwidth Estimation

To pick the right video quality, the browser has to know how much bandwidth is available right now. WebRTC uses two estimation algorithms: **REMB** (Receiver Estimated Maximum Bitrate), where the receiver measures inter-packet delays and reports them to the sender over RTCP, and **Transport-CC** (Transport-Wide Congestion Control), which is more accurate. The receiver returns the timestamp of each packet and the sender calculates bandwidth itself.

Google's **GCC** (Google Congestion Control) is the algorithm used by Chrome and most SFUs. It analyzes the trend of inter-packet delays: if delays grow, the network is congested, drop the bitrate by 15%. If they are stable, ramp up smoothly. Reaction to congestion is fast (100ms), recovery is slow (8 seconds), to avoid triggering another congestion event.

Google Meet switched from REMB to Transport-CC in 2018. Estimation accuracy improved from ±30% to ±8%, and video freezes dropped by 40% per their internal metrics.

Why is Transport-CC more accurate than REMB for estimating available bandwidth?

Simulcast Layers

**Simulcast** has the sender (browser or mobile app) encode the same video stream into several qualities at once and ship all three to the SFU. The SFU picks which layer to forward to each receiver based on their available bandwidth. This is the key adaptation mechanism in modern video conferencing.

The standard configuration is three spatial layers: **f** (full, 1080p 2Mbps), **h** (half, 540p 600Kbps), **q** (quarter, 270p 150Kbps). A 4G user gets the f-layer, a user on slow WiFi gets the q-layer. The SFU switches layers at keyframe boundaries to avoid artifacts.

Twilio Video documents that with simulcast, CPU on the browser grows by 15-25% (encoding three streams), but freezes disappear for participants on poor networks. In enterprise scenarios this is worth it. On mobile, simulcast is often capped at 2 layers.

The SFU receives three simulcast layers from one participant. What happens when it switches a specific receiver from the f-layer to the q-layer?

Scalable Video Coding (SVC)

**SVC** (Scalable Video Coding) is an alternative to simulcast. Instead of three independent streams, the sender ships one stream with hierarchical layers. The **base layer** (270p) is self-contained. **Enhancement layers** add quality but depend on the base layer. The SFU can drop enhancement layers right in the network, no transcoding required.

VP9 and AV1 support spatial SVC natively. H.264 SVC (via the SVC extension) is much less common. Google Meet has used VP9 SVC since 2020: one stream from the sender instead of three with simulcast, saving roughly 30% of CPU on the sending side. The SFU drops enhancement-layer packets for participants with poor networks.

Google Meet with VP9 SVC supports up to 500 participants in Google Workspace Enterprise. With simulcast and 500 participants, each would send 3 streams = 1500 streams hitting the SFU. With SVC, 500 streams. A 3x difference in infrastructure cost.

What is the main advantage of SVC over simulcast on the sender (browser) side?

Quality Adaptation

**Quality adaptation** is the logic on the SFU that decides which simulcast layer or SVC layer to forward to each receiver at every moment. Inputs: Transport-CC feedback from the receiver, the size of the rendered tile on the receiver's screen, and speaker priority (the active speaker gets more bitrate).

**Bandwidth allocation** in Zoom: the active speaker gets 70% of available bandwidth, the rest share 30%. When the speaker changes, reallocation happens within 200ms. Google Meet works differently: all videos are equal, but the thumbnail size determines the requested layer. The smaller the tile on screen, the lower the layer forwarded.

Livekit (open-source SFU) publishes data: with proper quality adaptation, the 95th-percentile video freeze rate on poor networks drops from 8% to 1.2%. The key parameter is the speed of reaction to bandwidth changes. Too fast = frequent artifacts, too slow = long freezes.

Adaptive bitrate just means dropping quality when the network is bad

ABR is a system: bandwidth estimation (Transport-CC/GCC) + smart allocation by priority + layer switching at keyframe boundaries + smooth recovery

Naively dropping the bitrate without priorities, keyframe boundaries, and smooth ramp-up creates constant artifacts and bad UX even when the bandwidth estimate is correct.

Why does Zoom allocate 70% of bandwidth to the active speaker instead of splitting evenly?

Summary

**Bandwidth estimation** (Transport-CC + GCC) - the browser continuously measures available bandwidth via RTCP feedback and adjusts the bitrate
**Simulcast** - three independent streams from the sender, the SFU forwards the right one; **SVC** - one hierarchical stream, the SFU drops enhancement layers
**Quality adaptation** on the SFU allocates bandwidth by priority: the active speaker gets more, layer switching only at keyframes

Вопросы для размышления

Google Meet uses VP9 SVC, Zoom uses H.264 simulcast. Which approach yields less CPU load on mobile devices with 10 participants, and why is the answer not obvious?
Transport-CC GCC reduces the bitrate by 15% at the first sign of congestion. How does that interact with TCP Slow Start if there is a TCP tunnel under WebRTC?
Design a quality adaptation algorithm for streaming gaming content. How would the priorities differ from video conferencing?

Связанные уроки

net-24-http2-http3