System Design

Case Study: YouTube

YouTube ships one billion hours of watch time per day. At peak, 50 million concurrent streams demand roughly 250 Pbps of bandwidth - a number larger than the total backbone capacity of most countries. Centralized delivery is not slow, it is mathematically impossible.

Cisco VNI 2022: YouTube and Netflix together produce over 25% of global downstream internet traffic.
Google Global Cache servers sit inside 1500+ ISPs, providing free hardware in exchange for offloaded peering traffic.
Covington et al. 2016 reported that 70% of YouTube watch time originates from algorithmic recommendations, not search or subscriptions.
FFmpeg with NVIDIA NVENC on a single A100 GPU encodes 1080p H.264 at roughly 40x real-time, the substrate behind the 500-hours-per-minute upload throughput.

Requirements and Scale

Цели урока

Define functional and non-functional requirements
Perform back-of-envelope estimation
Understand the unique challenges of a video platform

Functional Requirements

YouTube is more than just video hosting - it is a complex ecosystem with several key capabilities:

YouTube at Scale (Real Numbers)

YouTube is one of the most storage/bandwidth-intensive systems in the world:

Back-of-Envelope Estimation

For comparison: total internet traffic worldwide is roughly 1 Exabyte per day. YouTube generates a significant fraction of all global traffic.

Key Technical Challenges

Key Takeaways

YouTube operates at exabyte storage scale and petabit bandwidth scale
Key challenges: upload processing pipeline, adaptive streaming, a global CDN network, and ML-based recommendations

Approximately how much storage is needed to hold one day's worth of uploaded videos (all quality levels), in petabytes?

Video Upload Pipeline

Цели урока

Design resumable upload for large files
Understand the transcoding pipeline and parallel encoding
Master async job processing for video

The Problem: Uploading Large Files

A YouTube video can be up to 12 hours long. At 1080p, that is roughly 60 GB. Uploading such a file over a typical home connection takes hours. What happens if the connection drops?

Transcoding Pipeline Architecture

After upload, the video passes through a processing pipeline. This takes anywhere from minutes to hours depending on length.

Transcoding Job Implementation

HLS Segment Generation

Optimizations for High Load

Key Takeaways

The video upload pipeline consists of: resumable upload for reliability, async transcoding with parallel processing of different quality levels, and HLS segmentation for streaming
Priority queues and GPU clusters enable the scale of 500 hours per minute

Why is transcoding performed asynchronously rather than immediately during upload?

Adaptive Bitrate Streaming

Цели урока

Understand the HLS and DASH streaming protocols
Study adaptive bitrate switching algorithms
Master buffer management strategies

The Problem: Varying Internet Quality

Users watch video from all kinds of devices and networks: 5G, WiFi, slow mobile data. How do you deliver the best experience to each one?

HLS (HTTP Live Streaming)

Manifest Files

ABR Algorithms

How does the player decide when to switch to a different quality level? There are several approaches:

ABR Implementation

Optimizations for Fast Startup

Key Takeaways

Adaptive Bitrate Streaming splits video into 2-10 second segments, each available in multiple quality levels
The ABR algorithm (buffer-based, throughput-based, or hybrid) selects the optimal quality for each segment, balancing picture quality against the risk of rebuffering

Why is the buffer-based approach better than pure throughput-based?

CDN and Edge Caching

Цели урока

Design a multi-tier CDN architecture
Understand video-specific CDN optimizations
Study ISP peering and Google Global Cache

The Problem: 250 Pbps Bandwidth

It is impossible to serve 250 Petabits/second from centralized data centers. Latency would be enormous and the network would be overwhelmed. The solution: deliver content as close to the user as possible.

Multi-Tier CDN Architecture

Google Global Cache (GGC)

YouTube places servers directly inside ISP networks. This significantly reduces latency and costs for both parties.

Video-Specific CDN Optimizations

CDN Selection Algorithm

Key Takeaways

YouTube uses a multi-tier CDN: Edge PoPs (1000+), Regional Caches (100+), and Google Global Cache inside ISP networks
Video-specific optimizations: segment-level caching, quality-aware prioritization, predictive prefetching
The 80/20 rule holds: 10% of videos account for 80% of traffic

Why does YouTube cache individual segments rather than whole videos?

Recommendation System

Цели урока

Understand the multi-stage recommendation pipeline
Study candidate generation and ranking
Master real-time personalization

Why Recommendations Matter

70% of watch time on YouTube comes from recommendations. They are the primary driver of engagement and revenue. The challenge: select 20 videos from over 800 million that a given user will actually want to watch.

Stage 1: Candidate Generation

Stage 2: Ranking

From 10,000 candidates, the top 500 must be selected. This uses a complex ML model that predicts the probability of engagement.

Stage 3: Re-ranking

Real-time Personalization

Architecture Overview

Key Takeaways

YouTube recommendations use a multi-stage funnel: Candidate Generation (800M → 10K), Ranking (10K → 500), Re-ranking (500 → 20)
The primary metric is watch time, not clicks
Signals used include collaborative filtering, content-based, and contextual features
Everything must run in under 100ms

The ranking model is the heart of the recommender, so the key to better recommendations is a deeper neural network with more layers and parameters.

Covington, Adams and Sargin (2016) showed that at YouTube scale, gains from architecture were dwarfed by gains from feature engineering: sampling negatives, including search history as embeddings, modeling video age, and training on impressions not just clicks each beat layer-depth experiments.

When the candidate pool is 800M and serving budget is 100ms, model capacity is bounded by inference latency. The remaining headroom lives in signal quality - how watch time, freshness, and context are encoded - not in extra hidden units.

Why does YouTube optimize for watch time rather than clicks?

Where this connects

YouTube is the canonical petabyte-scale case study. It builds on Twitter's fan-out vocabulary, leans on CDN edge caching theory, and feeds directly into Uber's real-time delivery patterns and operator-level GPU scheduling.

Twitter case study (read-heavy distribution) — builds-on
CDN architecture — applies
Message queues for async pipelines — applies
GPU architecture and SIMT — analogous-to

What stays in memory

Storage scales to exabytes per year because every uploaded video lives in 5+ qualities plus HLS segments - the playback ladder is the real storage multiplier.
Resumable chunked upload (308 Resume Incomplete) plus async transcoding via Kafka decouples upload reliability from encode latency.
Adaptive Bitrate Streaming combines throughput estimation with buffer-based switching - buffer level is the safety margin, not just a quality indicator.
Multi-tier CDN (Edge PoPs to Regional caches to GGC inside ISPs) achieves 90%+ hit rate by caching segments individually, not whole files.
The recommendation funnel collapses 800M videos to 20 in under 100ms through Candidate Generation, Ranking, and Re-ranking - watch time is the optimization target, not clicks.

Вопросы для размышления

Why does YouTube prioritize segment-level caching over whole-video caching, and how does the 50% drop-off in viewer retention shape that choice?
If 4K bandwidth costs roughly 3x more than 1080p but only 8% of users have monitors capable of resolving 4K, how should the transcoding queue prioritize quality ladders?
Recommendations are optimized for watch time rather than click-through rate. What second-order effects on creator behavior does that objective produce, and how would the system detect drift?

Связанные уроки

sd-14-twitter — Twitter feed patterns precede video scale
sd-08-cdn — CDN is the foundation of video delivery: edge nodes near users
sd-09-message-queue — Video processing pipeline via queues (transcode, thumbnail)
arch-15-gpu-architecture — GPU for ML recommendations - same parallelism as transcoding
dist-11-replication