System Design
Case Study: YouTube
YouTube ships one billion hours of watch time per day. At peak, 50 million concurrent streams demand roughly 250 Pbps of bandwidth - a number larger than the total backbone capacity of most countries. Centralized delivery is not slow, it is mathematically impossible.
- Cisco VNI 2022: YouTube and Netflix together produce over 25% of global downstream internet traffic.
- Google Global Cache servers sit inside 1500+ ISPs, providing free hardware in exchange for offloaded peering traffic.
- Covington et al. 2016 reported that 70% of YouTube watch time originates from algorithmic recommendations, not search or subscriptions.
- FFmpeg with NVIDIA NVENC on a single A100 GPU encodes 1080p H.264 at roughly 40x real-time, the substrate behind the 500-hours-per-minute upload throughput.
Requirements and Scale
Цели урока
- Define functional and non-functional requirements
- Perform back-of-envelope estimation
- Understand the unique challenges of a video platform
Functional Requirements
YouTube is more than just video hosting - it is a complex ecosystem with several key capabilities:
YouTube at Scale (Real Numbers)
YouTube is one of the most storage/bandwidth-intensive systems in the world:
Back-of-Envelope Estimation
For comparison: total internet traffic worldwide is roughly 1 Exabyte per day. YouTube generates a significant fraction of all global traffic.
Key Technical Challenges
Key Takeaways
- YouTube operates at exabyte storage scale and petabit bandwidth scale
- Key challenges: upload processing pipeline, adaptive streaming, a global CDN network, and ML-based recommendations
Approximately how much storage is needed to hold one day's worth of uploaded videos (all quality levels), in petabytes?
Video Upload Pipeline
Цели урока
- Design resumable upload for large files
- Understand the transcoding pipeline and parallel encoding
- Master async job processing for video
The Problem: Uploading Large Files
A YouTube video can be up to 12 hours long. At 1080p, that is roughly 60 GB. Uploading such a file over a typical home connection takes hours. What happens if the connection drops?
Transcoding Pipeline Architecture
After upload, the video passes through a processing pipeline. This takes anywhere from minutes to hours depending on length.
Transcoding Job Implementation
HLS Segment Generation
Optimizations for High Load
Key Takeaways
- The video upload pipeline consists of: resumable upload for reliability, async transcoding with parallel processing of different quality levels, and HLS segmentation for streaming
- Priority queues and GPU clusters enable the scale of 500 hours per minute
Why is transcoding performed asynchronously rather than immediately during upload?
Adaptive Bitrate Streaming
Цели урока
- Understand the HLS and DASH streaming protocols
- Study adaptive bitrate switching algorithms
- Master buffer management strategies
The Problem: Varying Internet Quality
Users watch video from all kinds of devices and networks: 5G, WiFi, slow mobile data. How do you deliver the best experience to each one?
HLS (HTTP Live Streaming)
Manifest Files
ABR Algorithms
How does the player decide when to switch to a different quality level? There are several approaches:
ABR Implementation
Optimizations for Fast Startup
Key Takeaways
- Adaptive Bitrate Streaming splits video into 2-10 second segments, each available in multiple quality levels
- The ABR algorithm (buffer-based, throughput-based, or hybrid) selects the optimal quality for each segment, balancing picture quality against the risk of rebuffering
Why is the buffer-based approach better than pure throughput-based?
CDN and Edge Caching
Цели урока
- Design a multi-tier CDN architecture
- Understand video-specific CDN optimizations
- Study ISP peering and Google Global Cache
The Problem: 250 Pbps Bandwidth
It is impossible to serve 250 Petabits/second from centralized data centers. Latency would be enormous and the network would be overwhelmed. The solution: deliver content as close to the user as possible.
Multi-Tier CDN Architecture
Google Global Cache (GGC)
YouTube places servers directly inside ISP networks. This significantly reduces latency and costs for both parties.
Video-Specific CDN Optimizations
CDN Selection Algorithm
Key Takeaways
- YouTube uses a multi-tier CDN: Edge PoPs (1000+), Regional Caches (100+), and Google Global Cache inside ISP networks
- Video-specific optimizations: segment-level caching, quality-aware prioritization, predictive prefetching
- The 80/20 rule holds: 10% of videos account for 80% of traffic
Why does YouTube cache individual segments rather than whole videos?
Recommendation System
Цели урока
- Understand the multi-stage recommendation pipeline
- Study candidate generation and ranking
- Master real-time personalization
Why Recommendations Matter
70% of watch time on YouTube comes from recommendations. They are the primary driver of engagement and revenue. The challenge: select 20 videos from over 800 million that a given user will actually want to watch.
Stage 1: Candidate Generation
Stage 2: Ranking
From 10,000 candidates, the top 500 must be selected. This uses a complex ML model that predicts the probability of engagement.
Stage 3: Re-ranking
Real-time Personalization
Architecture Overview
Key Takeaways
- YouTube recommendations use a multi-stage funnel: Candidate Generation (800M → 10K), Ranking (10K → 500), Re-ranking (500 → 20)
- The primary metric is watch time, not clicks
- Signals used include collaborative filtering, content-based, and contextual features
- Everything must run in under 100ms
The ranking model is the heart of the recommender, so the key to better recommendations is a deeper neural network with more layers and parameters.
Covington, Adams and Sargin (2016) showed that at YouTube scale, gains from architecture were dwarfed by gains from feature engineering: sampling negatives, including search history as embeddings, modeling video age, and training on impressions not just clicks each beat layer-depth experiments.
When the candidate pool is 800M and serving budget is 100ms, model capacity is bounded by inference latency. The remaining headroom lives in signal quality - how watch time, freshness, and context are encoded - not in extra hidden units.
Why does YouTube optimize for watch time rather than clicks?
Where this connects
YouTube is the canonical petabyte-scale case study. It builds on Twitter's fan-out vocabulary, leans on CDN edge caching theory, and feeds directly into Uber's real-time delivery patterns and operator-level GPU scheduling.
- Twitter case study (read-heavy distribution) — builds-on
- CDN architecture — applies
- Message queues for async pipelines — applies
- GPU architecture and SIMT — analogous-to
What stays in memory
- Storage scales to exabytes per year because every uploaded video lives in 5+ qualities plus HLS segments - the playback ladder is the real storage multiplier.
- Resumable chunked upload (308 Resume Incomplete) plus async transcoding via Kafka decouples upload reliability from encode latency.
- Adaptive Bitrate Streaming combines throughput estimation with buffer-based switching - buffer level is the safety margin, not just a quality indicator.
- Multi-tier CDN (Edge PoPs to Regional caches to GGC inside ISPs) achieves 90%+ hit rate by caching segments individually, not whole files.
- The recommendation funnel collapses 800M videos to 20 in under 100ms through Candidate Generation, Ranking, and Re-ranking - watch time is the optimization target, not clicks.
Вопросы для размышления
- Why does YouTube prioritize segment-level caching over whole-video caching, and how does the 50% drop-off in viewer retention shape that choice?
- If 4K bandwidth costs roughly 3x more than 1080p but only 8% of users have monitors capable of resolving 4K, how should the transcoding queue prioritize quality ladders?
- Recommendations are optimized for watch time rather than click-through rate. What second-order effects on creator behavior does that objective produce, and how would the system detect drift?
Связанные уроки
- sd-14-twitter — Twitter feed patterns precede video scale
- sd-08-cdn — CDN is the foundation of video delivery: edge nodes near users
- sd-09-message-queue — Video processing pipeline via queues (transcode, thumbnail)
- arch-15-gpu-architecture — GPU for ML recommendations - same parallelism as transcoding
- dist-11-replication