System Design
Load Estimation (Back-of-Envelope)
Цели урока
- Know key numbers for quick calculations (100K seconds in a day)
- Be able to calculate QPS from DAU
- Be able to estimate Storage accounting for all factors
- Be able to calculate Bandwidth
- Understand thresholds for architecture selection
Предварительные знания
- Functional and non-functional requirements
- Basic concepts: DAU, QPS, latency
Why Back-of-Envelope Estimation Matters
**Back-of-envelope estimation** (literally: calculations on the back of an envelope) is the ability to quickly estimate system scale without a calculator. It demonstrates understanding of **orders of magnitude** and the ability to make **informed decisions**.
The question: 'how many servers does Twitter need?' - and 2 minutes to answer.
- Interview time is limited - scale estimation must be done quickly
- In production it matters to understand the order: is this 1GB or 1TB? 100 QPS or 100K?
- The right estimate determines the architecture: monolith vs microservices
In an interview, precision down to the byte is not the goal. What matters is getting the **order of magnitude** right: gigabytes or terabytes? Hundreds of requests or millions?
Why are approximate calculations important in a System Design interview?
Numbers to Memorize
There is a set of numbers worth memorizing for quick mental calculations.
Time
| Period | Exact value | For calculations |
|---|---|---|
| Seconds in a minute | 60 | 60 |
| Seconds in an hour | 3,600 | ~4K |
| Seconds in a day | 86,400 | ~100K (10⁵) |
| Seconds in a month | 2,592,000 | ~2.5M |
| Seconds in a year | 31,536,000 | ~30M (3×10⁷) |
**100K seconds in a day** - the golden number! Memorize it: to calculate QPS just divide the daily load by 100,000.
Data sizes
| Unit | Bytes | Power of 10 |
|---|---|---|
| 1 KB (kilobyte) | 1,000 | 10³ |
| 1 MB (megabyte) | 1,000,000 | 10⁶ |
| 1 GB (gigabyte) | 1,000,000,000 | 10⁹ |
| 1 TB (terabyte) | 1,000,000,000,000 | 10¹² |
| 1 PB (petabyte) | 10¹⁵ | 10¹⁵ |
Operation latency
| Operation | Time | Relative |
|---|---|---|
| L1 cache hit | 1 ns | 1x |
| L2 cache hit | 4 ns | 4x |
| RAM access | 100 ns | 100x |
| SSD random read | 16 μs | 16,000x |
| HDD seek | 2 ms | 2,000,000x |
| Network: same datacenter | 0.5 ms | 500,000x |
| Network: SF → NYC | 40 ms | 40M x |
| Network: SF → Europe | 80 ms | 80M x |
Approximately how many seconds are in a day (for quick calculations)?
Calculating QPS (Queries Per Second)
**QPS** (Queries Per Second) is the main metric for understanding system load. It determines how many servers are needed, whether a cache or load balancer is required, etc.
Example: Twitter Read QPS
Calculate read load for Twitter
Given: • DAU = 200M users • Each user reads the feed 5 times a day • Each read = fetching 20 tweets Read QPS = (200M × 5) / 100K = 1B / 100K = 10K QPS (requests to API) But each request reads 20 tweets from DB: DB Read QPS = 10K × 20 = 200K QPS
**Peak QPS** is usually 2-3x the average! For Twitter: peak = 200K × 3 = 600K QPS. The system must handle spikes.
Read vs Write Ratio
Most systems are **read-heavy** - they read data far more often than they write. This shapes architecture: aggressive caching and read replicas become viable.
| System | Read:Write | Implication |
|---|---|---|
| 1000:1 | Can cache the entire feed | |
| 100:1 | CDN for images is critical | |
| Messenger | 10:1 | Fast writes needed too |
| Stock exchange | 1:1 | Strong consistency needed |
A service has 50M DAU, each performing 10 actions per day. What is the QPS?
Calculating Storage
Storage estimation matters for choosing a database and planning infrastructure. The formula is simple, but all components must be accounted for.
Example: Twitter Storage over 5 years
How much space is needed to store tweets?
Given: • 500M tweets per day • Average tweet: 280 chars ≈ 300 bytes • Metadata: ~200 bytes (author_id, timestamp, etc) • Total per tweet: ~500 bytes Daily Storage: = 500M × 500 bytes = 250 GB/day (text only!) 5 years: = 250 GB × 365 × 5 = 456 TB ≈ 500 TB text With indexes (×1.2) and replication (×3): = 500 TB × 1.2 × 3 = 1.8 PB
**Media is a different order of magnitude!** If 20% of tweets have photos (~1MB): 500M × 0.2 × 1MB = 100TB/day. Over 5 years: 180 PB for media alone!
Typical object sizes
| Object | Size | Note |
|---|---|---|
| UUID/ID | 16 bytes | Or 8 bytes for BIGINT |
| Timestamp | 8 bytes | Unix timestamp |
| 50-100 bytes | Average length | |
| Username | 20-50 bytes | UTF-8 |
| Tweet/Post | 300-500 bytes | With metadata |
| User profile | 1-2 KB | Without avatar |
| JPEG (compressed) | 100-500 KB | For web |
| Photo original | 2-5 MB | DSLR camera |
| Video (1 min) | 10-50 MB | After compression |
A service stores 100M records at 1KB each. How many GB is needed?
Calculating Bandwidth
**Bandwidth** is the volume of data transferred per second. Critical for understanding network requirements and CDN needs.
Example: Twitter feed
Calculate bandwidth for serving the feed
Given: • Read QPS: 10K requests to API • Each response: 20 tweets with previews • Response size: ~100KB (tweets + metadata + previews) Bandwidth: = 10K × 100KB = 1,000,000 KB/s = 1 GB/s = 8 Gbps Peak (×3): = 24 Gbps
**24 Gbps** - that's serious traffic. A single 10Gbps server won't handle it. The solution: CDN for static assets, compression (gzip), multiple edge servers.
Bandwidth by content type
At 5K QPS with an average response of 200KB, what is the bandwidth in GB/s?
Scaling Thresholds by Load Level
QPS determines the required architecture. Here are practical thresholds for decision-making:
How many servers are needed?
At what QPS does a system typically need to move from a single server to a load balancer?
Estimation Cheat Sheet
- **Seconds per day** ≈ 100K (86,400)
- **QPS** = (DAU × actions) / 100K
- **Storage** = objects × size × retention
- **Bandwidth** = QPS × response_size
- **Peak** = Average × 2-3
- **Read:Write** - usually 10:1 or 100:1
- **< 100 QPS** → Single server
- **100-10K QPS** → Load balancer + replicas
- **10K-100K QPS** → Microservices + sharding
- **> 100K QPS** → Geo-distributed
How this relates to architecture
Estimation results determine component selection
- Load Balancer — Needed when QPS > 100-1000
- Caching — Critical for read-heavy load (Read:Write > 10:1)
- Sharding — Needed when data doesn't fit on a single server