System Design

Case Study: Dropbox

- **Languages**: Python (backend), Rust (Sync Engine rewrite), Go (infrastructure) - **Storage**: Custom block storage on commodity hardware (Magic Pocket) - **Database**: MySQL (sharded), EdgeStore (distributed metadata) - **Queue**: Kafka for event streaming - **Cache**: Memcache, Redis - **Network**: Custom protocol 'Sync' over HTTP/2

**Small blocks (1MB)**: - ✅ Better deduplication - ❌ More metadata overhead - ❌ More I/O operations **Large blocks (16MB)**: - ✅ Less metadata - ❌ Worse deduplication - ❌ More bandwidth for small changes **Dropbox choice**: 4MB as a balance

In 2016 Dropbox migrated from S3 to its own storage system, Magic Pocket: - **90%** of data stored in Magic Pocket - **10%** in AWS (for disaster recovery) - Saves ~$75M/year compared to S3 - Exabyte-scale on commodity hardware

Sync Engine Architecture

Why is this needed?

Dropbox synchronizes billions of files across devices. Understanding the sync engine architecture is critical for any system working with files locally and in the cloud.

What is it?

Dropbox uses a desktop client with a local database that tracks filesystem changes and synchronizes them through a block server. The architecture separates metadata (information about files) from block data (file contents).

How does it work?

Example

**Example sync flow:** ```typescript // User edits document.docx locally // 1. File watcher detects change // 2. Sync engine computes new blocklist // 3. Only changed blocks uploaded // 4. Metadata service notified // 5. Other devices receive notification // 6. They download only changed blocks // Efficiency: editing 1KB in 100MB file // → Upload only 1-2 blocks (4-8MB) not entire file ```

What did you learn about the Sync Engine architecture?

Content-Addressed Storage and Deduplication

Why is this needed?

Dropbox stores 500+ exabytes of data. Without deduplication this would be economically impossible. Content-addressed storage allows each unique block to be stored only once.

What is it?

Each block of data is addressed by its cryptographic hash (SHA-256). If two users upload the same file or even the same parts of different files - the data is stored only once.

How does it work?

Example

**Real Dropbox numbers:** ``` Deduplication savings: - Average file: 1.7 versions stored - Cross-user dedup: ~25% for documents - Cross-version dedup: ~70% for edited files - Overall storage efficiency: ~50% (store half of logical size) Popular files (OS installers, software): - Same Ubuntu ISO uploaded by 10,000 users - Stored only ONCE = massive savings ```

What did you learn about content-addressed storage and deduplication?

Chunking Algorithms

Why is this needed?

The choice of algorithm for splitting a file into blocks critically affects deduplication efficiency. Fixed-size chunks are simple but perform poorly on insertions. Content-defined chunking solves this problem.

What is it?

Dropbox uses content-defined chunking (CDC) based on a rolling hash (Rabin fingerprint). Block boundaries are determined by file contents, not fixed positions. This ensures stable boundaries under local changes.

How does it work?

Example

**Algorithm comparison:** ``` Algorithm Speed Dedup Efficiency ───────────────────────────────────────────── Fixed-size Fast Poor on inserts Rabin fingerprint Medium Good FastCDC Fast Good BuzHash Fast Good Dropbox uses: Rabin with 4MB target chunks Backblaze B2: FastCDC with variable targets Restic backup: CDC with 1-8MB chunks ```

What did you learn about chunking algorithms?

Metadata and File Journal

Why is this needed?

The metadata service is the heart of Dropbox. It tracks all files, versions, sharing, and ensures consistency across devices. The journal enables efficient change synchronization.

What is it?

The Server File Journal (SFJ) is an append-only log of all changes. Each device remembers its position in the journal and requests only new changes. This enables efficient detection and synchronization of changes.

How does it work?

Example

**Delta sync in action:** ```typescript // Device comes online after being offline async function syncFromServer(lastCursor: bigint): Promise<void> { while (true) { const changes = await metadataService.poll(namespaceId, lastCursor, 30000); if (changes.length === 0) { break; // Caught up } for (const change of changes) { await applyRemoteChange(change); lastCursor = change.cursor; } await saveLastCursor(lastCursor); } } // Efficiency: // - Offline for 1 week with 100 changes // - Only fetch those 100 entries, not full directory listing // - ~10KB of metadata, not scanning entire filesystem ```

What did you learn about metadata and the file journal?

Conflict Resolution

Why is this needed?

When two devices edit the same file while offline - a conflict occurs. Dropbox must preserve both versions and let the user resolve the conflict manually.

What is it?

Dropbox uses last-writer-wins with preservation of conflicted copies. When a conflict is detected, the losing version is saved as a separate file named 'file (conflicted copy DATE)'. The user decides which version to keep.

How does it work?

Example

**Conflict avoidance strategies:** ```typescript // 1. Optimistic locking for active sessions interface EditSession { fileId: string; userId: string; deviceId: string; startedAt: Date; lastHeartbeat: Date; } // Show warning: "John is currently editing this file" // User can still edit, but is warned about potential conflict // 2. Real-time sync when online // Sync every few seconds while file is open // Reduces window for conflicts // 3. Application-specific handlers // Some apps (Office, Figma) use OT/CRDT // Dropbox provides hooks for these integrations ```

What did you learn about conflict resolution?

Связанные уроки

dist-11-replication

System Design

Case Study: Dropbox

Sync Engine Architecture

Why is this needed?

Dropbox synchronizes billions of files across devices. Understanding the sync engine architecture is critical for any system working with files locally and in the cloud.

What is it?

How does it work?

Example

What did you learn about the Sync Engine architecture?

Content-Addressed Storage and Deduplication

Why is this needed?

Dropbox stores 500+ exabytes of data. Without deduplication this would be economically impossible. Content-addressed storage allows each unique block to be stored only once.

What is it?

Each block of data is addressed by its cryptographic hash (SHA-256). If two users upload the same file or even the same parts of different files - the data is stored only once.

How does it work?

Example

What did you learn about content-addressed storage and deduplication?

Chunking Algorithms

Why is this needed?

What is it?

How does it work?

Example

What did you learn about chunking algorithms?

Metadata and File Journal

Why is this needed?

The metadata service is the heart of Dropbox. It tracks all files, versions, sharing, and ensures consistency across devices. The journal enables efficient change synchronization.

What is it?

How does it work?

Example

What did you learn about metadata and the file journal?

Conflict Resolution

Why is this needed?

When two devices edit the same file while offline - a conflict occurs. Dropbox must preserve both versions and let the user resolve the conflict manually.