Cloud Computing
Object Storage: S3, GCS
Amazon S3 stores over 350 trillion objects and processes over 100 million requests per second. Airbnb stores all guest photos, host listings, and user-generated content in S3. Netflix stores every video encoding variant - thousands of files per title - in S3 before distribution to the CDN. Object storage is the default data persistence layer of cloud-native architecture.
- Dropbox migrated 500 petabytes of user data from its own datacenter to S3 in 2016. The migration took two years and cost less than maintaining their own storage infrastructure, while gaining eleven-9s of durability (99.999999999%).
- The New York Times digitized 5.5 million photos from its archives and stored them in S3 with intelligent tiering - frequently accessed photos in S3 Standard, rarely accessed in Glacier. The tiering reduced storage costs by 40% while maintaining sub-second access for editorial use.
- Figma uses S3 to store every version of every design file. S3 versioning means no file is ever truly deleted - every save is a new version. Point-in-time recovery of any design file to any previous state is a direct consequence of this architecture.
Buckets and Object Storage Model
An S3 bucket is a flat namespace of objects. There are no real directories - a key like "users/photos/alice/avatar.jpg" is a single string, not a path. The "/" is a visual convention; S3 lists keys with common prefixes together to simulate a directory listing. Each object has a key (up to 1024 bytes), a value (up to 5 TB), metadata (user-defined key-value pairs), and an ETag (MD5 hash of the content for integrity verification).
Pre-signed URLs are the secure pattern for user uploads: the backend generates a pre-signed PUT URL with a 15-minute expiry, returns it to the client, and the client uploads directly to S3. The backend never handles the binary data. S3 is the origin; the CDN serves the result. Backend scales independently of upload volume.
What is an S3 object key?
Lifecycle Policies
S3 lifecycle policies automatically manage object transitions and deletions. Storage classes range from S3 Standard (frequently accessed, highest cost) to S3 Intelligent-Tiering (auto-moves based on access patterns) to S3 Glacier Flexible Retrieval (archived, retrieval in minutes to hours) to S3 Glacier Deep Archive (lowest cost, retrieval in 12+ hours). A lifecycle rule can transition objects to a cheaper class after N days and delete them after M days.
AbortIncompleteMultipartUpload is a commonly missed lifecycle rule. Large file uploads use S3 multipart upload; if the upload fails mid-way, partial parts accumulate in the bucket and incur storage costs indefinitely. A 7-day abort rule on incomplete multipart uploads prevents silent cost accumulation.
What storage class should be used for logs that are rarely read after 90 days but must be retrievable within minutes?
Versioning
S3 versioning preserves every version of every object in a bucket. When versioning is enabled, deleting an object adds a delete marker; the previous version is still retrievable by version ID. Overwriting an object creates a new version; the old version is still accessible. This creates a complete audit trail and enables point-in-time recovery.
Versioning increases storage costs: every version is stored and billed. A lifecycle rule on non-current versions (delete non-current versions after 30 days) controls cost growth while maintaining a rolling recovery window. Without this rule, a large versioned bucket accumulates every historical version indefinitely.
What happens when an object is deleted from a versioning-enabled bucket?
Cross-Region Replication
Cross-Region Replication (CRR) automatically copies objects from a source bucket to a destination bucket in a different AWS region. Use cases: disaster recovery (data survives a regional outage), latency reduction (serve European users from eu-west-1 instead of us-east-1), and compliance (data residency requirements for specific jurisdictions).
CRR requires versioning to be enabled on both source and destination buckets. Replication is asynchronous - there is a lag between the write on the source and availability on the destination. For scenarios requiring zero RPO (Recovery Point Objective), S3 Replication Time Control (RTC) provides a 15-minute replication SLA with monitoring.
S3 stores objects like files in a hierarchical directory structure
S3 is a flat key-value store - the entire key including any "/" characters is one string; directories are a UI convention imposed by tools like the AWS Console
This distinction matters for performance: listing 1 million objects under a common prefix is a sequential API call with pagination, not a directory read. Applications that treat S3 like a filesystem often hit API rate limits and latency surprises.
What is the primary purpose of Cross-Region Replication (CRR)?
Key Ideas
- **Buckets:** the top-level container - globally unique name, single AWS region, configurable access control; everything inside is a flat namespace of keys (no real directories)
- **Lifecycle policies:** rules that automatically transition objects between storage classes (Standard -> Intelligent-Tiering -> Glacier) or delete them after a specified number of days
- **Versioning:** when enabled, S3 keeps every version of every object; deletes create a delete marker rather than removing data; point-in-time recovery becomes possible
- **Cross-region replication:** automatic replication of objects to a bucket in another AWS region for disaster recovery or lower latency reads in specific geographies
Related Topics
Object storage is the foundation layer:
- Block and File Storage — Object, block, and file storage serve different use cases - understanding all three determines the right choice for each workload
- Event-Driven Architecture — S3 event notifications (s3:ObjectCreated) trigger Lambda and SQS for async processing pipelines (image resizing, virus scanning)
- Data Lakes and Analytics — S3 is the standard data lake storage layer; Athena queries S3 directly without a separate database
Вопросы для размышления
- S3 versioning stores every version of every object. For a bucket that receives 10,000 small updates per day, how does the cost model change over 5 years - and what lifecycle rules prevent unbounded storage cost?
- Pre-signed URLs let clients upload directly to S3 without going through the backend. What are the security tradeoffs: what can an attacker do with a leaked pre-signed PUT URL, and how does the expiry time bound the risk?
- Cross-region replication is asynchronous with a lag of minutes. For a financial application where data must be available in two regions simultaneously with zero data loss, is S3 CRR sufficient - or does the architecture require a synchronous write path?