Databases

MongoDB: Document Database in Production

When Forbes migrated its site to a new platform, it needed a database for articles with unpredictable metadata structure. Thirty years ago an article was text and an author. Today it is text, author, tags, SEO metadata, embed codes, A/B headline variants, and translation versions. MongoDB allowed schema changes without painful migrations.

  • **Cisco**: MongoDB for IoT analytics - 500 million events per day from devices worldwide.
  • **Adobe Experience Manager**: document model for managing content across millions of media files.
  • **eBay**: MongoDB for the listings system - the flexible schema handles different product types with unique attributes.

The Document Data Model

MongoDB stores data as BSON (Binary JSON) documents in collections. BSON extends JSON with types such as Date, ObjectId, Binary, and Decimal128. An ObjectId is a 12-byte identifier that embeds a creation timestamp, enabling time-based sorting without an additional field.

The central data modeling decision is embed vs reference. If data is read together - embed (nested document). If data is updated independently or its size is unbounded - use a reference. Airbnb embeds listing photos in the document (read together), but stores reviews in a separate collection (potentially thousands of them).

The 16 MB rule: a MongoDB document is limited to 16 MB. eBay discovered this the hard way: a bid_history collection was accumulating bids directly inside the listing document. After thousands of bids, the document hit the limit. Fix: move bids to a separate collection with a reference to the listing.

When is it correct to use a reference instead of an embed in MongoDB?

0

1

Sign In

Aggregation Pipeline

The aggregation pipeline is MongoDB's analytics engine. Data passes through a chain of stages: $match (filtering), $group (aggregation), $sort, $lookup (join), $project (transformation), and $unwind (array unwinding). MongoDB executes the pipeline on the database side - data is not transferred to the application.

$lookup in MongoDB is the equivalent of JOIN, but it is expensive without an index. MongoDB Atlas automatically suggests creating required indexes via the Performance Advisor when slow queries are detected.

What does the $unwind stage do in an aggregation pipeline?

Indexes in MongoDB

MongoDB supports several index types. Single field: basic B-Tree. Compound index: multiple fields, with the ESR rule (Equality-Sort-Range) determining the correct field order. Multikey index: for arrays. Text index: full-text search. Geospatial: 2dsphere for coordinates.

A covering index eliminates the need to fetch the actual document. If the index contains all fields required by the query, MongoDB reads only the index. This can reduce I/O by 10x on large collections.

For the query find({ country: 'US', age: { $gt: 18 } }).sort({ name: 1 }), which index is optimal per the ESR rule?

Sharding and MongoDB Atlas

MongoDB sharding splits a collection into chunks and distributes them across shard servers. The shard key determines how documents are distributed. A hashed shard key distributes write load evenly - no hot shards. Range sharding is convenient for range queries but risks a hotspot when the key is monotonically increasing.

MongoDB Atlas is the managed service used by SEGA, Toyota, and Forbes. Atlas Search (Lucene under the hood) adds full-text search without a separate Elasticsearch cluster. Atlas Vector Search enables RAG applications and semantic search without additional infrastructure.

MongoDB does not support transactions and therefore cannot be used for financial data

MongoDB has supported multi-document ACID transactions across collections since version 4.0. Brex and other fintech companies use MongoDB in production.

This misconception stems from MongoDB's early versions (before 4.0) where multi-document transactions were not supported. Modern MongoDB is ACID-compliant.

Why is a monotonically increasing shard key (such as timestamp) problematic?

Key Ideas

  • **Embed vs Reference**: embed data that is read together; use references for unbounded arrays (16 MB document limit).
  • **Aggregation Pipeline**: $match -> $unwind -> $lookup -> $group -> $sort - analytics executed on the database side.
  • **ESR Rule**: compound index field order is Equality, Sort, Range. Violating the order makes the index inefficient.
  • **Shard key choice**: hashed for even writes, range for query efficiency. Monotonically increasing keys create hotspots.

Related Topics

MongoDB fits into the broader context of horizontal scaling and polyglot persistence.

  • Sharding — MongoDB sharding is a practical example of horizontal data partitioning.
  • B-Tree Indexes — MongoDB uses B-Tree for all indexes except hashed and text.
  • Polyglot Persistence — MongoDB is frequently combined with Redis and Elasticsearch in a single stack.

Вопросы для размышления

  • A blogging application has posts, comments, likes, and tags. What would you embed and what would you reference, and why?
  • When is the aggregation pipeline in MongoDB preferable to loading data into Python for processing?
  • MongoDB Atlas offers Vector Search. How does this change the architecture of AI applications using RAG?

Связанные уроки

  • dist-12-consistency
MongoDB: Document Database in Production