Big Data
MapReduce: the paradigm
Цели урока
- Understand the role of each phase - Map, Shuffle, Reduce - in the pipeline
- Explain why Shuffle is the bottleneck and how to optimize it
- Apply Combiners only for associative and commutative operations
- Know the limitations of MapReduce and when to use Spark instead
Предварительные знания
2008. The New York Times wanted to digitize 11 million newspaper articles spanning 150 years. On a single server - months of work. Engineer Derek Gottfrid fired up a MapReduce job on 100 Amazon EC2 virtual machines. Result: 4 TB processed in 24 hours, cost USD 240. The divide-and-conquer principle, scaled to hundreds of machines.
- **Google** originally used MapReduce to build the search index of the entire web - trillions of pages
- **Spotify** processes billions of listening events through MapReduce-like pipelines for Wrapped (year-end recap)
- **NASA** uses MapReduce to analyze terabytes of satellite imagery and telemetry
Jeff Dean, Sanjay Ghemawat, and the paper that transformed Big Data
In 2004 Google engineers Jeff Dean and Sanjay Ghemawat published "MapReduce: Simplified Data Processing on Large Clusters." They did not invent map and reduce - those came from functional programming (Lisp, 1960s). But Google grabbed the abstraction and forged an industrial system that automatically handled node failures and load balancing. The paper sparked Apache Hadoop in 2006 - the open implementation that democratized Big Data.
Map - data transformation
The **Map** phase kicks things off. Each cluster node grabs its data block from HDFS and applies a mapper function. The mapper reads input `(key, value)` pairs and emits intermediate `(key', value')` pairs. The key point: **mappers run in parallel and independently** - each one chews on its own block, oblivious to the others.
In real Hadoop MapReduce the mapper reads data line by line. Input key: the byte offset of the line in the file. Input value: the line text. The mapper emits new key-value pairs.
**Hadoop Streaming** lets mapper and reducer get written in any language (Python, Ruby, bash). Data flows through stdin/stdout. Production usually picks Java (native API), but Streaming is ideal for prototypes.
One mapper processes exactly one HDFS block (128 MB by default). A 10-block file launches 10 mappers. 100 single-block files launches 100 mappers. The number of mappers is set by the **number of input splits (blocks)**, not the number of nodes.
A file in HDFS occupies 5 blocks of 128 MB each. How many mapper tasks will be launched?
Shuffle - grouping and sorting
After Map the data is scattered: `(hello, 1)` pairs land on different nodes. Before Reduce, every value for the same key must be **gathered on the same reducer**. That is Shuffle's job - the most expensive phase of MapReduce, because it ships **data over the network** between nodes. Exactly why Apache Spark displaced Hadoop MapReduce for iterative workloads.
Shuffle breaks into several sub-steps. Each mapper sorts its output by key and writes it to local disk. The **Partitioner** decides which reducer each key goes to (usually `hash(key) % num_reducers`). Data ships over the network and gets **merge-sorted** on the reducer side.
**Shuffle is MapReduce's main performance enemy.** All intermediate data hits disk (mapper side), ships over the network, hits disk again (reducer side). For 1 TB of intermediate data that means ~3 TB disk I/O + 1 TB network I/O. Apache Spark sidesteps it by keeping data in memory.
**Custom Partitioner** controls which keys land on which reducer. The default is HashPartitioner. But skewed data (one key dominating 90% of records) creates a bottleneck on a single reducer while the others sit idle.
**Data Skew** is one of the worst problems in MapReduce. If the key "null" lands in 50% of records, one reducer takes 50% of all the work. Fixes: salting (slap a random suffix on the key), pre-aggregation, or a custom partitioner.
Why is Shuffle the slowest phase of MapReduce?
Reduce - aggregation of results
After Shuffle each reducer receives **all values for its set of keys**, pre-sorted. The reducer function aggregates: sum, average, max - whatever the task demands. The result lands back in HDFS.
For the reducer to work correctly, the aggregation function must satisfy certain properties. **Associativity** means grouping order does not matter: `(a + b) + c = a + (b + c)`. **Commutativity** means operand order does not matter: `a + b = b + a`. Together they guarantee correct results during parallel execution.
| Operation | Associative | Commutative | Suitable for Reduce |
|---|---|---|---|
| SUM | Yes | Yes | Ideal |
| COUNT | Yes | Yes | Ideal |
| MAX / MIN | Yes | Yes | Ideal |
| AVERAGE | No! | Yes | Must store sum + count, not avg |
| CONCAT (strings) | Yes | No! | Result depends on order |
| MEDIAN | No! | No! | Requires all data at once |
Always test mapper and reducer **locally** via pipe (`mapper | sort | reducer`) before launching on the cluster. Cluster debugging eats minutes of startup time with logs scattered across nodes. A local test runs in seconds.
Computing the median salary per department via MapReduce. The mapper emits (department, salary). What problem will arise in Reduce?
Combiners - local optimization
Shuffle is a bottleneck because of network I/O. What if the data volume could be **shrunk BEFORE shipping it over the network**? Enter the **Combiner** - a mini-reducer that runs **on the mapper side**, right after the Map phase.
A Combiner is the same reducer, executed **locally** on the mapper node. Not every operation supports a combiner. The rule: a combiner is fair game whenever the **aggregation function is associative and commutative**.
| Operation | Combiner possible? | Why |
|---|---|---|
| SUM | Yes | sum(1,1) + sum(1,1) = sum(1,1,1,1) = 4 |
| COUNT | Yes | count→sum: (word,1)+(word,1) → (word,2) |
| MAX / MIN | Yes | max(max(a,b), max(c,d)) = max(a,b,c,d) |
| AVERAGE | Not directly | avg(avg(1,3), avg(5,7)) = avg(2,6) = 4 != avg(1,3,5,7) = 4. Pass (sum, count) instead |
| MEDIAN | No | Median of sub-sets != overall median |
| DISTINCT | Partially | Local distinct reduces data, but final dedup needed in reducer |
In practice a combiner can shrink network traffic by **tens of times**. Word Count on natural language (where words repeat frequently) saves 80-90%. Data with unique keys (UUID) - combiner is useless.
**Hadoop does not guarantee the Combiner gets called.** The framework may call it 0, 1, or multiple times. So the combiner is an **optimization**, not part of the logic. The result must stay correct even without it.
MapReduce is suitable for any distributed computation - it is a universal paradigm.
MapReduce is a poor fit for iterative algorithms (ML, PageRank), real-time processing, and graph tasks. For those there are Spark (in-memory), Flink (streaming), GraphX (graphs).
Every MapReduce iteration writes intermediate data to disk. The PageRank algorithm requires 20-50 iterations - that is 20-50 cycles of disk I/O. Spark keeps data in RAM between iterations, which is 10-100x faster. For each class of task there is an optimal tool.
Computing the median salary via MapReduce. Can a Combiner be used?
Key Takeaways
- **Map** - parallel transformation: each mapper processes one HDFS block, emitting (key, value) pairs. Remember 11 million NYT articles? Each article was processed by a separate mapper on its own node
- **Shuffle** - the most expensive phase: grouping by key + sorting + transferring data over the network between nodes
- **Reduce** - aggregation: the function must be associative and commutative for correct parallel execution
- **Combiner** - optimization: a local reduce on the mapper node, reduces network traffic dramatically. Works only for associative operations
- MapReduce is not a silver bullet: for iterative, real-time, and graph tasks use Spark, Flink, GraphX
Related Topics
MapReduce is covered in detail. The next step is Apache Spark, which solves the main problem of MapReduce (disk I/O):
- Hadoop Ecosystem — MapReduce is one of the three pillars of Hadoop (HDFS + YARN + MR)
- What is Big Data: 5V — MapReduce solves the Volume problem - processing petabytes of data
Вопросы для размышления
- Why is the combiner for Word Count identical to the reducer, but for computing an average it is not? What mathematical property is at play?
- Consider a scenario where 90% of the data has the key 'null'. How does this affect Shuffle and Reduce? How can the problem be fixed?
- If MapReduce stored intermediate data in memory instead of on disk - what new problems would arise? (Hint: what if a node crashes?)
Связанные уроки
- bd-02 — HDFS block structure is the foundation of the Map phase
- bd-01 — The 5V concept and Volume as the reason MapReduce exists
- bd-04 — Apache Spark solves MapReduce's main weakness - disk I/O between iterations
- alg-01 — Associativity and commutativity determine Combiner applicability
- alg-04 — Divide and conquer is the fundamental idea behind Map/Reduce task splitting
- alg-19-divide-conquer