Big Data

MapReduce: the paradigm

Цели урока

Understand the role of each phase - Map, Shuffle, Reduce - in the pipeline
Explain why Shuffle is the bottleneck and how to optimize it
Apply Combiners only for associative and commutative operations
Know the limitations of MapReduce and when to use Spark instead

Предварительные знания

Hadoop Ecosystem

2008. The New York Times wanted to digitize 11 million newspaper articles spanning 150 years. On a single server - months of work. Engineer Derek Gottfrid fired up a MapReduce job on 100 Amazon EC2 virtual machines. Result: 4 TB processed in 24 hours, cost USD 240. The divide-and-conquer principle, scaled to hundreds of machines.

**Google** originally used MapReduce to build the search index of the entire web - trillions of pages
**Spotify** processes billions of listening events through MapReduce-like pipelines for Wrapped (year-end recap)
**NASA** uses MapReduce to analyze terabytes of satellite imagery and telemetry

Jeff Dean, Sanjay Ghemawat, and the paper that transformed Big Data

In 2004 Google engineers Jeff Dean and Sanjay Ghemawat published "MapReduce: Simplified Data Processing on Large Clusters." They did not invent map and reduce - those came from functional programming (Lisp, 1960s). But Google grabbed the abstraction and forged an industrial system that automatically handled node failures and load balancing. The paper sparked Apache Hadoop in 2006 - the open implementation that democratized Big Data.

Map - data transformation

The **Map** phase kicks things off. Each cluster node grabs its data block from HDFS and applies a mapper function. The mapper reads input `(key, value)` pairs and emits intermediate `(key', value')` pairs. The key point: **mappers run in parallel and independently** - each one chews on its own block, oblivious to the others.

In real Hadoop MapReduce the mapper reads data line by line. Input key: the byte offset of the line in the file. Input value: the line text. The mapper emits new key-value pairs.

**Hadoop Streaming** lets mapper and reducer get written in any language (Python, Ruby, bash). Data flows through stdin/stdout. Production usually picks Java (native API), but Streaming is ideal for prototypes.

One mapper processes exactly one HDFS block (128 MB by default). A 10-block file launches 10 mappers. 100 single-block files launches 100 mappers. The number of mappers is set by the **number of input splits (blocks)**, not the number of nodes.

A file in HDFS occupies 5 blocks of 128 MB each. How many mapper tasks will be launched?

Shuffle - grouping and sorting

After Map the data is scattered: `(hello, 1)` pairs land on different nodes. Before Reduce, every value for the same key must be **gathered on the same reducer**. That is Shuffle's job - the most expensive phase of MapReduce, because it ships **data over the network** between nodes. Exactly why Apache Spark displaced Hadoop MapReduce for iterative workloads.

Shuffle breaks into several sub-steps. Each mapper sorts its output by key and writes it to local disk. The **Partitioner** decides which reducer each key goes to (usually `hash(key) % num_reducers`). Data ships over the network and gets **merge-sorted** on the reducer side.

**Shuffle is MapReduce's main performance enemy.** All intermediate data hits disk (mapper side), ships over the network, hits disk again (reducer side). For 1 TB of intermediate data that means ~3 TB disk I/O + 1 TB network I/O. Apache Spark sidesteps it by keeping data in memory.

**Custom Partitioner** controls which keys land on which reducer. The default is HashPartitioner. But skewed data (one key dominating 90% of records) creates a bottleneck on a single reducer while the others sit idle.

**Data Skew** is one of the worst problems in MapReduce. If the key "null" lands in 50% of records, one reducer takes 50% of all the work. Fixes: salting (slap a random suffix on the key), pre-aggregation, or a custom partitioner.

Why is Shuffle the slowest phase of MapReduce?

Reduce - aggregation of results

After Shuffle each reducer receives **all values for its set of keys**, pre-sorted. The reducer function aggregates: sum, average, max - whatever the task demands. The result lands back in HDFS.

For the reducer to work correctly, the aggregation function must satisfy certain properties. **Associativity** means grouping order does not matter: `(a + b) + c = a + (b + c)`. **Commutativity** means operand order does not matter: `a + b = b + a`. Together they guarantee correct results during parallel execution.

Operation	Associative	Commutative	Suitable for Reduce
SUM	Yes	Yes	Ideal
COUNT	Yes	Yes	Ideal
MAX / MIN	Yes	Yes	Ideal
AVERAGE	No!	Yes	Must store sum + count, not avg
CONCAT (strings)	Yes	No!	Result depends on order
MEDIAN	No!	No!	Requires all data at once

Always test mapper and reducer **locally** via pipe (`mapper | sort | reducer`) before launching on the cluster. Cluster debugging eats minutes of startup time with logs scattered across nodes. A local test runs in seconds.

Computing the median salary per department via MapReduce. The mapper emits (department, salary). What problem will arise in Reduce?

Combiners - local optimization

Shuffle is a bottleneck because of network I/O. What if the data volume could be **shrunk BEFORE shipping it over the network**? Enter the **Combiner** - a mini-reducer that runs **on the mapper side**, right after the Map phase.

A Combiner is the same reducer, executed **locally** on the mapper node. Not every operation supports a combiner. The rule: a combiner is fair game whenever the **aggregation function is associative and commutative**.

Operation	Combiner possible?	Why
SUM	Yes	sum(1,1) + sum(1,1) = sum(1,1,1,1) = 4
COUNT	Yes	count→sum: (word,1)+(word,1) → (word,2)
MAX / MIN	Yes	max(max(a,b), max(c,d)) = max(a,b,c,d)
AVERAGE	Not directly	avg(avg(1,3), avg(5,7)) = avg(2,6) = 4 != avg(1,3,5,7) = 4. Pass (sum, count) instead
MEDIAN	No	Median of sub-sets != overall median
DISTINCT	Partially	Local distinct reduces data, but final dedup needed in reducer

In practice a combiner can shrink network traffic by **tens of times**. Word Count on natural language (where words repeat frequently) saves 80-90%. Data with unique keys (UUID) - combiner is useless.

**Hadoop does not guarantee the Combiner gets called.** The framework may call it 0, 1, or multiple times. So the combiner is an **optimization**, not part of the logic. The result must stay correct even without it.

MapReduce is suitable for any distributed computation - it is a universal paradigm.

MapReduce is a poor fit for iterative algorithms (ML, PageRank), real-time processing, and graph tasks. For those there are Spark (in-memory), Flink (streaming), GraphX (graphs).

Every MapReduce iteration writes intermediate data to disk. The PageRank algorithm requires 20-50 iterations - that is 20-50 cycles of disk I/O. Spark keeps data in RAM between iterations, which is 10-100x faster. For each class of task there is an optimal tool.

Computing the median salary via MapReduce. Can a Combiner be used?

Key Takeaways

**Map** - parallel transformation: each mapper processes one HDFS block, emitting (key, value) pairs. Remember 11 million NYT articles? Each article was processed by a separate mapper on its own node
**Shuffle** - the most expensive phase: grouping by key + sorting + transferring data over the network between nodes
**Reduce** - aggregation: the function must be associative and commutative for correct parallel execution
**Combiner** - optimization: a local reduce on the mapper node, reduces network traffic dramatically. Works only for associative operations
MapReduce is not a silver bullet: for iterative, real-time, and graph tasks use Spark, Flink, GraphX

Вопросы для размышления

Why is the combiner for Word Count identical to the reducer, but for computing an average it is not? What mathematical property is at play?
Consider a scenario where 90% of the data has the key 'null'. How does this affect Shuffle and Reduce? How can the problem be fixed?
If MapReduce stored intermediate data in memory instead of on disk - what new problems would arise? (Hint: what if a node crashes?)

Связанные уроки

bd-02 — HDFS block structure is the foundation of the Map phase
bd-01 — The 5V concept and Volume as the reason MapReduce exists
bd-04 — Apache Spark solves MapReduce's main weakness - disk I/O between iterations
alg-01 — Associativity and commutativity determine Combiner applicability
alg-04 — Divide and conquer is the fundamental idea behind Map/Reduce task splitting
alg-19-divide-conquer

Big Data

MapReduce: the paradigm

Цели урока

Understand the role of each phase - Map, Shuffle, Reduce - in the pipeline
Explain why Shuffle is the bottleneck and how to optimize it
Apply Combiners only for associative and commutative operations
Know the limitations of MapReduce and when to use Spark instead

Предварительные знания

Hadoop Ecosystem

**Google** originally used MapReduce to build the search index of the entire web - trillions of pages
**Spotify** processes billions of listening events through MapReduce-like pipelines for Wrapped (year-end recap)
**NASA** uses MapReduce to analyze terabytes of satellite imagery and telemetry

Jeff Dean, Sanjay Ghemawat, and the paper that transformed Big Data

Map - data transformation

In real Hadoop MapReduce the mapper reads data line by line. Input key: the byte offset of the line in the file. Input value: the line text. The mapper emits new key-value pairs.

A file in HDFS occupies 5 blocks of 128 MB each. How many mapper tasks will be launched?

Shuffle - grouping and sorting

Why is Shuffle the slowest phase of MapReduce?

Reduce - aggregation of results

After Shuffle each reducer receives **all values for its set of keys**, pre-sorted. The reducer function aggregates: sum, average, max - whatever the task demands. The result lands back in HDFS.

Operation	Associative	Commutative	Suitable for Reduce
SUM	Yes	Yes	Ideal
COUNT	Yes	Yes	Ideal
MAX / MIN	Yes	Yes	Ideal
AVERAGE	No!	Yes	Must store sum + count, not avg
CONCAT (strings)	Yes	No!	Result depends on order
MEDIAN	No!	No!	Requires all data at once

Computing the median salary per department via MapReduce. The mapper emits (department, salary). What problem will arise in Reduce?

Combiners - local optimization

Operation	Combiner possible?	Why
SUM	Yes	sum(1,1) + sum(1,1) = sum(1,1,1,1) = 4
COUNT	Yes	count→sum: (word,1)+(word,1) → (word,2)
MAX / MIN	Yes	max(max(a,b), max(c,d)) = max(a,b,c,d)
AVERAGE	Not directly	avg(avg(1,3), avg(5,7)) = avg(2,6) = 4 != avg(1,3,5,7) = 4. Pass (sum, count) instead
MEDIAN	No	Median of sub-sets != overall median
DISTINCT	Partially	Local distinct reduces data, but final dedup needed in reducer

MapReduce is suitable for any distributed computation - it is a universal paradigm.

MapReduce is a poor fit for iterative algorithms (ML, PageRank), real-time processing, and graph tasks. For those there are Spark (in-memory), Flink (streaming), GraphX (graphs).

Computing the median salary via MapReduce. Can a Combiner be used?

Key Takeaways

**Map** - parallel transformation: each mapper processes one HDFS block, emitting (key, value) pairs. Remember 11 million NYT articles? Each article was processed by a separate mapper on its own node
**Shuffle** - the most expensive phase: grouping by key + sorting + transferring data over the network between nodes
**Reduce** - aggregation: the function must be associative and commutative for correct parallel execution
**Combiner** - optimization: a local reduce on the mapper node, reduces network traffic dramatically. Works only for associative operations
MapReduce is not a silver bullet: for iterative, real-time, and graph tasks use Spark, Flink, GraphX

Вопросы для размышления

Why is the combiner for Word Count identical to the reducer, but for computing an average it is not? What mathematical property is at play?
Consider a scenario where 90% of the data has the key 'null'. How does this affect Shuffle and Reduce? How can the problem be fixed?
If MapReduce stored intermediate data in memory instead of on disk - what new problems would arise? (Hint: what if a node crashes?)

Связанные уроки

bd-02 — HDFS block structure is the foundation of the Map phase
bd-01 — The 5V concept and Volume as the reason MapReduce exists
bd-04 — Apache Spark solves MapReduce's main weakness - disk I/O between iterations
alg-01 — Associativity and commutativity determine Combiner applicability
alg-04 — Divide and conquer is the fundamental idea behind Map/Reduce task splitting
alg-19-divide-conquer

MapReduce: the paradigm

Цели урока

Предварительные знания

Jeff Dean, Sanjay Ghemawat, and the paper that transformed Big Data

Map - data transformation

Shuffle - grouping and sorting

Reduce - aggregation of results

Combiners - local optimization

Key Takeaways

Related Topics

Вопросы для размышления

Связанные уроки

MapReduce: the paradigm

Цели урока

Предварительные знания

Jeff Dean, Sanjay Ghemawat, and the paper that transformed Big Data

Map - data transformation

Shuffle - grouping and sorting

Reduce - aggregation of results

Combiners - local optimization

Key Takeaways

Related Topics

Вопросы для размышления

Связанные уроки