Big Data

Hadoop Ecosystem

2008. Twitter: 100 tweets per second. 2023: 6,000 tweets per second. Instagram: 95M photos per day. Hadoop was a Facebook pet project in 2006 - today it is Apache Hadoop, Apache Spark, and cloud storage at petabyte scale. Doug Cutting named the project after his son's yellow toy elephant. That elephant grew into an ecosystem that processes a significant fraction of the world's data.

**Yahoo** - one of the first production clusters: 4,000+ nodes, petabytes of data for search indexing
**Facebook** - the world's largest Hadoop cluster: 600+ PB in HDFS, thousands of Hive queries daily
**LinkedIn** - Hadoop + Kafka (created at LinkedIn) for processing billions of user activity events
**Alibaba** - the largest Hadoop cluster in Asia, processing data for 900M+ users

Doug Cutting and the birth of Hadoop

Doug Cutting was building the Nutch search engine and needed a way to process the entire web. When Google published papers on GFS (2003) and MapReduce (2004), Cutting and Mike Cafarella implemented them in open source. The project was named Hadoop - after Cutting's son's yellow plush elephant. Yahoo hired Cutting in 2006 and made Hadoop the backbone of their infrastructure. Cafarella was also co-author of the web crawler that became the foundation of Nutch and influenced the entire Big Data stack.

Предварительные знания

What is Big Data: 5V

HDFS: a file system across thousands of servers

2006. Facebook stores photos on ordinary servers. Five years later: 100 PB of data. No single disk holds that. **HDFS (Hadoop Distributed File System)** solves this radically: a file is split into 128 MB blocks, each block copied to 3 different servers. One dies - the other two hold a full copy. Facebook today: 600+ PB in HDFS, thousands of Hive queries daily.

Parameter	Value	Why
Block size	128 MB (default)	1 TB file = ~8,000 blocks, not millions. Minimizes metadata overhead
Replication factor	3 (default)	A block on 3 different nodes. Lose 2 nodes - data is intact
NameNode	1 (+ standby)	Stores the ENTIRE map: which block is on which node. Single point of failure
DataNode	Hundreds/thousands	Store data blocks. Send heartbeat every 3 seconds

HDFS is optimized for the **Write-Once-Read-Many (WORM)** pattern. Files are written once and read many times. Row 42 in a file cannot be changed - this is not a database. But a terabyte-sized file can be read sequentially at enormous speed. Netflix, Yahoo, and LinkedIn build their analytics pipelines on exactly this pattern.

**NameNode is a single point of failure.** If the NameNode goes down, the entire cluster becomes unavailable. In production: **NameNode HA** - active + standby NameNode with automatic failover via ZooKeeper. This is a CP choice from the CAP theorem: unavailability is preferable to inconsistent block mapping.

HDFS is not suitable for millions of small files. The NameNode stores metadata in RAM (~150 bytes per file). 100M files = 15 GB just for metadata. Solution: consolidate small files into SequenceFile or Avro containers.

A 500 MB file is uploaded to HDFS with a 128 MB block size and replication factor 3. How much space will it occupy?

MapReduce: move computation to data, not data to computation

2004. Google publishes 'MapReduce: Simplified Data Processing on Large Clusters'. Jeff Dean and Sanjay Ghemawat describe the paradigm that changes the industry. The insight is brilliantly simple: data sits on hundreds of servers - don't move it to the program, move the program to the data. A program weighs kilobytes. Data weighs terabytes. A gap of a million times.

MapReduce is inspired by functional programming: `map` is applied to each element, `reduce` aggregates the results. Divide and conquer at the distributed systems level - the same algorithmic pattern as merge sort, but across 1,000 machines.

**Word Count** - the Hello World of MapReduce. A terabyte file on a single server would take hours. With MapReduce on 100 nodes: minutes. Not because each node is smarter, but because each independently processes its slice of data in parallel.

**MapReduce's core problem is disk I/O.** Between Map and Reduce, all intermediate data is written to disk. For iterative algorithms (ML, PageRank) this is catastrophic: 10 iterations = 10 write/read cycles to disk. Spark keeps intermediate data in memory - hence the 10-100x speed improvement.

Why does MapReduce move computation to the data rather than data to computation?

YARN: one cluster for all frameworks

Hadoop 1.x: a cluster can only run MapReduce. Yahoo wants to run Spark and Hive alongside it - that means two separate clusters, duplicated data, double the cost. Hadoop 2.0 (2012) solves this with **YARN (Yet Another Resource Negotiator)**: one cluster, one resource pool, any framework.

**YARN** separates two concerns: resource management (who gets CPU and RAM) and task execution (how data is actually processed). MapReduce, Spark, Flink, and Hive now run on the same cluster simultaneously, without conflicts.

Component	Role	Analogy
ResourceManager	Distributes cluster resources	Airport dispatcher - decides who flies where
NodeManager	Resources on one node	Runway technician - watches over a specific aircraft
Container	Isolated portion of CPU+RAM	Seat - a fixed resource
ApplicationMaster	Coordinator for one application	Pilot - manages their own flight

YARN supports **schedulers**: CapacityScheduler (for multi-tenant clusters with per-team quotas) and FairScheduler (resources shared equally). Multiple teams sharing one cluster = CapacityScheduler.

A team wants to run Spark and MapReduce simultaneously on one cluster. What is needed?

Hadoop Ecosystem: what grew around the core

HDFS + MapReduce + YARN is the core. But in 15 years an ecosystem has grown around it that thousands of companies rely on. Hive turns SQL queries into MapReduce jobs. HBase gives random read/write on top of HDFS. Sqoop moves data from MySQL to HDFS in a single command. ZooKeeper holds all of it together.

Tool	Purpose	Key use case
Hive	SQL on Hadoop	SQL -> MapReduce/Spark. For analysts, not programmers
Pig	Scripted ETL pipelines	Pig Latin DSL. Easier than Java MapReduce
HBase	NoSQL (column-family) on HDFS	Google Bigtable clone. Billions of rows, random read/write, low latency
Sqoop	Import/Export RDBMS <-> HDFS	MySQL -> HDFS in one command. Parallel import
Flume	Log collection into HDFS	Agent -> Collector -> HDFS. For streaming data
ZooKeeper	Coordination of distributed systems	Leader election, distributed locks. Used by HBase, Kafka, YARN

The modern Big Data landscape has evolved: MapReduce gave way to **Apache Spark** (in-memory, 10-100x faster). Cloud solutions appeared: **AWS EMR**, **Google Dataproc**, **Azure HDInsight**. But HDFS and YARN remain the foundation for thousands of clusters. Spark does not replace Hadoop - it replaces MapReduce, running on top of the same HDFS and YARN.

**Apache Spark** does not replace Hadoop - it replaces **MapReduce**. Spark runs on top of HDFS and YARN, but keeps intermediate data in memory. For iterative tasks (ML, graph algorithms) Spark is 10-100x faster.

Hadoop is dead - it has been fully replaced by Spark and cloud solutions.

MapReduce is obsolete, but HDFS and YARN are alive and actively used. Spark runs ON TOP OF Hadoop.

Thousands of companies (banks, telecoms, retail) have on-premise Hadoop clusters with petabytes of data. Migration to the cloud takes years. HDFS stores data, YARN manages resources, Spark replaced MapReduce as the compute engine. Without understanding Hadoop architecture it is impossible to work effectively with Spark, Hive, and Kafka.

An analyst wants to run a SQL query against 500 GB of logs in HDFS. Which tool should they use?

Key Takeaways

**HDFS** - 128 MB blocks, x3 replication, NameNode + DataNodes. 600+ PB at Facebook, hundreds of thousands of servers at Yahoo
**MapReduce** - computation to data: Map (parallel) -> Shuffle (grouping) -> Reduce (aggregation). Slow due to disk I/O
**YARN** - resource manager: Spark, MapReduce, Flink on one cluster simultaneously
**Ecosystem** - Hive (SQL), HBase (NoSQL), Sqoop (import), Flume (logs), ZooKeeper (coordination)
**Evolution** - MapReduce replaced by Spark, HDFS+YARN remain the foundation, cloud added S3+EMR on top
**Doug Cutting's toy elephant** gave a name to the project that changed data processing worldwide

Вопросы для размышления

Why does HDFS use 128 MB blocks instead of 4 KB like regular file systems? What happens if the block size is reduced?
When should HBase be chosen over Hive? What requirements drive the choice?
Starting a Big Data project from scratch in 2024 - on-premise Hadoop or a cloud solution (AWS EMR)? What factors matter?

Связанные уроки

bd-01 — The 5V model and Big Data challenges from the first lesson
bd-03 — MapReduce in depth - the next step after the ecosystem overview
ds-02-cap-theorem — HDFS replication and NameNode HA are a live CAP trade-off
bt-13-kafka — Kafka often pairs with HDFS as a data source for Hadoop pipelines
alg-19-divide-conquer — MapReduce is divide and conquer at the distributed systems level
isd-08-database-selection

Big Data

Hadoop Ecosystem

**Yahoo** - one of the first production clusters: 4,000+ nodes, petabytes of data for search indexing
**Facebook** - the world's largest Hadoop cluster: 600+ PB in HDFS, thousands of Hive queries daily
**LinkedIn** - Hadoop + Kafka (created at LinkedIn) for processing billions of user activity events
**Alibaba** - the largest Hadoop cluster in Asia, processing data for 900M+ users

Doug Cutting and the birth of Hadoop

Предварительные знания

What is Big Data: 5V

HDFS: a file system across thousands of servers

Parameter	Value	Why
Block size	128 MB (default)	1 TB file = ~8,000 blocks, not millions. Minimizes metadata overhead
Replication factor	3 (default)	A block on 3 different nodes. Lose 2 nodes - data is intact
NameNode	1 (+ standby)	Stores the ENTIRE map: which block is on which node. Single point of failure
DataNode	Hundreds/thousands	Store data blocks. Send heartbeat every 3 seconds

A 500 MB file is uploaded to HDFS with a 128 MB block size and replication factor 3. How much space will it occupy?

MapReduce: move computation to data, not data to computation

Why does MapReduce move computation to the data rather than data to computation?

YARN: one cluster for all frameworks

Component	Role	Analogy
ResourceManager	Distributes cluster resources	Airport dispatcher - decides who flies where
NodeManager	Resources on one node	Runway technician - watches over a specific aircraft
Container	Isolated portion of CPU+RAM	Seat - a fixed resource
ApplicationMaster	Coordinator for one application	Pilot - manages their own flight

YARN supports **schedulers**: CapacityScheduler (for multi-tenant clusters with per-team quotas) and FairScheduler (resources shared equally). Multiple teams sharing one cluster = CapacityScheduler.

A team wants to run Spark and MapReduce simultaneously on one cluster. What is needed?

Hadoop Ecosystem: what grew around the core

Tool	Purpose	Key use case
Hive	SQL on Hadoop	SQL -> MapReduce/Spark. For analysts, not programmers
Pig	Scripted ETL pipelines	Pig Latin DSL. Easier than Java MapReduce
HBase	NoSQL (column-family) on HDFS	Google Bigtable clone. Billions of rows, random read/write, low latency
Sqoop	Import/Export RDBMS <-> HDFS	MySQL -> HDFS in one command. Parallel import
Flume	Log collection into HDFS	Agent -> Collector -> HDFS. For streaming data
ZooKeeper	Coordination of distributed systems	Leader election, distributed locks. Used by HBase, Kafka, YARN

Hadoop is dead - it has been fully replaced by Spark and cloud solutions.

MapReduce is obsolete, but HDFS and YARN are alive and actively used. Spark runs ON TOP OF Hadoop.

An analyst wants to run a SQL query against 500 GB of logs in HDFS. Which tool should they use?

Key Takeaways

**HDFS** - 128 MB blocks, x3 replication, NameNode + DataNodes. 600+ PB at Facebook, hundreds of thousands of servers at Yahoo
**MapReduce** - computation to data: Map (parallel) -> Shuffle (grouping) -> Reduce (aggregation). Slow due to disk I/O
**YARN** - resource manager: Spark, MapReduce, Flink on one cluster simultaneously
**Ecosystem** - Hive (SQL), HBase (NoSQL), Sqoop (import), Flume (logs), ZooKeeper (coordination)
**Evolution** - MapReduce replaced by Spark, HDFS+YARN remain the foundation, cloud added S3+EMR on top
**Doug Cutting's toy elephant** gave a name to the project that changed data processing worldwide

Вопросы для размышления

Why does HDFS use 128 MB blocks instead of 4 KB like regular file systems? What happens if the block size is reduced?
When should HBase be chosen over Hive? What requirements drive the choice?
Starting a Big Data project from scratch in 2024 - on-premise Hadoop or a cloud solution (AWS EMR)? What factors matter?

Связанные уроки

bd-01 — The 5V model and Big Data challenges from the first lesson
bd-03 — MapReduce in depth - the next step after the ecosystem overview
ds-02-cap-theorem — HDFS replication and NameNode HA are a live CAP trade-off
bt-13-kafka — Kafka often pairs with HDFS as a data source for Hadoop pipelines
alg-19-divide-conquer — MapReduce is divide and conquer at the distributed systems level
isd-08-database-selection

Hadoop Ecosystem

Doug Cutting and the birth of Hadoop

Предварительные знания

HDFS: a file system across thousands of servers

MapReduce: move computation to data, not data to computation

YARN: one cluster for all frameworks

Hadoop Ecosystem: what grew around the core

Key Takeaways

Related Topics

Вопросы для размышления

Связанные уроки

Hadoop Ecosystem

Doug Cutting and the birth of Hadoop

Предварительные знания

HDFS: a file system across thousands of servers

MapReduce: move computation to data, not data to computation

YARN: one cluster for all frameworks

Hadoop Ecosystem: what grew around the core

Key Takeaways

Related Topics

Вопросы для размышления

Связанные уроки