Big Data

Columnar Formats: Parquet and ORC

Facebook stores trillions of user action events. A query like 'how many times was the Like button clicked in January in Brazil' must execute in seconds, not hours. The secret is not cluster power but storage format. Parquet reads 2 columns out of 80 and skips 90% of rows via min/max statistics - long before the data reaches the CPU.

**Apache Spark** - reads/writes Parquet natively; it is the default format for `spark.write.save()`
**AWS Athena** - SQL over S3, optimized for Parquet/ORC with predicate pushdown
**Apache Iceberg / Delta Lake** - use Parquet as the physical format, adding ACID on top
**Pandas / PyArrow** - `pd.read_parquet()` is the standard way to work with large datasets in Python

Why columnar storage is faster for analytics

The query `SELECT AVG(price) FROM orders` touches only one column out of, say, 50. In row storage (PostgreSQL, CSV) the full record is read for each row - all 50 fields - even though only one is needed. At 1 billion rows of 500 bytes each this means 500 GB of I/O instead of 10 GB with a columnar approach. A 50x difference - purely from a different data layout.

A second gain: **homogeneous data** in one block compresses far better. A `country_code` column contains repeated strings like "US", "DE", "GB" - RLE (Run-Length Encoding) compresses this dramatically. Mixed row data compresses much worse.

Parquet (developed by Twitter and Cloudera in 2013) and ORC (developed at Hortonworks for Hive in 2013) are the two dominant columnar formats in the big data ecosystem. Both store data on disk column-wise but differ in encoding details and indexing.

Query `SELECT user_id, revenue FROM events` touches 2 out of 80 columns. How much less data does the columnar format read?

Parquet encoding: dictionary, RLE, bit-packing

Parquet does not simply rearrange data into columns - it applies several encoding layers before compression. The encoding algorithm is chosen per-column automatically based on data cardinality.

**Dictionary Encoding:** the `status` column has 3 unique values ("active", "inactive", "pending"). Parquet builds a dictionary {0:"active", 1:"inactive", 2:"pending"} and stores integer indexes instead of strings
**RLE (Run-Length Encoding):** a sequence 0,0,0,0,1,1,1,1,1 becomes (4x0, 5x1). Effective for sorted columns or columns with repetitions
**Bit-packing:** if values range from 0 to 3 (2 bits), why store them in 4 bytes? Bit-packing packs N values into the minimum number of bits

Codec	Decompress speed	Ratio	Usage
Snappy	Very fast	1.5-2x	Default in Parquet, Kafka
Gzip	Slow	2.5-3x	Archives, infrequent reads
Zstd	Fast	2.5-4x	Parquet 2.0+, LZ4 alternative
LZ4	Very fast	1.5x	Hot data, streaming

Dictionary encoding is most effective for:

Predicate Pushdown: skipping data without reading it

Parquet divides data into **row groups** (~128 MB by default). For each row group and each column, min/max statistics are stored in the file **footer**. For a query like `WHERE date > '2024-01-01'`, the engine reads the footer, compares against each row group's min/max, and skips row groups that provably contain no matching data - without reading them at all.

**Page-level statistics** go further: within a row group data is split into pages (~1 MB), and each page also stores min/max. **Bloom filters** (optional, Parquet 2.0+) allow an exact check of `WHERE user_id = 'abc123'` without scanning the page - a probabilistic filter with zero false negatives.

Predicate pushdown works only if data is **sorted** or **partitioned** by the filter column. Random order = random min/max in row groups = no skipping. This is why sort order at write time matters (`ZORDER BY` in Delta Lake).

Parquet predicate pushdown is most effective when:

Schema Evolution: changing the schema without migrations

In OLTP databases changing the schema is painful: `ALTER TABLE` on a billion-row table can take hours. Parquet handles this differently: each file is self-contained and stores its own schema in the footer. Files with different schemas can coexist in the same directory.

**Adding a column:** old files do not have `loyalty_tier` - at read time it is filled with `null`. New files are written with the column. Both types are read together without errors
**Removing a column:** new files do not contain `deprecated_field`. Old files contain it - Spark simply ignores it when querying without that field
**Renaming a column:** Parquet cannot help here - this is a semantic change requiring either an alias or file rewrite
**Type widening:** widening casts (INT32 -> INT64) are possible; narrowing (DOUBLE -> FLOAT) requires explicit cast and may lose precision

Feature	Parquet	ORC
Created	2013, Twitter+Cloudera	2013, Hortonworks for Hive
Ecosystem	Spark, Flink, Presto (de facto standard)	Hive (primary), Spark
Compression	Snappy/Gzip/Zstd	Zlib/Snappy/LZ4
Nested types	Excellent (Dremel encoding)	Good
Bloom filters	Parquet 2.0+	Built-in (ORC 1.0+)
ACID support	Via Delta Lake/Iceberg	Native in Hive ACID

Delta Lake adds **schema enforcement** on top of Parquet: writing data with an incompatible schema raises an exception. **Schema evolution** in Delta is enabled explicitly: `mergeSchema=true` at write time or `ALTER TABLE ADD COLUMN`. This combines Parquet's flexibility with Data Warehouse strictness.

Parquet and ORC are interchangeable - the choice does not matter

Parquet is the de facto standard in the Spark/Presto/Flink ecosystem with better support for nested structures. ORC is historically stronger in Hive and has native ACID without Delta Lake

ORC was built specifically for Hive and is optimized for its query engine. Parquet was originally designed around the Dremel nested data model and has broader cross-engine support. Choose based on query engine first, not abstract benchmarks

When a new column is added to Parquet files, what happens when old files (without the column) are read?