Big Data
Columnar Formats: Parquet and ORC
Facebook stores trillions of user action events. A query like 'how many times was the Like button clicked in January in Brazil' must execute in seconds, not hours. The secret is not cluster power but storage format. Parquet reads 2 columns out of 80 and skips 90% of rows via min/max statistics - long before the data reaches the CPU.
- **Apache Spark** - reads/writes Parquet natively; it is the default format for `spark.write.save()`
- **AWS Athena** - SQL over S3, optimized for Parquet/ORC with predicate pushdown
- **Apache Iceberg / Delta Lake** - use Parquet as the physical format, adding ACID on top
- **Pandas / PyArrow** - `pd.read_parquet()` is the standard way to work with large datasets in Python
Why columnar storage is faster for analytics
The query `SELECT AVG(price) FROM orders` touches only one column out of, say, 50. In row storage (PostgreSQL, CSV) the full record is read for each row - all 50 fields - even though only one is needed. At 1 billion rows of 500 bytes each this means 500 GB of I/O instead of 10 GB with a columnar approach. A 50x difference - purely from a different data layout.
A second gain: **homogeneous data** in one block compresses far better. A `country_code` column contains repeated strings like "US", "DE", "GB" - RLE (Run-Length Encoding) compresses this dramatically. Mixed row data compresses much worse.
Parquet (developed by Twitter and Cloudera in 2013) and ORC (developed at Hortonworks for Hive in 2013) are the two dominant columnar formats in the big data ecosystem. Both store data on disk column-wise but differ in encoding details and indexing.
Query `SELECT user_id, revenue FROM events` touches 2 out of 80 columns. How much less data does the columnar format read?
Parquet encoding: dictionary, RLE, bit-packing
Parquet does not simply rearrange data into columns - it applies several encoding layers before compression. The encoding algorithm is chosen per-column automatically based on data cardinality.
- **Dictionary Encoding:** the `status` column has 3 unique values ("active", "inactive", "pending"). Parquet builds a dictionary {0:"active", 1:"inactive", 2:"pending"} and stores integer indexes instead of strings
- **RLE (Run-Length Encoding):** a sequence 0,0,0,0,1,1,1,1,1 becomes (4x0, 5x1). Effective for sorted columns or columns with repetitions
- **Bit-packing:** if values range from 0 to 3 (2 bits), why store them in 4 bytes? Bit-packing packs N values into the minimum number of bits
| Codec | Decompress speed | Ratio | Usage |
|---|---|---|---|
| Snappy | Very fast | 1.5-2x | Default in Parquet, Kafka |
| Gzip | Slow | 2.5-3x | Archives, infrequent reads |
| Zstd | Fast | 2.5-4x | Parquet 2.0+, LZ4 alternative |
| LZ4 | Very fast | 1.5x | Hot data, streaming |
Dictionary encoding is most effective for:
Predicate Pushdown: skipping data without reading it
Parquet divides data into **row groups** (~128 MB by default). For each row group and each column, min/max statistics are stored in the file **footer**. For a query like `WHERE date > '2024-01-01'`, the engine reads the footer, compares against each row group's min/max, and skips row groups that provably contain no matching data - without reading them at all.
**Page-level statistics** go further: within a row group data is split into pages (~1 MB), and each page also stores min/max. **Bloom filters** (optional, Parquet 2.0+) allow an exact check of `WHERE user_id = 'abc123'` without scanning the page - a probabilistic filter with zero false negatives.
Predicate pushdown works only if data is **sorted** or **partitioned** by the filter column. Random order = random min/max in row groups = no skipping. This is why sort order at write time matters (`ZORDER BY` in Delta Lake).
Parquet predicate pushdown is most effective when:
Schema Evolution: changing the schema without migrations
In OLTP databases changing the schema is painful: `ALTER TABLE` on a billion-row table can take hours. Parquet handles this differently: each file is self-contained and stores its own schema in the footer. Files with different schemas can coexist in the same directory.
- **Adding a column:** old files do not have `loyalty_tier` - at read time it is filled with `null`. New files are written with the column. Both types are read together without errors
- **Removing a column:** new files do not contain `deprecated_field`. Old files contain it - Spark simply ignores it when querying without that field
- **Renaming a column:** Parquet cannot help here - this is a semantic change requiring either an alias or file rewrite
- **Type widening:** widening casts (INT32 -> INT64) are possible; narrowing (DOUBLE -> FLOAT) requires explicit cast and may lose precision
| Feature | Parquet | ORC |
|---|---|---|
| Created | 2013, Twitter+Cloudera | 2013, Hortonworks for Hive |
| Ecosystem | Spark, Flink, Presto (de facto standard) | Hive (primary), Spark |
| Compression | Snappy/Gzip/Zstd | Zlib/Snappy/LZ4 |
| Nested types | Excellent (Dremel encoding) | Good |
| Bloom filters | Parquet 2.0+ | Built-in (ORC 1.0+) |
| ACID support | Via Delta Lake/Iceberg | Native in Hive ACID |
Delta Lake adds **schema enforcement** on top of Parquet: writing data with an incompatible schema raises an exception. **Schema evolution** in Delta is enabled explicitly: `mergeSchema=true` at write time or `ALTER TABLE ADD COLUMN`. This combines Parquet's flexibility with Data Warehouse strictness.
Parquet and ORC are interchangeable - the choice does not matter
Parquet is the de facto standard in the Spark/Presto/Flink ecosystem with better support for nested structures. ORC is historically stronger in Hive and has native ACID without Delta Lake
ORC was built specifically for Hive and is optimized for its query engine. Parquet was originally designed around the Dremel nested data model and has broader cross-engine support. Choose based on query engine first, not abstract benchmarks
When a new column is added to Parquet files, what happens when old files (without the column) are read?
Columnar Formats
- Columnar storage reads only needed columns - the gain is proportional to total table columns
- Parquet encodes data (dictionary, RLE, bit-packing) before compression - typical ratio is 10-20x vs raw CSV
- Predicate pushdown skips row groups via footer min/max statistics - only effective with sorted or partitioned data
- Schema evolution: adding/removing columns without rewriting files; renaming requires explicit migration
Related topics
Columnar formats are the physical foundation of modern data pipelines.
- Data Lake vs Data Warehouse — Parquet/ORC are the physical format used in Lakehouse
- Apache Spark — Primary engine for reading/writing Parquet in a distributed environment
Вопросы для размышления
- Why does predicate pushdown lose effectiveness when data is written in random order to Parquet?
- In what scenarios is ORC preferable to Parquet despite Parquet being the de facto Spark standard?
- How do bloom filters in Parquet 2.0 complement min/max statistics for point lookups on high-cardinality columns?