Databases
System Design: Analytics Platform
Netflix analyzes the behavior of 260 million subscribers: what content is watched, at what timestamp users abandon a video, which thumbnails generate more clicks. This requires querying billions of events per day with sub-second response times. Running these queries on the production OLTP database would saturate it within minutes. ClickHouse processes the same queries in under one second because it reads only the 2-3 columns relevant to each query instead of all 50 columns per row.
- **Cloudflare**: ClickHouse processes 36 million HTTP requests per second in real-time. Analytics over the last 24 hours (trillions of rows) return in under one second using columnar storage and materialized views.
- **Shopify**: Snowflake + dbt. Over 500 dbt models transform raw order and merchant data into revenue dashboards, cohort retention analysis, and merchant performance reports. Daily pipeline processes hundreds of billions of rows.
OLTP vs OLAP: Two Different Worlds
OLTP (Online Transaction Processing) databases handle many short, concurrent transactions: insert an order, update a balance, read a user profile. They optimize for write latency, transactional correctness, and row-level access. OLAP (Online Analytical Processing) systems handle complex aggregations over large datasets: revenue by region by month, cohort retention, funnel analysis. These access patterns require fundamentally different storage engines.
Running heavy OLAP queries directly on a production OLTP database causes lock contention and degrades transactional performance for users. Even read-only analytical queries can saturate disk I/O, increasing latency for concurrent writes. Always separate analytical workloads from OLTP systems.
Why should OLAP analytical queries not run on the production OLTP database?
Star Schema and Snowflake Schema
Data warehouses use dimensional modeling to organize facts (measurable events) and dimensions (descriptive attributes). Star schema stores dimensions denormalized in a single table per dimension. Snowflake schema normalizes dimensions into multiple related tables. Star schema favors query simplicity; snowflake schema favors storage efficiency.
What is the main advantage of star schema over snowflake schema for OLAP queries?
ClickHouse: Columnar OLAP at Scale
ClickHouse stores each column as a separate compressed file. An aggregation query reads only the needed columns, reducing I/O by 10-100x compared to row storage. MergeTree is the primary table engine: data is sorted by the ORDER BY key, enabling efficient range queries and delta compression on sorted columns.
Cloudflare uses ClickHouse to process 36 million HTTP requests per second in real-time analytics. Queries over the last 24 hours of traffic (trillions of rows) return in under one second. ClickHouse's columnar storage compresses URL strings 10x (sorted data has repeated patterns that LZ4 and ZSTD compress efficiently).
Why is columnar storage faster for analytical aggregations like SELECT avg(amount) FROM orders?
Materialized Views for Pre-Aggregation
Materialized views pre-compute and store query results. ClickHouse materialized views update incrementally as new data arrives - each INSERT triggers the view to aggregate the new rows into the pre-computed result. PostgreSQL materialized views are static snapshots that require a full REFRESH.
Netflix uses ClickHouse materialized views to power real-time dashboards showing content performance metrics (views per hour, completion rates, quality issues) for all 260 million subscribers. The raw event stream contains billions of playback events per day; materialized views reduce dashboard queries to milliseconds.
How does a ClickHouse materialized view differ from a PostgreSQL MATERIALIZED VIEW?
ETL vs ELT Pipelines
ETL (Extract, Transform, Load) transforms data before loading into the warehouse. ELT (Extract, Load, Transform) loads raw data first, then transforms it using the warehouse's compute power. Modern cloud data warehouses (Snowflake, BigQuery, ClickHouse) favor ELT because their compute is more scalable and cheaper than external ETL infrastructure.
What is the key difference between ETL and ELT approaches?
Key Ideas
- **OLTP vs OLAP**: OLTP optimizes for concurrent short transactions (row access). OLAP optimizes for aggregations over large datasets (columnar access). Never run OLAP queries on production OLTP systems.
- **Star schema**: fact table surrounded by dimension tables. Fewer JOINs per query compared to snowflake schema. Standard for data warehouses.
- **ClickHouse MergeTree**: columnar storage, partition by time, sort by query keys. Reads only needed columns = 10-100x less I/O than row storage for aggregations.
- **Materialized views in ClickHouse** update incrementally on every INSERT. PostgreSQL materialized views require explicit REFRESH. ClickHouse MVs power real-time dashboards.
- **ELT** (Extract, Load, Transform) loads raw data into the warehouse first, then uses dbt to define SQL transformations. More flexible than ETL for schema evolution.
Related Topics
Analytics platforms build on partitioning, time-series, and storage engine concepts:
- Partitioning — ClickHouse PARTITION BY time is the same concept as PostgreSQL range partitioning. Partition pruning eliminates irrelevant time ranges from analytical queries.
- Time-Series Databases — ClickHouse handles both time-series metrics (like InfluxDB) and event analytics (like a data warehouse). The distinction is schema design, not the database itself.
- LSM-Tree — ClickHouse's MergeTree engine uses a merge-based approach similar to LSM-Tree: data is written in sorted chunks and merged in the background, enabling high write throughput.
Вопросы для размышления
- Netflix wants to compute the 'abandonment rate by episode and minute mark' for all content in real time. Design the ClickHouse schema and materialized view for this metric.
- Shopify runs 500+ dbt models daily. A single model failure can cascade to downstream models. How would you design the dbt DAG to minimize cascade failures and make recovery fast?
- When would you choose Snowflake over ClickHouse for an analytics platform, and vice versa?