Big Data

Data Quality and Observability: Great Expectations, Monte Carlo, and Circuit Breakers

Knight Capital lost USD 440 million in 45 minutes because of a single stale data flag. No exception was raised. No alert fired. The system worked exactly as designed - on bad data. Data quality is not a nice-to-have. It is the contract between reality and the machine.

Airbnb: data contracts enforced in CI across 200+ producer-consumer pairs - schema breaks caught before merge
DoorDash: Monte Carlo detected a broken ETL 40 minutes before the first customer complaint, preventing stale pricing in production
LinkedIn: GE validation suites run in Airflow after every Spark job - failed validations halt the DAG and page on-call
Spotify: column-level lineage via OpenLineage traces feature drift from ML model back to Kafka topic in seconds

The Five Dimensions of Data Quality

April 2012. Knight Capital automated trading system executes USD 7 billion in trades in 45 minutes. Not because of a network fault - because of a data quality failure: a stale flag in one field activated untested legacy code. Within hours Knight Capital lost USD 440 million and was effectively bankrupt.

Data quality is not binary. It has five independent dimensions, each of which can fail silently while others look fine.

The hardest dimension is **consistency** across systems. A user table in Postgres and a Kafka events stream both have `user_id`, but one was backfilled with a different ID generation scheme. The join silently multiplies rows by 3x. The ML model sees 3x more 'activity' for certain users. The fraud model stops flagging them. No exception raised.

**Population Stability Index (PSI)** is the industry-standard scalar for distribution drift. PSI < 0.1: stable. PSI 0.1-0.2: investigate. PSI > 0.2: data has shifted significantly - retrain or block. More sensitive to tail drift than KS test, which matters for fraud and credit models where extreme values dominate.

A feature value `avg_spend_30d` is not null and within range, but was last updated 48 hours ago. Which quality dimension fails?

Great Expectations: Executable Data Contracts

**Great Expectations (GE)** turns quality rules into versioned, executable test suites called **Expectation Suites**. Each expectation is a named assertion: `expect_column_values_to_not_be_null`, `expect_column_values_to_be_between`, `expect_table_row_count_to_be_between`. GE runs them against a batch of data and produces a machine-readable validation result with pass rate, observed statistics, and failed sample rows.

GE integrates with Airflow as a sensor operator: validation runs after each pipeline step, and a failed suite blocks downstream tasks from executing. This is the **circuit-breaker pattern** for data: stale or invalid data cannot flow further into ML training or dashboards.

**Data Contracts** formalize expectations between a data producer (upstream team) and a data consumer (ML team, analytics). A contract is a GE suite committed to the producer's repo. When the producer changes schema or drops a column, CI validates the contract and blocks the merge. LinkedIn and Airbnb use data contracts across 100+ producer-consumer pairs.

What does the `mostly=0.999` parameter in a GE expectation mean?

Data Observability: Monte Carlo, Anomaly Detection, and Lineage

Great Expectations catches known unknowns - violations of rules that were explicitly written. **Data Observability** platforms catch unknown unknowns: the column that had 1% nulls for 6 months and suddenly hits 40% on a Tuesday. No one wrote an expectation for that specific threshold - but **Monte Carlo** or **Bigeye** would have sent a Slack alert within minutes.

Monte Carlo works by establishing baselines over rolling windows (7-30 days) and detecting anomalies using statistical models. It monitors five signals without any manual rule-writing: row count changes, schema changes, freshness gaps, null rate spikes, and distribution shifts. At DoorDash, Monte Carlo caught a broken ETL job 40 minutes before the first user complaint.

**Column-level lineage** is the complement to monitoring: when an alert fires, which upstream jobs produced this column? Apache Atlas and OpenLineage instrument Spark jobs to emit lineage events to a graph database. The result is a dependency map: `ml_features.avg_spend_30d -> spark_job:etl_aggregations -> s3://raw/events`. Debugging a bad value traces from the ML model back to the Kafka topic in seconds, not days.

Data quality checks are only needed before loading data into a warehouse

Quality must be validated at every pipeline stage: ingestion, transformation, feature computation, and model input - each stage can introduce silent failures

Transformer inputs at Uber look clean at ingestion but accumulate drift through five ETL joins before reaching the model. A circuit breaker at ingestion catches schema errors; observability mid-pipeline catches semantic drift. One gate is not enough.

What is the main advantage of Monte Carlo over a manually written GE expectation suite for detecting null rate increases?

Key ideas

Five quality dimensions: completeness, uniqueness, timeliness, validity, consistency - each can fail independently
Great Expectations: executable Expectation Suites that version, test, and gate pipeline stages
Data Contracts: GE suites committed to the producer repo - schema changes break CI before reaching consumers
Monte Carlo / Bigeye: learned baselines detect unknown anomalies without manual thresholds
PSI > 0.2 means distribution shifted significantly - alert and consider retraining

Вопросы для размышления

How would a circuit-breaker pattern for an ML pipeline differ from one in a microservices architecture?
Design a data contract between a Kafka event producer and a Feature Store consumer - what expectations are non-negotiable?
When is PSI a better drift metric than KS test, and when is it worse?

Связанные уроки

bd-16 — Feature Store monitoring builds on the same freshness and drift concepts
bd-14 — Data Lakehouse is the storage layer where quality checks run
bd-10 — Kafka pipeline events are primary targets for data contract validation
bd-18 — Quality metadata feeds the Data Catalog and Governance layer
stat-31-eda