Big Data

Data Quality and Observability: Great Expectations, Monte Carlo, and Circuit Breakers

Knight Capital lost USD 440 million in 45 minutes because of a single stale data flag. No exception was raised. No alert fired. The system worked exactly as designed - on bad data. Data quality is not a nice-to-have. It is the contract between reality and the machine.

  • Airbnb: data contracts enforced in CI across 200+ producer-consumer pairs - schema breaks caught before merge
  • DoorDash: Monte Carlo detected a broken ETL 40 minutes before the first customer complaint, preventing stale pricing in production
  • LinkedIn: GE validation suites run in Airflow after every Spark job - failed validations halt the DAG and page on-call
  • Spotify: column-level lineage via OpenLineage traces feature drift from ML model back to Kafka topic in seconds

The Five Dimensions of Data Quality

April 2012. Knight Capital automated trading system executes USD 7 billion in trades in 45 minutes. Not because of a network fault - because of a data quality failure: a stale flag in one field activated untested legacy code. Within hours Knight Capital lost USD 440 million and was effectively bankrupt.

Data quality is not binary. It has five independent dimensions, each of which can fail silently while others look fine.

The hardest dimension is **consistency** across systems. A user table in Postgres and a Kafka events stream both have `user_id`, but one was backfilled with a different ID generation scheme. The join silently multiplies rows by 3x. The ML model sees 3x more 'activity' for certain users. The fraud model stops flagging them. No exception raised.

**Population Stability Index (PSI)** is the industry-standard scalar for distribution drift. PSI < 0.1: stable. PSI 0.1-0.2: investigate. PSI > 0.2: data has shifted significantly - retrain or block. More sensitive to tail drift than KS test, which matters for fraud and credit models where extreme values dominate.

A feature value `avg_spend_30d` is not null and within range, but was last updated 48 hours ago. Which quality dimension fails?

Great Expectations: Executable Data Contracts

**Great Expectations (GE)** turns quality rules into versioned, executable test suites called **Expectation Suites**. Each expectation is a named assertion: `expect_column_values_to_not_be_null`, `expect_column_values_to_be_between`, `expect_table_row_count_to_be_between`. GE runs them against a batch of data and produces a machine-readable validation result with pass rate, observed statistics, and failed sample rows.

GE integrates with Airflow as a sensor operator: validation runs after each pipeline step, and a failed suite blocks downstream tasks from executing. This is the **circuit-breaker pattern** for data: stale or invalid data cannot flow further into ML training or dashboards.

**Data Contracts** formalize expectations between a data producer (upstream team) and a data consumer (ML team, analytics). A contract is a GE suite committed to the producer's repo. When the producer changes schema or drops a column, CI validates the contract and blocks the merge. LinkedIn and Airbnb use data contracts across 100+ producer-consumer pairs.

What does the `mostly=0.999` parameter in a GE expectation mean?

Data Observability: Monte Carlo, Anomaly Detection, and Lineage

Great Expectations catches known unknowns - violations of rules that were explicitly written. **Data Observability** platforms catch unknown unknowns: the column that had 1% nulls for 6 months and suddenly hits 40% on a Tuesday. No one wrote an expectation for that specific threshold - but **Monte Carlo** or **Bigeye** would have sent a Slack alert within minutes.

Monte Carlo works by establishing baselines over rolling windows (7-30 days) and detecting anomalies using statistical models. It monitors five signals without any manual rule-writing: row count changes, schema changes, freshness gaps, null rate spikes, and distribution shifts. At DoorDash, Monte Carlo caught a broken ETL job 40 minutes before the first user complaint.

**Column-level lineage** is the complement to monitoring: when an alert fires, which upstream jobs produced this column? Apache Atlas and OpenLineage instrument Spark jobs to emit lineage events to a graph database. The result is a dependency map: `ml_features.avg_spend_30d -> spark_job:etl_aggregations -> s3://raw/events`. Debugging a bad value traces from the ML model back to the Kafka topic in seconds, not days.

Data quality checks are only needed before loading data into a warehouse

Quality must be validated at every pipeline stage: ingestion, transformation, feature computation, and model input - each stage can introduce silent failures

Transformer inputs at Uber look clean at ingestion but accumulate drift through five ETL joins before reaching the model. A circuit breaker at ingestion catches schema errors; observability mid-pipeline catches semantic drift. One gate is not enough.

What is the main advantage of Monte Carlo over a manually written GE expectation suite for detecting null rate increases?

Key ideas

  • Five quality dimensions: completeness, uniqueness, timeliness, validity, consistency - each can fail independently
  • Great Expectations: executable Expectation Suites that version, test, and gate pipeline stages
  • Data Contracts: GE suites committed to the producer repo - schema changes break CI before reaching consumers
  • Monte Carlo / Bigeye: learned baselines detect unknown anomalies without manual thresholds
  • PSI > 0.2 means distribution shifted significantly - alert and consider retraining

Related topics

Data quality connects the storage, pipeline, and governance layers of the Big Data stack.

  • Feature Store Monitoring — Quality dimensions directly map to freshness, drift, and null rate checks in the Feature Store
  • Data Governance — Quality metadata and lineage are the inputs to the Data Catalog and compliance layer
  • Kafka Pipelines — Stream ingestion is the first quality gate - schema registry and event contracts

Вопросы для размышления

  • How would a circuit-breaker pattern for an ML pipeline differ from one in a microservices architecture?
  • Design a data contract between a Kafka event producer and a Feature Store consumer - what expectations are non-negotiable?
  • When is PSI a better drift metric than KS test, and when is it worse?

Связанные уроки

  • bd-16 — Feature Store monitoring builds on the same freshness and drift concepts
  • bd-14 — Data Lakehouse is the storage layer where quality checks run
  • bd-10 — Kafka pipeline events are primary targets for data contract validation
  • bd-18 — Quality metadata feeds the Data Catalog and Governance layer
  • stat-31-eda
Data Quality and Observability: Great Expectations, Monte Carlo, and Circuit Breakers

0

1

Sign In