Big Data

Feature Store: Centralized Feature Management

DoorDash saves USD 200 thousand per year in compute costs alone through feature reuse in their Feature Store - instead of duplicated Spark jobs. But the main saving is not money - it is speed: a new ML model reaches production in 2 days instead of 2 weeks, because features are already computed and available.

Uber Michelangelo: the first Feature Store, 200+ ML models sharing the same computed features
Airbnb Zipline: batch and streaming features for price optimization and fraud detection on one platform
Stripe: real-time fraud features via Kafka + Redis, <5ms latency for 10,000 transactions per second
LinkedIn: Feature Store holds 500+ features for job recommendations, skill matching, and feed ranking

Feature Store: Why It Exists and What Is Inside

Uber, 2017. 200 ML models. Each team recomputes the same features: 'user activity over 7 days', 'average driver rating over 30 days'. 200 times. Inconsistently. With different results. This led to the creation of Michelangelo - the industry's first Feature Store.

A **Feature Store** is a centralized repository for ML features. Three functions: (1) storing computed features, (2) serving features for training and inference with identical logic, (3) versioning and lineage. The core problem it solves: **training-serving skew** - divergence between features used at training time and at serving time.

**Feast** (Feature Store) is open-source, originally developed at Gojek. Supports Redis (online) plus BigQuery/Parquet (offline). Feast SDK: define a FeatureView (how to compute a feature), Feature (one feature), Entity (the key). Feast writes to the online store during materialize and reads during get_online_features.

What is training-serving skew and how does a Feature Store help avoid it?

Managed Feature Stores: Tecton, Vertex AI, Databricks

**Tecton** is an enterprise Feature Store co-founded by the Uber Michelangelo team. Difference from Feast: a managed service with a feature pipeline scheduler, quality monitoring, and a built-in transformation engine. Airbnb, DoorDash, and Stripe use Tecton for production ML. Price: ~USD 50000/year enterprise.

**Vertex AI Feature Store** (Google Cloud) and **Databricks Feature Store** are managed solutions tied to their respective platforms. Vertex AI: Bigtable as online store (milliseconds), BigQuery as offline. Databricks: Delta Lake as offline, automatic materialization. The choice is driven by the company's cloud strategy.

What distinguishes a managed Feature Store (Tecton) from a self-hosted one (Feast)?

Feature Pipeline: Batch, Streaming, On-Demand

Three types of feature computation: **batch** (Spark on a schedule, for historical data), **streaming** (Flink on Kafka, for near-realtime), **on-demand** (computed at request time, for context-dependent features). Fraud detection requires all three: historical patterns plus real-time velocity plus the current transaction context.

Which type of feature computation is best for 'deviation of transaction amount from user average'?

Feature Store Monitoring and Data Quality

A Feature Store without monitoring is a time bomb. Kafka stops delivering events - streaming features go stale. A Spark job fails - batch features are from last month. The model continues making predictions on outdated data. At Stripe this led to a 3x increase in fraud detection false positives within one hour.

**PSI (Population Stability Index)** is the industry standard for drift monitoring. PSI < 0.1: data is stable. PSI 0.1-0.2: minor shift, monitor. PSI > 0.2: significant drift, retraining needed. More sensitive to distribution tails than the KS test.

A single Feature Store solves all ML platform problems

A Feature Store solves only the feature consistency problem. Separate components are needed for: model registry, experiment tracking, data lineage, and serving infrastructure

An ML platform is an ecosystem. Feature Store (Feast/Tecton) + Model Registry (MLflow) + Experiment Tracking (W&B) + Serving (Triton/TorchServe) + Observability (custom) - each component solves its own problem.

Why monitor feature freshness in a Feature Store?

Key ideas

Feature Store solves training-serving skew: one computation logic for both training and serving
Three types: batch (Spark), streaming (Flink), on-demand (at request time)
Online store (Redis): milliseconds for serving. Offline store (Parquet): for training
Feast is open source; Tecton/Vertex AI are managed with scheduling and monitoring
Freshness monitoring is critical: stale features cause silent quality degradation

Вопросы для размышления

How does a Feature Store handle point-in-time correctness when training on historical data?
Design a Feature Store for an e-commerce platform: which features are needed and which are batch vs streaming vs on-demand?
How can an A/B test be organized between two Feature Store versions of the same feature without risk to production?

Связанные уроки

bd-15 — Spark ML pipeline is the predecessor of the Feature Store
bd-06 — Streaming features are computed in Flink/Kafka and written to the Feature Store
bd-10 — Kafka is the standard source for online feature computation
bd-14 — Data Lakehouse architecture includes the Feature Store as a layer
ml-04-data-preprocessing