Big Data
Feature Store: Centralized Feature Management
DoorDash saves USD 200 thousand per year in compute costs alone through feature reuse in their Feature Store - instead of duplicated Spark jobs. But the main saving is not money - it is speed: a new ML model reaches production in 2 days instead of 2 weeks, because features are already computed and available.
- Uber Michelangelo: the first Feature Store, 200+ ML models sharing the same computed features
- Airbnb Zipline: batch and streaming features for price optimization and fraud detection on one platform
- Stripe: real-time fraud features via Kafka + Redis, <5ms latency for 10,000 transactions per second
- LinkedIn: Feature Store holds 500+ features for job recommendations, skill matching, and feed ranking
Feature Store: Why It Exists and What Is Inside
Uber, 2017. 200 ML models. Each team recomputes the same features: 'user activity over 7 days', 'average driver rating over 30 days'. 200 times. Inconsistently. With different results. This led to the creation of Michelangelo - the industry's first Feature Store.
A **Feature Store** is a centralized repository for ML features. Three functions: (1) storing computed features, (2) serving features for training and inference with identical logic, (3) versioning and lineage. The core problem it solves: **training-serving skew** - divergence between features used at training time and at serving time.
**Feast** (Feature Store) is open-source, originally developed at Gojek. Supports Redis (online) plus BigQuery/Parquet (offline). Feast SDK: define a FeatureView (how to compute a feature), Feature (one feature), Entity (the key). Feast writes to the online store during materialize and reads during get_online_features.
What is training-serving skew and how does a Feature Store help avoid it?
Managed Feature Stores: Tecton, Vertex AI, Databricks
**Tecton** is an enterprise Feature Store co-founded by the Uber Michelangelo team. Difference from Feast: a managed service with a feature pipeline scheduler, quality monitoring, and a built-in transformation engine. Airbnb, DoorDash, and Stripe use Tecton for production ML. Price: ~USD 50000/year enterprise.
**Vertex AI Feature Store** (Google Cloud) and **Databricks Feature Store** are managed solutions tied to their respective platforms. Vertex AI: Bigtable as online store (milliseconds), BigQuery as offline. Databricks: Delta Lake as offline, automatic materialization. The choice is driven by the company's cloud strategy.
What distinguishes a managed Feature Store (Tecton) from a self-hosted one (Feast)?
Feature Pipeline: Batch, Streaming, On-Demand
Three types of feature computation: **batch** (Spark on a schedule, for historical data), **streaming** (Flink on Kafka, for near-realtime), **on-demand** (computed at request time, for context-dependent features). Fraud detection requires all three: historical patterns plus real-time velocity plus the current transaction context.
Which type of feature computation is best for 'deviation of transaction amount from user average'?
Feature Store Monitoring and Data Quality
A Feature Store without monitoring is a time bomb. Kafka stops delivering events - streaming features go stale. A Spark job fails - batch features are from last month. The model continues making predictions on outdated data. At Stripe this led to a 3x increase in fraud detection false positives within one hour.
**PSI (Population Stability Index)** is the industry standard for drift monitoring. PSI < 0.1: data is stable. PSI 0.1-0.2: minor shift, monitor. PSI > 0.2: significant drift, retraining needed. More sensitive to distribution tails than the KS test.
A single Feature Store solves all ML platform problems
A Feature Store solves only the feature consistency problem. Separate components are needed for: model registry, experiment tracking, data lineage, and serving infrastructure
An ML platform is an ecosystem. Feature Store (Feast/Tecton) + Model Registry (MLflow) + Experiment Tracking (W&B) + Serving (Triton/TorchServe) + Observability (custom) - each component solves its own problem.
Why monitor feature freshness in a Feature Store?
Key ideas
- Feature Store solves training-serving skew: one computation logic for both training and serving
- Three types: batch (Spark), streaming (Flink), on-demand (at request time)
- Online store (Redis): milliseconds for serving. Offline store (Parquet): for training
- Feast is open source; Tecton/Vertex AI are managed with scheduling and monitoring
- Freshness monitoring is critical: stale features cause silent quality degradation
Related topics
The Feature Store is a central component of the ML platform on Big Data.
- Spark MLlib — Batch feature computation is built on Spark transformations
- Flink Streaming — Real-time features are computed in Flink and written to the online store
- Kafka — Source for streaming feature computation
Вопросы для размышления
- How does a Feature Store handle point-in-time correctness when training on historical data?
- Design a Feature Store for an e-commerce platform: which features are needed and which are batch vs streaming vs on-demand?
- How can an A/B test be organized between two Feature Store versions of the same feature without risk to production?
Связанные уроки
- bd-15 — Spark ML pipeline is the predecessor of the Feature Store
- bd-06 — Streaming features are computed in Flink/Kafka and written to the Feature Store
- bd-10 — Kafka is the standard source for online feature computation
- bd-14 — Data Lakehouse architecture includes the Feature Store as a layer
- ml-04-data-preprocessing