Vetora logo
🏭AI / ML Infrastructure

Feature Stores

A feature store is a centralized platform for defining, storing, and serving ML features consistently across training and inference. It solves the train-serve skew problem by ensuring the exact same feature transformation logic produces data for both offline model training and online prediction.

Overview

The feature store emerged as an ML infrastructure primitive around 2017-2018, pioneered by Uber (Michelangelo), Airbnb (Zipline), and LinkedIn (Frame). The core insight was that feature engineering -- the process of transforming raw data into model inputs -- is the most time-consuming part of applied ML (often 60-80% of effort), and most of that work is duplicated across teams. Different teams re-implement the same features (e.g., 'user's average session duration over 30 days') with slight variations, leading to inconsistencies between training and production.

The feature store solves three fundamental problems. First, train-serve skew: when training features are computed in batch (e.g., Spark SQL over a data lake) but serving features are computed in real-time (e.g., Java code in the serving path), subtle differences in logic, precision, or data freshness cause the model to receive inputs in production that differ from what it learned during training. A feature store ensures both paths use the same transformation definition. Second, feature reuse: once a team defines 'user_30d_avg_session_duration', any other team can discover and use it without re-implementing the computation. At Uber, this increased feature reuse from ~0% to over 50%. Third, point-in-time correctness: when building a training dataset, you must join features as they existed at the time of each training example, not as they exist today. Otherwise, you introduce data leakage (the model trains on future information). Feature stores automate point-in-time joins across multiple feature tables.

Architecturally, a feature store has four components: (1) a feature registry that catalogs feature definitions, owners, and metadata; (2) a transformation engine that computes features from raw data (batch transforms via Spark/Flink, streaming transforms via Kafka/Flink, on-demand transforms at request time); (3) an offline store (data warehouse or data lake) for training data retrieval; and (4) an online store (Redis, DynamoDB, or Bigtable) for low-latency serving. Materialization pipelines sync computed features from the offline store to the online store on a schedule.

The distinction between batch, streaming, and on-demand features is critical. Batch features (recomputed hourly or daily) are the simplest and cheapest -- most features fall here. Streaming features (updated in near-real-time from event streams) are needed for fraud detection, real-time recommendations, and dynamic pricing. On-demand features (computed at request time from the request payload, like 'distance between user location and merchant') cannot be precomputed because they depend on request-time data. A production feature store must support all three modes.

Key Points
  • 1Train-serve skew is the #1 ML production bug. It occurs when training features are computed differently than serving features -- even small numerical precision differences can degrade model accuracy by 5-15%. Feature stores enforce a single transformation definition for both paths.
  • 2Point-in-time joins are essential for training correctness. A naive join of features at the latest timestamp leaks future information into the training set. Feature stores maintain feature history and automatically join features as they existed at each training example's timestamp.
  • 3The online store (Redis, DynamoDB) serves features at p99 < 5ms for real-time inference. The offline store (BigQuery, S3/Parquet) provides bulk retrieval for training. Materialization pipelines bridge the two, typically running on a schedule or triggered by data arrival.
  • 4Feature reuse across teams eliminates redundant computation and inconsistency. Organizations with mature feature stores report 40-60% feature reuse rates, reducing feature engineering effort by 30-50% for new models.
  • 5Streaming features (computed from Kafka/Flink in near-real-time) are critical for fraud detection and real-time personalization but cost 5-10x more to operate than batch features. Use them only when freshness directly impacts model performance.
  • 6On-demand features are computed at serving time from request context (e.g., user's current GPS coordinates). They cannot be pre-materialized and require the transformation to run within the serving latency budget.
Simple Example

Preventing Train-Serve Skew in Fraud Detection

A fraud detection model uses a feature 'user_7d_transaction_count'. During training, a data scientist computes this using a SQL query over the data warehouse: COUNT(*) with a 7-day window. In production, an engineer re-implements the same logic in Java, but uses a 7-day window based on UTC midnight boundaries instead of a rolling 7-day window. The model's accuracy drops 8% in production because the feature values differ. With a feature store, both training and serving use the same registered transformation definition, eliminating the discrepancy.

Real-World Examples

Uber (Michelangelo)

Uber's Michelangelo platform includes a feature store that serves features for ride pricing, ETA prediction, fraud detection, and restaurant recommendations. It processes over 10 million feature lookups per second from a Cassandra-backed online store with p99 < 10ms. Feature definitions are registered in a central catalog, and the same Spark-based transformations generate both training datasets and materialized online features.

Stripe

Stripe's feature store powers fraud detection models that evaluate billions of transactions per year. Streaming features (e.g., 'number of transactions from this card in the last 5 minutes') are computed via Flink from a Kafka event stream and stored in a Redis-backed online store with p99 < 3ms. The same Flink job backfills the offline store for training, ensuring zero train-serve skew for time-windowed aggregation features.

DoorDash

DoorDash built a feature store on top of Redis and Apache Flink to serve features for delivery time estimation, dynamic pricing, and merchant recommendations. The system handles 30K+ feature lookups per second during peak dinner hours. They reduced new model development time from 2 weeks to 3 days by enabling data scientists to discover and reuse existing features from a centralized catalog of 2,000+ registered features.

Trade-Offs
AspectDescription
Freshness vs. CostReal-time streaming features (updated every second) provide the freshest data but require Flink/Spark Streaming infrastructure costing 5-10x more than batch. Daily batch features are cheap but stale. Choose freshness based on model sensitivity: fraud detection needs seconds; product recommendations tolerate hours.
Centralization vs. Team AutonomyA centralized feature store enforces consistency and enables reuse but can become a bottleneck if feature onboarding requires central team approval. A decentralized approach lets teams define features independently but risks duplication and inconsistency. Most organizations adopt a federated model: central infra, team-owned feature definitions.
Pre-computation vs. On-Demand ComputationPre-computing and materializing features to the online store ensures low-latency serving but wastes resources on features that are rarely queried. On-demand computation saves storage but adds serving latency and CPU cost. High-QPS features should be pre-materialized; long-tail features can be computed on demand.
Build vs. BuyOpen-source Feast is free and flexible but requires significant operational investment (managing Redis, Spark, materialization pipelines). Managed solutions (Tecton, Databricks, SageMaker Feature Store) reduce ops burden but add cost ($50K-$500K/year) and vendor lock-in. The break-even depends on team size and operational maturity.
Case Study

Airbnb's Zipline Feature Store

Scenario

Airbnb's ML teams were spending 60% of their time on feature engineering, with each team independently computing similar features. A search ranking team and a pricing team both computed 'average nightly price in region over 30 days' using different SQL queries with slightly different definitions of 'region', causing inconsistent model behavior across products.

Solution

Airbnb built Zipline, an internal feature store with a unified feature definition language, automated point-in-time-correct training data generation, and a low-latency online serving layer backed by a custom key-value store. Features were defined once in a declarative config and automatically materialized to both the offline (Hive) and online (custom KV) stores.

Outcome

Feature reuse reached 50% across 100+ ML models. New model development time dropped from weeks to days because data scientists could browse and compose existing features rather than writing new ETL jobs. Train-serve skew incidents decreased by 80%, directly improving model accuracy in production by an average of 3-5% across Airbnb's ML portfolio.

Common Mistakes
  • Building a feature store before having 5+ models in production. Feature stores add significant infrastructure complexity. If you have 1-3 models, a well-structured feature pipeline with shared libraries is sufficient. The investment pays off when feature duplication and train-serve skew become recurring problems across multiple teams.
  • Ignoring point-in-time correctness in training data generation. Joining features at the latest available timestamp (rather than the timestamp of each training example) leaks future information and produces inflated offline metrics that do not reproduce in production. Always use point-in-time joins.
  • Materializing all features to the online store regardless of access patterns. Pre-materializing 10,000 features when only 500 are used for online serving wastes compute and storage. Monitor feature access patterns and only materialize features that are actually queried at serving time.
  • Treating the feature store as just a key-value cache. The value of a feature store is in the transformation management, not just the storage. If you only use it to cache pre-computed values without registering transformation logic, you lose the train-serve consistency guarantee.
Related Concepts

See Feature Stores in action

Explore system design templates that use feature stores and run traffic simulations to see how these concepts perform under real load.

Browse Templates

Measure feature retrieval latency for real-time ranking

Metrics to watch
feature_lookup_mstrain_serve_skew_pctcache_hit_ratiothroughput_rps
Run Simulation
Test Your Understanding

1What is the primary problem that feature stores solve?

2Why are point-in-time joins important when generating training datasets?

Deeper Reading