Feature Engineering & Feature Stores at Scale
Session 38 of the 48-session learning series.
Why this session matters
This is Session 38 of 48 in the ML track. Modern ML is 80% feature engineering, 20% modelling — and at scale the consistency between training and serving features is the #1 source of silent model degradation. Feature stores were invented to fix exactly that. Understanding the architecture, not just the buzzword, is critical.
Agenda
- What a feature actually is (and where definitions go to die)
- Feature stores — offline + online + the consistency contract
- Training/serving skew — the silent killer
- Feature pipelines — point-in-time correctness, backfills, freshness
- Real-time features — streaming aggregations, materialisation latency
Pre-read (skim before the session)
- Uber — Michelangelo platform
- Feast — Feature store concepts
- Tecton blog — point-in-time correctness
- Eugene Yan — Feature stores explained
Deep dive
1. The feature definition problem
A feature is a deterministic function of raw data:
user_clicks_last_7d(user_id, ts) = COUNT(clicks WHERE user=user_id AND click_ts BETWEEN ts-7d AND ts)
Definitions live in three places without a feature store:
- The training notebook (Python + Spark).
- The online serving code (Go service).
- The eval script (different Python script).
Three implementations of the same logic. Three sources of drift. Three places to fix when the definition changes.
2. What a feature store gives you
- One definition for a feature → used in both training and serving.
- Offline store (warehouse) for training; bulk historical reads.
- Online store (low-latency KV) for serving; single-key fast reads.
- Materialisation — pipeline copies pre-computed features from offline → online.
- Discovery — UI / catalog of available features.
- Lineage — feature → upstream tables → owners.
- Monitoring — drift, freshness, distribution.
It's data infrastructure with an ML flavour.
3. Offline vs online stores
| Layer | Tech | Latency | Use |
|---|---|---|---|
| Offline | Parquet on S3, Snowflake, BigQuery, Delta | seconds-minutes | Training, backfill, batch eval |
| Online | Redis, DynamoDB, Cassandra, ScyllaDB | < 10 ms | Real-time inference |
The store must guarantee: feature(user=42, time=now) at training time = feature(user=42, time=now) at serving time.
4. Training/serving skew — the bug class
Symptoms:
- Model AUC=0.85 offline, behaves like AUC=0.65 in production.
- Bug manifest months later when a feature pipeline silently drifts.
Causes:
- Feature computed differently in train vs serve (different SQL, different timezone, different null handling).
- Feature snapshot used a 24h-late table in training (look-ahead bias) but live in serving.
- Online feature has different freshness than training timestamps assumed.
Fixes (all of them, layered):
- Single source of truth for feature logic (feature store).
- Point-in-time joins (next section).
- Log live features → reuse for retraining.
5. Point-in-time correctness
This is the hard part. When you build training data, for each row (entity, label_ts) you want the features as they were at label_ts — not now.
-- WRONG: uses current feature value
SELECT label.*, feat.value
FROM labels JOIN features ON labels.user = features.user
-- RIGHT: temporal join, point-in-time
SELECT label.*, feat.value
FROM labels
ASOF JOIN features
ON labels.user = features.user
AND feat.event_ts <= label.label_ts
Without point-in-time joins you leak the future into the past and your offline metrics are fiction. Spark, Polars, Feast, dbt all support point-in-time joins with varying syntax.
6. Streaming features
The features that drove the field forward:
clicks_last_5min(user)— recency matters.merchant_velocity_last_hour(merchant)— fraud detection.session_pages_so_far(session)— recommendation context.
Compute via:
- Flink / Spark Structured Streaming → write to online store every N seconds.
- Materialised in Redis with TTL.
- Reads at serve = single Redis GET.
Trade-offs: lower freshness vs query simplicity vs cost.
7. Feature freshness budget
Per feature, declare:
- Staleness SLA — "this feature must be < N seconds old".
- Alert when freshness exceeds SLA.
Production pattern:
- Real-time fraud features: < 5 s.
- Recommendation features: < 60 s.
- Personalisation embeddings: < 1 day.
- Demographic / static: weekly is fine.
Pay for compute matching the SLA. Don't run sub-second updates on weekly features.
8. Backfill
When you add a new feature you need historical values to train on. Backfill = compute the feature for every row of historical data.
Patterns:
- Run a Spark job over all historical raw data.
- Use the same code path as the streaming compute (Kappa architecture) — single source of truth.
- Idempotent writes; can re-run on failure.
- Watch the cost — backfilling 2 years of data is petabyte-scale.
9. Embedding features
Embeddings (S17) are a special kind of feature:
- High-dim numeric vectors.
- Stored in vector DB or as bytes in KV store.
- Updated when the model that produced them is retrained.
- Need version tracking (
item_emb_v3).
Versioning is critical. Serving an old model with a new embedding format = silent garbage.
10. Feature governance
- Owner — engineer responsible.
- Description — what does it actually mean.
- Source — upstream tables / streams.
- Tags — sensitive (PII), expensive, slow-to-refresh.
- Usage — which models consume it; auto-discovered.
- Quality stats — null rate, distribution, drift score.
Same hygiene as data governance (S32). A feature without an owner is a feature you'll delete during the next outage.
11. The build/buy decision
Feature store tools:
- Feast — open source; you bring the infra. Most popular OSS.
- Tecton — SaaS; biggest commercial; richest features.
- SageMaker Feature Store — AWS-native.
- Vertex AI Feature Store — GCP-native.
- Databricks Feature Engineering — built into Lakehouse.
- Build your own — fine for a single ML team; quickly painful for multi-team.
Rule of thumb:
- 1–2 ML engineers, 1–3 models: skip the feature store. Use Spark + Redis.
- 5+ ML engineers, 10+ models: feature store is mandatory.
12. Reality check
A first-feature-store rollout:
- Audit current features — find the duplicates.
- Move top-10 highest-traffic features into Feast (or chosen platform).
- Implement point-in-time training data generation.
- Wire monitoring: freshness, null rate, drift.
- Deprecate the duplicates; force-migrate model owners.
Plan for 3 months minimum. The benefit lands as the second and third model consume the same definitions — that's when you stop reinventing.
Reading material
Books:
- Designing Machine Learning Systems — Chip Huyen (the feature engineering + feature stores chapters; the canonical reference)
- Feature Engineering for Machine Learning — Alice Zheng & Amanda Casari (O'Reilly; the algorithm-by-algorithm cookbook)
- Feature Store for Machine Learning — Jayanth Kumar M J (Packt; the implementation-focused book)
- Reliable Machine Learning — Cathy Chen, Niall Murphy et al. (the Google/Microsoft SREs-of-ML book; the training/serving skew chapter)
Papers:
- Hidden Technical Debt in Machine Learning Systems — Sculley et al. 2015 (Google, NeurIPS) — the paper that named training/serving skew as the dominant cost.
- Feature Stores: The Data Side of ML Pipelines — Stoyanovich et al. 2022 — the academic survey.
Official docs:
- Feast documentation — the open-source feature store.
- Tecton documentation — the commercial feature platform from the Uber Michelangelo team.
- Databricks Feature Engineering docs — the lakehouse-native feature store.
- AWS SageMaker Feature Store — the AWS take.
- Google Vertex AI Feature Store — the GCP take.
Blog posts:
- Uber Engineering — Michelangelo: Uber's Machine Learning Platform — the OG feature store; the paper everyone copies.
- Airbnb Engineering — Zipline: Airbnb's Machine Learning Data Management Platform — point-in-time correctness done right.
- Eugene Yan — Feature Stores: A Hierarchy of Needs — the modern practitioner essay; the right reading after the book.
- Chip Huyen — Real-time machine learning: challenges and solutions — the canonical online-features essay.
- Tecton — Time travel in ML feature stores — the deepest practical write-up on point-in-time correctness.
In-depth research material
- Feast — github.com/feast-dev/feast — ~5.8k ★, the open-source feature store; the reference implementation.
- Hopsworks — github.com/logicalclocks/hopsworks — ~1.2k ★, the OG end-to-end feature store + lakehouse.
- Featureform — github.com/featureform/featureform — ~1.8k ★, the virtual feature store (works on top of your existing warehouse).
- Chronon — github.com/airbnb/chronon — Airbnb's open-sourced version of Zipline.
- DoorDash — Building a Real-time Predictive Maintenance Platform — the gigascale feature serving story.
- Spotify Engineering — Modeling Engineering Productivity — feature store as platform.
- Lyft Engineering — Lyft's Feature Generation Platform — the cost + scale story.
- Robinhood Engineering — Building a feature store at Robinhood — fintech low-latency feature serving.
- Pinterest Engineering — Scaling Feature Engineering at Pinterest — petabyte-scale feature pipelines.
- Materialize — Streaming feature stores with SQL — the streaming-SQL angle on real-time features.
- Tecton blog — Online and offline consistency — the production playbook.
Videos
- Feature Stores Explained — Made With ML — Goku Mohandas · 22 min — the cleanest 22-minute primer.
- The Hierarchy of Needs for ML Feature Stores — Eugene Yan — 41 min — the talk version of his essay.
- Building a feature store at Tecton — Mike Del Balso (Tecton CEO, ex-Uber Michelangelo) — 32 min — from the person who built the original at Uber.
- Real-time machine learning at scale — Chip Huyen — 47 min — the streaming-features deep dive.
- Feast: The Open-Source Feature Store — Willem Pienaar — Willem Pienaar (Feast creator) · 35 min — from the project lead.
LeetCode — Subarray Sum Equals K
- Link: https://leetcode.com/problems/subarray-sum-equals-k/
- Difficulty: Medium
- Why this problem: Aggregations over a rolling window with hash-map memoisation — the algorithmic heart of "events in last X minutes" features.
- Time-box: 30 minutes. Look up the editorial only after.
Post-session checklist
By the end of this session you should be able to:
- Define training/serving skew and 3 root causes.
- Explain point-in-time joins and write the ASOF join syntax.
- Sketch the offline/online architecture of a feature store.
- Pick a freshness SLA appropriate to the use case.
- Decide when a team needs a feature store vs Spark + Redis.
- Solve
subarray-sum-equals-k— prefix-sum + hashmap, the same shape as streaming-window aggregations.
Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.