ai mlintermediate 12m2026-06-09

Feature Engineering & Feature Stores at Scale

Session 38 of the 48-session learning series.

Why this session matters

This is Session 38 of 48 in the ML track. Modern ML is 80% feature engineering, 20% modelling — and at scale the consistency between training and serving features is the #1 source of silent model degradation. Feature stores were invented to fix exactly that. Understanding the architecture, not just the buzzword, is critical.

Agenda

What a feature actually is (and where definitions go to die)
Feature stores — offline + online + the consistency contract
Training/serving skew — the silent killer
Feature pipelines — point-in-time correctness, backfills, freshness
Real-time features — streaming aggregations, materialisation latency

Pre-read (skim before the session)

Deep dive

1. The feature definition problem

A feature is a deterministic function of raw data:

user_clicks_last_7d(user_id, ts) = COUNT(clicks WHERE user=user_id AND click_ts BETWEEN ts-7d AND ts)

Definitions live in three places without a feature store:

The training notebook (Python + Spark).
The online serving code (Go service).
The eval script (different Python script).

Three implementations of the same logic. Three sources of drift. Three places to fix when the definition changes.

2. What a feature store gives you

One definition for a feature → used in both training and serving.
Offline store (warehouse) for training; bulk historical reads.
Online store (low-latency KV) for serving; single-key fast reads.
Materialisation — pipeline copies pre-computed features from offline → online.
Discovery — UI / catalog of available features.
Lineage — feature → upstream tables → owners.
Monitoring — drift, freshness, distribution.

It's data infrastructure with an ML flavour.

3. Offline vs online stores

Layer	Tech	Latency	Use
Offline	Parquet on S3, Snowflake, BigQuery, Delta	seconds-minutes	Training, backfill, batch eval
Online	Redis, DynamoDB, Cassandra, ScyllaDB	< 10 ms	Real-time inference

The store must guarantee: feature(user=42, time=now) at training time = feature(user=42, time=now) at serving time.

4. Training/serving skew — the bug class

Symptoms:

Model AUC=0.85 offline, behaves like AUC=0.65 in production.
Bug manifest months later when a feature pipeline silently drifts.

Causes:

Feature computed differently in train vs serve (different SQL, different timezone, different null handling).
Feature snapshot used a 24h-late table in training (look-ahead bias) but live in serving.
Online feature has different freshness than training timestamps assumed.

Fixes (all of them, layered):

Single source of truth for feature logic (feature store).
Point-in-time joins (next section).
Log live features → reuse for retraining.

5. Point-in-time correctness

This is the hard part. When you build training data, for each row (entity, label_ts) you want the features as they were at label_ts — not now.

-- WRONG: uses current feature value
SELECT label.*, feat.value
FROM labels JOIN features ON labels.user = features.user

-- RIGHT: temporal join, point-in-time
SELECT label.*, feat.value
FROM labels
ASOF JOIN features
  ON labels.user = features.user
  AND feat.event_ts <= label.label_ts

Without point-in-time joins you leak the future into the past and your offline metrics are fiction. Spark, Polars, Feast, dbt all support point-in-time joins with varying syntax.

6. Streaming features

The features that drove the field forward:

clicks_last_5min(user) — recency matters.
merchant_velocity_last_hour(merchant) — fraud detection.
session_pages_so_far(session) — recommendation context.

Compute via:

Flink / Spark Structured Streaming → write to online store every N seconds.
Materialised in Redis with TTL.
Reads at serve = single Redis GET.

Trade-offs: lower freshness vs query simplicity vs cost.

7. Feature freshness budget

Per feature, declare:

Staleness SLA — "this feature must be < N seconds old".
Alert when freshness exceeds SLA.

Production pattern:

Real-time fraud features: < 5 s.
Recommendation features: < 60 s.
Personalisation embeddings: < 1 day.
Demographic / static: weekly is fine.

Pay for compute matching the SLA. Don't run sub-second updates on weekly features.

8. Backfill

When you add a new feature you need historical values to train on. Backfill = compute the feature for every row of historical data.

Patterns:

Run a Spark job over all historical raw data.
Use the same code path as the streaming compute (Kappa architecture) — single source of truth.
Idempotent writes; can re-run on failure.
Watch the cost — backfilling 2 years of data is petabyte-scale.

9. Embedding features

Embeddings (S17) are a special kind of feature:

High-dim numeric vectors.
Stored in vector DB or as bytes in KV store.
Updated when the model that produced them is retrained.
Need version tracking (item_emb_v3).

Versioning is critical. Serving an old model with a new embedding format = silent garbage.

10. Feature governance

Owner — engineer responsible.
Description — what does it actually mean.
Source — upstream tables / streams.
Tags — sensitive (PII), expensive, slow-to-refresh.
Usage — which models consume it; auto-discovered.
Quality stats — null rate, distribution, drift score.

Same hygiene as data governance (S32). A feature without an owner is a feature you'll delete during the next outage.

11. The build/buy decision

Feature store tools:

Feast — open source; you bring the infra. Most popular OSS.
Tecton — SaaS; biggest commercial; richest features.
SageMaker Feature Store — AWS-native.
Vertex AI Feature Store — GCP-native.
Databricks Feature Engineering — built into Lakehouse.
Build your own — fine for a single ML team; quickly painful for multi-team.

Rule of thumb:

1–2 ML engineers, 1–3 models: skip the feature store. Use Spark + Redis.
5+ ML engineers, 10+ models: feature store is mandatory.

12. Reality check

A first-feature-store rollout:

Audit current features — find the duplicates.
Move top-10 highest-traffic features into Feast (or chosen platform).
Implement point-in-time training data generation.
Wire monitoring: freshness, null rate, drift.
Deprecate the duplicates; force-migrate model owners.

Plan for 3 months minimum. The benefit lands as the second and third model consume the same definitions — that's when you stop reinventing.

Reading material

Books:

Designing Machine Learning Systems — Chip Huyen (the feature engineering + feature stores chapters; the canonical reference)
Feature Engineering for Machine Learning — Alice Zheng & Amanda Casari (O'Reilly; the algorithm-by-algorithm cookbook)
Feature Store for Machine Learning — Jayanth Kumar M J (Packt; the implementation-focused book)
Reliable Machine Learning — Cathy Chen, Niall Murphy et al. (the Google/Microsoft SREs-of-ML book; the training/serving skew chapter)

Papers:

Hidden Technical Debt in Machine Learning Systems — Sculley et al. 2015 (Google, NeurIPS) — the paper that named training/serving skew as the dominant cost.
Feature Stores: The Data Side of ML Pipelines — Stoyanovich et al. 2022 — the academic survey.

Official docs:

Feast documentation — the open-source feature store.
Tecton documentation — the commercial feature platform from the Uber Michelangelo team.
Databricks Feature Engineering docs — the lakehouse-native feature store.
AWS SageMaker Feature Store — the AWS take.
Google Vertex AI Feature Store — the GCP take.

Blog posts:

Uber Engineering — Michelangelo: Uber's Machine Learning Platform — the OG feature store; the paper everyone copies.
Airbnb Engineering — Zipline: Airbnb's Machine Learning Data Management Platform — point-in-time correctness done right.
Eugene Yan — Feature Stores: A Hierarchy of Needs — the modern practitioner essay; the right reading after the book.
Chip Huyen — Real-time machine learning: challenges and solutions — the canonical online-features essay.
Tecton — Time travel in ML feature stores — the deepest practical write-up on point-in-time correctness.

In-depth research material

Feast — github.com/feast-dev/feast — ~5.8k ★, the open-source feature store; the reference implementation.
Hopsworks — github.com/logicalclocks/hopsworks — ~1.2k ★, the OG end-to-end feature store + lakehouse.
Featureform — github.com/featureform/featureform — ~1.8k ★, the virtual feature store (works on top of your existing warehouse).
Chronon — github.com/airbnb/chronon — Airbnb's open-sourced version of Zipline.
DoorDash — Building a Real-time Predictive Maintenance Platform — the gigascale feature serving story.
Spotify Engineering — Modeling Engineering Productivity — feature store as platform.
Lyft Engineering — Lyft's Feature Generation Platform — the cost + scale story.
Robinhood Engineering — Building a feature store at Robinhood — fintech low-latency feature serving.
Pinterest Engineering — Scaling Feature Engineering at Pinterest — petabyte-scale feature pipelines.
Materialize — Streaming feature stores with SQL — the streaming-SQL angle on real-time features.
Tecton blog — Online and offline consistency — the production playbook.

Videos

Feature Stores Explained — Made With ML — Goku Mohandas · 22 min — the cleanest 22-minute primer.
The Hierarchy of Needs for ML Feature Stores — Eugene Yan — 41 min — the talk version of his essay.
Building a feature store at Tecton — Mike Del Balso (Tecton CEO, ex-Uber Michelangelo) — 32 min — from the person who built the original at Uber.
Real-time machine learning at scale — Chip Huyen — 47 min — the streaming-features deep dive.
Feast: The Open-Source Feature Store — Willem Pienaar — Willem Pienaar (Feast creator) · 35 min — from the project lead.

LeetCode — Subarray Sum Equals K

Link: https://leetcode.com/problems/subarray-sum-equals-k/
Difficulty: Medium
Why this problem: Aggregations over a rolling window with hash-map memoisation — the algorithmic heart of "events in last X minutes" features.
Time-box: 30 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

Define training/serving skew and 3 root causes.
Explain point-in-time joins and write the ASOF join syntax.
Sketch the offline/online architecture of a feature store.
Pick a freshness SLA appropriate to the use case.
Decide when a team needs a feature store vs Spark + Redis.
Solve subarray-sum-equals-k — prefix-sum + hashmap, the same shape as streaming-window aggregations.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.

← previous

Petabyte Cost Optimisation — Compression, Partitioning, Z-Order, File Sizing

Designing a Distributed Job Queue — Reliability, Backoff, Idempotency