ai mlintermediate 12m2026-06-09

Recommender Systems — Two-Tower, Multi-Stage Ranking

Session 27 of the 48-session learning series.

Why this session matters

This is Session 27 of 48 in the ML track. Modern recsys is a 4-stage funnel — candidate retrieval, filtering, ranking, re-ranking — and almost every consumer surface (YouTube, TikTok, Pinterest, Spotify) is some flavour of this pipeline. Understanding the shape is more valuable than memorising any single model.

Agenda

Why CF and matrix factorisation aren't enough at scale
The two-tower architecture — query encoder + item encoder + ANN
Multi-stage funnel — retrieve → filter → rank → re-rank
Loss functions — pointwise, pairwise, listwise; sampled softmax
Evaluation — offline (NDCG, recall@K) and online (CTR, dwell, A/B)

Pre-read (skim before the session)

Deep dive

1. The recsys problem

Given a user + context (time, location, recent activity), produce an ordered list of N items that maximises some business objective. Catalog is millions to billions, latency budget is < 200 ms, candidates served per second can hit millions.

You cannot score every (user, item) pair. You need a funnel.

2. The 4-stage funnel

[ ~1B items ]
    │  retrieve (recall focus, very cheap)
    ▼
[ ~1000 candidates ]
    │  filter (eligibility, business rules, blocked content)
    ▼
[ ~200 candidates ]
    │  rank (deep model, expensive)
    ▼
[ ~50 ranked ]
    │  re-rank (diversity, freshness, exploration, business)
    ▼
[ Top 10 shown ]

Each stage is a different optimisation problem with different latency/cost/quality budget. Don't conflate them.

3. Retrieval — the two-tower architecture

[ user features ] ──► [ Query Tower DNN ] ──► query embedding u
                                                  │
                                          cos sim │ → ANN over items
                                                  │
[ item features ] ──► [ Item Tower DNN ] ──► item embedding v

Train so that cos(u, v) is high for (user, item) pairs they engaged with, low for non-engaged. At serve time:

Pre-compute all item embeddings, push to ANN index (Faiss, ScaNN, vector DB).
Online: encode user; ANN-search top-K items.
Latency ~5–20 ms for 1B items.

The two towers don't share weights. The dot product is the only meeting point — that's what makes it ANN-able.

4. Sampled softmax — making it tractable

Naive softmax over a billion items is impossible. Use:

In-batch negatives — for batch of B (user, item) pairs, use the other B-1 items in the batch as negatives. Free, biased toward popular items.
Negative mining — sample hard negatives (similar but unclicked). Better gradient, more compute.
Log-Q correction — adjust for in-batch popularity bias: score -= log(P(item)). YouTube uses this.

5. Multi-stage ranker

After retrieval, you have ~1000 candidates. Now you can afford a heavier model. Inputs include rich features:

User features — age, country, language, tenure, recent watch history embedding.
Item features — topic, language, age, view count, creator, embedding.
Cross features — user-item interaction history, same-creator engagement.
Context — device, time, session length, what they just saw.

Common ranker architectures:

Wide & Deep (Google) — wide linear + deep DNN.
DCN / DCN-V2 (Google) — cross network for explicit feature interactions.
DLRM (Meta) — embedding tables + DNN, scales to TBs of params.
Transformer-based — recent shift; user history as sequence.

6. Loss functions

Pointwise — predict P(click | user, item). Treat each example independently. BCE loss. Simple but loses ranking signal.
Pairwise — for (item_pos, item_neg), score(pos) > score(neg). Margin loss, RankNet. Good ranking signal.
Listwise — score the whole list; NDCG-aware loss (LambdaRank, ListNet). Hardest, often best.

In practice: train a multi-task model (CTR + dwell + share), use pointwise BCE per head, combine at serve with learned or hand-tuned weights.

7. Re-ranking — the often-skipped stage

The top-50 from the ranker is sorted by predicted relevance. But:

All 5 top picks might be from the same creator → boring feed.
All 5 might be 3 days old → stale.
All 5 might be ad-adjacent → ads tank.

Re-ranking applies:

Diversity (MMR, DPP) — penalise similarity to already-picked items.
Freshness boost — exponential decay on item age.
Exploration — ε-greedy inject of low-confidence items to gather data.
Business rules — slot ads, demote restricted content, surface promoted creator.

8. The cold-start nightmare

Three flavours:

New user — no history. Solve with: onboarding survey, demographic priors, item popularity, contextual bandits.
New item — no engagement. Solve with: content-based features (text, image), creator priors, exploration slots in the feed.
New platform — no anything. Solve with: editorial picks, popularity, manual ranking.

Two-tower handles new items well if item tower uses only content features (no item ID). Pure ID-embedding towers can't generalise.

9. Evaluation

Offline:

Recall@K for retrieval (did we get the clicked item in top-K?).
NDCG@K for ranking (positional discount).
MAP, MRR for binary relevance.
Hit rate — quick sanity.

Watch out: offline metrics correlate poorly with online. Always shadow + A/B before shipping.

Online:

CTR — easy to game (clickbait).
Dwell time — better signal of satisfaction.
Long-term metrics — DAU, retention, time-to-second-action. Slow but real.
Surveys / explicit feedback — gold but small sample.

10. The popularity trap

Most metrics reward recommending popular items. You end up showing everyone the same 100 items, killing long-tail discovery, killing creator economy, killing future training signal.

Mitigations: IPS (inverse propensity sampling) in training, exploration bonuses in re-rank, diversity loss, explicit "discovery" slots.

11. Feedback loops are everywhere

Train model → recommend → users engage with what's recommended → log → retrain. The model trains on data it caused. This causes:

Filter bubbles.
Stale catalog blind spots.
Self-reinforcing bias.

Mitigate with: held-out random slots, off-policy correction (IPS), occasional batch refresh from broad sources.

12. Reality check

A minimum viable recsys for a startup:

ALS or matrix factorisation as baseline.
Two-tower with item content features for retrieval.
LightGBM ranker with hand-crafted features for ranking.
Hand-tuned re-rank rules (recency, dedup by author).
A/B test framework with statistical power planning.

You can serve millions of users with this stack. Add neural rankers, sequential models, embedding tables when business need is clear.

Reading material

Books:

Recommender Systems Handbook, 3rd ed. — Ricci, Rokach, Shapira (the academic reference; deep on ranking + matrix factorization + deep learning recs)
Practical Recommender Systems — Kim Falk (Manning, the practitioner's book; covers cold-start, two-tower, real systems)
Designing Machine Learning Systems — Chip Huyen (the chapters on feature stores + online inference)
Mining of Massive Datasets, 3rd ed. — Leskovec, Rajaraman, Ullman (free; the canonical Stanford CS246 textbook on recommendations + LSH)

Papers:

Deep Neural Networks for YouTube Recommendations — Covington, Adams, Sargin 2016 (RecSys) — the candidate-gen + ranking architecture every modern recsys copies.
Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations — Yi et al. 2019 (RecSys) — the two-tower training tricks paper.
PinSage — Ying et al. 2018 (KDD, Pinterest) — GraphSAGE at Pinterest scale.
Embedding-based Retrieval in Facebook Search — Huang et al. 2020 — EBR at Meta scale.
DLRM: Deep Learning Recommendation Model — Naumov et al. 2019 (Meta) — Meta's open-source CTR model.
Deep Cross Network V2 — Wang et al. 2020 (Google) — a feature-interaction architecture used widely in industry.

Official docs:

TensorFlow Recommenders docs — the canonical two-tower / ranking framework.
Microsoft Recommenders — algorithm catalog — the most comprehensive algorithm catalog with notebooks.
Vertex AI — Matching Engine docs — the production ANN-serving model.
Feast — Feature Store docs — the open-source feature store standard.

Blog posts:

Eugene Yan — System Design for Discovery (RecSys) — the canonical practitioner essay.
Eugene Yan — Real-time Recommendations — the streaming twist.
Pinterest — PinnerSage — multi-modal user embedding at Pinterest.

In-depth research material

Microsoft Recommenders — github.com/recommenders-team/recommenders — ~20k ★, the most extensive recsys algorithm library + notebooks.
TensorFlow Recommenders — github.com/tensorflow/recommenders — ~2k ★, the canonical two-tower TF library.
RecBole — github.com/RUCAIBox/RecBole — ~4k ★, unified PyTorch framework for 80+ recsys models.
Faiss — github.com/facebookresearch/faiss — ~36k ★, the ANN library every retriever uses.
Netflix — Learning a Personalized Homepage — the bandit-style personalization model.
Instagram — Powered by AI: Instagram Explore Recommendation System — the multi-stage architecture in production.
Spotify — Music Recommendations at Scale with Spark — the historical-but-foundational implicit-feedback ALS story.
Uber Eats — Personalized Recommendations — the marketplace twist with constraints.
Recsys Conference Proceedings (ACM) — the canonical annual conference; the keynotes are gold.
Pinterest — Three Levels of Recommendations — the Pixie real-time graph — random-walk-on-graph retrieval.

Videos

How YouTube Recommendations Actually Work — Stanford CS224W / Jure Leskovec · 1 h 18 min — the canonical academic lecture on the YouTube paper.
Building a Real-time Recommender System — Eugene Yan — Eugene Yan · 45 min — from streaming feeds to ANN; super practical.
Two-Tower Models for Recommendations — TensorFlow — 32 min — the official TFRS walkthrough of two-towers.
Designing a Recommender System for News (Netflix Research) — 50 min — the production-architecture talk with multi-stage funneling.
Recsys at LinkedIn — Bee-Chung Chen (KDD) — 1 h — the deep look at LinkedIn's recommendation infra from a long-time leader.

LeetCode — Design Twitter

Link: https://leetcode.com/problems/design-twitter/
Difficulty: Medium
Why this problem: Pulling top-K most-recent items across followed users — same shape as merging candidate sources in recsys retrieval.
Time-box: 30 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

Draw the 4-stage recsys funnel and explain the latency/cost tradeoff at each stage.
Describe the two-tower architecture and why dot-product separation enables ANN.
Explain sampled softmax + log-Q correction.
Pick between pointwise / pairwise / listwise loss for a given scenario.
List 3 cold-start strategies for new items.
Solve design-twitter — k-way merge of per-user feeds, the recsys retrieval primitive.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.

← previous

Caching Strategies — CDN, Application Cache, Cache-Aside, Read-Through

Data Modelling — Dimensional, Data Vault, OBT for the Lakehouse