Recommender Systems — Two-Tower, Multi-Stage Ranking
Session 27 of the 48-session learning series.
Why this session matters
This is Session 27 of 48 in the ML track. Modern recsys is a 4-stage funnel — candidate retrieval, filtering, ranking, re-ranking — and almost every consumer surface (YouTube, TikTok, Pinterest, Spotify) is some flavour of this pipeline. Understanding the shape is more valuable than memorising any single model.
Agenda
- Why CF and matrix factorisation aren't enough at scale
- The two-tower architecture — query encoder + item encoder + ANN
- Multi-stage funnel — retrieve → filter → rank → re-rank
- Loss functions — pointwise, pairwise, listwise; sampled softmax
- Evaluation — offline (NDCG, recall@K) and online (CTR, dwell, A/B)
Pre-read (skim before the session)
- YouTube Recommendations DNN (Covington et al., 2016)
- Pinterest — PinSage (KDD 2018)
- Facebook — Embedding-based Retrieval at FB
- Eugene Yan — Real-time recommendations
Deep dive
1. The recsys problem
Given a user + context (time, location, recent activity), produce an ordered list of N items that maximises some business objective. Catalog is millions to billions, latency budget is < 200 ms, candidates served per second can hit millions.
You cannot score every (user, item) pair. You need a funnel.
2. The 4-stage funnel
[ ~1B items ]
│ retrieve (recall focus, very cheap)
▼
[ ~1000 candidates ]
│ filter (eligibility, business rules, blocked content)
▼
[ ~200 candidates ]
│ rank (deep model, expensive)
▼
[ ~50 ranked ]
│ re-rank (diversity, freshness, exploration, business)
▼
[ Top 10 shown ]
Each stage is a different optimisation problem with different latency/cost/quality budget. Don't conflate them.
3. Retrieval — the two-tower architecture
[ user features ] ──► [ Query Tower DNN ] ──► query embedding u
│
cos sim │ → ANN over items
│
[ item features ] ──► [ Item Tower DNN ] ──► item embedding v
Train so that cos(u, v) is high for (user, item) pairs they engaged with, low for non-engaged. At serve time:
- Pre-compute all item embeddings, push to ANN index (Faiss, ScaNN, vector DB).
- Online: encode user; ANN-search top-K items.
- Latency ~5–20 ms for 1B items.
The two towers don't share weights. The dot product is the only meeting point — that's what makes it ANN-able.
4. Sampled softmax — making it tractable
Naive softmax over a billion items is impossible. Use:
- In-batch negatives — for batch of B (user, item) pairs, use the other B-1 items in the batch as negatives. Free, biased toward popular items.
- Negative mining — sample hard negatives (similar but unclicked). Better gradient, more compute.
- Log-Q correction — adjust for in-batch popularity bias:
score -= log(P(item)). YouTube uses this.
5. Multi-stage ranker
After retrieval, you have ~1000 candidates. Now you can afford a heavier model. Inputs include rich features:
- User features — age, country, language, tenure, recent watch history embedding.
- Item features — topic, language, age, view count, creator, embedding.
- Cross features — user-item interaction history, same-creator engagement.
- Context — device, time, session length, what they just saw.
Common ranker architectures:
- Wide & Deep (Google) — wide linear + deep DNN.
- DCN / DCN-V2 (Google) — cross network for explicit feature interactions.
- DLRM (Meta) — embedding tables + DNN, scales to TBs of params.
- Transformer-based — recent shift; user history as sequence.
6. Loss functions
- Pointwise — predict P(click | user, item). Treat each example independently. BCE loss. Simple but loses ranking signal.
- Pairwise — for (item_pos, item_neg), score(pos) > score(neg). Margin loss, RankNet. Good ranking signal.
- Listwise — score the whole list; NDCG-aware loss (LambdaRank, ListNet). Hardest, often best.
In practice: train a multi-task model (CTR + dwell + share), use pointwise BCE per head, combine at serve with learned or hand-tuned weights.
7. Re-ranking — the often-skipped stage
The top-50 from the ranker is sorted by predicted relevance. But:
- All 5 top picks might be from the same creator → boring feed.
- All 5 might be 3 days old → stale.
- All 5 might be ad-adjacent → ads tank.
Re-ranking applies:
- Diversity (MMR, DPP) — penalise similarity to already-picked items.
- Freshness boost — exponential decay on item age.
- Exploration — ε-greedy inject of low-confidence items to gather data.
- Business rules — slot ads, demote restricted content, surface promoted creator.
8. The cold-start nightmare
Three flavours:
- New user — no history. Solve with: onboarding survey, demographic priors, item popularity, contextual bandits.
- New item — no engagement. Solve with: content-based features (text, image), creator priors, exploration slots in the feed.
- New platform — no anything. Solve with: editorial picks, popularity, manual ranking.
Two-tower handles new items well if item tower uses only content features (no item ID). Pure ID-embedding towers can't generalise.
9. Evaluation
Offline:
- Recall@K for retrieval (did we get the clicked item in top-K?).
- NDCG@K for ranking (positional discount).
- MAP, MRR for binary relevance.
- Hit rate — quick sanity.
Watch out: offline metrics correlate poorly with online. Always shadow + A/B before shipping.
Online:
- CTR — easy to game (clickbait).
- Dwell time — better signal of satisfaction.
- Long-term metrics — DAU, retention, time-to-second-action. Slow but real.
- Surveys / explicit feedback — gold but small sample.
10. The popularity trap
Most metrics reward recommending popular items. You end up showing everyone the same 100 items, killing long-tail discovery, killing creator economy, killing future training signal.
Mitigations: IPS (inverse propensity sampling) in training, exploration bonuses in re-rank, diversity loss, explicit "discovery" slots.
11. Feedback loops are everywhere
Train model → recommend → users engage with what's recommended → log → retrain. The model trains on data it caused. This causes:
- Filter bubbles.
- Stale catalog blind spots.
- Self-reinforcing bias.
Mitigate with: held-out random slots, off-policy correction (IPS), occasional batch refresh from broad sources.
12. Reality check
A minimum viable recsys for a startup:
- ALS or matrix factorisation as baseline.
- Two-tower with item content features for retrieval.
- LightGBM ranker with hand-crafted features for ranking.
- Hand-tuned re-rank rules (recency, dedup by author).
- A/B test framework with statistical power planning.
You can serve millions of users with this stack. Add neural rankers, sequential models, embedding tables when business need is clear.
Reading material
Books:
- Recommender Systems Handbook, 3rd ed. — Ricci, Rokach, Shapira (the academic reference; deep on ranking + matrix factorization + deep learning recs)
- Practical Recommender Systems — Kim Falk (Manning, the practitioner's book; covers cold-start, two-tower, real systems)
- Designing Machine Learning Systems — Chip Huyen (the chapters on feature stores + online inference)
- Mining of Massive Datasets, 3rd ed. — Leskovec, Rajaraman, Ullman (free; the canonical Stanford CS246 textbook on recommendations + LSH)
Papers:
- Deep Neural Networks for YouTube Recommendations — Covington, Adams, Sargin 2016 (RecSys) — the candidate-gen + ranking architecture every modern recsys copies.
- Sampling-Bias-Corrected Neural Modeling for Large Corpus Item Recommendations — Yi et al. 2019 (RecSys) — the two-tower training tricks paper.
- PinSage — Ying et al. 2018 (KDD, Pinterest) — GraphSAGE at Pinterest scale.
- Embedding-based Retrieval in Facebook Search — Huang et al. 2020 — EBR at Meta scale.
- DLRM: Deep Learning Recommendation Model — Naumov et al. 2019 (Meta) — Meta's open-source CTR model.
- Deep Cross Network V2 — Wang et al. 2020 (Google) — a feature-interaction architecture used widely in industry.
Official docs:
- TensorFlow Recommenders docs — the canonical two-tower / ranking framework.
- Microsoft Recommenders — algorithm catalog — the most comprehensive algorithm catalog with notebooks.
- Vertex AI — Matching Engine docs — the production ANN-serving model.
- Feast — Feature Store docs — the open-source feature store standard.
Blog posts:
- Eugene Yan — System Design for Discovery (RecSys) — the canonical practitioner essay.
- Eugene Yan — Real-time Recommendations — the streaming twist.
- Pinterest — PinnerSage — multi-modal user embedding at Pinterest.
In-depth research material
- Microsoft Recommenders — github.com/recommenders-team/recommenders — ~20k ★, the most extensive recsys algorithm library + notebooks.
- TensorFlow Recommenders — github.com/tensorflow/recommenders — ~2k ★, the canonical two-tower TF library.
- RecBole — github.com/RUCAIBox/RecBole — ~4k ★, unified PyTorch framework for 80+ recsys models.
- Faiss — github.com/facebookresearch/faiss — ~36k ★, the ANN library every retriever uses.
- Netflix — Learning a Personalized Homepage — the bandit-style personalization model.
- Instagram — Powered by AI: Instagram Explore Recommendation System — the multi-stage architecture in production.
- Spotify — Music Recommendations at Scale with Spark — the historical-but-foundational implicit-feedback ALS story.
- Uber Eats — Personalized Recommendations — the marketplace twist with constraints.
- Recsys Conference Proceedings (ACM) — the canonical annual conference; the keynotes are gold.
- Pinterest — Three Levels of Recommendations — the Pixie real-time graph — random-walk-on-graph retrieval.
Videos
- How YouTube Recommendations Actually Work — Stanford CS224W / Jure Leskovec · 1 h 18 min — the canonical academic lecture on the YouTube paper.
- Building a Real-time Recommender System — Eugene Yan — Eugene Yan · 45 min — from streaming feeds to ANN; super practical.
- Two-Tower Models for Recommendations — TensorFlow — 32 min — the official TFRS walkthrough of two-towers.
- Designing a Recommender System for News (Netflix Research) — 50 min — the production-architecture talk with multi-stage funneling.
- Recsys at LinkedIn — Bee-Chung Chen (KDD) — 1 h — the deep look at LinkedIn's recommendation infra from a long-time leader.
LeetCode — Design Twitter
- Link: https://leetcode.com/problems/design-twitter/
- Difficulty: Medium
- Why this problem: Pulling top-K most-recent items across followed users — same shape as merging candidate sources in recsys retrieval.
- Time-box: 30 minutes. Look up the editorial only after.
Post-session checklist
By the end of this session you should be able to:
- Draw the 4-stage recsys funnel and explain the latency/cost tradeoff at each stage.
- Describe the two-tower architecture and why dot-product separation enables ANN.
- Explain sampled softmax + log-Q correction.
- Pick between pointwise / pairwise / listwise loss for a given scenario.
- List 3 cold-start strategies for new items.
- Solve
design-twitter— k-way merge of per-user feeds, the recsys retrieval primitive.
Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.