Online Learning, Bandits, Counterfactual Evaluation
Session 43 of the 48-session learning series.
Why this session matters
This is Session 43 of 48 in the ML track. Most ML systems retrain weekly and live with that staleness. The frontier — recsys, ranking, ad bidding, dynamic pricing — needs to learn and adapt within minutes, or to estimate what would have happened if it had decided differently. Bandits and counterfactual eval are the toolkit.
Agenda
- Online learning — incremental training; SGD on a stream
- Multi-armed bandits — ε-greedy, UCB, Thompson sampling
- Contextual bandits — when the right action depends on context
- Counterfactual / off-policy evaluation — IPS, doubly robust
- Practical deployment — exploration cost, safety nets, monitoring
Pre-read (skim before the session)
- John Langford — Bandits & Contextual Bandits course notes
- Thompson Sampling for Contextual Bandits (Chapelle & Li, 2011)
- Inverse Propensity Scoring intro (Bottou et al., 2013)
- Vowpal Wabbit — Contextual bandit tutorial
Deep dive
1. Why most ML is "offline" and what's wrong with that
Typical lifecycle: collect data → train batch model → deploy → repeat next week.
Problems:
- Stale to new content / new users.
- Cold-start for new items takes a full cycle.
- Slow to react to distribution shift (trend, season, breaking news).
Online learning: model updates as data flows. Adaptable, responsive, harder.
2. Online vs incremental vs full retrain
- Full retrain — train from scratch on full historical data. Most accurate; slowest.
- Incremental / warm-start — start from last model; train on new data only. Faster; may drift over time.
- Streaming / online — update parameters per-example or mini-batch as data arrives. Real-time; risks instability.
Most production ML is full + incremental. Streaming is for the cases where minutes-of-freshness actually move the needle.
3. Tools for online learning
- Vowpal Wabbit — VW is the canonical streaming learner; tens of GB/s throughput.
- River — Python streaming-ML lib; clean API, slower.
- TensorFlow Extended (TFX) — supports incremental.
- Custom: SGD with checkpoint + warm-start; works fine for most cases.
4. The multi-armed bandit problem
You have K arms (actions, items, recommendations). Each has unknown reward distribution. Pull arms over time to maximise total reward. Exploration (try arms you're uncertain about) vs exploitation (pick the best one so far).
| Algorithm | How | Pros / Cons |
|---|---|---|
| ε-greedy | best most of the time; ε fraction random | simple; wastes pulls on bad arms |
| UCB | upper confidence bound — pull arm with highest "optimistic" estimate | tight regret bounds; assumes stationary |
| Thompson Sampling | sample reward from posterior; pull arg-max | best empirical perf; Bayesian; easy |
For real systems: Thompson Sampling is the modern default. Easy to extend to contextual.
5. Contextual bandits
Instead of K fixed arms with fixed rewards, you observe a context x and choose an action a; reward r ~ p(r | x, a).
Examples:
- News headline ranking: context = user features; action = headline; reward = click.
- Email subject A/B test: context = recipient features; action = subject variant; reward = open.
- Ad bidding: context = page + user; action = creative; reward = revenue.
Algorithms:
- LinUCB — linear model + UCB.
- Thompson Sampling with linear regression — Bayesian linear with posterior sampling.
- Neural bandits — DNN with output uncertainty estimate.
VW makes most of this 1-liner; do not implement yourself unless you must.
6. The cost of exploration
Every exploration shows a user something the model thinks isn't best. Costs:
- Lost short-term revenue / engagement.
- Bad UX for the user shown the "exploration" pick.
- Bias in your collected data toward exploration arms.
Mitigations:
- Cap exploration rate (e.g. 5% of traffic).
- Apply only to suitable surfaces (not the lead slot).
- Explore more for new items (Bayesian uncertainty), less for known.
- Constrain exploration to a "safe" candidate set.
7. Counterfactual / off-policy evaluation
You logged: state → action → reward, using policy π_0 (current production). You want: estimate the reward of policy π_1 (a new model), without rolling it out.
Inverse Propensity Scoring (IPS):
estimate(π_1) = mean( r * π_1(a|x) / π_0(a|x) )
Weighting each historical reward by how much more (or less) the new policy would have chosen that action. Unbiased if π_0 had support everywhere π_1 has support (no deterministic policies).
Problems:
- High variance for rare actions (large weights blow up).
- Brittle if logged policy was deterministic (π_0(a|x) = 1 for one action, 0 for others; divisions break).
Fixes:
- Clip weights (cap at some max).
- Doubly Robust — combine IPS with a regression-based estimator; robust if either is correct.
- Direct method — pure regression on logged rewards; biased but stable.
8. Logging requirements
To do off-policy evaluation, your production system must log:
- The features / context.
- The action taken.
- The probability the production policy chose that action (
propensity). - The reward observed.
Without propensity, you can't compute IPS — your historical data is useless for counterfactual. Add propensity logging before you need it.
9. Safety and guardrails
When deploying online learners:
- Action whitelist — model can only choose from approved items.
- Reward sanity check — clip extreme rewards (one viral post outlier can dominate).
- Rate limit changes — parameters can only move X% per hour.
- Fall-back model — if online model goes haywire, switch back to last stable batch.
- Shadow mode — run new online learner in shadow; compare predictions before serving.
Most "AI gone wrong" incidents in news come from missing safety nets like these.
10. Non-stationarity
The real world drifts. Trends change, fashion changes, users grow up.
Handling:
- Time-decay — weight recent examples more.
- Sliding window — only train on last X days.
- Concept drift detection — track residual; retrain when distribution shifts.
- Cold start re-exploration — periodically re-explore "dead" arms; user preferences change.
11. A/B vs bandit
When you have 2 fixed options and want statistical significance → A/B test. When you have many options and want to maximise reward while learning → bandit.
A/B gives you a confidence interval on the difference. Bandit gives you the best option without bothering to prove which is which.
12. Reality check
A pragmatic online-learning stack:
- Batch-trained baseline (LightGBM, neural ranker) — gold standard.
- Contextual bandit (VW or Thompson Sampling) on top of baseline predictions — explore variants.
- Always log propensities.
- Off-policy eval with doubly-robust before any new policy ships.
- Guardrails: whitelists, clipping, fallbacks.
Bandits are not a replacement for ML; they are a smart exploration policy on top. Most teams misunderstand this and end up with a worse system than plain offline ML.
Reading material
Books:
- Reinforcement Learning: An Introduction (2nd ed.) — Sutton & Barto (Ch. 2 on bandits is the canonical reference; the rest is the broader RL context bandits live inside)
- Bandit Algorithms — Lattimore & Szepesvári (the rigorous modern textbook; free PDF online)
- A Tutorial on Thompson Sampling — Russo, Van Roy, Kazerouni, Osband, Wen (the canonical TS tutorial; ~120 pages, freely available)
- Designing Machine Learning Systems — Chip Huyen (the chapters on continual learning + monitoring + experimentation that wrap bandits in production)
Papers:
- A Contextual-Bandit Approach to Personalized News Article Recommendation — Li et al. 2010 (LinUCB) — the Yahoo News paper that put contextual bandits on the map for industry.
- An Empirical Evaluation of Thompson Sampling — Chapelle & Li 2011 — the paper that turned TS from theory curiosity to industry default.
- Doubly Robust Policy Evaluation and Learning — Dudik, Langford, Li 2011 — the canonical IPS + DR counterfactual estimator paper.
- Counterfactual Reasoning and Learning Systems — Bottou et al. 2013 — the Microsoft Bing paper that defined how to ship policies offline-evaluated on logs.
- Practical Lessons from Predicting Clicks on Ads at Facebook — He et al. 2014 — the canonical online-learning paper from a production ranking system.
Official docs:
- Vowpal Wabbit — wiki — the canonical online-learning / contextual-bandit library.
- Vowpal Wabbit — Contextual bandits tutorial — the in-docs walkthrough.
- Microsoft Personalizer (Decision Service) docs — the managed contextual-bandit service from MSR.
- scikit-learn — Online (partial_fit) estimators — list of which sklearn models can learn incrementally.
- River — online ML in Python — the canonical streaming-ML library.
- TensorFlow TFX — Continuous training — the production pipeline TFX uses for continuous model refresh.
Blog posts:
- Eugene Yan — Bandits for personalization — the practitioner introduction.
- Stitch Fix — Multi-Armed Bandits and the Stitch Fix Experimentation Platform — a real production bandit at a recsys company.
- Netflix Tech Blog — Artwork Personalization — the canonical Netflix contextual-bandit story (thumbnail selection).
- Spotify Engineering — Bandit-based recommendation — production posts on Home + Discover ranking bandits.
- Microsoft Research — Counterfactual Learning — long-running blog on offline policy eval at MSR.
- John Langford — Hunch.net contextual bandits posts — long-form essays by the LinUCB co-author.
In-depth research material
- Vowpal Wabbit — github.com/VowpalWabbit/vowpal_wabbit — ~9k ★, the canonical online-learning + bandit library.
- River — github.com/online-ml/river — ~6k ★, modern Python online-ML.
- Recogym — github.com/criteo-research/reco-gym — Criteo's simulator for evaluating contextual bandits on recsys.
- Microsoft DecisionService — github.com/Microsoft/mwt-ds — the open core of Personalizer.
- Open Bandit Pipeline — github.com/st-tech/zr-obp — production-grade off-policy-eval library from ZOZO Research.
- BanditPAM — github.com/ThrunGroup/BanditPAM — bandit-driven k-medoids clustering (a fun aside).
- Tigramite — github.com/jakobrunge/tigramite — causal inference for time-series; useful for counterfactual evaluation context.
- Stitch Fix — Multi-threaded blog — long-running data-science engineering posts; multiple bandit-in-prod write-ups.
- Netflix Research — Causal inference team — papers on counterfactual evaluation at Netflix scale.
- John Langford — Hunch.net — the LinUCB co-author's blog; deep posts on bandits + counterfactual eval.
- Eugene Yan — Real-time ML — practitioner overview of online learning in production.
- Booking.com — 150 Successful ML Experiments — case study including bandits.
Videos
- Multi-Armed Bandits Explained — Steve Brunton — Steve Brunton · 26 min — the cleanest visual primer on epsilon-greedy, UCB, and Thompson Sampling.
- Contextual Bandits & Real-World ML — John Langford (ICML keynote) — John Langford · 1 h 02 min — the co-inventor of LinUCB on bringing contextual bandits to production.
- Thompson Sampling — Sebastian Thrun (Bandit Lectures) — Sebastian Thrun · 18 min — the cleanest 18-minute explanation of TS you'll find.
- Bandits in Production at Stitch Fix — Brian Amadio (Strata) — Brian Amadio · 32 min — real production bandit story; same team that wrote the canonical Stitch Fix blog post.
- Off-Policy Evaluation in Recommender Systems — Yuta Saito (RecSys) — Yuta Saito · 36 min — the Open Bandit Pipeline author on counterfactual evaluation.
LeetCode — Random Pick with Weight
- Link: https://leetcode.com/problems/random-pick-with-weight/
- Difficulty: Medium
- Why this problem: Weighted sampling is the primitive Thompson Sampling and probability-matching bandit policies rely on.
- Time-box: 30 minutes. Look up the editorial only after.
Post-session checklist
By the end of this session you should be able to:
- Pick between batch retrain, incremental, and streaming for a given freshness need.
- Explain ε-greedy, UCB, Thompson Sampling and pick one per scenario.
- Define contextual bandits and 3 production use-cases.
- Compute IPS estimate of a new policy from logged data with propensities.
- List 4 safety guardrails for an online-learning system.
- Solve
random-pick-with-weight— prefix-sum + binary search, the weighted-sampling primitive.
Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.