ai mlintermediate 12m2026-06-09

Online Learning, Bandits, Counterfactual Evaluation

Session 43 of the 48-session learning series.

Why this session matters

This is Session 43 of 48 in the ML track. Most ML systems retrain weekly and live with that staleness. The frontier — recsys, ranking, ad bidding, dynamic pricing — needs to learn and adapt within minutes, or to estimate what would have happened if it had decided differently. Bandits and counterfactual eval are the toolkit.

Agenda

Online learning — incremental training; SGD on a stream
Multi-armed bandits — ε-greedy, UCB, Thompson sampling
Contextual bandits — when the right action depends on context
Counterfactual / off-policy evaluation — IPS, doubly robust
Practical deployment — exploration cost, safety nets, monitoring

Pre-read (skim before the session)

Deep dive

1. Why most ML is "offline" and what's wrong with that

Typical lifecycle: collect data → train batch model → deploy → repeat next week.

Problems:

Stale to new content / new users.
Cold-start for new items takes a full cycle.
Slow to react to distribution shift (trend, season, breaking news).

Online learning: model updates as data flows. Adaptable, responsive, harder.

2. Online vs incremental vs full retrain

Full retrain — train from scratch on full historical data. Most accurate; slowest.
Incremental / warm-start — start from last model; train on new data only. Faster; may drift over time.
Streaming / online — update parameters per-example or mini-batch as data arrives. Real-time; risks instability.

Most production ML is full + incremental. Streaming is for the cases where minutes-of-freshness actually move the needle.

3. Tools for online learning

Vowpal Wabbit — VW is the canonical streaming learner; tens of GB/s throughput.
River — Python streaming-ML lib; clean API, slower.
TensorFlow Extended (TFX) — supports incremental.
Custom: SGD with checkpoint + warm-start; works fine for most cases.

4. The multi-armed bandit problem

You have K arms (actions, items, recommendations). Each has unknown reward distribution. Pull arms over time to maximise total reward. Exploration (try arms you're uncertain about) vs exploitation (pick the best one so far).

Algorithm	How	Pros / Cons
ε-greedy	best most of the time; ε fraction random	simple; wastes pulls on bad arms
UCB	upper confidence bound — pull arm with highest "optimistic" estimate	tight regret bounds; assumes stationary
Thompson Sampling	sample reward from posterior; pull arg-max	best empirical perf; Bayesian; easy

For real systems: Thompson Sampling is the modern default. Easy to extend to contextual.

5. Contextual bandits

Instead of K fixed arms with fixed rewards, you observe a context x and choose an action a; reward r ~ p(r | x, a).

Examples:

News headline ranking: context = user features; action = headline; reward = click.
Email subject A/B test: context = recipient features; action = subject variant; reward = open.
Ad bidding: context = page + user; action = creative; reward = revenue.

Algorithms:

LinUCB — linear model + UCB.
Thompson Sampling with linear regression — Bayesian linear with posterior sampling.
Neural bandits — DNN with output uncertainty estimate.

VW makes most of this 1-liner; do not implement yourself unless you must.

6. The cost of exploration

Every exploration shows a user something the model thinks isn't best. Costs:

Lost short-term revenue / engagement.
Bad UX for the user shown the "exploration" pick.
Bias in your collected data toward exploration arms.

Mitigations:

Cap exploration rate (e.g. 5% of traffic).
Apply only to suitable surfaces (not the lead slot).
Explore more for new items (Bayesian uncertainty), less for known.
Constrain exploration to a "safe" candidate set.

7. Counterfactual / off-policy evaluation

You logged: state → action → reward, using policy π_0 (current production). You want: estimate the reward of policy π_1 (a new model), without rolling it out.

Inverse Propensity Scoring (IPS):

estimate(π_1) = mean( r * π_1(a|x) / π_0(a|x) )

Weighting each historical reward by how much more (or less) the new policy would have chosen that action. Unbiased if π_0 had support everywhere π_1 has support (no deterministic policies).

Problems:

High variance for rare actions (large weights blow up).
Brittle if logged policy was deterministic (π_0(a|x) = 1 for one action, 0 for others; divisions break).

Fixes:

Clip weights (cap at some max).
Doubly Robust — combine IPS with a regression-based estimator; robust if either is correct.
Direct method — pure regression on logged rewards; biased but stable.

8. Logging requirements

To do off-policy evaluation, your production system must log:

The features / context.
The action taken.
The probability the production policy chose that action (propensity).
The reward observed.

Without propensity, you can't compute IPS — your historical data is useless for counterfactual. Add propensity logging before you need it.

9. Safety and guardrails

When deploying online learners:

Action whitelist — model can only choose from approved items.
Reward sanity check — clip extreme rewards (one viral post outlier can dominate).
Rate limit changes — parameters can only move X% per hour.
Fall-back model — if online model goes haywire, switch back to last stable batch.
Shadow mode — run new online learner in shadow; compare predictions before serving.

Most "AI gone wrong" incidents in news come from missing safety nets like these.

10. Non-stationarity

The real world drifts. Trends change, fashion changes, users grow up.

Handling:

Time-decay — weight recent examples more.
Sliding window — only train on last X days.
Concept drift detection — track residual; retrain when distribution shifts.
Cold start re-exploration — periodically re-explore "dead" arms; user preferences change.

11. A/B vs bandit

When you have 2 fixed options and want statistical significance → A/B test. When you have many options and want to maximise reward while learning → bandit.

A/B gives you a confidence interval on the difference. Bandit gives you the best option without bothering to prove which is which.

12. Reality check

A pragmatic online-learning stack:

Batch-trained baseline (LightGBM, neural ranker) — gold standard.
Contextual bandit (VW or Thompson Sampling) on top of baseline predictions — explore variants.
Always log propensities.
Off-policy eval with doubly-robust before any new policy ships.
Guardrails: whitelists, clipping, fallbacks.

Bandits are not a replacement for ML; they are a smart exploration policy on top. Most teams misunderstand this and end up with a worse system than plain offline ML.

Reading material

Books:

Reinforcement Learning: An Introduction (2nd ed.) — Sutton & Barto (Ch. 2 on bandits is the canonical reference; the rest is the broader RL context bandits live inside)
Bandit Algorithms — Lattimore & Szepesvári (the rigorous modern textbook; free PDF online)
A Tutorial on Thompson Sampling — Russo, Van Roy, Kazerouni, Osband, Wen (the canonical TS tutorial; ~120 pages, freely available)
Designing Machine Learning Systems — Chip Huyen (the chapters on continual learning + monitoring + experimentation that wrap bandits in production)

Papers:

A Contextual-Bandit Approach to Personalized News Article Recommendation — Li et al. 2010 (LinUCB) — the Yahoo News paper that put contextual bandits on the map for industry.
An Empirical Evaluation of Thompson Sampling — Chapelle & Li 2011 — the paper that turned TS from theory curiosity to industry default.
Doubly Robust Policy Evaluation and Learning — Dudik, Langford, Li 2011 — the canonical IPS + DR counterfactual estimator paper.
Counterfactual Reasoning and Learning Systems — Bottou et al. 2013 — the Microsoft Bing paper that defined how to ship policies offline-evaluated on logs.
Practical Lessons from Predicting Clicks on Ads at Facebook — He et al. 2014 — the canonical online-learning paper from a production ranking system.

Official docs:

Vowpal Wabbit — wiki — the canonical online-learning / contextual-bandit library.
Vowpal Wabbit — Contextual bandits tutorial — the in-docs walkthrough.
Microsoft Personalizer (Decision Service) docs — the managed contextual-bandit service from MSR.
scikit-learn — Online (partial_fit) estimators — list of which sklearn models can learn incrementally.
River — online ML in Python — the canonical streaming-ML library.
TensorFlow TFX — Continuous training — the production pipeline TFX uses for continuous model refresh.

Blog posts:

Eugene Yan — Bandits for personalization — the practitioner introduction.
Stitch Fix — Multi-Armed Bandits and the Stitch Fix Experimentation Platform — a real production bandit at a recsys company.
Netflix Tech Blog — Artwork Personalization — the canonical Netflix contextual-bandit story (thumbnail selection).
Spotify Engineering — Bandit-based recommendation — production posts on Home + Discover ranking bandits.
Microsoft Research — Counterfactual Learning — long-running blog on offline policy eval at MSR.
John Langford — Hunch.net contextual bandits posts — long-form essays by the LinUCB co-author.

In-depth research material

Vowpal Wabbit — github.com/VowpalWabbit/vowpal_wabbit — ~9k ★, the canonical online-learning + bandit library.
River — github.com/online-ml/river — ~6k ★, modern Python online-ML.
Recogym — github.com/criteo-research/reco-gym — Criteo's simulator for evaluating contextual bandits on recsys.
Microsoft DecisionService — github.com/Microsoft/mwt-ds — the open core of Personalizer.
Open Bandit Pipeline — github.com/st-tech/zr-obp — production-grade off-policy-eval library from ZOZO Research.
BanditPAM — github.com/ThrunGroup/BanditPAM — bandit-driven k-medoids clustering (a fun aside).
Tigramite — github.com/jakobrunge/tigramite — causal inference for time-series; useful for counterfactual evaluation context.
Stitch Fix — Multi-threaded blog — long-running data-science engineering posts; multiple bandit-in-prod write-ups.
Netflix Research — Causal inference team — papers on counterfactual evaluation at Netflix scale.
John Langford — Hunch.net — the LinUCB co-author's blog; deep posts on bandits + counterfactual eval.
Eugene Yan — Real-time ML — practitioner overview of online learning in production.
Booking.com — 150 Successful ML Experiments — case study including bandits.

Videos

Multi-Armed Bandits Explained — Steve Brunton — Steve Brunton · 26 min — the cleanest visual primer on epsilon-greedy, UCB, and Thompson Sampling.
Contextual Bandits & Real-World ML — John Langford (ICML keynote) — John Langford · 1 h 02 min — the co-inventor of LinUCB on bringing contextual bandits to production.
Thompson Sampling — Sebastian Thrun (Bandit Lectures) — Sebastian Thrun · 18 min — the cleanest 18-minute explanation of TS you'll find.
Bandits in Production at Stitch Fix — Brian Amadio (Strata) — Brian Amadio · 32 min — real production bandit story; same team that wrote the canonical Stitch Fix blog post.
Off-Policy Evaluation in Recommender Systems — Yuta Saito (RecSys) — Yuta Saito · 36 min — the Open Bandit Pipeline author on counterfactual evaluation.

LeetCode — Random Pick with Weight

Link: https://leetcode.com/problems/random-pick-with-weight/
Difficulty: Medium
Why this problem: Weighted sampling is the primitive Thompson Sampling and probability-matching bandit policies rely on.
Time-box: 30 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

Pick between batch retrain, incremental, and streaming for a given freshness need.
Explain ε-greedy, UCB, Thompson Sampling and pick one per scenario.
Define contextual bandits and 3 production use-cases.
Compute IPS estimate of a new policy from logged data with propensities.
List 4 safety guardrails for an online-learning system.
Solve random-pick-with-weight — prefix-sum + binary search, the weighted-sampling primitive.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.

← previous

Prompt Engineering at Production Scale — Templates, Caching, Drift

Designing a Search Engine — Crawl, Index, Query, Ranking