ai mlintermediate 12m2026-06-09

MLOps — Experiment Tracking, Model Registry, CI/CD for Models

Session 24 of the 48-session learning series.

Why this session matters

This is Session 24 of 48 in the Machine Learning track. It builds on the rhythm of one focused topic, paced so you have time to actually absorb it rather than rush.

Agenda

What MLOps actually means — the lifecycle map
Experiment tracking — MLflow, W&B, Neptune; what to log
Model registry — versions, stages, lineage, governance
CI/CD for models — tests, validation gates, canary, shadow
Monitoring in production — drift, performance, alerting, retraining

Pre-read (skim before the session)

Deep dive

1. The MLOps lifecycle

[Data] → [Features] → [Train] → [Eval] → [Register] → [Deploy] → [Monitor] → [Retrain]
   ▲                                                                                  │
   └──────────────────────────────────────────────────────────────────────────────────┘

Each arrow is a place to fail. MLOps is the practice of automating + observing + auditing every arrow, so the loop is reliable enough to put a model in front of paying users.

2. Why this is harder than software DevOps

Software CI/CD: deterministic code → deterministic build → deterministic deploy. Bugs are reproducible.

ML CI/CD: stochastic training → metric-bounded model → deploy with regression risk. Bugs may only appear on a specific data slice 3 weeks later. You need:

Data versioning (DVC, lakeFS, Delta time travel).
Code versioning (git).
Hyperparameter + metric tracking.
Model artefacts versioning.
Statistical evaluation gates.
Live monitoring of model behaviour, not just system health.

3. Experiment tracking — what to log

Every training run, log:

Item	Why
Git commit hash	Reproducibility of code
Data version (Delta version / DVC hash)	Reproducibility of data
Hyperparameters	Find best config; rerun
Library versions (`pip freeze`)	Reproducibility of env
Metrics (loss, AUC, F1, per-slice)	Compare runs
Confusion matrix / ROC curves	Debug failures
Sample predictions	Sanity check
Hardware (GPU type, count)	Compare cost/perf
Wall-clock + GPU hours	Cost tracking
Model artefact + signature	Downstream deploy

import mlflow
with mlflow.start_run():
    mlflow.log_params(params)
    mlflow.log_metric("auc", auc)
    mlflow.log_artifact("model.pkl")
    mlflow.set_tag("git_sha", git_sha)

4. Tools

MLflow — open source, runs in any cloud, model registry built-in. Standard choice for self-host.
Weights & Biases (W&B) — SaaS, richer UI, great for deep learning, $$.
Neptune — lightweight, good metadata model.
Vertex AI / SageMaker / Azure ML experiments — cloud-native; bundled with their pipelines.

For a startup: MLflow + Postgres + S3, runs anywhere. Migrate later if needed.

5. Model registry

Centralised catalogue of trained models. Each model has:

Versions (1, 2, 3, ...).
Stages (None → Staging → Production → Archived).
Lineage (which run produced it, on which data).
Metrics snapshot.
Approval (who promoted to Production, when).

The registry is the single source of truth for "what is in production". Your serving stack pulls from registry.get_latest_versions(name='ranker', stages=['Production']).

6. CI/CD pipeline

git push (training code)
   ↓
[ CI: lint + unit tests + smoke train ]
   ↓
[ Trigger full training job ]
   ↓
[ Eval gates: AUC > 0.85, latency < 50ms, slice perf OK ]
   ↓
[ Register new model version ]
   ↓
[ Deploy to Staging endpoint ]
   ↓
[ Shadow test against Production for N hours ]
   ↓
[ Promote to Production: canary 1% → 10% → 100% ]
   ↓
[ Monitor; auto-rollback if metrics regress ]

Eval gates are non-negotiable. The gate is what stops a worse model from shipping. Common gates:

Headline metric ≥ current production - epsilon.
No slice regresses by > X%.
Inference latency p99 within budget.
Bias / fairness metrics within bounds.

7. Validation gates — the hard part

The naïve gate (new_auc > old_auc) misses:

Slice regression — overall AUC up 0.5%, but it tanked on the high-value user segment by 5%.
Stratified eval — perf by country, device, age group; not just global.
Counterfactual — does the new model behave reasonably on edge cases (zero history, fresh signup, abusive content)?
Calibration — predicted probabilities match actual rates?

Maintain a fixed eval suite (think "test set with personality"), version it, run every candidate.

8. Deployment patterns

Blue/green — deploy new alongside old; instantly switch traffic; instant rollback.
Canary — small % to new, monitor, ramp up. Catches issues without full blast radius.
Shadow — send 100% of traffic to both; compare predictions; don't actually use new yet. Risk-free A/B.
A/B test — split users; measure business outcome (CTR, revenue, retention). The only honest answer to "is this model better?".

Most teams: canary for tech metrics → A/B for business metrics.

9. Monitoring in production

Three layers:

System — latency, error rate, saturation. Same as any service.
Model inputs — feature distributions vs training. PSI, KS test. Drift alerts.
Model outputs — prediction distribution, confidence, calibration.
Outcomes (when label arrives) — actual accuracy, business KPIs.

Latency from prediction to label is the hardest part. Some labels are instant (CTR); some take weeks (fraud, refund). You need an eventual eval loop, separate from the immediate monitoring.

10. Retraining triggers

Scheduled — every week / day / hour. Simplest, often enough.
Drift-triggered — PSI > threshold → trigger retrain.
Performance-triggered — labelled-batch metric drops below SLA → retrain.
Manual — for major changes.

Always have a rollback path before automating retraining-and-deploy. The first auto-retrain bug ships a worse model; you want to undo in one click.

11. Feature store consistency (preview of S38)

The classic ML production bug: training/serving skew. Feature computed differently online vs offline → model trained on one distribution, sees another. Mitigate with:

Single source of truth for feature logic (feature store).
Logging online features → reuse for retraining.

12. Governance and audit

For regulated industries:

Every prediction traceable to a model version + input features + score.
Every model traceable to training data + code + approver.
Right to explanation (GDPR, AI Act).
Periodic bias audits.

Bundle with the registry; this isn't optional in finance, healthcare, lending.

13. Reality check

Most teams don't need a 10-tool platform. A minimal viable MLOps:

Git for code.
MLflow for tracking + registry.
A pipeline orchestrator (Airflow, Prefect, Dagster).
A serving stack (BentoML, Triton, FastAPI).
Prometheus + Grafana for monitoring.
A dashboard / notebook for drift + perf.

You can ship that in 2 weeks. Scale the platform when team size or model count demands it, not before.

Reading material

Books:

Designing Machine Learning Systems — Chip Huyen (the chapters on training data, deployment, monitoring are the modern MLOps spine)
Building Machine Learning Powered Applications — Emmanuel Ameisen (the end-to-end ML systems book)
Reliable Machine Learning — Cathy Chen, Niall Murphy et al. (SRE for ML, from Google folks)
Machine Learning Engineering — Andriy Burkov (the practical, language-agnostic reference)

Papers:

Hidden Technical Debt in Machine Learning Systems — Sculley et al. 2015 (NeurIPS) — the ML-as-anti-pattern paper.
The ML Test Score: A Rubric for ML Production Readiness — Breck et al. 2017 (Google) — 28 actionable production checks.
TFX: A TensorFlow-Based Production-Scale Machine Learning Platform — Baylor et al. 2017 — Google's reference ML platform.
Continuous Delivery for Machine Learning — Sato, Wider, Windheuser 2019 (martinfowler.com) — the CD4ML canon.

Official docs:

MLflow documentation — tracking, projects, models, registry; the default open-source stack.
Weights & Biases documentation — hosted experiment tracking + sweeps.
DVC documentation — git-style versioning for data and models.
Vertex AI — Model Registry — the GCP model-registry reference.
SageMaker MLOps — AWS — the canonical AWS reference.

Blog posts:

Rules of Machine Learning — Martin Zinkevich (Google) — 43 hard-won rules; required reading.
Eugene Yan — Real-time Machine Learning — deployment patterns + their trade-offs.
MLOps: Continuous Delivery and Automation Pipelines in ML — Google Cloud Architecture — the maturity-model paper Google references everywhere.

In-depth research material

MLflow — github.com/mlflow/mlflow — ~18k ★, the open-source ML platform.
Kubeflow — github.com/kubeflow/kubeflow — ~14k ★, the K8s-native ML stack.
Metaflow (Netflix) — github.com/Netflix/metaflow — ~8k ★, the human-centred ML framework.
Prefect — github.com/PrefectHQ/prefect — ~17k ★, modern Python orchestration.
ZenML — github.com/zenml-io/zenml — ~4k ★, modular MLOps framework.
Made With ML — MLOps course — Goku Mohandas — the most popular free MLOps walk-through.
Awesome MLOps — github.com/visenger/awesome-mlops — ~12k ★, curated MLOps reference list.
Uber — Michelangelo, Uber's ML Platform — the first major industry MLOps platform paper.
LinkedIn — ML Platform overview — the Pro-ML evolution.
DoorDash — Maintaining Machine Learning Model Accuracy — drift detection in production.

Videos

MLflow Tutorial — End-to-End ML Lifecycle — Databricks · 25 min — the canonical MLflow walk-through.
ML Test Score Rubric — Eric Breck (NeurIPS) — 28 min — the original ML production-readiness talk.
How to MLOps — Chip Huyen at Stanford MLSys Seminar — Chip Huyen · 1 h 02 min — best end-to-end systems framing from the author of Designing ML Systems.
Productionizing ML Models — Andrej Karpathy (Tesla AI Day) — 25 min — the data-engine + shadow-mode pattern Tesla actually used.
Real-time ML Systems — Eugene Yan — Eugene Yan · 45 min — batch vs streaming vs online; tools and trade-offs.

LeetCode — Design In Memory File System

Link: https://leetcode.com/problems/design-in-memory-file-system/
Difficulty: Hard
Why this problem: Tree of inodes with create/read/ls/cat — same vocabulary as a model artifact registry.
Time-box: 30 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

List 8 items every training run should log.
Describe the model registry lifecycle (None → Staging → Production → Archived).
Design 4 validation gates that block bad models from production.
Compare canary, shadow, and A/B test deployments.
List 3 retraining triggers and the risks of each.
Solve design-in-memory-file-system — directory tree + file blobs; mirrors a model artefact registry.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.

← previous

Streaming with Flink/Spark — Watermarks, Windows, State

LLM Evaluation — Benchmarks, LLM-as-Judge, RAGAS, Inspect