MLOps — Experiment Tracking, Model Registry, CI/CD for Models
Session 24 of the 48-session learning series.
Why this session matters
This is Session 24 of 48 in the Machine Learning track. It builds on the rhythm of one focused topic, paced so you have time to actually absorb it rather than rush.
Agenda
- What MLOps actually means — the lifecycle map
- Experiment tracking — MLflow, W&B, Neptune; what to log
- Model registry — versions, stages, lineage, governance
- CI/CD for models — tests, validation gates, canary, shadow
- Monitoring in production — drift, performance, alerting, retraining
Pre-read (skim before the session)
- Hidden Technical Debt in ML Systems (Sculley et al., 2015)
- MLflow — Concepts overview
- ML Test Score (Breck et al., 2017)
- Chip Huyen — Designing ML Systems (book)
Deep dive
1. The MLOps lifecycle
[Data] → [Features] → [Train] → [Eval] → [Register] → [Deploy] → [Monitor] → [Retrain]
▲ │
└──────────────────────────────────────────────────────────────────────────────────┘
Each arrow is a place to fail. MLOps is the practice of automating + observing + auditing every arrow, so the loop is reliable enough to put a model in front of paying users.
2. Why this is harder than software DevOps
Software CI/CD: deterministic code → deterministic build → deterministic deploy. Bugs are reproducible.
ML CI/CD: stochastic training → metric-bounded model → deploy with regression risk. Bugs may only appear on a specific data slice 3 weeks later. You need:
- Data versioning (DVC, lakeFS, Delta time travel).
- Code versioning (git).
- Hyperparameter + metric tracking.
- Model artefacts versioning.
- Statistical evaluation gates.
- Live monitoring of model behaviour, not just system health.
3. Experiment tracking — what to log
Every training run, log:
| Item | Why |
|---|---|
| Git commit hash | Reproducibility of code |
| Data version (Delta version / DVC hash) | Reproducibility of data |
| Hyperparameters | Find best config; rerun |
Library versions (pip freeze) | Reproducibility of env |
| Metrics (loss, AUC, F1, per-slice) | Compare runs |
| Confusion matrix / ROC curves | Debug failures |
| Sample predictions | Sanity check |
| Hardware (GPU type, count) | Compare cost/perf |
| Wall-clock + GPU hours | Cost tracking |
| Model artefact + signature | Downstream deploy |
import mlflow
with mlflow.start_run():
mlflow.log_params(params)
mlflow.log_metric("auc", auc)
mlflow.log_artifact("model.pkl")
mlflow.set_tag("git_sha", git_sha)
4. Tools
- MLflow — open source, runs in any cloud, model registry built-in. Standard choice for self-host.
- Weights & Biases (W&B) — SaaS, richer UI, great for deep learning, $$.
- Neptune — lightweight, good metadata model.
- Vertex AI / SageMaker / Azure ML experiments — cloud-native; bundled with their pipelines.
For a startup: MLflow + Postgres + S3, runs anywhere. Migrate later if needed.
5. Model registry
Centralised catalogue of trained models. Each model has:
- Versions (1, 2, 3, ...).
- Stages (None → Staging → Production → Archived).
- Lineage (which run produced it, on which data).
- Metrics snapshot.
- Approval (who promoted to Production, when).
The registry is the single source of truth for "what is in production". Your serving stack pulls from registry.get_latest_versions(name='ranker', stages=['Production']).
6. CI/CD pipeline
git push (training code)
↓
[ CI: lint + unit tests + smoke train ]
↓
[ Trigger full training job ]
↓
[ Eval gates: AUC > 0.85, latency < 50ms, slice perf OK ]
↓
[ Register new model version ]
↓
[ Deploy to Staging endpoint ]
↓
[ Shadow test against Production for N hours ]
↓
[ Promote to Production: canary 1% → 10% → 100% ]
↓
[ Monitor; auto-rollback if metrics regress ]
Eval gates are non-negotiable. The gate is what stops a worse model from shipping. Common gates:
- Headline metric ≥ current production - epsilon.
- No slice regresses by > X%.
- Inference latency p99 within budget.
- Bias / fairness metrics within bounds.
7. Validation gates — the hard part
The naïve gate (new_auc > old_auc) misses:
- Slice regression — overall AUC up 0.5%, but it tanked on the high-value user segment by 5%.
- Stratified eval — perf by country, device, age group; not just global.
- Counterfactual — does the new model behave reasonably on edge cases (zero history, fresh signup, abusive content)?
- Calibration — predicted probabilities match actual rates?
Maintain a fixed eval suite (think "test set with personality"), version it, run every candidate.
8. Deployment patterns
- Blue/green — deploy new alongside old; instantly switch traffic; instant rollback.
- Canary — small % to new, monitor, ramp up. Catches issues without full blast radius.
- Shadow — send 100% of traffic to both; compare predictions; don't actually use new yet. Risk-free A/B.
- A/B test — split users; measure business outcome (CTR, revenue, retention). The only honest answer to "is this model better?".
Most teams: canary for tech metrics → A/B for business metrics.
9. Monitoring in production
Three layers:
- System — latency, error rate, saturation. Same as any service.
- Model inputs — feature distributions vs training. PSI, KS test. Drift alerts.
- Model outputs — prediction distribution, confidence, calibration.
- Outcomes (when label arrives) — actual accuracy, business KPIs.
Latency from prediction to label is the hardest part. Some labels are instant (CTR); some take weeks (fraud, refund). You need an eventual eval loop, separate from the immediate monitoring.
10. Retraining triggers
- Scheduled — every week / day / hour. Simplest, often enough.
- Drift-triggered — PSI > threshold → trigger retrain.
- Performance-triggered — labelled-batch metric drops below SLA → retrain.
- Manual — for major changes.
Always have a rollback path before automating retraining-and-deploy. The first auto-retrain bug ships a worse model; you want to undo in one click.
11. Feature store consistency (preview of S38)
The classic ML production bug: training/serving skew. Feature computed differently online vs offline → model trained on one distribution, sees another. Mitigate with:
- Single source of truth for feature logic (feature store).
- Logging online features → reuse for retraining.
12. Governance and audit
For regulated industries:
- Every prediction traceable to a model version + input features + score.
- Every model traceable to training data + code + approver.
- Right to explanation (GDPR, AI Act).
- Periodic bias audits.
Bundle with the registry; this isn't optional in finance, healthcare, lending.
13. Reality check
Most teams don't need a 10-tool platform. A minimal viable MLOps:
- Git for code.
- MLflow for tracking + registry.
- A pipeline orchestrator (Airflow, Prefect, Dagster).
- A serving stack (BentoML, Triton, FastAPI).
- Prometheus + Grafana for monitoring.
- A dashboard / notebook for drift + perf.
You can ship that in 2 weeks. Scale the platform when team size or model count demands it, not before.
Reading material
Books:
- Designing Machine Learning Systems — Chip Huyen (the chapters on training data, deployment, monitoring are the modern MLOps spine)
- Building Machine Learning Powered Applications — Emmanuel Ameisen (the end-to-end ML systems book)
- Reliable Machine Learning — Cathy Chen, Niall Murphy et al. (SRE for ML, from Google folks)
- Machine Learning Engineering — Andriy Burkov (the practical, language-agnostic reference)
Papers:
- Hidden Technical Debt in Machine Learning Systems — Sculley et al. 2015 (NeurIPS) — the ML-as-anti-pattern paper.
- The ML Test Score: A Rubric for ML Production Readiness — Breck et al. 2017 (Google) — 28 actionable production checks.
- TFX: A TensorFlow-Based Production-Scale Machine Learning Platform — Baylor et al. 2017 — Google's reference ML platform.
- Continuous Delivery for Machine Learning — Sato, Wider, Windheuser 2019 (martinfowler.com) — the CD4ML canon.
Official docs:
- MLflow documentation — tracking, projects, models, registry; the default open-source stack.
- Weights & Biases documentation — hosted experiment tracking + sweeps.
- DVC documentation — git-style versioning for data and models.
- Vertex AI — Model Registry — the GCP model-registry reference.
- SageMaker MLOps — AWS — the canonical AWS reference.
Blog posts:
- Rules of Machine Learning — Martin Zinkevich (Google) — 43 hard-won rules; required reading.
- Eugene Yan — Real-time Machine Learning — deployment patterns + their trade-offs.
- MLOps: Continuous Delivery and Automation Pipelines in ML — Google Cloud Architecture — the maturity-model paper Google references everywhere.
In-depth research material
- MLflow — github.com/mlflow/mlflow — ~18k ★, the open-source ML platform.
- Kubeflow — github.com/kubeflow/kubeflow — ~14k ★, the K8s-native ML stack.
- Metaflow (Netflix) — github.com/Netflix/metaflow — ~8k ★, the human-centred ML framework.
- Prefect — github.com/PrefectHQ/prefect — ~17k ★, modern Python orchestration.
- ZenML — github.com/zenml-io/zenml — ~4k ★, modular MLOps framework.
- Made With ML — MLOps course — Goku Mohandas — the most popular free MLOps walk-through.
- Awesome MLOps — github.com/visenger/awesome-mlops — ~12k ★, curated MLOps reference list.
- Uber — Michelangelo, Uber's ML Platform — the first major industry MLOps platform paper.
- LinkedIn — ML Platform overview — the Pro-ML evolution.
- DoorDash — Maintaining Machine Learning Model Accuracy — drift detection in production.
Videos
- MLflow Tutorial — End-to-End ML Lifecycle — Databricks · 25 min — the canonical MLflow walk-through.
- ML Test Score Rubric — Eric Breck (NeurIPS) — 28 min — the original ML production-readiness talk.
- How to MLOps — Chip Huyen at Stanford MLSys Seminar — Chip Huyen · 1 h 02 min — best end-to-end systems framing from the author of Designing ML Systems.
- Productionizing ML Models — Andrej Karpathy (Tesla AI Day) — 25 min — the data-engine + shadow-mode pattern Tesla actually used.
- Real-time ML Systems — Eugene Yan — Eugene Yan · 45 min — batch vs streaming vs online; tools and trade-offs.
LeetCode — Design In Memory File System
- Link: https://leetcode.com/problems/design-in-memory-file-system/
- Difficulty: Hard
- Why this problem: Tree of inodes with create/read/ls/cat — same vocabulary as a model artifact registry.
- Time-box: 30 minutes. Look up the editorial only after.
Post-session checklist
By the end of this session you should be able to:
- List 8 items every training run should log.
- Describe the model registry lifecycle (None → Staging → Production → Archived).
- Design 4 validation gates that block bad models from production.
- Compare canary, shadow, and A/B test deployments.
- List 3 retraining triggers and the risks of each.
- Solve
design-in-memory-file-system— directory tree + file blobs; mirrors a model artefact registry.
Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.