Observability for Data Pipelines — SLAs, SLOs, Freshness, Data Tests
Session 46 of the 48-session learning series.
Why this session matters
This is Session 46 of 48 in the DE track. Software observability matured a decade ago; data observability is catching up. The discipline that takes you from "I'll check the dashboard tomorrow" to "alert fired at 03:12, runbook executed, no humans woken" is what makes pipelines you can sleep through.
Agenda
- SLA vs SLO vs SLI — definitions that matter
- The 5 pillars revisited — freshness, volume, distribution, schema, lineage
- Instrumentation — what to emit from every pipeline run
- Alerting that doesn't burn out the team
- Incident postmortems for data systems
Pre-read (skim before the session)
- Google SRE Book — Service Level Objectives
- Monte Carlo — 5 Pillars of Data Observability
- Maxime Beauchemin — The rise of the data engineer
- Liz Fong-Jones — Observability
Deep dive
1. SLA, SLO, SLI
- SLI — Service Level Indicator. The metric: "% of orders_fact updates completed within 30 min of source commit".
- SLO — Service Level Objective. The target: "99.5% of updates within 30 min, monthly".
- SLA — Service Level Agreement. The contract with a customer (internal or external): "if we miss SLO by > X, we credit Y".
Most teams need SLOs (internal targets); few real SLAs (formal commitments). Confusing them is common.
2. SLOs for data — what to measure
Common SLIs:
- Freshness —
now - max(event_ts)for each dataset. - Completeness —
count(seen_today) / count(expected_today). - Accuracy —
count(rows passing quality test) / count(rows). - Availability —
uptime of dataset endpoint. - Latency —
time from source event → available in target.
Per dataset, declare 1–3 SLIs with concrete SLOs. Track them weekly.
3. Error budget
Software SRE concept that applies cleanly to data:
Error budget = (1 - SLO) per month
SLO 99.5% → 0.5% × 30 days × 24 h = 3.6 hours/month of allowed outage
If you've burned the budget, freeze feature work; focus on reliability. If you have budget left, take risks; ship faster.
Counter-intuitive for data teams used to 100% perfection mindset. Once adopted, transforms the team's relationship with risk.
4. The 5 pillars instrumented
| Pillar | What to emit | Sample alert |
|---|---|---|
| Freshness | latest event ts, ingestion lag | lag > 30 min |
| Volume | rows in / rows out per run | \< 50% of 7d median |
| Distribution | mean, p95, null %, distinct count | null % > 5% |
| Schema | columns + types per run | schema diff vs baseline |
| Lineage | upstream tables touched | (no alert; UI surface) |
Most pipelines emit only "success/fail". You need to emit metrics — Prometheus, OpenLineage, or proprietary.
5. Instrumentation patterns
Per pipeline run, emit:
- Start time, end time, duration.
- Source tables read (with version/snapshot ID).
- Destination tables written (with new version/snapshot ID).
- Row counts (in / out per stage).
- Schema fingerprint.
- Job-level success / failure / error class.
- Custom business metrics (revenue total, distinct users, ...).
OpenLineage emits the lineage + run metadata; pair with Statsd/Prometheus for metrics.
6. Alerts — sustainable design
Alert characteristics:
- Actionable — someone can do something about it. Otherwise, dashboard, not alert.
- Symptom-based — alert on user impact (freshness SLO breach), not on every internal hiccup.
- De-duped — one alert per incident, not 50.
- Routable — to the team that owns the dataset.
- Escalation — if not acked in N minutes, page next-on-call.
Anti-pattern: 200 alerts per day, all "info-level"; team mutes the channel; real incident missed.
7. SLO-based alerting
Better than threshold alerts:
- Threshold: "alert if freshness > 30 min". Fires constantly during normal jitter.
- SLO-based: "alert if you're burning error budget faster than X% per hour". Fires only when a real incident is brewing.
Burn-rate alerts (Google SRE) — fewer false positives, catches real incidents earlier. Modern monitoring (Datadog, Grafana SLO, Honeycomb) supports natively.
8. Runbooks
For every alert, a runbook:
- Symptoms — what does the alert mean?
- Quick checks — what to look at first.
- Common causes — top 3 historical root causes.
- Remediation — copy-pasteable commands.
- Escalation — who to wake up if you can't fix.
Runbooks live in your wiki, linked from the alert payload. Saves 30 min per page; sometimes saves the data.
9. Postmortems for data incidents
Same shape as service postmortems:
- Timeline.
- Impact (which downstream consumers were affected, for how long).
- Root cause (5 whys).
- Action items with owners + dates.
- Distribute broadly; learn collectively.
Blameless postmortem culture: focus on systems, not people. "How did our system allow X?" not "who deployed Y?"
10. Data contracts (preview of S40 / recap of S32)
Each contract has an embedded SLA:
sla:
freshness: 30m
completeness: 99.9%
on_breach: page producer
When the contract is broken, the producer is paged — not the consumer. Pushes ownership to the right team.
11. Tooling
- dbt — tests + freshness checks in pipelines.
- Great Expectations / Soda — rich expectations, scheduled or in-pipeline.
- Monte Carlo / Bigeye / Anomalo — auto-anomaly + lineage + impact graph; SaaS.
- OpenLineage — emit lineage from any orchestrator (Airflow, Dagster, Spark).
- DataHub / OpenMetadata — catalog + freshness display.
- Prometheus + Grafana — generic metrics; SLO-aware add-ons.
Start with dbt tests + Prometheus + Grafana + scheduled freshness check. Buy a Monte Carlo when scale + cross-system warrants it.
12. On-call for data
Yes, data teams should have an on-call rotation:
- Top-10 critical datasets get the page.
- Tier-2 datasets: business-hours response only.
- Tier-3: investigated weekly.
- Quarterly review of incident frequency, action-item completion.
Without on-call: a broken Sunday-morning pipeline blocks Monday's exec meeting. With: a sleepy DE acks the page, kicks the job, goes back to bed.
13. Reality check
A 6-week observability rollout for an existing pipeline stack:
- Week 1–2: catalog top-10 datasets; assign owners.
- Week 3: define SLIs + SLOs per dataset; baseline current performance.
- Week 4: emit OpenLineage from orchestrator; ingest into DataHub.
- Week 5: build SLO-based burn-rate alerts; runbooks per alert.
- Week 6: stand up on-call rotation; first round of postmortems.
After this: data outages have a process. Teams stop discovering brokenness via Slack from execs.
Reading material
Books:
- Implementing Service Level Objectives — Alex Hidalgo (the canonical SLO/SLI/error-budget book; applies cleanly to data pipelines)
- Site Reliability Engineering — Beyer, Jones, Petoff, Murphy (the SLO + alerting + postmortem chapters; foundational)
- Data Quality Fundamentals — Barr Moses & Lior Gavish (the canonical "data observability" book by the Monte Carlo founders)
- Fundamentals of Data Engineering — Joe Reis & Matt Housley (the data-lifecycle chapter that frames pipeline observability)
- Streaming Systems — Akidau, Chernyak, Lax (the watermarks + late-data chapters that ground freshness SLOs)
Papers:
- Site Reliability Engineering (Google) — full book free online — Chapters 4 (SLOs) and 6 (Monitoring) are the canonical SLO definitions.
- The Five Pillars of Data Observability — Barr Moses (Monte Carlo manifesto) — the manifesto paper for the field.
- OpenLineage Specification — the canonical open-spec for pipeline lineage.
- Data Lineage in Practice — Lyft Engineering — the paper-ish case study behind Amundsen.
Official docs:
- OpenLineage spec + docs — the canonical open standard for pipeline lineage.
- Marquez — OpenLineage's reference impl — the reference metadata server.
- Great Expectations docs — the canonical OSS data-validation framework.
- Soda Core docs — the alternative data-quality testing framework.
- dbt — Tests + freshness — dbt's built-in tests + source freshness; the most-used DQ tooling.
- Datadog — SLO docs — how SLOs are operationalised on a popular APM.
- Grafana — SLO module — the open-source SLO + alerting platform.
- OpenTelemetry — Metrics & traces — the canonical OSS observability spec; increasingly used for data pipelines.
Blog posts:
- Monte Carlo — Data Reliability Engineering — the canonical "shift-left" data-quality essay.
- Barr Moses — Substack (Towards Data Science archive) — Moses's founding posts on "data downtime".
- Airbnb Engineering — DataOps & DQ posts — Wall (data catalog), Dataportal, Minerva (metric layer).
- Netflix Tech Blog — Data quality posts — search Netflix's blog for DQ + observability; the canonical "data mesh" practitioner.
- Locally Optimistic — alert hygiene articles — practitioner essays on noise + alert fatigue from analytics engineers.
- Honeycomb — Observability blog — Charity Majors' company; the most influential blog on observability practice.
In-depth research material
- Marquez (OpenLineage reference) — github.com/MarquezProject/marquez — ~1.7k ★, the reference metadata server.
- Great Expectations — github.com/great-expectations/great_expectations — ~10k ★, the most-used OSS data-validation framework.
- Soda Core — github.com/sodadata/soda-core — the OSS data-quality framework with CI integration.
- Amundsen — github.com/amundsen-io/amundsen — Lyft's OSS data catalog.
- DataHub — github.com/datahub-project/datahub — ~10k ★, LinkedIn's OSS metadata platform.
- OpenMetadata — github.com/open-metadata/OpenMetadata — ~6k ★, modern OSS data catalog with built-in DQ.
- OpenLineage — github.com/OpenLineage/OpenLineage — ~2k ★, the canonical lineage spec.
- dbt — github.com/dbt-labs/dbt-core — ~10k ★, the canonical SQL transformation tool with built-in DQ.
- Elementary — github.com/elementary-data/elementary — dbt-native observability + anomaly detection.
- Monte Carlo blog — the canonical data-observability practitioner blog.
- Honeycomb blog — Charity Majors' company; the canonical observability-practice blog.
- Google SRE Workbook — free book; the SLO + alerting chapters apply directly.
Videos
- Data Observability in Practice — Barr Moses (Monte Carlo) — Barr Moses · 34 min — the founder's framing of the five-pillars model.
- SLOs in Practice — Liz Fong-Jones (Honeycomb / former Google SRE) — Liz Fong-Jones · 42 min — the canonical SLO + error-budget talk.
- Observability is a Many-Splendoured Thing — Charity Majors (Honeycomb) — Charity Majors · 47 min — the talk that defines what "observability" means vs monitoring.
- Building Reliable Data Pipelines with Airflow — Maxime Beauchemin (Apache Airflow creator) — Maxime Beauchemin · 38 min — the Airflow creator on what makes a pipeline observable.
- OpenLineage in Production — Julien Le Dem (Astronomer) — Julien Le Dem · 32 min — the OpenLineage co-creator on lineage as the foundation of DQ.
LeetCode — Design Logger Rate Limiter
- Link: https://leetcode.com/problems/design-logger-rate-limiter/
- Difficulty: Easy
- Why this problem: Suppress duplicate messages within a window — the exact primitive behind sane alert de-duplication.
- Time-box: 20 minutes. Look up the editorial only after.
Post-session checklist
By the end of this session you should be able to:
- Define SLI, SLO, SLA and write one of each for a real dataset.
- Compute and explain an error budget; argue what to do when burned.
- Wire OpenLineage + Prometheus to a typical dbt + Airflow stack.
- Design burn-rate alerts that don't page on noise.
- Run a blameless postmortem for a data incident.
- Solve
design-logger-rate-limiter— sliding-window dedup, the alert hygiene primitive.
Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.