data engineeringintermediate 12m2026-06-09

Observability for Data Pipelines — SLAs, SLOs, Freshness, Data Tests

Session 46 of the 48-session learning series.

Why this session matters

This is Session 46 of 48 in the DE track. Software observability matured a decade ago; data observability is catching up. The discipline that takes you from "I'll check the dashboard tomorrow" to "alert fired at 03:12, runbook executed, no humans woken" is what makes pipelines you can sleep through.

Agenda

SLA vs SLO vs SLI — definitions that matter
The 5 pillars revisited — freshness, volume, distribution, schema, lineage
Instrumentation — what to emit from every pipeline run
Alerting that doesn't burn out the team
Incident postmortems for data systems

Pre-read (skim before the session)

Deep dive

1. SLA, SLO, SLI

SLI — Service Level Indicator. The metric: "% of orders_fact updates completed within 30 min of source commit".
SLO — Service Level Objective. The target: "99.5% of updates within 30 min, monthly".
SLA — Service Level Agreement. The contract with a customer (internal or external): "if we miss SLO by > X, we credit Y".

Most teams need SLOs (internal targets); few real SLAs (formal commitments). Confusing them is common.

2. SLOs for data — what to measure

Common SLIs:

Freshness — now - max(event_ts) for each dataset.
Completeness — count(seen_today) / count(expected_today).
Accuracy — count(rows passing quality test) / count(rows).
Availability — uptime of dataset endpoint.
Latency — time from source event → available in target.

Per dataset, declare 1–3 SLIs with concrete SLOs. Track them weekly.

3. Error budget

Software SRE concept that applies cleanly to data:

Error budget = (1 - SLO) per month
SLO 99.5% → 0.5% × 30 days × 24 h = 3.6 hours/month of allowed outage

If you've burned the budget, freeze feature work; focus on reliability. If you have budget left, take risks; ship faster.

Counter-intuitive for data teams used to 100% perfection mindset. Once adopted, transforms the team's relationship with risk.

4. The 5 pillars instrumented

Pillar	What to emit	Sample alert
Freshness	latest event ts, ingestion lag	`lag > 30 min`
Volume	rows in / rows out per run	`\< 50% of 7d median`
Distribution	mean, p95, null %, distinct count	`null % > 5%`
Schema	columns + types per run	`schema diff vs baseline`
Lineage	upstream tables touched	(no alert; UI surface)

Most pipelines emit only "success/fail". You need to emit metrics — Prometheus, OpenLineage, or proprietary.

5. Instrumentation patterns

Per pipeline run, emit:

Start time, end time, duration.
Source tables read (with version/snapshot ID).
Destination tables written (with new version/snapshot ID).
Row counts (in / out per stage).
Schema fingerprint.
Job-level success / failure / error class.
Custom business metrics (revenue total, distinct users, ...).

OpenLineage emits the lineage + run metadata; pair with Statsd/Prometheus for metrics.

6. Alerts — sustainable design

Alert characteristics:

Actionable — someone can do something about it. Otherwise, dashboard, not alert.
Symptom-based — alert on user impact (freshness SLO breach), not on every internal hiccup.
De-duped — one alert per incident, not 50.
Routable — to the team that owns the dataset.
Escalation — if not acked in N minutes, page next-on-call.

Anti-pattern: 200 alerts per day, all "info-level"; team mutes the channel; real incident missed.

7. SLO-based alerting

Better than threshold alerts:

Threshold: "alert if freshness > 30 min". Fires constantly during normal jitter.
SLO-based: "alert if you're burning error budget faster than X% per hour". Fires only when a real incident is brewing.

Burn-rate alerts (Google SRE) — fewer false positives, catches real incidents earlier. Modern monitoring (Datadog, Grafana SLO, Honeycomb) supports natively.

8. Runbooks

For every alert, a runbook:

Symptoms — what does the alert mean?
Quick checks — what to look at first.
Common causes — top 3 historical root causes.
Remediation — copy-pasteable commands.
Escalation — who to wake up if you can't fix.

Runbooks live in your wiki, linked from the alert payload. Saves 30 min per page; sometimes saves the data.

9. Postmortems for data incidents

Same shape as service postmortems:

Timeline.
Impact (which downstream consumers were affected, for how long).
Root cause (5 whys).
Action items with owners + dates.
Distribute broadly; learn collectively.

Blameless postmortem culture: focus on systems, not people. "How did our system allow X?" not "who deployed Y?"

10. Data contracts (preview of S40 / recap of S32)

Each contract has an embedded SLA:

sla:
  freshness: 30m
  completeness: 99.9%
  on_breach: page producer

When the contract is broken, the producer is paged — not the consumer. Pushes ownership to the right team.

11. Tooling

dbt — tests + freshness checks in pipelines.
Great Expectations / Soda — rich expectations, scheduled or in-pipeline.
Monte Carlo / Bigeye / Anomalo — auto-anomaly + lineage + impact graph; SaaS.
OpenLineage — emit lineage from any orchestrator (Airflow, Dagster, Spark).
DataHub / OpenMetadata — catalog + freshness display.
Prometheus + Grafana — generic metrics; SLO-aware add-ons.

Start with dbt tests + Prometheus + Grafana + scheduled freshness check. Buy a Monte Carlo when scale + cross-system warrants it.

12. On-call for data

Yes, data teams should have an on-call rotation:

Top-10 critical datasets get the page.
Tier-2 datasets: business-hours response only.
Tier-3: investigated weekly.
Quarterly review of incident frequency, action-item completion.

Without on-call: a broken Sunday-morning pipeline blocks Monday's exec meeting. With: a sleepy DE acks the page, kicks the job, goes back to bed.

13. Reality check

A 6-week observability rollout for an existing pipeline stack:

Week 1–2: catalog top-10 datasets; assign owners.
Week 3: define SLIs + SLOs per dataset; baseline current performance.
Week 4: emit OpenLineage from orchestrator; ingest into DataHub.
Week 5: build SLO-based burn-rate alerts; runbooks per alert.
Week 6: stand up on-call rotation; first round of postmortems.

After this: data outages have a process. Teams stop discovering brokenness via Slack from execs.

Reading material

Books:

Implementing Service Level Objectives — Alex Hidalgo (the canonical SLO/SLI/error-budget book; applies cleanly to data pipelines)
Site Reliability Engineering — Beyer, Jones, Petoff, Murphy (the SLO + alerting + postmortem chapters; foundational)
Data Quality Fundamentals — Barr Moses & Lior Gavish (the canonical "data observability" book by the Monte Carlo founders)
Fundamentals of Data Engineering — Joe Reis & Matt Housley (the data-lifecycle chapter that frames pipeline observability)
Streaming Systems — Akidau, Chernyak, Lax (the watermarks + late-data chapters that ground freshness SLOs)

Papers:

Site Reliability Engineering (Google) — full book free online — Chapters 4 (SLOs) and 6 (Monitoring) are the canonical SLO definitions.
The Five Pillars of Data Observability — Barr Moses (Monte Carlo manifesto) — the manifesto paper for the field.
OpenLineage Specification — the canonical open-spec for pipeline lineage.
Data Lineage in Practice — Lyft Engineering — the paper-ish case study behind Amundsen.

Official docs:

OpenLineage spec + docs — the canonical open standard for pipeline lineage.
Marquez — OpenLineage's reference impl — the reference metadata server.
Great Expectations docs — the canonical OSS data-validation framework.
Soda Core docs — the alternative data-quality testing framework.
dbt — Tests + freshness — dbt's built-in tests + source freshness; the most-used DQ tooling.
Datadog — SLO docs — how SLOs are operationalised on a popular APM.
Grafana — SLO module — the open-source SLO + alerting platform.
OpenTelemetry — Metrics & traces — the canonical OSS observability spec; increasingly used for data pipelines.

Blog posts:

Monte Carlo — Data Reliability Engineering — the canonical "shift-left" data-quality essay.
Barr Moses — Substack (Towards Data Science archive) — Moses's founding posts on "data downtime".
Airbnb Engineering — DataOps & DQ posts — Wall (data catalog), Dataportal, Minerva (metric layer).
Netflix Tech Blog — Data quality posts — search Netflix's blog for DQ + observability; the canonical "data mesh" practitioner.
Locally Optimistic — alert hygiene articles — practitioner essays on noise + alert fatigue from analytics engineers.
Honeycomb — Observability blog — Charity Majors' company; the most influential blog on observability practice.

In-depth research material

Marquez (OpenLineage reference) — github.com/MarquezProject/marquez — ~1.7k ★, the reference metadata server.
Great Expectations — github.com/great-expectations/great_expectations — ~10k ★, the most-used OSS data-validation framework.
Soda Core — github.com/sodadata/soda-core — the OSS data-quality framework with CI integration.
Amundsen — github.com/amundsen-io/amundsen — Lyft's OSS data catalog.
DataHub — github.com/datahub-project/datahub — ~10k ★, LinkedIn's OSS metadata platform.
OpenMetadata — github.com/open-metadata/OpenMetadata — ~6k ★, modern OSS data catalog with built-in DQ.
OpenLineage — github.com/OpenLineage/OpenLineage — ~2k ★, the canonical lineage spec.
dbt — github.com/dbt-labs/dbt-core — ~10k ★, the canonical SQL transformation tool with built-in DQ.
Elementary — github.com/elementary-data/elementary — dbt-native observability + anomaly detection.
Monte Carlo blog — the canonical data-observability practitioner blog.
Honeycomb blog — Charity Majors' company; the canonical observability-practice blog.
Google SRE Workbook — free book; the SLO + alerting chapters apply directly.

Videos

Data Observability in Practice — Barr Moses (Monte Carlo) — Barr Moses · 34 min — the founder's framing of the five-pillars model.
SLOs in Practice — Liz Fong-Jones (Honeycomb / former Google SRE) — Liz Fong-Jones · 42 min — the canonical SLO + error-budget talk.
Observability is a Many-Splendoured Thing — Charity Majors (Honeycomb) — Charity Majors · 47 min — the talk that defines what "observability" means vs monitoring.
Building Reliable Data Pipelines with Airflow — Maxime Beauchemin (Apache Airflow creator) — Maxime Beauchemin · 38 min — the Airflow creator on what makes a pipeline observable.
OpenLineage in Production — Julien Le Dem (Astronomer) — Julien Le Dem · 32 min — the OpenLineage co-creator on lineage as the foundation of DQ.

LeetCode — Design Logger Rate Limiter

Link: https://leetcode.com/problems/design-logger-rate-limiter/
Difficulty: Easy
Why this problem: Suppress duplicate messages within a window — the exact primitive behind sane alert de-duplication.
Time-box: 20 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

Define SLI, SLO, SLA and write one of each for a real dataset.
Compute and explain an error budget; argue what to do when burned.
Wire OpenLineage + Prometheus to a typical dbt + Airflow stack.
Design burn-rate alerts that don't page on noise.
Run a blameless postmortem for a data incident.
Solve design-logger-rate-limiter — sliding-window dedup, the alert hygiene primitive.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.

← previous

LLM Safety — Jailbreaks, Prompt Injection, Output Filtering, Red-Teaming

Production Error Handling — Retries, Circuit Breakers, Timeouts, Bulkheads