data engineeringintermediate 12m2026-06-09

Change Data Capture — Debezium, Outbox Pattern, Snapshot+Stream

Session 40 of the 48-session learning series.

Why this session matters

This is Session 40 of 48 in the DE track. CDC is how you keep two systems in sync without polling — the connective tissue between OLTP and OLAP, between microservices, between primary and search/cache. Get it right and your data lake stays seconds-fresh; get it wrong and you lose updates.

Agenda

What CDC is and why polling is the wrong answer
WAL/binlog mining — how Debezium really works
Snapshot + stream — bootstrapping correctness
Outbox pattern — atomic publish-with-write
Schema evolution and watermarks for CDC streams

Pre-read (skim before the session)

Deep dive

1. The problem CDC solves

You have a Postgres OLTP. You want:

A Snowflake replica updated within seconds.
A search index (Elastic / OpenSearch) updated within seconds.
An event stream to drive ML features.
An audit log of every change for compliance.

Bad options:

SELECT * WHERE updated_at > last_poll — misses deletes; misses non-updated_at columns; polling load on prod.
Application dual-write — write to DB and to Kafka in the same handler. Two-phase consistency problem.

CDC option: read the DB's write-ahead log (WAL). Every committed change is there, in order, with full row diffs.

2. How log mining works

Every transactional DB has a sequential log of every change for crash recovery:

Postgres: WAL.
MySQL: binary log.
SQL Server: transaction log.
Mongo: oplog.

A CDC tool reads the log:

Decodes each entry into {op: INSERT|UPDATE|DELETE, before: {...}, after: {...}}.
Publishes to a downstream sink (Kafka, S3, anywhere).
Tracks position (LSN, GTID) for resume after crash.

It's exactly-once on the producer side (the DB committed it), at-least-once on the sink side (you must dedup).

3. Debezium — the open-source workhorse

Architecture:

Kafka Connect framework + Debezium connectors.
One connector per DB → reads log → emits per-table Kafka topic.
Schema Registry holds Avro/JSON schema; auto-evolves.

[ Postgres WAL ]
      ↓
[ Debezium Postgres connector (Kafka Connect worker) ]
      ↓
[ Kafka: pg.public.users topic ]
      ↓
[ consumers: warehouse, search index, ML pipeline, ... ]

Each consumer is independent. Add new ones without touching the source.

4. Snapshot + stream

The cold-start problem: the WAL doesn't contain history before you turned CDC on.

Solution:

Snapshot phase — SELECT * FROM table for each table, publish each row as if it were an INSERT, in some boundary transaction.
Stream phase — switch to log mining from the LSN captured at snapshot start.
Order matters: snapshot must finish before stream catches up, or you get out-of-order writes.

Modern variants:

Incremental snapshots (Debezium 1.5+) — snapshot a chunk at a time, interleaved with stream. No long lock; resumable.
Backfill from warehouse + stream from now — if the warehouse already has history, skip the snapshot.

5. The outbox pattern

Problem: when a service writes a business event to DB and needs to publish to Kafka, the two are not atomic. Crash between them = inconsistency.

Solution: write to an outbox table in the same DB transaction as the business write. A CDC poll/connector reads the outbox and publishes to Kafka, then marks as published (or deletes).

BEGIN;
  INSERT INTO orders(...);
  INSERT INTO outbox(event_type, payload, status) VALUES ('OrderCreated', '{...}', 'PENDING');
COMMIT;

Separate process:

for row in outbox.select_pending():
    kafka.publish(topic=row.type, value=row.payload)
    outbox.mark_published(row.id)

CDC the outbox table with Debezium → no need for the custom poller. Best of both.

Trade-off: outbox grows; needs periodic cleanup.

6. Schema evolution

The DB schema changes. The CDC stream must:

Detect (Debezium does, via DDL events in some sources).
Emit a new schema version (Avro forward/backward compat).
Consumers handle the change (add new optional fields = no-op; drop field = consumers ignore or fail).

Rules:

Only add columns, never drop, in CDC-fed schemas.
Always nullable defaults.
Versioned schema registry; deprecate slowly.
Mark breaking changes; require consumer acknowledgement before producing.

7. Watermarks and lateness

CDC events have:

DB commit time (source timestamp).
WAL position (logical clock).
Sink ingestion time.

If consumers do windowed aggregations (S23), pick a watermark policy:

event_time = source commit time — most correct for OLAP.
Late events arrive after watermark; either drop or trigger restate.

For dedup: use (source_table, source_pk, LSN) as the unique key.

8. Performance and back-pressure

A busy OLTP can produce 100k+ row changes/sec. Connector throughput must keep up or:

WAL grows unbounded → DB out of disk.
Lag grows → downstream stale.

Patterns:

Keep connector workers near the DB (low network latency).
Filter at the connector — don't ship every column / every table; ship what's needed.
Multiple connectors per DB (per schema, per high-traffic table).
Monitor wal_lag_bytes — alert at 1 GB / N minutes behind.

9. Deletes are subtle

CDC delete events contain before row but after = null. Sinks must handle:

Warehouse: soft delete with deleted_at column? hard delete?
Search index: explicit delete by ID.
Aggregations: must subtract old contribution.

Hard deletes are easy to get wrong. Default to soft deletes in downstream stores for safety.

10. Consistency levels

Eventually consistent — downstream lags source by seconds-minutes; ok for analytics.
Read-your-writes — caller writes, then immediately reads from downstream; downstream might not have it yet. Solve with: read from source for that key; or sync wait until LSN reached.
Strong consistency across sinks — impossible without 2PC across systems. Don't promise this.

Be honest about consistency in API docs.

11. Failure recovery

The CDC connector restarts:

Reads its last committed offset (LSN).
Resumes from there.
DB must retain WAL beyond that LSN (configure retention!).

If you lose the offset state: re-snapshot. Cost: minutes to hours depending on data size.

Always store offset state durably (Kafka Connect uses a Kafka topic itself).

12. Reality check

CDC stack for a new project:

Postgres (or MySQL) as OLTP.
Debezium → Kafka.
Outbox table for explicit business events.
Per-domain consumers: warehouse loader, search index updater, ML feature stream.
Schema Registry (Confluent or Apicurio).
Lag dashboards + alerts.

CDC isn't the simplest pattern. It earns its complexity when you have 3+ downstream consumers of the same data. Below that, just dual-write carefully and move on.

Reading material

Books:

Designing Data-Intensive Applications — Martin Kleppmann (the logs + CDC + derived data chapters; the canonical reference)
Microservices Patterns — Chris Richardson (the outbox + transactional messaging chapters)
Streaming Systems — Akidau, Chernyak, Lax (the unified batch + stream model that CDC slots into)
Database Internals — Alex Petrov (the WAL + replication chapters CDC parses for a living)

Papers:

Apache Kafka: A Distributed Messaging System for Log Processing — Kreps et al. 2011 — the paper that made the CDC pattern mainstream.
The Log: What every software engineer should know — Jay Kreps 2013 (LinkedIn) — long-form essay; the foundation paper.

Official docs:

Debezium documentation — the canonical OSS CDC platform; supports PostgreSQL, MySQL, MongoDB, Cassandra, Oracle, SQL Server.
Debezium Connector reference (Postgres) — how WAL mining actually works.
Confluent — CDC with Kafka Connect — the modern Kafka-side of CDC.
PostgreSQL — Logical replication — the wire-format CDC ingests.
Materialize documentation — Sources — streaming SQL on top of CDC.
Apache Flink — Table API + CDC — modern stream processing on CDC streams.

Blog posts:

Martin Kleppmann — Turning the Database Inside Out (Confluent) — the canonical essay; required reading.
Gunnar Morling — Five Advantages of CDC for Database Replication — by the long-time Debezium lead.
microservices.io — Transactional outbox pattern (Chris Richardson) — the canonical pattern page.
Stripe Engineering — Online migrations at scale — schema evolution with CDC in production.
Netflix Tech Blog — DBLog: A Generic Change-Data-Capture Framework — Netflix's in-house CDC framework.

In-depth research material

Debezium — github.com/debezium/debezium — ~10k ★, the canonical OSS CDC platform.
Apache Kafka Connect — github.com/apache/kafka — ~29k ★, the connector framework Debezium sits on top of.
Maxwell's Daemon — github.com/zendesk/maxwell — ~4.4k ★, a simpler MySQL-only CDC daemon used at Zendesk.
PgQ / Skytools — github.com/pgq/pgq — Skype's classic Postgres queue/CDC daemon; still in production.
Materialize — github.com/MaterializeInc/materialize — ~5.8k ★, streaming SQL on CDC; the "always-fresh materialised view" engine.
RisingWave — github.com/risingwavelabs/risingwave — ~7k ★, the open-source streaming database with CDC sources.
Airbyte — github.com/airbytehq/airbyte — ~16k ★, ELT platform with built-in CDC connectors.
Fivetran — Engineering blog — the managed CDC market leader; many practitioner posts.
Shopify Engineering — Capturing Every Change From Shopify's Sharded Monolith — the CDC story at Shopify scale.
Netflix Tech Blog — Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores — the inverse direction (warehouse → KV store).
Confluent blog — No more silos: Integrating databases with Kafka + CDC — Tim Berglund's canonical primer.
Schema Registry — Avro / Protobuf compatibility rules — how schema evolution actually works in production CDC.

Videos

Change Data Capture — Tim Berglund (Confluent) — Tim Berglund · 43 min — the canonical CDC walkthrough; clearer than any docs page.
Building Streaming Data Pipelines with Debezium — Gunnar Morling — Gunnar Morling (long-time Debezium lead) · 44 min — the inside view of how Debezium works.
Turning the Database Inside Out — Martin Kleppmann (Strange Loop 2015) — Kleppmann · 47 min — the talk version of the canonical essay.
The Outbox Pattern in Action — Chris Richardson — Chris Richardson · 38 min — the canonical microservices-pattern talk.
CDC with Postgres logical replication, Debezium and Kafka — Coding Tech — 1 h 04 min — a hands-on demo from setup to streaming consumer.

LeetCode — Event Emitter

Link: https://leetcode.com/problems/event-emitter/
Difficulty: Medium
Why this problem: Build a publish/subscribe primitive in code — the same shape Debezium implements over a DB log.
Time-box: 30 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

Explain why polling is the wrong way to sync changes.
Describe WAL/binlog log mining at a high level.
Sketch the snapshot + stream bootstrap correctly.
Implement the outbox pattern in a service.
Handle schema evolution and deletes safely in CDC sinks.
Solve event-emitter — subscribe/publish/unsubscribe; CDC's interface in miniature.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.

← previous

Designing a Distributed Job Queue — Reliability, Backoff, Idempotency

Testing, Mocks, Property-Based Tests, Mutation Testing