Change Data Capture — Debezium, Outbox Pattern, Snapshot+Stream
Session 40 of the 48-session learning series.
Why this session matters
This is Session 40 of 48 in the DE track. CDC is how you keep two systems in sync without polling — the connective tissue between OLTP and OLAP, between microservices, between primary and search/cache. Get it right and your data lake stays seconds-fresh; get it wrong and you lose updates.
Agenda
- What CDC is and why polling is the wrong answer
- WAL/binlog mining — how Debezium really works
- Snapshot + stream — bootstrapping correctness
- Outbox pattern — atomic publish-with-write
- Schema evolution and watermarks for CDC streams
Pre-read (skim before the session)
- Debezium architecture overview
- Microservices.io — Transactional Outbox pattern
- Martin Kleppmann — Turning the database inside out
- Confluent — Streaming ETL with Kafka Connect
Deep dive
1. The problem CDC solves
You have a Postgres OLTP. You want:
- A Snowflake replica updated within seconds.
- A search index (Elastic / OpenSearch) updated within seconds.
- An event stream to drive ML features.
- An audit log of every change for compliance.
Bad options:
SELECT * WHERE updated_at > last_poll— misses deletes; misses non-updated_at columns; polling load on prod.- Application dual-write — write to DB and to Kafka in the same handler. Two-phase consistency problem.
CDC option: read the DB's write-ahead log (WAL). Every committed change is there, in order, with full row diffs.
2. How log mining works
Every transactional DB has a sequential log of every change for crash recovery:
- Postgres: WAL.
- MySQL: binary log.
- SQL Server: transaction log.
- Mongo: oplog.
A CDC tool reads the log:
- Decodes each entry into
{op: INSERT|UPDATE|DELETE, before: {...}, after: {...}}. - Publishes to a downstream sink (Kafka, S3, anywhere).
- Tracks position (LSN, GTID) for resume after crash.
It's exactly-once on the producer side (the DB committed it), at-least-once on the sink side (you must dedup).
3. Debezium — the open-source workhorse
Architecture:
- Kafka Connect framework + Debezium connectors.
- One connector per DB → reads log → emits per-table Kafka topic.
- Schema Registry holds Avro/JSON schema; auto-evolves.
[ Postgres WAL ]
↓
[ Debezium Postgres connector (Kafka Connect worker) ]
↓
[ Kafka: pg.public.users topic ]
↓
[ consumers: warehouse, search index, ML pipeline, ... ]
Each consumer is independent. Add new ones without touching the source.
4. Snapshot + stream
The cold-start problem: the WAL doesn't contain history before you turned CDC on.
Solution:
- Snapshot phase —
SELECT * FROM tablefor each table, publish each row as if it were an INSERT, in some boundary transaction. - Stream phase — switch to log mining from the LSN captured at snapshot start.
- Order matters: snapshot must finish before stream catches up, or you get out-of-order writes.
Modern variants:
- Incremental snapshots (Debezium 1.5+) — snapshot a chunk at a time, interleaved with stream. No long lock; resumable.
- Backfill from warehouse + stream from now — if the warehouse already has history, skip the snapshot.
5. The outbox pattern
Problem: when a service writes a business event to DB and needs to publish to Kafka, the two are not atomic. Crash between them = inconsistency.
Solution: write to an outbox table in the same DB transaction as the business write. A CDC poll/connector reads the outbox and publishes to Kafka, then marks as published (or deletes).
BEGIN;
INSERT INTO orders(...);
INSERT INTO outbox(event_type, payload, status) VALUES ('OrderCreated', '{...}', 'PENDING');
COMMIT;
Separate process:
for row in outbox.select_pending():
kafka.publish(topic=row.type, value=row.payload)
outbox.mark_published(row.id)
CDC the outbox table with Debezium → no need for the custom poller. Best of both.
Trade-off: outbox grows; needs periodic cleanup.
6. Schema evolution
The DB schema changes. The CDC stream must:
- Detect (Debezium does, via DDL events in some sources).
- Emit a new schema version (Avro forward/backward compat).
- Consumers handle the change (add new optional fields = no-op; drop field = consumers ignore or fail).
Rules:
- Only add columns, never drop, in CDC-fed schemas.
- Always nullable defaults.
- Versioned schema registry; deprecate slowly.
- Mark breaking changes; require consumer acknowledgement before producing.
7. Watermarks and lateness
CDC events have:
- DB commit time (source timestamp).
- WAL position (logical clock).
- Sink ingestion time.
If consumers do windowed aggregations (S23), pick a watermark policy:
event_time = source commit time— most correct for OLAP.- Late events arrive after watermark; either drop or trigger restate.
For dedup: use (source_table, source_pk, LSN) as the unique key.
8. Performance and back-pressure
A busy OLTP can produce 100k+ row changes/sec. Connector throughput must keep up or:
- WAL grows unbounded → DB out of disk.
- Lag grows → downstream stale.
Patterns:
- Keep connector workers near the DB (low network latency).
- Filter at the connector — don't ship every column / every table; ship what's needed.
- Multiple connectors per DB (per schema, per high-traffic table).
- Monitor
wal_lag_bytes— alert at 1 GB / N minutes behind.
9. Deletes are subtle
CDC delete events contain before row but after = null. Sinks must handle:
- Warehouse: soft delete with
deleted_atcolumn? hard delete? - Search index: explicit delete by ID.
- Aggregations: must subtract old contribution.
Hard deletes are easy to get wrong. Default to soft deletes in downstream stores for safety.
10. Consistency levels
- Eventually consistent — downstream lags source by seconds-minutes; ok for analytics.
- Read-your-writes — caller writes, then immediately reads from downstream; downstream might not have it yet. Solve with: read from source for that key; or sync wait until LSN reached.
- Strong consistency across sinks — impossible without 2PC across systems. Don't promise this.
Be honest about consistency in API docs.
11. Failure recovery
The CDC connector restarts:
- Reads its last committed offset (LSN).
- Resumes from there.
- DB must retain WAL beyond that LSN (configure retention!).
If you lose the offset state: re-snapshot. Cost: minutes to hours depending on data size.
Always store offset state durably (Kafka Connect uses a Kafka topic itself).
12. Reality check
CDC stack for a new project:
- Postgres (or MySQL) as OLTP.
- Debezium → Kafka.
- Outbox table for explicit business events.
- Per-domain consumers: warehouse loader, search index updater, ML feature stream.
- Schema Registry (Confluent or Apicurio).
- Lag dashboards + alerts.
CDC isn't the simplest pattern. It earns its complexity when you have 3+ downstream consumers of the same data. Below that, just dual-write carefully and move on.
Reading material
Books:
- Designing Data-Intensive Applications — Martin Kleppmann (the logs + CDC + derived data chapters; the canonical reference)
- Microservices Patterns — Chris Richardson (the outbox + transactional messaging chapters)
- Streaming Systems — Akidau, Chernyak, Lax (the unified batch + stream model that CDC slots into)
- Database Internals — Alex Petrov (the WAL + replication chapters CDC parses for a living)
Papers:
- Apache Kafka: A Distributed Messaging System for Log Processing — Kreps et al. 2011 — the paper that made the CDC pattern mainstream.
- The Log: What every software engineer should know — Jay Kreps 2013 (LinkedIn) — long-form essay; the foundation paper.
Official docs:
- Debezium documentation — the canonical OSS CDC platform; supports PostgreSQL, MySQL, MongoDB, Cassandra, Oracle, SQL Server.
- Debezium Connector reference (Postgres) — how WAL mining actually works.
- Confluent — CDC with Kafka Connect — the modern Kafka-side of CDC.
- PostgreSQL — Logical replication — the wire-format CDC ingests.
- Materialize documentation — Sources — streaming SQL on top of CDC.
- Apache Flink — Table API + CDC — modern stream processing on CDC streams.
Blog posts:
- Martin Kleppmann — Turning the Database Inside Out (Confluent) — the canonical essay; required reading.
- Gunnar Morling — Five Advantages of CDC for Database Replication — by the long-time Debezium lead.
- microservices.io — Transactional outbox pattern (Chris Richardson) — the canonical pattern page.
- Stripe Engineering — Online migrations at scale — schema evolution with CDC in production.
- Netflix Tech Blog — DBLog: A Generic Change-Data-Capture Framework — Netflix's in-house CDC framework.
In-depth research material
- Debezium — github.com/debezium/debezium — ~10k ★, the canonical OSS CDC platform.
- Apache Kafka Connect — github.com/apache/kafka — ~29k ★, the connector framework Debezium sits on top of.
- Maxwell's Daemon — github.com/zendesk/maxwell — ~4.4k ★, a simpler MySQL-only CDC daemon used at Zendesk.
- PgQ / Skytools — github.com/pgq/pgq — Skype's classic Postgres queue/CDC daemon; still in production.
- Materialize — github.com/MaterializeInc/materialize — ~5.8k ★, streaming SQL on CDC; the "always-fresh materialised view" engine.
- RisingWave — github.com/risingwavelabs/risingwave — ~7k ★, the open-source streaming database with CDC sources.
- Airbyte — github.com/airbytehq/airbyte — ~16k ★, ELT platform with built-in CDC connectors.
- Fivetran — Engineering blog — the managed CDC market leader; many practitioner posts.
- Shopify Engineering — Capturing Every Change From Shopify's Sharded Monolith — the CDC story at Shopify scale.
- Netflix Tech Blog — Bulldozer: Batch Data Moving from Data Warehouse to Online Key-Value Stores — the inverse direction (warehouse → KV store).
- Confluent blog — No more silos: Integrating databases with Kafka + CDC — Tim Berglund's canonical primer.
- Schema Registry — Avro / Protobuf compatibility rules — how schema evolution actually works in production CDC.
Videos
- Change Data Capture — Tim Berglund (Confluent) — Tim Berglund · 43 min — the canonical CDC walkthrough; clearer than any docs page.
- Building Streaming Data Pipelines with Debezium — Gunnar Morling — Gunnar Morling (long-time Debezium lead) · 44 min — the inside view of how Debezium works.
- Turning the Database Inside Out — Martin Kleppmann (Strange Loop 2015) — Kleppmann · 47 min — the talk version of the canonical essay.
- The Outbox Pattern in Action — Chris Richardson — Chris Richardson · 38 min — the canonical microservices-pattern talk.
- CDC with Postgres logical replication, Debezium and Kafka — Coding Tech — 1 h 04 min — a hands-on demo from setup to streaming consumer.
LeetCode — Event Emitter
- Link: https://leetcode.com/problems/event-emitter/
- Difficulty: Medium
- Why this problem: Build a publish/subscribe primitive in code — the same shape Debezium implements over a DB log.
- Time-box: 30 minutes. Look up the editorial only after.
Post-session checklist
By the end of this session you should be able to:
- Explain why polling is the wrong way to sync changes.
- Describe WAL/binlog log mining at a high level.
- Sketch the snapshot + stream bootstrap correctly.
- Implement the outbox pattern in a service.
- Handle schema evolution and deletes safely in CDC sinks.
- Solve
event-emitter— subscribe/publish/unsubscribe; CDC's interface in miniature.
Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.