data engineeringintermediate 12m2026-06-09

Streaming with Flink/Spark — Watermarks, Windows, State

Session 23 of the 48-session learning series.

Why this session matters

This is Session 23 of 48 in the Data Engineering track. It builds on the rhythm of one focused topic, paced so you have time to actually absorb it rather than rush.

Agenda

Event time vs processing time — and why everyone confuses them
Watermarks — how systems decide "time has passed"
Windows — tumbling, sliding, session, global
State management — keyed state, broadcast state, checkpoints, savepoints
Flink vs Spark Structured Streaming — choosing for your workload

Pre-read (skim before the session)

Deep dive

1. Why streaming is different

Batch: bounded dataset, you wait for all of it. Streaming: unbounded, you produce answers as data arrives. The question becomes: when do I know I've seen "enough" to produce an answer?

The fundamental insight (Dataflow paper, 2015): you cannot answer this perfectly. You can only trade between completeness, latency, and cost. Watermarks are how systems pick a point on that trade-off.

2. Event time vs processing time

Event time — when the event happened in the real world (in the payload).
Processing time — when the streaming system observed it.

Difference matters because events are delayed by network, batching, mobile-offline, etc. A "user clicked at 12:00:01" event might arrive at 12:00:05 — or 12:30:00.

Almost always, you want event-time semantics. Processing-time aggregates are non-deterministic across replays.

3. Watermarks — "time is X-ish"

A watermark W(t) is a claim: "I believe all events with event time ≤ t have arrived." When the watermark passes the end of a window, that window can fire.

Two strategies:

Perfect watermark — only achievable if you can characterise sources perfectly.
Heuristic watermark — based on observed max event time minus a slack. E.g., W = max_seen_event_time - 5 seconds.

Late events (after watermark) get handled by allowed lateness — extend the window's lifespan and update results; eventually drop.

events: 12:00:01, 12:00:03, 12:00:02 (late!), 12:00:05
        max=01    max=03    still 03      max=05
watermark = max - 2s:
            -1     01        01           03

Window [12:00:00, 12:00:05) fires when watermark crosses 12:00:05 → when we see an event with ts ≥ 12:00:07 (because of 2s slack).

4. Windows

Tumbling — fixed-size, non-overlapping. [0,5), [5,10), [10,15). The default for periodic reports.
Sliding — fixed-size, overlapping. [0,5), [1,6), [2,7). Rolling stats.
Session — gap-based; groups events with \< gap time between. User sessions.
Global — one window for the whole stream; only fires on triggers.

Trigger = "when should this window emit?". Default: at watermark. Custom: every N elements, every N seconds, on watermark + early/late, etc.

5. State management

Streaming = batch with state. Every stateful operation (counts per key, joins, top-K) carries state across events.

Keyed state — partitioned by key. count_per_user["alice"] = 42.

Operator state — non-keyed; e.g., Kafka source offsets.

Broadcast state — small reference data sent to every operator instance.

State backends:

Memory — fast, fits in heap. Lose it on crash unless checkpointed.
RocksDB (Flink default for big state) — on-disk, LSM. Slower per op, but TBs are fine. Snapshots to S3/HDFS.

6. Checkpoints and savepoints (Flink)

Checkpoint — automatic, triggered every N seconds. Used for failure recovery. Cleared after retention. Cheap, frequent.

Savepoint — manual, used to upgrade code or migrate state. More expensive (full).

On failure: Flink restores from last checkpoint, replays Kafka from checkpointed offsets. End-to-end exactly-once requires:

Source replay (Kafka with consumer offsets in checkpoint).
State recovery (checkpoint).
Sink idempotency or transactionality (Kafka transactional producer; database upsert with primary key).

7. Spark Structured Streaming

Spark's model: streaming = infinitely growing table. Each micro-batch is a SQL query over the new rows + state.

events = (spark.readStream
    .format("kafka")
    .option("subscribe", "events")
    .load())

per_user = (events
    .withWatermark("event_time", "10 minutes")
    .groupBy(window("event_time", "5 minutes"), "user_id")
    .count())

per_user.writeStream.format("delta").outputMode("append").start("/data/out")

Micro-batch latency floor: ~hundreds of ms to seconds. Recently added "continuous processing" mode for ms-level latency at lower throughput.

8. Flink — true streaming

Flink processes record-by-record. Latency floor: single-digit ms. State is first-class; APIs are more low-level.

DataStream<Event> events = env.addSource(new FlinkKafkaConsumer<>(...));
events
  .keyBy(Event::getUserId)
  .window(TumblingEventTimeWindows.of(Time.minutes(5)))
  .reduce((a, b) -> a.merge(b))
  .addSink(new FlinkKafkaProducer<>(...));

9. Choosing Flink vs Spark Structured Streaming

Dimension	Flink	Spark SS
Latency floor	~10 ms	~100 ms (micro-batch)
Stateful ops	First-class, advanced	Improving, simpler API
SQL	Yes (mature)	Yes (very mature)
Batch story	Reasonable	Best-in-class
Operations	Dedicated cluster	Reuse Spark infra
Community / ecosystem	Strong in streaming	Wider

Pick Flink if streaming is core to your business (Uber, Netflix, payment processors).

Pick Spark Structured Streaming if you already run Spark for batch and need "good enough" streaming — saves the operational tax of a second engine.

10. Common bugs

Wall-clock for timestamp — now() in your sink, not event_time. Replay non-determinism.
No watermark, default null — windows never fire. Set a watermark even for processing time.
Allowed lateness too generous — state grows forever; OOM.
State backend mismatch — RocksDB with tiny state has overhead; switch to heap.
Sink not idempotent — exactly-once at the engine doesn't mean exactly-once at the DB.

11. Real production numbers (a Flink job we operated)

500 K events/sec from Kafka.
5-minute tumbling window, key = (user_id, event_type).
50 GB state in RocksDB.
Checkpoint every 30 s; full size ~12 GB, incremental ~200 MB.
p99 end-to-end latency: 1.2 s.
Hardware: 24 task managers × 8 vCPU × 32 GB.

After tuning (parallelism = 96, RocksDB block cache = 8 GB per TM, async snapshots): p99 down to 600 ms, checkpoint time halved.

Reading material

Books:

Streaming Systems — Tyler Akidau, Slava Chernyak, Reuven Lax (O'Reilly, the canonical book on streaming; written by the Dataflow team)
Stream Processing with Apache Flink — Hueske & Kalavri (the practical Flink companion)
Designing Data-Intensive Applications — Martin Kleppmann (ch. 11: Stream Processing)

Papers:

The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost — Akidau et al. 2015 (VLDB) — the unifying theory of windowing + watermarks.
MillWheel: Fault-Tolerant Stream Processing at Internet Scale — Akidau et al. 2013 — the Google system Dataflow grew from.
Apache Flink: Stream and Batch Processing in a Single Engine — Carbone et al. 2015 — the Flink architecture paper.
Discretized Streams — Zaharia et al. 2013 (Spark Streaming) — the micro-batch model.

Official docs:

Flink — Stateful Stream Processing — the conceptual model: state, time, exactly-once.
Flink — Windowing — tumbling, sliding, session, global.
Spark Structured Streaming Programming Guide — the table-of-streams abstraction.
Kafka Streams — Architecture — the embeddable library approach.

Blog posts:

The world beyond batch: Streaming 101 — Tyler Akidau (O'Reilly) — the essay that taught a generation what watermarks are.
The world beyond batch: Streaming 102 — Tyler Akidau (O'Reilly) — the windowing + triggers follow-up.
Watermarks in Apache Flink Made Easy — Ververica — the practical "OK but how do they actually work" post.

In-depth research material

Apache Flink — github.com/apache/flink — ~24k ★, the reference stateful stream engine.
Apache Beam — github.com/apache/beam — ~8k ★, the unified batch+stream SDK (Dataflow model).
Pathway — github.com/pathwaycom/pathway — ~30k ★, Python-first incremental stream framework.
Bytewax — github.com/bytewax/bytewax — ~1.6k ★, Python streaming on a Rust engine.
Materialize — github.com/MaterializeInc/materialize — streaming SQL, differential dataflow.
RisingWave — github.com/risingwavelabs/risingwave — ~7k ★, Postgres-compatible streaming SQL.
Flink Forward conference talks (YouTube) — the canonical streaming conference; production case studies.
Uber — Streaming SQL with Apache Flink — AthenaX, real-time SQL at Uber scale.
Netflix — Keystone real-time stream processing — Netflix's Flink-on-Mesos platform.
The Definitive Guide to Apache Pulsar Functions and Connectors — Pulsar's take on stream processing.

Videos

Watermarks in Apache Flink — Flink Forward · 41 min — the official "how watermarks work" deep-dive.
Streaming Systems — Tyler Akidau — Tyler Akidau · 50 min — the Strange Loop talk by the Streaming Systems author.
Apache Flink — What is it and what does it do? — Stephan Ewen — Flink PMC chair · 35 min — the high-level Flink intro from a co-creator.
Spark Structured Streaming Deep Dive — Tathagata Das — Spark Streaming creator · 50 min — micro-batch vs continuous, watermarks in Spark.
Event-Time Processing — Kostas Tzoumas (Ververica) — Flink co-creator · 30 min — the cleanest explanation of why event-time matters.

LeetCode — Sliding Window Maximum

Link: https://leetcode.com/problems/sliding-window-maximum/
Difficulty: Hard
Why this problem: Monotonic deque of indices — exact pattern a streaming window aggregator uses.
Time-box: 30 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

Distinguish event time and processing time; pick one for analytics.
Explain watermarks and allowed lateness; trace a late event.
Pick a window type for: top-K every 5 min, session lengths, daily aggregates.
Configure end-to-end exactly-once across Kafka → Flink → Kafka.
Choose Flink vs Spark SS for a given workload.
Solve sliding-window-maximum — monotonic deque, the textbook streaming window kernel.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.

← previous

Designing a Chat System — Connections, Fanout, Storage, Delivery

MLOps — Experiment Tracking, Model Registry, CI/CD for Models