Streaming with Flink/Spark — Watermarks, Windows, State
Session 23 of the 48-session learning series.
Why this session matters
This is Session 23 of 48 in the Data Engineering track. It builds on the rhythm of one focused topic, paced so you have time to actually absorb it rather than rush.
Agenda
- Event time vs processing time — and why everyone confuses them
- Watermarks — how systems decide "time has passed"
- Windows — tumbling, sliding, session, global
- State management — keyed state, broadcast state, checkpoints, savepoints
- Flink vs Spark Structured Streaming — choosing for your workload
Pre-read (skim before the session)
- Tyler Akidau — The world beyond batch (Parts I & II)
- Flink — Concepts: Stateful Stream Processing
- Spark Structured Streaming Programming Guide
- Dataflow Model paper (Akidau et al., 2015)
Deep dive
1. Why streaming is different
Batch: bounded dataset, you wait for all of it. Streaming: unbounded, you produce answers as data arrives. The question becomes: when do I know I've seen "enough" to produce an answer?
The fundamental insight (Dataflow paper, 2015): you cannot answer this perfectly. You can only trade between completeness, latency, and cost. Watermarks are how systems pick a point on that trade-off.
2. Event time vs processing time
- Event time — when the event happened in the real world (in the payload).
- Processing time — when the streaming system observed it.
Difference matters because events are delayed by network, batching, mobile-offline, etc. A "user clicked at 12:00:01" event might arrive at 12:00:05 — or 12:30:00.
Almost always, you want event-time semantics. Processing-time aggregates are non-deterministic across replays.
3. Watermarks — "time is X-ish"
A watermark W(t) is a claim: "I believe all events with event time ≤ t have arrived." When the watermark passes the end of a window, that window can fire.
Two strategies:
- Perfect watermark — only achievable if you can characterise sources perfectly.
- Heuristic watermark — based on observed max event time minus a slack. E.g.,
W = max_seen_event_time - 5 seconds.
Late events (after watermark) get handled by allowed lateness — extend the window's lifespan and update results; eventually drop.
events: 12:00:01, 12:00:03, 12:00:02 (late!), 12:00:05
max=01 max=03 still 03 max=05
watermark = max - 2s:
-1 01 01 03
Window [12:00:00, 12:00:05) fires when watermark crosses 12:00:05 → when we see an event with ts ≥ 12:00:07 (because of 2s slack).
4. Windows
- Tumbling — fixed-size, non-overlapping.
[0,5), [5,10), [10,15). The default for periodic reports. - Sliding — fixed-size, overlapping.
[0,5), [1,6), [2,7). Rolling stats. - Session — gap-based; groups events with
\< gaptime between. User sessions. - Global — one window for the whole stream; only fires on triggers.
Trigger = "when should this window emit?". Default: at watermark. Custom: every N elements, every N seconds, on watermark + early/late, etc.
5. State management
Streaming = batch with state. Every stateful operation (counts per key, joins, top-K) carries state across events.
Keyed state — partitioned by key. count_per_user["alice"] = 42.
Operator state — non-keyed; e.g., Kafka source offsets.
Broadcast state — small reference data sent to every operator instance.
State backends:
- Memory — fast, fits in heap. Lose it on crash unless checkpointed.
- RocksDB (Flink default for big state) — on-disk, LSM. Slower per op, but TBs are fine. Snapshots to S3/HDFS.
6. Checkpoints and savepoints (Flink)
Checkpoint — automatic, triggered every N seconds. Used for failure recovery. Cleared after retention. Cheap, frequent.
Savepoint — manual, used to upgrade code or migrate state. More expensive (full).
On failure: Flink restores from last checkpoint, replays Kafka from checkpointed offsets. End-to-end exactly-once requires:
- Source replay (Kafka with consumer offsets in checkpoint).
- State recovery (checkpoint).
- Sink idempotency or transactionality (Kafka transactional producer; database upsert with primary key).
7. Spark Structured Streaming
Spark's model: streaming = infinitely growing table. Each micro-batch is a SQL query over the new rows + state.
events = (spark.readStream
.format("kafka")
.option("subscribe", "events")
.load())
per_user = (events
.withWatermark("event_time", "10 minutes")
.groupBy(window("event_time", "5 minutes"), "user_id")
.count())
per_user.writeStream.format("delta").outputMode("append").start("/data/out")
Micro-batch latency floor: ~hundreds of ms to seconds. Recently added "continuous processing" mode for ms-level latency at lower throughput.
8. Flink — true streaming
Flink processes record-by-record. Latency floor: single-digit ms. State is first-class; APIs are more low-level.
DataStream<Event> events = env.addSource(new FlinkKafkaConsumer<>(...));
events
.keyBy(Event::getUserId)
.window(TumblingEventTimeWindows.of(Time.minutes(5)))
.reduce((a, b) -> a.merge(b))
.addSink(new FlinkKafkaProducer<>(...));
9. Choosing Flink vs Spark Structured Streaming
| Dimension | Flink | Spark SS |
|---|---|---|
| Latency floor | ~10 ms | ~100 ms (micro-batch) |
| Stateful ops | First-class, advanced | Improving, simpler API |
| SQL | Yes (mature) | Yes (very mature) |
| Batch story | Reasonable | Best-in-class |
| Operations | Dedicated cluster | Reuse Spark infra |
| Community / ecosystem | Strong in streaming | Wider |
Pick Flink if streaming is core to your business (Uber, Netflix, payment processors).
Pick Spark Structured Streaming if you already run Spark for batch and need "good enough" streaming — saves the operational tax of a second engine.
10. Common bugs
- Wall-clock for timestamp —
now()in your sink, notevent_time. Replay non-determinism. - No watermark, default null — windows never fire. Set a watermark even for processing time.
- Allowed lateness too generous — state grows forever; OOM.
- State backend mismatch — RocksDB with tiny state has overhead; switch to heap.
- Sink not idempotent — exactly-once at the engine doesn't mean exactly-once at the DB.
11. Real production numbers (a Flink job we operated)
- 500 K events/sec from Kafka.
- 5-minute tumbling window, key = (user_id, event_type).
- 50 GB state in RocksDB.
- Checkpoint every 30 s; full size ~12 GB, incremental ~200 MB.
- p99 end-to-end latency: 1.2 s.
- Hardware: 24 task managers × 8 vCPU × 32 GB.
After tuning (parallelism = 96, RocksDB block cache = 8 GB per TM, async snapshots): p99 down to 600 ms, checkpoint time halved.
Reading material
Books:
- Streaming Systems — Tyler Akidau, Slava Chernyak, Reuven Lax (O'Reilly, the canonical book on streaming; written by the Dataflow team)
- Stream Processing with Apache Flink — Hueske & Kalavri (the practical Flink companion)
- Designing Data-Intensive Applications — Martin Kleppmann (ch. 11: Stream Processing)
Papers:
- The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost — Akidau et al. 2015 (VLDB) — the unifying theory of windowing + watermarks.
- MillWheel: Fault-Tolerant Stream Processing at Internet Scale — Akidau et al. 2013 — the Google system Dataflow grew from.
- Apache Flink: Stream and Batch Processing in a Single Engine — Carbone et al. 2015 — the Flink architecture paper.
- Discretized Streams — Zaharia et al. 2013 (Spark Streaming) — the micro-batch model.
Official docs:
- Flink — Stateful Stream Processing — the conceptual model: state, time, exactly-once.
- Flink — Windowing — tumbling, sliding, session, global.
- Spark Structured Streaming Programming Guide — the table-of-streams abstraction.
- Kafka Streams — Architecture — the embeddable library approach.
Blog posts:
- The world beyond batch: Streaming 101 — Tyler Akidau (O'Reilly) — the essay that taught a generation what watermarks are.
- The world beyond batch: Streaming 102 — Tyler Akidau (O'Reilly) — the windowing + triggers follow-up.
- Watermarks in Apache Flink Made Easy — Ververica — the practical "OK but how do they actually work" post.
In-depth research material
- Apache Flink — github.com/apache/flink — ~24k ★, the reference stateful stream engine.
- Apache Beam — github.com/apache/beam — ~8k ★, the unified batch+stream SDK (Dataflow model).
- Pathway — github.com/pathwaycom/pathway — ~30k ★, Python-first incremental stream framework.
- Bytewax — github.com/bytewax/bytewax — ~1.6k ★, Python streaming on a Rust engine.
- Materialize — github.com/MaterializeInc/materialize — streaming SQL, differential dataflow.
- RisingWave — github.com/risingwavelabs/risingwave — ~7k ★, Postgres-compatible streaming SQL.
- Flink Forward conference talks (YouTube) — the canonical streaming conference; production case studies.
- Uber — Streaming SQL with Apache Flink — AthenaX, real-time SQL at Uber scale.
- Netflix — Keystone real-time stream processing — Netflix's Flink-on-Mesos platform.
- The Definitive Guide to Apache Pulsar Functions and Connectors — Pulsar's take on stream processing.
Videos
- Watermarks in Apache Flink — Flink Forward · 41 min — the official "how watermarks work" deep-dive.
- Streaming Systems — Tyler Akidau — Tyler Akidau · 50 min — the Strange Loop talk by the Streaming Systems author.
- Apache Flink — What is it and what does it do? — Stephan Ewen — Flink PMC chair · 35 min — the high-level Flink intro from a co-creator.
- Spark Structured Streaming Deep Dive — Tathagata Das — Spark Streaming creator · 50 min — micro-batch vs continuous, watermarks in Spark.
- Event-Time Processing — Kostas Tzoumas (Ververica) — Flink co-creator · 30 min — the cleanest explanation of why event-time matters.
LeetCode — Sliding Window Maximum
- Link: https://leetcode.com/problems/sliding-window-maximum/
- Difficulty: Hard
- Why this problem: Monotonic deque of indices — exact pattern a streaming window aggregator uses.
- Time-box: 30 minutes. Look up the editorial only after.
Post-session checklist
By the end of this session you should be able to:
- Distinguish event time and processing time; pick one for analytics.
- Explain watermarks and allowed lateness; trace a late event.
- Pick a window type for: top-K every 5 min, session lengths, daily aggregates.
- Configure end-to-end exactly-once across Kafka → Flink → Kafka.
- Choose Flink vs Spark SS for a given workload.
- Solve
sliding-window-maximum— monotonic deque, the textbook streaming window kernel.
Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.