ai mlintermediate 12m2026-06-09

LLM Serving Part 1 — vLLM, KV Cache, Continuous Batching

Session 30 of the 48-session learning series.

Why this session matters

This is Session 30 of 48 in the LLM track. Pre-training gets the headlines; serving is what pays the bills. Understanding KV cache, continuous batching, and the memory layout of a transformer at inference time is what makes the difference between a $50 and$ 5000 inference bill for the same workload.

Agenda

Why naive model.generate() is terrible for production
The KV cache — what it stores, how big it gets, why it matters
Continuous (in-flight) batching vs static batching
PagedAttention — vLLM's killer idea
Throughput vs latency — the eternal tradeoff

Pre-read (skim before the session)

Deep dive

1. The inference workload — two distinct phases

[ PREFILL ]                    [ DECODE ]
input prompt (N tokens)        one token at a time
heavy compute, parallel        memory-bound, sequential
~100 ms for 1k tokens          ~30 ms per token

Prefill: compute-bound. You process the entire prompt in parallel through every transformer layer. GPU's tensor cores are saturated.
Decode: memory-bound. One token forward at a time; bottleneck is reading the KV cache and weights from HBM, not compute.

These are different optimisation problems. Serving systems optimise both.

2. The KV cache — the silent budget eater

In a transformer, every attention head needs K and V tensors for every token in the context. To avoid recomputing them every step, we cache them.

Size per token, per layer:

2 * d_model * dtype_bytes

For Llama-3-70B (80 layers, hidden 8192, 64 heads, fp16):

Per token KV cache: 2 * 80 * 8192 * 2 = ~2.5 MB
4k context: ~10 GB just for one request's KV cache.
128k context: ~320 GB. Yes, GB. Just KV.

This is why long context is expensive even at low compute cost.

3. Why naive batching breaks down

Static batching = wait for N requests, run them together as a batch.

Problems:

Padding — variable-length prompts pad to max length; lots of wasted GPU.
Tail-blocking — fastest request waits for slowest. 16-token reply blocked by 512-token reply.
Cold queue — low traffic = small batches = poor GPU utilisation.

Throughput collapses for typical LLM workloads.

4. Continuous (in-flight) batching

Orca / vLLM insight: at decode time, every request just needs one more token. So:

Maintain a pool of active requests.
Each decode step: gather all active requests, run them together, append one token each.
When a request finishes, evict it; admit a new one immediately.
New prefill requests interleave with ongoing decodes.

Result: GPU never idles waiting for the slowest request. Throughput often 4–10x over static batching.

Time →
req A: P D D D D D D D
req B:     P D D D D D D D D
req C:           P D D D D D D D
                       ↑ A finishes, C admitted

5. PagedAttention — vLLM's contribution

Problem: KV cache for each request is variable-size, grows over time, lives next to other requests' caches. Contiguous allocation = massive fragmentation as requests come and go.

Solution: borrow virtual memory paging from OS.

Divide KV cache into fixed-size blocks (e.g. 16 tokens).
Each request has a block table mapping logical positions → physical blocks.
Blocks allocated/freed on demand; no fragmentation.

Bonus: blocks can be shared across requests (prefix caching, beam search shares blocks).

Outcome: 2–4x more requests fit in same GPU memory; throughput up correspondingly.

6. Other serving systems

TGI (HuggingFace Text Generation Inference) — production-ready, model-zoo wide, simple to deploy.
TensorRT-LLM (NVIDIA) — fastest on NVIDIA hardware, painful build.
DeepSpeed-Inference (Microsoft) — good for Mixture-of-Experts.
SGLang — newer; structured outputs, prefix caching, often beats vLLM on certain workloads.
MLX-LM / llama.cpp — local/edge inference; runs on Apple Silicon and CPU.

For most server-side production: vLLM or TGI. For best raw NVIDIA perf: TensorRT-LLM.

7. Prefix caching

Common scenario: every request shares a 2 KB system prompt. Recomputing prefill for that on every request wastes 100 ms each.

Prefix caching: hash the prefix tokens, cache the resulting KV blocks, reuse across requests. vLLM and SGLang support automatically when prefixes match.

Big win for: agent workloads (long system prompts), few-shot prompting, RAG (the same retrieved chunks reused).

8. Throughput vs latency

Two distinct service-level objectives:

TTFT (time to first token) — user-perceived "did it start?"
TPOT (time per output token) — user-perceived speed of streaming.
End-to-end latency — total time.
Throughput — total tokens/sec across all users.

You can't max all of them. Choices:

Bigger batches → higher throughput, worse latency.
Smaller batches → better latency, worse $/token.
Speculative decoding (S34) → better TPOT, costs draft model time.
Chunked prefill → smooth TTFT, slight throughput loss.

Define your SLO first. Tune second.

9. Tensor + pipeline parallelism

A 70B model doesn't fit on one 80 GB GPU at full precision. Options:

Tensor parallelism (TP) — split each layer's matrices across GPUs; activations all-reduce after each layer. Low latency, high bandwidth needs.
Pipeline parallelism (PP) — different layers on different GPUs; activations passed forward. Higher latency, less bandwidth.
Expert parallelism (EP) — for MoE; experts on different GPUs.

For 8x A100: TP=8 typically fastest for inference.

10. Quantisation (preview of S34)

fp16 → int8 → int4 shrinks model + KV cache. Throughput up, quality usually within 1% on standard benchmarks. Most production deployments are int8 or int4 by 2026. Detailed in S34.

11. Cost math (do this in interview)

Llama-3-70B fp16 on 4× A100 80GB ($16/GPU-hr cloud):

Cost: ~$64/hr.
Realistic throughput with vLLM + continuous batching: ~3000 output tokens/sec.
Cost per 1M tokens: $64 / (3000 * 3600 / 1e6) = ~$5.93.

Compare OpenAI GPT-4o-mini: ~$0.60/1M out. Why? Better hardware, better software stack, scale, quantisation, batching efficiency.

The gap is closing fast — but "self-host LLM" is rarely cheaper than API at low traffic.

12. Reality check

Decision tree for serving:

< 1M tokens/day, latency permissive → API (OpenAI, Anthropic, etc.).
1–100M tokens/day, batchable → API with caching + prompt minimisation.
100M+ tokens/day with stable load → self-host with vLLM/TGI on rented GPUs.
Privacy/regulatory hard requirement → self-host regardless.

Self-hosting eats engineering time. Budget 1 FTE for serving infra at any meaningful scale.

Reading material

Books:

Hands-On Large Language Models — Jay Alammar & Maarten Grootendorst (the serving chapter)
Generative AI with LangChain — Ben Auffarth (the deployment + cost chapters)
AI Engineering — Chip Huyen (the deployment + inference economics chapter; the best practitioner book of 2024)

Papers:

Efficient Memory Management for Large Language Model Serving with PagedAttention — Kwon et al. 2023 (SOSP) — the vLLM paper; PagedAttention explained from first principles.
Orca: A Distributed Serving System for Transformer-Based Generative Models — Yu et al. 2022 (OSDI) — the original continuous-batching paper.
FlashAttention — Dao et al. 2022 — the IO-aware attention kernel everyone now uses.
FlashAttention-2 — Dao 2023 — the throughput-doubling sequel.
SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills — Agrawal et al. 2023 — chunked prefill, now the default in vLLM.

Official docs:

vLLM documentation — the most popular OSS serving engine.
Hugging Face Text Generation Inference (TGI) — HF's production serving stack.
NVIDIA TensorRT-LLM — the GPU-vendor reference for max throughput.
SGLang documentation — the fast structured-output serving engine.

Blog posts:

Anyscale — Continuous Batching: How to Get 23x Throughput in LLM Inference — the canonical explanation with measured numbers.
vLLM team — Easy, Fast, and Cheap LLM Serving with PagedAttention — the launch post; the 1-page mental model.
Tim Dettmers — Inference math for transformers — the only resource you need for "how big a GPU do I need?"

In-depth research material

vLLM — github.com/vllm-project/vllm — ~31k ★, the de-facto OSS serving engine; PagedAttention reference implementation.
TGI — github.com/huggingface/text-generation-inference — ~9k ★, Hugging Face's production-grade Rust+Python serving stack.
TensorRT-LLM — github.com/NVIDIA/TensorRT-LLM — ~9k ★, NVIDIA's max-throughput inference engine.
SGLang — github.com/sgl-project/sglang — ~7k ★, the LMSYS engine focused on structured generation + fast serving.
LMDeploy — github.com/InternLM/lmdeploy — ~5k ★, alternative serving runtime with TurboMind backend.
llama.cpp — github.com/ggerganov/llama.cpp — ~67k ★, CPU/Apple Silicon inference; the "tiny VM" of the LLM world.
SARATHI follow-up: Taming Throughput-Latency Tradeoff in LLM Inference — Microsoft Research; the modern scheduler design.
Splitwise: Efficient generative LLM inference using phase splitting — Microsoft / NVIDIA; separating prefill and decode onto different hardware.
DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized LLM Serving — OSDI'24 — the disaggregated-serving paper.
Character.ai blog — Optimizing AI Inference at Character.ai — a real serving stack producing 20k+ queries/sec/GPU.

Videos

vLLM: Easy, Fast, and Cheap LLM Serving — Woosuk Kwon (vLLM lead, CMU) — Woosuk Kwon · 32 min — from the paper's first author; the definitive walk-through.
Continuous Batching for LLM Inference — Anyscale — 24 min — the visual explanation of why static batching wastes GPUs.
Inside vLLM: Anatomy of a High-Performance LLM Serving System — Stanford MLSys — Zhuohan Li · 58 min — the architecture lecture from a co-author.
How to Serve LLMs with vLLM and OVHcloud AI Deploy — 42 min — hands-on deployment walkthrough.
GPU MODE: Triton kernels for LLM inference — Horace He (PyTorch) — 1 h 12 min — the GPU-kernel-level view from a PyTorch core dev.

LeetCode — Design Bounded Blocking Queue

Link: https://leetcode.com/problems/design-bounded-blocking-queue/
Difficulty: Medium
Why this problem: Producer-consumer with bounded buffer — same shape as the inference request queue feeding a continuous-batching scheduler.
Time-box: 30 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

Explain why prefill is compute-bound and decode is memory-bound.
Compute the KV cache size for a given model and context length.
Compare static vs continuous batching with a timing diagram.
Describe how PagedAttention avoids KV cache fragmentation.
List 3 SLOs (TTFT, TPOT, throughput) and explain how batch size trades them off.
Solve design-bounded-blocking-queue — semaphore-based producer-consumer.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.

← previous

Idiomatic Python (and a Touch of C++) — Type Hints, Protocols, Dataclasses

API Design — REST, GraphQL, gRPC, Versioning, Pagination, Errors