LLM Serving Part 1 — vLLM, KV Cache, Continuous Batching
Session 30 of the 48-session learning series.
Why this session matters
This is Session 30 of 48 in the LLM track. Pre-training gets the headlines; serving is what pays the bills. Understanding KV cache, continuous batching, and the memory layout of a transformer at inference time is what makes the difference between a 5000 inference bill for the same workload.
Agenda
- Why naive
model.generate()is terrible for production - The KV cache — what it stores, how big it gets, why it matters
- Continuous (in-flight) batching vs static batching
- PagedAttention — vLLM's killer idea
- Throughput vs latency — the eternal tradeoff
Pre-read (skim before the session)
- vLLM paper — Efficient Memory Management for LLM Serving with PagedAttention (2023)
- Orca — Continuous batching paper (OSDI 2022)
- HuggingFace — Text Generation Inference docs
- vLLM blog — first announcement
Deep dive
1. The inference workload — two distinct phases
[ PREFILL ] [ DECODE ]
input prompt (N tokens) one token at a time
heavy compute, parallel memory-bound, sequential
~100 ms for 1k tokens ~30 ms per token
- Prefill: compute-bound. You process the entire prompt in parallel through every transformer layer. GPU's tensor cores are saturated.
- Decode: memory-bound. One token forward at a time; bottleneck is reading the KV cache and weights from HBM, not compute.
These are different optimisation problems. Serving systems optimise both.
2. The KV cache — the silent budget eater
In a transformer, every attention head needs K and V tensors for every token in the context. To avoid recomputing them every step, we cache them.
Size per token, per layer:
2 * d_model * dtype_bytes
For Llama-3-70B (80 layers, hidden 8192, 64 heads, fp16):
- Per token KV cache:
2 * 80 * 8192 * 2 = ~2.5 MB - 4k context: ~10 GB just for one request's KV cache.
- 128k context: ~320 GB. Yes, GB. Just KV.
This is why long context is expensive even at low compute cost.
3. Why naive batching breaks down
Static batching = wait for N requests, run them together as a batch.
Problems:
- Padding — variable-length prompts pad to max length; lots of wasted GPU.
- Tail-blocking — fastest request waits for slowest. 16-token reply blocked by 512-token reply.
- Cold queue — low traffic = small batches = poor GPU utilisation.
Throughput collapses for typical LLM workloads.
4. Continuous (in-flight) batching
Orca / vLLM insight: at decode time, every request just needs one more token. So:
- Maintain a pool of active requests.
- Each decode step: gather all active requests, run them together, append one token each.
- When a request finishes, evict it; admit a new one immediately.
- New prefill requests interleave with ongoing decodes.
Result: GPU never idles waiting for the slowest request. Throughput often 4–10x over static batching.
Time →
req A: P D D D D D D D
req B: P D D D D D D D D
req C: P D D D D D D D
↑ A finishes, C admitted
5. PagedAttention — vLLM's contribution
Problem: KV cache for each request is variable-size, grows over time, lives next to other requests' caches. Contiguous allocation = massive fragmentation as requests come and go.
Solution: borrow virtual memory paging from OS.
- Divide KV cache into fixed-size blocks (e.g. 16 tokens).
- Each request has a block table mapping logical positions → physical blocks.
- Blocks allocated/freed on demand; no fragmentation.
Bonus: blocks can be shared across requests (prefix caching, beam search shares blocks).
Outcome: 2–4x more requests fit in same GPU memory; throughput up correspondingly.
6. Other serving systems
- TGI (HuggingFace Text Generation Inference) — production-ready, model-zoo wide, simple to deploy.
- TensorRT-LLM (NVIDIA) — fastest on NVIDIA hardware, painful build.
- DeepSpeed-Inference (Microsoft) — good for Mixture-of-Experts.
- SGLang — newer; structured outputs, prefix caching, often beats vLLM on certain workloads.
- MLX-LM / llama.cpp — local/edge inference; runs on Apple Silicon and CPU.
For most server-side production: vLLM or TGI. For best raw NVIDIA perf: TensorRT-LLM.
7. Prefix caching
Common scenario: every request shares a 2 KB system prompt. Recomputing prefill for that on every request wastes 100 ms each.
Prefix caching: hash the prefix tokens, cache the resulting KV blocks, reuse across requests. vLLM and SGLang support automatically when prefixes match.
Big win for: agent workloads (long system prompts), few-shot prompting, RAG (the same retrieved chunks reused).
8. Throughput vs latency
Two distinct service-level objectives:
- TTFT (time to first token) — user-perceived "did it start?"
- TPOT (time per output token) — user-perceived speed of streaming.
- End-to-end latency — total time.
- Throughput — total tokens/sec across all users.
You can't max all of them. Choices:
- Bigger batches → higher throughput, worse latency.
- Smaller batches → better latency, worse $/token.
- Speculative decoding (S34) → better TPOT, costs draft model time.
- Chunked prefill → smooth TTFT, slight throughput loss.
Define your SLO first. Tune second.
9. Tensor + pipeline parallelism
A 70B model doesn't fit on one 80 GB GPU at full precision. Options:
- Tensor parallelism (TP) — split each layer's matrices across GPUs; activations all-reduce after each layer. Low latency, high bandwidth needs.
- Pipeline parallelism (PP) — different layers on different GPUs; activations passed forward. Higher latency, less bandwidth.
- Expert parallelism (EP) — for MoE; experts on different GPUs.
For 8x A100: TP=8 typically fastest for inference.
10. Quantisation (preview of S34)
fp16 → int8 → int4 shrinks model + KV cache. Throughput up, quality usually within 1% on standard benchmarks. Most production deployments are int8 or int4 by 2026. Detailed in S34.
11. Cost math (do this in interview)
Llama-3-70B fp16 on 4× A100 80GB ($16/GPU-hr cloud):
- Cost: ~$64/hr.
- Realistic throughput with vLLM + continuous batching: ~3000 output tokens/sec.
- Cost per 1M tokens:
$64 / (3000 * 3600 / 1e6) = ~$5.93.
Compare OpenAI GPT-4o-mini: ~$0.60/1M out. Why? Better hardware, better software stack, scale, quantisation, batching efficiency.
The gap is closing fast — but "self-host LLM" is rarely cheaper than API at low traffic.
12. Reality check
Decision tree for serving:
- < 1M tokens/day, latency permissive → API (OpenAI, Anthropic, etc.).
- 1–100M tokens/day, batchable → API with caching + prompt minimisation.
- 100M+ tokens/day with stable load → self-host with vLLM/TGI on rented GPUs.
- Privacy/regulatory hard requirement → self-host regardless.
Self-hosting eats engineering time. Budget 1 FTE for serving infra at any meaningful scale.
Reading material
Books:
- Hands-On Large Language Models — Jay Alammar & Maarten Grootendorst (the serving chapter)
- Generative AI with LangChain — Ben Auffarth (the deployment + cost chapters)
- AI Engineering — Chip Huyen (the deployment + inference economics chapter; the best practitioner book of 2024)
Papers:
- Efficient Memory Management for Large Language Model Serving with PagedAttention — Kwon et al. 2023 (SOSP) — the vLLM paper; PagedAttention explained from first principles.
- Orca: A Distributed Serving System for Transformer-Based Generative Models — Yu et al. 2022 (OSDI) — the original continuous-batching paper.
- FlashAttention — Dao et al. 2022 — the IO-aware attention kernel everyone now uses.
- FlashAttention-2 — Dao 2023 — the throughput-doubling sequel.
- SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills — Agrawal et al. 2023 — chunked prefill, now the default in vLLM.
Official docs:
- vLLM documentation — the most popular OSS serving engine.
- Hugging Face Text Generation Inference (TGI) — HF's production serving stack.
- NVIDIA TensorRT-LLM — the GPU-vendor reference for max throughput.
- SGLang documentation — the fast structured-output serving engine.
Blog posts:
- Anyscale — Continuous Batching: How to Get 23x Throughput in LLM Inference — the canonical explanation with measured numbers.
- vLLM team — Easy, Fast, and Cheap LLM Serving with PagedAttention — the launch post; the 1-page mental model.
- Tim Dettmers — Inference math for transformers — the only resource you need for "how big a GPU do I need?"
In-depth research material
- vLLM — github.com/vllm-project/vllm — ~31k ★, the de-facto OSS serving engine; PagedAttention reference implementation.
- TGI — github.com/huggingface/text-generation-inference — ~9k ★, Hugging Face's production-grade Rust+Python serving stack.
- TensorRT-LLM — github.com/NVIDIA/TensorRT-LLM — ~9k ★, NVIDIA's max-throughput inference engine.
- SGLang — github.com/sgl-project/sglang — ~7k ★, the LMSYS engine focused on structured generation + fast serving.
- LMDeploy — github.com/InternLM/lmdeploy — ~5k ★, alternative serving runtime with TurboMind backend.
- llama.cpp — github.com/ggerganov/llama.cpp — ~67k ★, CPU/Apple Silicon inference; the "tiny VM" of the LLM world.
- SARATHI follow-up: Taming Throughput-Latency Tradeoff in LLM Inference — Microsoft Research; the modern scheduler design.
- Splitwise: Efficient generative LLM inference using phase splitting — Microsoft / NVIDIA; separating prefill and decode onto different hardware.
- DistServe: Disaggregating Prefill and Decoding for Goodput-Optimized LLM Serving — OSDI'24 — the disaggregated-serving paper.
- Character.ai blog — Optimizing AI Inference at Character.ai — a real serving stack producing 20k+ queries/sec/GPU.
Videos
- vLLM: Easy, Fast, and Cheap LLM Serving — Woosuk Kwon (vLLM lead, CMU) — Woosuk Kwon · 32 min — from the paper's first author; the definitive walk-through.
- Continuous Batching for LLM Inference — Anyscale — 24 min — the visual explanation of why static batching wastes GPUs.
- Inside vLLM: Anatomy of a High-Performance LLM Serving System — Stanford MLSys — Zhuohan Li · 58 min — the architecture lecture from a co-author.
- How to Serve LLMs with vLLM and OVHcloud AI Deploy — 42 min — hands-on deployment walkthrough.
- GPU MODE: Triton kernels for LLM inference — Horace He (PyTorch) — 1 h 12 min — the GPU-kernel-level view from a PyTorch core dev.
LeetCode — Design Bounded Blocking Queue
- Link: https://leetcode.com/problems/design-bounded-blocking-queue/
- Difficulty: Medium
- Why this problem: Producer-consumer with bounded buffer — same shape as the inference request queue feeding a continuous-batching scheduler.
- Time-box: 30 minutes. Look up the editorial only after.
Post-session checklist
By the end of this session you should be able to:
- Explain why prefill is compute-bound and decode is memory-bound.
- Compute the KV cache size for a given model and context length.
- Compare static vs continuous batching with a timing diagram.
- Describe how PagedAttention avoids KV cache fragmentation.
- List 3 SLOs (TTFT, TPOT, throughput) and explain how batch size trades them off.
- Solve
design-bounded-blocking-queue— semaphore-based producer-consumer.
Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.