LLM Serving Part 2 — Speculative Decoding, Quantisation, Throughput
Session 34 of the 48-session learning series.
Why this session matters
This is Session 34 of 48 in the LLM track. Part 1 covered the engine (vLLM, KV cache, batching). Part 2 covers the levers — quantisation, speculative decoding, throughput tuning — that turn that engine from "works" to "$0.50 per million tokens". If you operate any non-trivial LLM workload, these techniques are no longer optional.
Agenda
- Quantisation — int8, int4, FP8, AWQ, GPTQ; what to pick
- Speculative decoding — small draft + big verifier
- Chunked prefill, prefix caching, lookahead decoding
- Multi-LoRA serving — many adapters, one base
- Throughput tuning — batch size, KV pool, parallelism
Pre-read (skim before the session)
- Speculative Decoding (Leviathan et al., 2023)
- GPTQ paper (Frantar et al., 2022)
- AWQ paper (Lin et al., 2023)
- vLLM docs — prefix caching
Deep dive
1. Quantisation — the headline lever
Reducing precision of weights (and optionally activations) shrinks memory + accelerates compute.
| Format | Bits | Memory ratio | Quality loss | Use |
|---|---|---|---|---|
| fp32 | 32 | 1.0× | reference | training |
| fp16 / bf16 | 16 | 0.5× | none | training + serving |
| fp8 | 8 | 0.25× | minimal (H100+) | serving (latest GPUs) |
| int8 | 8 | 0.25× | minimal | serving (any GPU) |
| int4 | 4 | 0.125× | 0.5–1% benchmarks | serving |
| int3 | 3 | 0.094× | 2–5% | experimental |
Memory savings cascade: smaller weights → more room for KV cache → more concurrent requests → higher throughput.
2. Weight-only vs weight-and-activation quantisation
- Weight-only (W8A16, W4A16) — weights quantised, activations stay fp16. Easier, less quality loss, easiest to get right.
- Weight + activation (W8A8, FP8) — both quantised. Requires careful calibration; bigger throughput win.
Most production deployments are weight-only int8/int4. FP8 weights+activations on H100 is the rising standard.
3. The popular int4 methods
- GPTQ — calibration-set post-training quantisation. Layer-by-layer optimisation. Standard for years; widely supported.
- AWQ — Activation-aware. Identifies "salient" weights (driven by activation magnitude), keeps them at higher precision. Smaller quality loss than GPTQ at int4.
- GGUF / Q4_K_M / Q5_K_S (llama.cpp ecosystem) — multiple bit widths within one model; great for CPU + Apple Silicon.
For server-side GPU: AWQ int4 is currently the sweet spot.
4. Calibration data matters
Quantisation algorithms learn from a small calibration set. Use:
- ~128–512 samples.
- Representative of your serving distribution (not random web text).
- Diverse: mix instruction, chat, domain-specific samples.
Same model quantised with wrong calibration = noticeably worse on your workload.
5. Speculative decoding
Decoding is memory-bound; the GPU is largely idle waiting for KV cache + weight reads. Idea: parallelise.
- A small draft model proposes the next N tokens (cheap, fast).
- The large target model verifies all N in one forward pass.
- Accepted tokens are kept; first rejection truncates and continues from there.
If draft model is good, you get ~2–3× speedup with no quality loss (the target model is mathematically still the authority).
draft proposes: "the quick brown fox jumps"
target verifies all 5 in parallel:
accepts "the quick brown" — rejects "fox"
re-samples token 4 itself: "dog"
draft restarts from "the quick brown dog"
Variants:
- Self-speculative — same model with early-layer shortcut as draft (no separate model needed).
- Lookahead decoding — draft via n-gram pattern matching (no model needed).
- EAGLE / Medusa — small "head" branches on top of base model; cheap draft.
6. Chunked prefill
Prefill of a long prompt blocks all in-flight decodes (it monopolises GPU). Chunked prefill splits the prefill into segments, interleaves with decode steps. Result: smoother TTFT, less head-of-line blocking for short queries arriving during a long prefill.
vLLM and TGI both support. Set max_num_batched_tokens to control chunk size.
7. Prefix caching (already touched in S30)
Cache KV blocks for repeated prefixes. Hits common with:
- System prompts (every request shares).
- Few-shot exemplars.
- Conversation history (each chat turn extends from previous).
- RAG with same chunks for similar queries.
vLLM enables automatically with --enable-prefix-caching. The cache lives in GPU memory; evicted LRU. Typical 2–5× throughput improvement on chat workloads.
8. Multi-LoRA serving
Scenario: 100 customers, each with their own fine-tuned adapter. Don't deploy 100 separate models.
Pattern:
- Load base model once.
- Adapter (LoRA delta) loaded per request, small (~50 MB).
- Use kernels that compute
(W + B·A)·xon the fly.
vLLM with --enable-lora, TGI with the same feature. Single GPU serves 100s of LoRAs.
Big architectural win: one fleet for all customers, vs N fleets. Cost down 10–50×.
9. Parallelism for serving
Recap from S30:
- Tensor parallel (TP) — splits weights across GPUs in one node. Low latency, high bandwidth (NVLink). Default for inference.
- Pipeline parallel (PP) — layers on different nodes. Adds latency per pipeline stage; suited to networks with weak interconnect.
- Expert parallel (EP) — MoE; experts across GPUs.
For 70B on 4× A100: TP=4. For 405B on 8× H100: TP=8 (or TP=4 PP=2).
10. Throughput tuning checklist
max_num_batched_tokens— chunked prefill threshold.max_num_seqs— max in-flight requests.gpu_memory_utilization— fraction of GPU reserved (0.85–0.92 typical).kv_cache_dtype— fp16 vs fp8 (fp8 doubles KV capacity).- Quantised weights (int8/int4).
- Prefix caching on.
- Speculative decoding on if you have a good draft.
Each lever interacts. Tune one at a time, measure, retune. Spreadsheet of (config, throughput, p50, p99) is your friend.
11. Observability for inference
- Token-level metrics — input tokens, output tokens, accepted-speculative-tokens.
- Phase timings — prefill ms, decode ms/token.
- Queue depth — pending requests.
- KV cache occupancy — fraction full; alert at 90%.
- Prefix-cache hit rate — should be 30–80% for chat workloads.
Standard exporters (Prometheus) ship with vLLM/TGI. Wire to Grafana, set SLO alerts.
12. Reality check
Cost-minimised stack circa 2026:
- vLLM with int4 AWQ weights + fp8 KV cache.
- Prefix caching enabled.
- Self-speculative decoding via Medusa or EAGLE.
- Multi-LoRA for tenant isolation.
- TP=4 on a node of 4× A100 or H100.
- Chunked prefill for smooth tail latency.
That setup serves 4–8× more tokens/sec than naive fp16 vLLM. Same hardware, same SLOs, ~5× cheaper per token. The math is obvious; the engineering is real work.
Reading material
Books:
- AI Engineering — Chip Huyen (the inference economics + cost chapters)
- Generative Deep Learning, 2nd ed. — David Foster (the chapter on autoregressive sampling for the sampling background you need)
- Hands-On Large Language Models — Jay Alammar & Maarten Grootendorst (the deployment + quantisation chapter)
Papers:
- Fast Inference from Transformers via Speculative Decoding — Leviathan, Kalman, Matias 2023 — the original speculative decoding paper.
- GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers — Frantar et al. 2022 — 3-4 bit weight quantisation that actually works.
- AWQ: Activation-aware Weight Quantization for LLM Compression — Lin et al. 2023 — the quantisation method shipped in vLLM / TGI by default.
- Medusa: Simple LLM Inference Acceleration with Multiple Decoding Heads — Cai et al. 2024 — speculative decoding without a separate draft model.
- EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty — Li et al. 2024 — the SOTA spec-decoding head used by vLLM.
- SmoothQuant: Accurate and Efficient Post-Training Quantization for LLMs — Xiao et al. 2023 — activation quantisation made practical.
- FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision — Shah et al. 2024 — the H100-era attention kernel.
Official docs:
- vLLM — Performance Optimization guide — chunked prefill, prefix caching, speculative decoding settings.
- vLLM — Quantization support — AWQ, GPTQ, FP8, INT8 in production.
- Hugging Face Optimum — the HF quantisation + acceleration library.
- NVIDIA TensorRT-LLM — Speculative Decoding — production multi-token speculation.
Blog posts:
- Together AI — Sequoia: speculative decoding with a tree — how production speculative decoding actually works.
- PyTorch blog — Accelerating Generative AI with PyTorch II: GPT, Fast — Horace He's 10x speedup walkthrough.
- Anyscale — Throughput is all you need — the right mental model for batch sizing.
In-depth research material
- vLLM — github.com/vllm-project/vllm — ~31k ★, the canonical OSS engine with continuous batching + AWQ + FP8 + spec decoding.
- SGLang — github.com/sgl-project/sglang — ~7k ★, LMSYS engine with the fastest prefix caching and structured output.
- TensorRT-LLM — github.com/NVIDIA/TensorRT-LLM — ~9k ★, the NVIDIA max-throughput engine.
- llama.cpp — github.com/ggerganov/llama.cpp — ~67k ★, GGUF quantisation reference; the CPU/edge inference standard.
- exllamav2 — github.com/turboderp/exllamav2 — ~3.8k ★, the fastest single-GPU 4-bit inference engine for consumer hardware.
- AutoAWQ — github.com/casper-hansen/AutoAWQ — ~1.9k ★, the canonical AWQ implementation.
- AutoGPTQ — github.com/AutoGPTQ/AutoGPTQ — ~4.8k ★, the canonical GPTQ implementation.
- PyTorch GPT-Fast — github.com/pytorch-labs/gpt-fast — ~5.6k ★, the PyTorch team's reference for
torch.compile+ spec decoding. - Mooncake: Trading More Storage for Less Computation — Kimi 2024 — the disaggregated KV-cache architecture from Moonshot AI (Kimi).
- Character.AI Research — Optimizing inference at Character.AI — the most detailed real-world production tuning story (20k qps/GPU).
- Tim Dettmers — Which GPU(s) to Get for Deep Learning — the only inference economics primer you need.
Videos
- Speculative Decoding: When Two LLMs are Faster than One — Mark Saroufim (PyTorch) — 31 min — the cleanest visual explanation; pairs perfectly with the Leviathan paper.
- Quantization in Depth — Hugging Face — 1 h 03 min — int8/int4/FP8/AWQ/GPTQ from first principles.
- Accelerating Generative AI: From Slow to Fast — Horace He (PyTorch) — 56 min — the GPT-Fast 10x talk; CUDA graphs + spec decoding + quantisation combined.
- Inside vLLM 2.0 — Woosuk Kwon (vLLM lead) — 58 min — chunked prefill, prefix caching, multi-LoRA, FP8.
- GPU MODE: Lecture 14 — TensorRT-LLM internals — 1 h 22 min — NVIDIA's serving engine from a GPU-kernel viewpoint.
LeetCode — Longest Palindromic Substring
- Link: https://leetcode.com/problems/longest-palindromic-substring/
- Difficulty: Medium
- Why this problem: Speculative-decode verification is essentially "longest accepted prefix of draft"; both problems are about matching segments efficiently.
- Time-box: 30 minutes. Look up the editorial only after.
Post-session checklist
By the end of this session you should be able to:
- Pick the right quantisation method (int8/int4/FP8/AWQ/GPTQ) for a given GPU and workload.
- Explain why decoding is memory-bound and how speculative decoding exploits it.
- Configure prefix caching and predict the throughput impact for a chat workload.
- Stand up multi-LoRA serving with vLLM/TGI.
- Tune
max_num_batched_tokens,gpu_memory_utilization, and KV-cache dtype. - Solve
longest-palindromic-substring— prefix-matching primitive used in spec verification.
Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.