ai mlintermediate 12m2026-06-09

LLM Serving Part 2 — Speculative Decoding, Quantisation, Throughput

Session 34 of the 48-session learning series.

Why this session matters

This is Session 34 of 48 in the LLM track. Part 1 covered the engine (vLLM, KV cache, batching). Part 2 covers the levers — quantisation, speculative decoding, throughput tuning — that turn that engine from "works" to "$0.50 per million tokens". If you operate any non-trivial LLM workload, these techniques are no longer optional.

Agenda

Quantisation — int8, int4, FP8, AWQ, GPTQ; what to pick
Speculative decoding — small draft + big verifier
Chunked prefill, prefix caching, lookahead decoding
Multi-LoRA serving — many adapters, one base
Throughput tuning — batch size, KV pool, parallelism

Pre-read (skim before the session)

Deep dive

1. Quantisation — the headline lever

Reducing precision of weights (and optionally activations) shrinks memory + accelerates compute.

Format	Bits	Memory ratio	Quality loss	Use
fp32	32	1.0×	reference	training
fp16 / bf16	16	0.5×	none	training + serving
fp8	8	0.25×	minimal (H100+)	serving (latest GPUs)
int8	8	0.25×	minimal	serving (any GPU)
int4	4	0.125×	0.5–1% benchmarks	serving
int3	3	0.094×	2–5%	experimental

Memory savings cascade: smaller weights → more room for KV cache → more concurrent requests → higher throughput.

2. Weight-only vs weight-and-activation quantisation

Weight-only (W8A16, W4A16) — weights quantised, activations stay fp16. Easier, less quality loss, easiest to get right.
Weight + activation (W8A8, FP8) — both quantised. Requires careful calibration; bigger throughput win.

Most production deployments are weight-only int8/int4. FP8 weights+activations on H100 is the rising standard.

3. The popular int4 methods

GPTQ — calibration-set post-training quantisation. Layer-by-layer optimisation. Standard for years; widely supported.
AWQ — Activation-aware. Identifies "salient" weights (driven by activation magnitude), keeps them at higher precision. Smaller quality loss than GPTQ at int4.
GGUF / Q4_K_M / Q5_K_S (llama.cpp ecosystem) — multiple bit widths within one model; great for CPU + Apple Silicon.

For server-side GPU: AWQ int4 is currently the sweet spot.

4. Calibration data matters

Quantisation algorithms learn from a small calibration set. Use:

~128–512 samples.
Representative of your serving distribution (not random web text).
Diverse: mix instruction, chat, domain-specific samples.

Same model quantised with wrong calibration = noticeably worse on your workload.

5. Speculative decoding

Decoding is memory-bound; the GPU is largely idle waiting for KV cache + weight reads. Idea: parallelise.

A small draft model proposes the next N tokens (cheap, fast).
The large target model verifies all N in one forward pass.
Accepted tokens are kept; first rejection truncates and continues from there.

If draft model is good, you get ~2–3× speedup with no quality loss (the target model is mathematically still the authority).

draft proposes: "the quick brown fox jumps"
target verifies all 5 in parallel:
   accepts "the quick brown" — rejects "fox"
   re-samples token 4 itself: "dog"
   draft restarts from "the quick brown dog"

Variants:

Self-speculative — same model with early-layer shortcut as draft (no separate model needed).
Lookahead decoding — draft via n-gram pattern matching (no model needed).
EAGLE / Medusa — small "head" branches on top of base model; cheap draft.

6. Chunked prefill

Prefill of a long prompt blocks all in-flight decodes (it monopolises GPU). Chunked prefill splits the prefill into segments, interleaves with decode steps. Result: smoother TTFT, less head-of-line blocking for short queries arriving during a long prefill.

vLLM and TGI both support. Set max_num_batched_tokens to control chunk size.

7. Prefix caching (already touched in S30)

Cache KV blocks for repeated prefixes. Hits common with:

System prompts (every request shares).
Few-shot exemplars.
Conversation history (each chat turn extends from previous).
RAG with same chunks for similar queries.

vLLM enables automatically with --enable-prefix-caching. The cache lives in GPU memory; evicted LRU. Typical 2–5× throughput improvement on chat workloads.

8. Multi-LoRA serving

Scenario: 100 customers, each with their own fine-tuned adapter. Don't deploy 100 separate models.

Pattern:

Load base model once.
Adapter (LoRA delta) loaded per request, small (~50 MB).
Use kernels that compute (W + B·A)·x on the fly.

vLLM with --enable-lora, TGI with the same feature. Single GPU serves 100s of LoRAs.

Big architectural win: one fleet for all customers, vs N fleets. Cost down 10–50×.

9. Parallelism for serving

Recap from S30:

Tensor parallel (TP) — splits weights across GPUs in one node. Low latency, high bandwidth (NVLink). Default for inference.
Pipeline parallel (PP) — layers on different nodes. Adds latency per pipeline stage; suited to networks with weak interconnect.
Expert parallel (EP) — MoE; experts across GPUs.

For 70B on 4× A100: TP=4. For 405B on 8× H100: TP=8 (or TP=4 PP=2).

10. Throughput tuning checklist

max_num_batched_tokens — chunked prefill threshold.
max_num_seqs — max in-flight requests.
gpu_memory_utilization — fraction of GPU reserved (0.85–0.92 typical).
kv_cache_dtype — fp16 vs fp8 (fp8 doubles KV capacity).
Quantised weights (int8/int4).
Prefix caching on.
Speculative decoding on if you have a good draft.

Each lever interacts. Tune one at a time, measure, retune. Spreadsheet of (config, throughput, p50, p99) is your friend.

11. Observability for inference

Token-level metrics — input tokens, output tokens, accepted-speculative-tokens.
Phase timings — prefill ms, decode ms/token.
Queue depth — pending requests.
KV cache occupancy — fraction full; alert at 90%.
Prefix-cache hit rate — should be 30–80% for chat workloads.

Standard exporters (Prometheus) ship with vLLM/TGI. Wire to Grafana, set SLO alerts.

12. Reality check

Cost-minimised stack circa 2026:

vLLM with int4 AWQ weights + fp8 KV cache.
Prefix caching enabled.
Self-speculative decoding via Medusa or EAGLE.
Multi-LoRA for tenant isolation.
TP=4 on a node of 4× A100 or H100.
Chunked prefill for smooth tail latency.

That setup serves 4–8× more tokens/sec than naive fp16 vLLM. Same hardware, same SLOs, ~5× cheaper per token. The math is obvious; the engineering is real work.

Reading material

Books:

AI Engineering — Chip Huyen (the inference economics + cost chapters)
Generative Deep Learning, 2nd ed. — David Foster (the chapter on autoregressive sampling for the sampling background you need)
Hands-On Large Language Models — Jay Alammar & Maarten Grootendorst (the deployment + quantisation chapter)

Papers:

Fast Inference from Transformers via Speculative Decoding — Leviathan, Kalman, Matias 2023 — the original speculative decoding paper.
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers — Frantar et al. 2022 — 3-4 bit weight quantisation that actually works.
AWQ: Activation-aware Weight Quantization for LLM Compression — Lin et al. 2023 — the quantisation method shipped in vLLM / TGI by default.
Medusa: Simple LLM Inference Acceleration with Multiple Decoding Heads — Cai et al. 2024 — speculative decoding without a separate draft model.
EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty — Li et al. 2024 — the SOTA spec-decoding head used by vLLM.
SmoothQuant: Accurate and Efficient Post-Training Quantization for LLMs — Xiao et al. 2023 — activation quantisation made practical.
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-Precision — Shah et al. 2024 — the H100-era attention kernel.

Official docs:

vLLM — Performance Optimization guide — chunked prefill, prefix caching, speculative decoding settings.
vLLM — Quantization support — AWQ, GPTQ, FP8, INT8 in production.
Hugging Face Optimum — the HF quantisation + acceleration library.
NVIDIA TensorRT-LLM — Speculative Decoding — production multi-token speculation.

Blog posts:

Together AI — Sequoia: speculative decoding with a tree — how production speculative decoding actually works.
PyTorch blog — Accelerating Generative AI with PyTorch II: GPT, Fast — Horace He's 10x speedup walkthrough.
Anyscale — Throughput is all you need — the right mental model for batch sizing.

In-depth research material

vLLM — github.com/vllm-project/vllm — ~31k ★, the canonical OSS engine with continuous batching + AWQ + FP8 + spec decoding.
SGLang — github.com/sgl-project/sglang — ~7k ★, LMSYS engine with the fastest prefix caching and structured output.
TensorRT-LLM — github.com/NVIDIA/TensorRT-LLM — ~9k ★, the NVIDIA max-throughput engine.
llama.cpp — github.com/ggerganov/llama.cpp — ~67k ★, GGUF quantisation reference; the CPU/edge inference standard.
exllamav2 — github.com/turboderp/exllamav2 — ~3.8k ★, the fastest single-GPU 4-bit inference engine for consumer hardware.
AutoAWQ — github.com/casper-hansen/AutoAWQ — ~1.9k ★, the canonical AWQ implementation.
AutoGPTQ — github.com/AutoGPTQ/AutoGPTQ — ~4.8k ★, the canonical GPTQ implementation.
PyTorch GPT-Fast — github.com/pytorch-labs/gpt-fast — ~5.6k ★, the PyTorch team's reference for torch.compile + spec decoding.
Mooncake: Trading More Storage for Less Computation — Kimi 2024 — the disaggregated KV-cache architecture from Moonshot AI (Kimi).
Character.AI Research — Optimizing inference at Character.AI — the most detailed real-world production tuning story (20k qps/GPU).
Tim Dettmers — Which GPU(s) to Get for Deep Learning — the only inference economics primer you need.

Videos

Speculative Decoding: When Two LLMs are Faster than One — Mark Saroufim (PyTorch) — 31 min — the cleanest visual explanation; pairs perfectly with the Leviathan paper.
Quantization in Depth — Hugging Face — 1 h 03 min — int8/int4/FP8/AWQ/GPTQ from first principles.
Accelerating Generative AI: From Slow to Fast — Horace He (PyTorch) — 56 min — the GPT-Fast 10x talk; CUDA graphs + spec decoding + quantisation combined.
Inside vLLM 2.0 — Woosuk Kwon (vLLM lead) — 58 min — chunked prefill, prefix caching, multi-LoRA, FP8.
GPU MODE: Lecture 14 — TensorRT-LLM internals — 1 h 22 min — NVIDIA's serving engine from a GPU-kernel viewpoint.

LeetCode — Longest Palindromic Substring

Link: https://leetcode.com/problems/longest-palindromic-substring/
Difficulty: Medium
Why this problem: Speculative-decode verification is essentially "longest accepted prefix of draft"; both problems are about matching segments efficiently.
Time-box: 30 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

Pick the right quantisation method (int8/int4/FP8/AWQ/GPTQ) for a given GPU and workload.
Explain why decoding is memory-bound and how speculative decoding exploits it.
Configure prefix caching and predict the throughput impact for a chat workload.
Stand up multi-LoRA serving with vLLM/TGI.
Tune max_num_batched_tokens, gpu_memory_utilization, and KV-cache dtype.
Solve longest-palindromic-substring — prefix-matching primitive used in spec verification.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.

← previous

Practical Fine-Tuning — LoRA, QLoRA, PEFT, Instruction Datasets

News Feed / Timeline System — Fanout-on-Read vs Write, Ranking