Practical Fine-Tuning — LoRA, QLoRA, PEFT, Instruction Datasets
Session 33 of the 48-session learning series.
Why this session matters
This is Session 33 of 48 in the ML track. Fine-tuning got 1000x cheaper in 24 months. A 7B model can now be domain-adapted on a single consumer GPU in under a day with LoRA + 4-bit quant. The question shifted from "can we fine-tune?" to "should we?" — and knowing how to answer that for a given problem is the new core skill.
Agenda
- Fine-tuning vs RAG vs prompting — the decision matrix
- LoRA — why low-rank adapters work; what to set rank to
- QLoRA — 4-bit base model + LoRA adapters; the consumer-GPU breakthrough
- Instruction datasets — quality, format, contamination
- Training pipeline — Axolotl, TRL, Unsloth, evaluation gates
Pre-read (skim before the session)
- LoRA paper (Hu et al., 2021)
- QLoRA paper (Dettmers et al., 2023)
- HuggingFace PEFT docs
- Sebastian Raschka — LoRA practical insights
Deep dive
1. Fine-tune vs RAG vs prompt — decision
| You need | Use |
|---|---|
| Knowledge about your private docs | RAG |
| Specific output format / style | Prompt first, then fine-tune |
| Domain vocabulary / terminology | Fine-tune |
| Better reasoning on niche tasks | Fine-tune (often w/ synthetic data) |
| Safety policy enforcement | Fine-tune + system prompt |
| Latency reduction (smaller distilled model) | Fine-tune smaller base |
| Up-to-date info | RAG (always) |
90% of teams should try RAG + prompt before fine-tuning. Fine-tuning is harder, slower, and costs more iteration time. But it unlocks abilities you can't prompt your way into.
2. The fine-tuning spectrum
- Continued pre-training — domain corpus, unsupervised. Adapts language model to your distribution. Most expensive.
- Supervised fine-tuning (SFT) — instruction → answer pairs. The base of any RLHF stack.
- DPO / IPO / KTO — preference tuning from
(chosen, rejected)pairs. No reward model needed. Now the default for alignment. - RLHF (PPO) — reward model + RL. Still used by frontier labs; rarely worth the complexity downstream.
For application teams: SFT → DPO is the modern pipeline.
3. Full fine-tune vs PEFT
Full fine-tune of a 7B model:
- Updates all 7B params.
- Needs ~14 GB just for fp16 weights, ~56 GB for fp32 Adam optimiser state.
- Doesn't fit on a single 24 GB consumer GPU.
PEFT (Parameter-Efficient Fine-Tuning) freezes the base and trains a tiny set of new params:
- LoRA, QLoRA, prefix tuning, adapters.
- 0.1–1% of params trained.
- Same quality on most tasks as full fine-tune.
- Adapters can be swapped at runtime → one base, many specialisations.
4. LoRA — the math (1 paragraph)
For each weight matrix W, instead of updating W directly, train two low-rank matrices A (d×r) and B (r×k), and the effective weight is W + B·A. Rank r is typically 8–64. Forward pass: (W + B·A)·x. Only A and B are trained; gradient memory is tiny.
Why it works: empirically, the "update" learned during fine-tuning is low-rank. You don't need to move every weight; you need to nudge the right subspace.
5. QLoRA — fitting big models on small GPUs
QLoRA combines:
- 4-bit base model (NF4 quantisation) — frozen, dequantised on the fly during forward pass.
- LoRA adapters in fp16 — trainable.
- Paged optimiser — Adam state lives in CPU RAM, paged in as needed.
Result: 65B model fine-tunable on a single 48 GB GPU; 7B on a single 24 GB consumer GPU.
Quality penalty vs full fp16 fine-tune: usually < 1 point on benchmarks. Negligible for almost all use cases.
6. Setting LoRA hyperparameters
- Rank
r— 8 for small tasks, 16–32 default, 64+ when underfitting. Higher rank = more capacity = more risk of overfitting. - Alpha — scales adapter contribution; typical
alpha = roralpha = 2r. - Target modules — at minimum
q_proj, v_proj. Better:q_proj, k_proj, v_proj, o_proj. Best (slowest): all linear layers. - Dropout — 0.05–0.1.
Start with rank 16, target qkvo, alpha 32, dropout 0.05. Tune from there.
7. Instruction dataset quality
The single biggest predictor of fine-tune success.
Rules:
- Quality > quantity. 1k high-quality > 100k low-quality, by a wide margin.
- Format consistency. Pick one chat template; stick to it.
- No contamination. Strip eval-set examples from training.
- Diverse instructions. Cover the breadth you want at inference.
- Match output style. Want concise? Train on concise.
- System prompt training. Always include in the format you'll use at inference.
LIMA paper (Meta, 2023): 1000 carefully curated examples beat 50k filtered scrapes.
8. Synthetic data
When real instruction data is scarce:
- Use GPT-4 / Claude to generate examples.
- Self-Instruct, Evol-Instruct patterns.
- WizardLM-style instruction evolution (rewrite simple → complex).
- Constitutional AI for safety dataset.
Caveats: model collapse if you only train on synthetic; mode collapse if generator is too narrow. Always blend with real examples (≥ 20%).
9. Training pipeline
Modern stack:
- Axolotl — YAML-driven, batteries-included. Multi-GPU, DeepSpeed integration.
- TRL (HuggingFace) — SFTTrainer, DPOTrainer, GRPOTrainer. Most flexibility.
- Unsloth — fastest single-GPU; custom kernels; 2× speedup.
- LLaMA-Factory — UI-driven; popular in China; good for non-engineers.
Example Axolotl config snippet:
base_model: meta-llama/Meta-Llama-3-8B
load_in_4bit: true
adapter: qlora
lora_r: 16
lora_alpha: 32
lora_target_modules: [q_proj, k_proj, v_proj, o_proj]
datasets:
- path: ./my-dataset.jsonl
type: chat_template
sequence_len: 4096
micro_batch_size: 2
gradient_accumulation_steps: 4
learning_rate: 2e-4
num_epochs: 3
10. Evaluation gates
Before declaring success:
- Held-out task accuracy — domain eval set, never seen in training.
- General capability — MMLU, HellaSwag — confirm you didn't degrade the base.
- Safety eval — refusals on harmful prompts; LLM-as-judge on a red-team set.
- Format compliance — does it return the JSON / structure you asked for? 100% is the target.
- Latency — adapter merge vs adapter-on-the-fly; serving impact.
Always compare against the base model with your best prompt — if prompt-engineering closes the gap, ship the prompt, not the tune.
11. Serving fine-tuned models
Options:
- Merged weights — fold adapter into base; ship one model file. Simplest serving, can't swap.
- Adapter at runtime — load base once, swap adapters per request. Multi-tenant; vLLM and TGI support this. Tiny per-tenant cost.
- Distilled smaller model — fine-tune a 7B to match a 70B's behaviour on your task. Most $$ saved long-term.
For a multi-customer SaaS: adapter-at-runtime is the winning pattern. Caveat: chooses one base; switching base means re-tuning all adapters.
12. Reality check
A 2-week fine-tune plan:
- Days 1–2: build a 200-example eval set; freeze it.
- Days 3–4: try prompting + RAG; record baseline.
- Days 5–8: curate a 1–5k SFT dataset; train QLoRA on Llama-3-8B.
- Days 9–10: eval; iterate dataset, not hyperparams.
- Days 11–12: DPO from production thumbs-down/thumbs-up.
- Days 13–14: serving integration; production canary.
If the baseline beats your fine-tune, ship the baseline. Most of the value is in the eval set + dataset curation, not the training itself.
Reading material
Books:
- Build a Large Language Model (from Scratch) — Sebastian Raschka (Manning; the fine-tuning + instruction-following chapters)
- Hands-On Large Language Models — Jay Alammar & Maarten Grootendorst (the PEFT and alignment chapters)
- Natural Language Processing with Transformers — Tunstall, von Werra, Wolf (the HF authors; chapter on fine-tuning + datasets)
Papers:
- LoRA: Low-Rank Adaptation of Large Language Models — Hu et al. 2021 — the canonical paper; rank-r decomposition explained from first principles.
- QLoRA: Efficient Finetuning of Quantized LLMs — Dettmers et al. 2023 (NeurIPS) — the 4-bit + LoRA paper that put 65B fine-tuning on a single 48 GB GPU.
- DPO: Direct Preference Optimization — Rafailov et al. 2023 (NeurIPS) — the RLHF replacement that fits in a normal training loop.
- LIMA: Less Is More for Alignment — Zhou et al. 2023 (Meta) — 1,000 high-quality examples beat 50k mediocre ones.
- ORPO: Monolithic Preference Optimization without Reference Model — Hong et al. 2024 — one-stage SFT + alignment.
Official docs:
- Hugging Face PEFT documentation — the canonical library for LoRA/QLoRA/IA3/prompt-tuning.
- Hugging Face TRL — Transformer Reinforcement Learning — SFT + DPO + PPO + KTO trainers.
- Axolotl documentation — the YAML-driven fine-tuning framework everyone uses for production runs.
- Unsloth documentation — the 2x-faster fine-tuning library for single-GPU workflows.
Blog posts:
- Sebastian Raschka — Practical Tips for Finetuning LLMs Using LoRA — the most read practitioner essay on LoRA hyperparameters.
- Tim Dettmers — QLoRA: 4-bit Fine-Tuning on a Single GPU — from the paper's first author; the intuition.
- Hamel Husain — Fine-Tuning LLMs: A Practitioner's Guide — the modern playbook on when to fine-tune vs prompt vs RAG.
- Eugene Yan — DPO: Replacing RLHF with a single loss — the survey of alignment methods.
In-depth research material
- PEFT — github.com/huggingface/peft — ~17k ★, the canonical parameter-efficient fine-tuning library.
- TRL — github.com/huggingface/trl — ~12k ★, the SFT/DPO/PPO trainer library.
- Axolotl — github.com/axolotl-ai-cloud/axolotl — ~8k ★, the production-grade YAML fine-tuning harness.
- Unsloth — github.com/unslothai/unsloth — ~21k ★, 2x faster + 50% less memory single-GPU fine-tuning.
- LLaMA-Factory — github.com/hiyouga/LLaMA-Factory — ~38k ★, web-UI + CLI for fine-tuning 100+ models.
- DeepSpeed — github.com/microsoft/DeepSpeed — ~36k ★, Microsoft's distributed training framework; ZeRO sharding.
- bitsandbytes — github.com/bitsandbytes-foundation/bitsandbytes — ~6k ★, the 4-bit + 8-bit optimisers that make QLoRA possible.
- GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection — Zhao et al. 2024 — full-parameter fine-tuning at LoRA memory cost.
- Stanford Alpaca — github.com/tatsu-lab/stanford_alpaca — ~30k ★, the self-instruct dataset format that started the open-source fine-tuning explosion.
- Argilla — github.com/argilla-io/argilla — ~4.5k ★, the open-source data-curation platform for instruction datasets.
Videos
- LoRA explained (and a bit about precision and quantization) — DeepLearningAI — Mark Saroufim (PyTorch) · 17 min — the cleanest visual explanation of LoRA + quantisation.
- Fine-tuning LLMs with PEFT and LoRA — Sam Witteveen — 35 min — hands-on Colab walkthrough; the right "see it work" video.
- QLoRA: How to Fine-tune an LLM on a Single GPU — Tim Dettmers — Tim Dettmers (paper author) · 26 min — the paper's first author explaining QLoRA.
- Direct Preference Optimization (DPO) — Aleksa Gordić — 52 min — the DPO paper walkthrough.
- The Novice's LLM Training Guide — Sebastian Raschka — Sebastian Raschka · 1 h 14 min — the soup-to-nuts tour of pre-training / SFT / DPO from the Build an LLM from Scratch author.
LeetCode — Matrix Multiplication
- Link: https://leetcode.com/problems/matrix-multiplication/
- Difficulty: Medium
- Why this problem: LoRA is literally matrix decomposition
W ≈ W + B·A. Implementing matmul once cements the intuition. - Time-box: 30 minutes. Look up the editorial only after.
Post-session checklist
By the end of this session you should be able to:
- Pick between full FT, LoRA, QLoRA for a given GPU + model combo.
- Explain why low-rank updates work for fine-tuning empirically.
- Configure LoRA hyperparameters (rank, alpha, target modules) sensibly.
- Curate an SFT dataset that respects format, diversity, and contamination rules.
- Design a 5-step eval gate before promoting a fine-tuned model.
- Solve
matrix-multiplication— the primitive operation that LoRA decomposes.
Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.