ai mlintermediate 12m2026-06-09

Practical Fine-Tuning — LoRA, QLoRA, PEFT, Instruction Datasets

Session 33 of the 48-session learning series.

Why this session matters

This is Session 33 of 48 in the ML track. Fine-tuning got 1000x cheaper in 24 months. A 7B model can now be domain-adapted on a single consumer GPU in under a day with LoRA + 4-bit quant. The question shifted from "can we fine-tune?" to "should we?" — and knowing how to answer that for a given problem is the new core skill.

Agenda

Fine-tuning vs RAG vs prompting — the decision matrix
LoRA — why low-rank adapters work; what to set rank to
QLoRA — 4-bit base model + LoRA adapters; the consumer-GPU breakthrough
Instruction datasets — quality, format, contamination
Training pipeline — Axolotl, TRL, Unsloth, evaluation gates

Pre-read (skim before the session)

Deep dive

1. Fine-tune vs RAG vs prompt — decision

You need	Use
Knowledge about your private docs	RAG
Specific output format / style	Prompt first, then fine-tune
Domain vocabulary / terminology	Fine-tune
Better reasoning on niche tasks	Fine-tune (often w/ synthetic data)
Safety policy enforcement	Fine-tune + system prompt
Latency reduction (smaller distilled model)	Fine-tune smaller base
Up-to-date info	RAG (always)

90% of teams should try RAG + prompt before fine-tuning. Fine-tuning is harder, slower, and costs more iteration time. But it unlocks abilities you can't prompt your way into.

2. The fine-tuning spectrum

Continued pre-training — domain corpus, unsupervised. Adapts language model to your distribution. Most expensive.
Supervised fine-tuning (SFT) — instruction → answer pairs. The base of any RLHF stack.
DPO / IPO / KTO — preference tuning from (chosen, rejected) pairs. No reward model needed. Now the default for alignment.
RLHF (PPO) — reward model + RL. Still used by frontier labs; rarely worth the complexity downstream.

For application teams: SFT → DPO is the modern pipeline.

3. Full fine-tune vs PEFT

Full fine-tune of a 7B model:

Updates all 7B params.
Needs ~14 GB just for fp16 weights, ~56 GB for fp32 Adam optimiser state.
Doesn't fit on a single 24 GB consumer GPU.

PEFT (Parameter-Efficient Fine-Tuning) freezes the base and trains a tiny set of new params:

LoRA, QLoRA, prefix tuning, adapters.
0.1–1% of params trained.
Same quality on most tasks as full fine-tune.
Adapters can be swapped at runtime → one base, many specialisations.

4. LoRA — the math (1 paragraph)

For each weight matrix W, instead of updating W directly, train two low-rank matrices A (d×r) and B (r×k), and the effective weight is W + B·A. Rank r is typically 8–64. Forward pass: (W + B·A)·x. Only A and B are trained; gradient memory is tiny.

Why it works: empirically, the "update" learned during fine-tuning is low-rank. You don't need to move every weight; you need to nudge the right subspace.

5. QLoRA — fitting big models on small GPUs

QLoRA combines:

4-bit base model (NF4 quantisation) — frozen, dequantised on the fly during forward pass.
LoRA adapters in fp16 — trainable.
Paged optimiser — Adam state lives in CPU RAM, paged in as needed.

Result: 65B model fine-tunable on a single 48 GB GPU; 7B on a single 24 GB consumer GPU.

Quality penalty vs full fp16 fine-tune: usually < 1 point on benchmarks. Negligible for almost all use cases.

6. Setting LoRA hyperparameters

Rank r — 8 for small tasks, 16–32 default, 64+ when underfitting. Higher rank = more capacity = more risk of overfitting.
Alpha — scales adapter contribution; typical alpha = r or alpha = 2r.
Target modules — at minimum q_proj, v_proj. Better: q_proj, k_proj, v_proj, o_proj. Best (slowest): all linear layers.
Dropout — 0.05–0.1.

Start with rank 16, target qkvo, alpha 32, dropout 0.05. Tune from there.

7. Instruction dataset quality

The single biggest predictor of fine-tune success.

Rules:

Quality > quantity. 1k high-quality > 100k low-quality, by a wide margin.
Format consistency. Pick one chat template; stick to it.
No contamination. Strip eval-set examples from training.
Diverse instructions. Cover the breadth you want at inference.
Match output style. Want concise? Train on concise.
System prompt training. Always include in the format you'll use at inference.

LIMA paper (Meta, 2023): 1000 carefully curated examples beat 50k filtered scrapes.

8. Synthetic data

When real instruction data is scarce:

Use GPT-4 / Claude to generate examples.
Self-Instruct, Evol-Instruct patterns.
WizardLM-style instruction evolution (rewrite simple → complex).
Constitutional AI for safety dataset.

Caveats: model collapse if you only train on synthetic; mode collapse if generator is too narrow. Always blend with real examples (≥ 20%).

9. Training pipeline

Modern stack:

Axolotl — YAML-driven, batteries-included. Multi-GPU, DeepSpeed integration.
TRL (HuggingFace) — SFTTrainer, DPOTrainer, GRPOTrainer. Most flexibility.
Unsloth — fastest single-GPU; custom kernels; 2× speedup.
LLaMA-Factory — UI-driven; popular in China; good for non-engineers.

Example Axolotl config snippet:

base_model: meta-llama/Meta-Llama-3-8B
load_in_4bit: true
adapter: qlora
lora_r: 16
lora_alpha: 32
lora_target_modules: [q_proj, k_proj, v_proj, o_proj]
datasets:
  - path: ./my-dataset.jsonl
    type: chat_template
sequence_len: 4096
micro_batch_size: 2
gradient_accumulation_steps: 4
learning_rate: 2e-4
num_epochs: 3

10. Evaluation gates

Before declaring success:

Held-out task accuracy — domain eval set, never seen in training.
General capability — MMLU, HellaSwag — confirm you didn't degrade the base.
Safety eval — refusals on harmful prompts; LLM-as-judge on a red-team set.
Format compliance — does it return the JSON / structure you asked for? 100% is the target.
Latency — adapter merge vs adapter-on-the-fly; serving impact.

Always compare against the base model with your best prompt — if prompt-engineering closes the gap, ship the prompt, not the tune.

11. Serving fine-tuned models

Options:

Merged weights — fold adapter into base; ship one model file. Simplest serving, can't swap.
Adapter at runtime — load base once, swap adapters per request. Multi-tenant; vLLM and TGI support this. Tiny per-tenant cost.
Distilled smaller model — fine-tune a 7B to match a 70B's behaviour on your task. Most $$ saved long-term.

For a multi-customer SaaS: adapter-at-runtime is the winning pattern. Caveat: chooses one base; switching base means re-tuning all adapters.

12. Reality check

A 2-week fine-tune plan:

Days 1–2: build a 200-example eval set; freeze it.
Days 3–4: try prompting + RAG; record baseline.
Days 5–8: curate a 1–5k SFT dataset; train QLoRA on Llama-3-8B.
Days 9–10: eval; iterate dataset, not hyperparams.
Days 11–12: DPO from production thumbs-down/thumbs-up.
Days 13–14: serving integration; production canary.

If the baseline beats your fine-tune, ship the baseline. Most of the value is in the eval set + dataset curation, not the training itself.

Reading material

Books:

Build a Large Language Model (from Scratch) — Sebastian Raschka (Manning; the fine-tuning + instruction-following chapters)
Hands-On Large Language Models — Jay Alammar & Maarten Grootendorst (the PEFT and alignment chapters)
Natural Language Processing with Transformers — Tunstall, von Werra, Wolf (the HF authors; chapter on fine-tuning + datasets)

Papers:

LoRA: Low-Rank Adaptation of Large Language Models — Hu et al. 2021 — the canonical paper; rank-r decomposition explained from first principles.
QLoRA: Efficient Finetuning of Quantized LLMs — Dettmers et al. 2023 (NeurIPS) — the 4-bit + LoRA paper that put 65B fine-tuning on a single 48 GB GPU.
DPO: Direct Preference Optimization — Rafailov et al. 2023 (NeurIPS) — the RLHF replacement that fits in a normal training loop.
LIMA: Less Is More for Alignment — Zhou et al. 2023 (Meta) — 1,000 high-quality examples beat 50k mediocre ones.
ORPO: Monolithic Preference Optimization without Reference Model — Hong et al. 2024 — one-stage SFT + alignment.

Official docs:

Hugging Face PEFT documentation — the canonical library for LoRA/QLoRA/IA3/prompt-tuning.
Hugging Face TRL — Transformer Reinforcement Learning — SFT + DPO + PPO + KTO trainers.
Axolotl documentation — the YAML-driven fine-tuning framework everyone uses for production runs.
Unsloth documentation — the 2x-faster fine-tuning library for single-GPU workflows.

Blog posts:

Sebastian Raschka — Practical Tips for Finetuning LLMs Using LoRA — the most read practitioner essay on LoRA hyperparameters.
Tim Dettmers — QLoRA: 4-bit Fine-Tuning on a Single GPU — from the paper's first author; the intuition.
Hamel Husain — Fine-Tuning LLMs: A Practitioner's Guide — the modern playbook on when to fine-tune vs prompt vs RAG.
Eugene Yan — DPO: Replacing RLHF with a single loss — the survey of alignment methods.

In-depth research material

PEFT — github.com/huggingface/peft — ~17k ★, the canonical parameter-efficient fine-tuning library.
TRL — github.com/huggingface/trl — ~12k ★, the SFT/DPO/PPO trainer library.
Axolotl — github.com/axolotl-ai-cloud/axolotl — ~8k ★, the production-grade YAML fine-tuning harness.
Unsloth — github.com/unslothai/unsloth — ~21k ★, 2x faster + 50% less memory single-GPU fine-tuning.
LLaMA-Factory — github.com/hiyouga/LLaMA-Factory — ~38k ★, web-UI + CLI for fine-tuning 100+ models.
DeepSpeed — github.com/microsoft/DeepSpeed — ~36k ★, Microsoft's distributed training framework; ZeRO sharding.
bitsandbytes — github.com/bitsandbytes-foundation/bitsandbytes — ~6k ★, the 4-bit + 8-bit optimisers that make QLoRA possible.
GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection — Zhao et al. 2024 — full-parameter fine-tuning at LoRA memory cost.
Stanford Alpaca — github.com/tatsu-lab/stanford_alpaca — ~30k ★, the self-instruct dataset format that started the open-source fine-tuning explosion.
Argilla — github.com/argilla-io/argilla — ~4.5k ★, the open-source data-curation platform for instruction datasets.

Videos

LoRA explained (and a bit about precision and quantization) — DeepLearningAI — Mark Saroufim (PyTorch) · 17 min — the cleanest visual explanation of LoRA + quantisation.
Fine-tuning LLMs with PEFT and LoRA — Sam Witteveen — 35 min — hands-on Colab walkthrough; the right "see it work" video.
QLoRA: How to Fine-tune an LLM on a Single GPU — Tim Dettmers — Tim Dettmers (paper author) · 26 min — the paper's first author explaining QLoRA.
Direct Preference Optimization (DPO) — Aleksa Gordić — 52 min — the DPO paper walkthrough.
The Novice's LLM Training Guide — Sebastian Raschka — Sebastian Raschka · 1 h 14 min — the soup-to-nuts tour of pre-training / SFT / DPO from the Build an LLM from Scratch author.

LeetCode — Matrix Multiplication

Link: https://leetcode.com/problems/matrix-multiplication/
Difficulty: Medium
Why this problem: LoRA is literally matrix decomposition W ≈ W + B·A. Implementing matmul once cements the intuition.
Time-box: 30 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

Pick between full FT, LoRA, QLoRA for a given GPU + model combo.
Explain why low-rank updates work for fine-tuning empirically.
Configure LoRA hyperparameters (rank, alpha, target modules) sensibly.
Curate an SFT dataset that respects format, diversity, and contamination rules.
Design a 5-step eval gate before promoting a fine-tuned model.
Solve matrix-multiplication — the primitive operation that LoRA decomposes.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.

← previous

Data Governance — Lineage, Quality, Catalogs, Contracts, Observability

LLM Serving Part 2 — Speculative Decoding, Quantisation, Throughput