LLM Evaluation — Benchmarks, LLM-as-Judge, RAGAS, Inspect
Session 25 of the 48-session learning series.
Why this session matters
This is Session 25 of 48 in the LLM track. Evaluation is the unsexy half of LLM work — and the half that decides whether your shiny demo survives contact with real users. You can't ship what you can't measure.
Agenda
- Why evaluating LLMs is genuinely harder than classical ML
- Static benchmarks — MMLU, HumanEval, GSM8K — and their limits
- LLM-as-judge — pairwise, score-based, the bias gotchas
- RAGAS, TruLens — RAG-specific eval frameworks
- Inspect, OpenAI Evals — building your own eval harness
Pre-read (skim before the session)
- Anthropic — Challenges in evaluating AI systems
- RAGAS docs — Metrics
- Judging LLM-as-a-Judge (Zheng et al., 2023)
- Inspect AI — UK AISI eval framework
Deep dive
1. Why LLM eval is hard
Classical ML: one label, one prediction, clean metric (accuracy, AUC). LLMs:
- Open-ended output — no single right answer. "Summarise this" has 1000 acceptable answers.
- Multi-dimensional quality — factuality, fluency, safety, tone, helpfulness; can trade off.
- Reference-free settings — you often don't have a ground truth ("write a marketing email").
- Long-tail failures — the model is fine 99% of the time and catastrophically wrong on edge cases.
- Contamination — test set leaked into pre-training corpus → benchmark inflated.
You can't compute "accuracy" against a list of strings and call it a day.
2. The 4 layers of eval
[ Unit-level ] → does the prompt produce well-formed JSON?
[ Behavioural ] → does it refuse the jailbreak? answer the maths question?
[ Capability ] → MMLU, HumanEval, MATH on held-out data
[ User-outcome ] → did the user accept the suggestion? task success?
The bottom two cost the most and matter the most. Most teams over-invest in capability benchmarks and under-invest in user-outcome metrics.
3. Static benchmarks — useful, but...
| Benchmark | Tests | Gotcha |
|---|---|---|
| MMLU | 57-subject multiple choice | Contamination; rote knowledge |
| HumanEval | Python function from docstring | Tiny (164 tasks); plateaued |
| MATH | Competition maths | Reasoning + arithmetic mix |
| GSM8K | Grade-school word problems | Largely solved; check GSM-Symbolic |
| MT-Bench | Multi-turn chat, LLM-judged | Judge bias; small |
| HELM | Broad suite | Heavy, dated; good audit trail |
| BBH | "Hard" sub-tasks of BIG-Bench | Mixed quality |
| ARC-AGI | Visual puzzles | The reasoning bar; expensive to run |
Rule: use benchmarks to exclude models, not to pick them. If MMLU is < 60% your candidate, drop it. Above some threshold, benchmarks stop correlating with what you actually need.
4. LLM-as-Judge — the workhorse
Use a strong model (GPT-4-class) to evaluate outputs of another model. Three flavours:
- Pairwise — show judge two outputs A and B, ask which is better. Most reliable.
- Scalar — rate output 1–10 on dimension X. Easy but noisy.
- Rubric-based — multi-criteria scoring against a written rubric. Best for production.
Pairwise example:
You are a judge. Compare two assistant responses to the same prompt.
Decide which is better on: helpfulness, accuracy, conciseness.
Output JSON: {"winner": "A" | "B" | "tie", "reason": "..."}
Prompt: {user_prompt}
Response A: {output_a}
Response B: {output_b}
5. Judge biases (this will burn you)
- Position bias — first option wins more often. Mitigation: swap order, average.
- Length bias — longer answers preferred even when worse. Mitigation: length-controlled rubric.
- Self-preference — model judges its own family higher. Mitigation: use a different model family as judge.
- Verbosity bias — flowery > terse. Mitigation: explicit rubric criterion.
- Authority bias — confident wrong > tentative correct. Mitigation: factuality sub-score.
Validate the judge against ~200 human-labelled pairs before you trust it for thousands of evals.
6. RAGAS — RAG-specific metrics
For retrieval-augmented systems:
| Metric | What it measures |
|---|---|
faithfulness | Are all claims in answer supported by context? |
answer_relevancy | Does the answer actually address the question? |
context_precision | Are retrieved chunks actually useful? |
context_recall | Did retrieval find all needed info? |
answer_correctness | Does answer match ground truth (when available)? |
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
result = evaluate(
dataset=eval_dataset,
metrics=[faithfulness, answer_relevancy],
)
These give you a numeric, comparable score per retrieval/generation strategy. Crucial for the eval loop in RAG (see S13).
7. TruLens, DeepEval, Inspect
- TruLens — instrumented runtime; collects traces; defines "feedback functions" (RAGAS-ish).
- DeepEval — pytest-style assertions for LLM output.
assert_relevance(),assert_no_hallucination(). - Inspect (UK AISI) — capability + safety eval framework, used for frontier-model red-teaming. Plugin model.
- OpenAI Evals — open-source harness, plug-in eval types.
- Promptfoo — YAML-driven A/B prompt eval; great for prompt regression.
8. Building your own eval set
Don't skip this. A 100-example, hand-crafted eval set tailored to your product beats any public benchmark.
Process:
- Mine real user prompts (anonymise!).
- Bucket by intent (10–15 buckets is enough).
- For each bucket, write 5–10 gold-standard answers (or rubrics).
- Add adversarial examples — your top 20 known failure modes.
- Version the set; checksum it; track it like a dataset, not a doc.
Re-run the whole set on every model/prompt change. Track scores in MLflow.
9. Online vs offline eval
- Offline — fixed dataset, runs on every release. Fast, repeatable, but stale.
- Online — production traffic, label by user reaction (👍/👎, accept-rate, edit-distance to final answer). Slow, biased by current cohort, but real.
Both. Offline catches regressions before ship; online catches reality drift.
10. The eval flywheel
[ Real user fails ] → [ Add to eval set ] → [ Run all candidates ] → [ Pick winner ] → [ Ship ]
▲ │
└───────────────────────────────────────────────────────────────────────────────────┘
Every bug becomes a permanent test. After 6 months you have 2000 examples that capture exactly what your product needs. That set is your moat.
11. Cost-aware eval
LLM-as-judge is expensive. Budgeting tips:
- Sample, don't run-all-on-everything.
- Cache judge results keyed on
(prompt, output_a, output_b, judge_model). - Use a cheaper "screening" judge → escalate ambiguous cases to a stronger judge.
- Score per-dollar: improvements aren't free; track
quality_delta / cost_delta.
12. Reality check
A pragmatic minimum eval stack for a startup:
- 100 hand-crafted prompts with rubrics (versioned in git).
- A pytest job that runs them on every PR using
promptfooor homemade. - Pairwise LLM-judge with order-swap, against GPT-4o.
- Production logging of (input, output, model_version, user_feedback).
- Weekly review of bottom-decile responses → add to eval set.
This is enough. Buy the platform when you outgrow it, not before.
Reading material
Books:
- AI Engineering — Chip Huyen (the entire evaluation chapter; the best modern treatment)
- Building LLMs for Production — Bouchard & Peters (the eval chapter with hands-on RAGAS examples)
- Speech and Language Processing, 3rd ed. — Jurafsky & Martin (the classical NLP eval chapter; ROUGE, BLEU, BERTScore origins)
Papers:
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena — Zheng et al. 2023 — the canonical LLM-as-judge paper.
- RAGAS: Automated Evaluation of Retrieval Augmented Generation — Es et al. 2023 — faithfulness + answer relevance + context recall.
- HELM: Holistic Evaluation of Language Models — Liang et al. 2022 (Stanford CRFM) — the multi-axis benchmark framework.
- BIG-bench — Srivastava et al. 2022 — 200+ tasks from 450 collaborators.
- GPT-4 Technical Report — OpenAI 2023 (evaluation sections) — how a frontier lab discloses (or doesn't disclose) evals.
Official docs:
- RAGAS — metrics docs — faithfulness, answer relevance, context precision/recall.
- Inspect AI — UK AISI — the UK AI Safety Institute's framework; the modern standard.
- Promptfoo docs — declarative prompt eval, regression tests.
- OpenAI Evals — the format OpenAI uses internally.
- LangSmith docs — Evaluation — hosted LLM tracing + eval.
Blog posts:
- Eugene Yan — Patterns for Building LLM-based Systems & Products: Evals — the single best practitioner essay.
- Hamel Husain — Your AI Product Needs Evals — the canonical "how to actually run evals in production" post.
- Anthropic — Challenges in evaluating AI systems — the honest "why this is hard" post from Anthropic.
In-depth research material
- RAGAS — github.com/explodinggradients/ragas — ~10k ★, RAG-specific evals.
- DeepEval — github.com/confident-ai/deepeval — ~7k ★, pytest-flavoured LLM evals.
- OpenAI Evals — github.com/openai/evals — ~16k ★, the open eval framework + dataset.
- LM Evaluation Harness — github.com/EleutherAI/lm-evaluation-harness — ~9k ★, the de-facto OSS benchmark runner (HF Open LLM Leaderboard uses it).
- Inspect AI — github.com/UKGovernmentBEIS/inspect_ai — the UK AISI framework, now adopted by NIST.
- HELM — github.com/stanford-crfm/helm — ~2k ★, Stanford holistic eval.
- Chatbot Arena — LMSys — crowdsourced head-to-head rankings.
- Vellum — LLM Leaderboard — the most current production-facing leaderboard.
- Hamel Husain — Field guide to rapidly improving AI products — the workflow that ties evals + error analysis + fixes.
- Eugene Yan — LLM evaluators (a survey) — the meta-paper on LLM-as-judge biases.
Videos
- A Practitioner's Guide to LLM Evals — Hamel Husain — Hamel Husain · 1 h 06 min — the talk version of his canonical blog post; watch this first.
- LLM Evaluation Explained — Maven AI Eng course · 1 h 21 min — the workshop-format walkthrough.
- Evaluating LLM Systems with Eugene Yan — Eugene Yan · 50 min — from the patterns-for-LLM-systems author.
- LangSmith Evaluations Deep Dive — LangChain · 33 min — hands-on with the most popular hosted eval tool.
- Building Reliable AI Systems: Evals — Shreya Shankar (Stanford MLSys) — 1 h — the academic-rigour view on eval systems.
LeetCode — Evaluate Division
- Link: https://leetcode.com/problems/evaluate-division/
- Difficulty: Medium
- Why this problem: Build a graph of equations and answer arbitrary queries — exact shape of evaluating one model output against a chain of judge criteria.
- Time-box: 30 minutes. Look up the editorial only after.
Post-session checklist
By the end of this session you should be able to:
- Explain why open-ended LLM output makes eval qualitatively harder than classical ML.
- List the 4 layers of eval (unit, behavioural, capability, user-outcome).
- Design a pairwise LLM-judge prompt with bias mitigations.
- Pick the right RAGAS metric (faithfulness vs context_precision vs answer_relevancy) for a given failure.
- Sketch the 5-step "eval flywheel" loop.
- Solve
evaluate-division— graph traversal of weighted edges; mirrors chained eval criteria.
Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.