ai mlintermediate 12m2026-06-09

LLM Evaluation — Benchmarks, LLM-as-Judge, RAGAS, Inspect

Session 25 of the 48-session learning series.

Why this session matters

This is Session 25 of 48 in the LLM track. Evaluation is the unsexy half of LLM work — and the half that decides whether your shiny demo survives contact with real users. You can't ship what you can't measure.

Agenda

Why evaluating LLMs is genuinely harder than classical ML
Static benchmarks — MMLU, HumanEval, GSM8K — and their limits
LLM-as-judge — pairwise, score-based, the bias gotchas
RAGAS, TruLens — RAG-specific eval frameworks
Inspect, OpenAI Evals — building your own eval harness

Pre-read (skim before the session)

Deep dive

1. Why LLM eval is hard

Classical ML: one label, one prediction, clean metric (accuracy, AUC). LLMs:

Open-ended output — no single right answer. "Summarise this" has 1000 acceptable answers.
Multi-dimensional quality — factuality, fluency, safety, tone, helpfulness; can trade off.
Reference-free settings — you often don't have a ground truth ("write a marketing email").
Long-tail failures — the model is fine 99% of the time and catastrophically wrong on edge cases.
Contamination — test set leaked into pre-training corpus → benchmark inflated.

You can't compute "accuracy" against a list of strings and call it a day.

2. The 4 layers of eval

[ Unit-level ]      → does the prompt produce well-formed JSON?
[ Behavioural ]     → does it refuse the jailbreak? answer the maths question?
[ Capability ]      → MMLU, HumanEval, MATH on held-out data
[ User-outcome ]    → did the user accept the suggestion? task success?

The bottom two cost the most and matter the most. Most teams over-invest in capability benchmarks and under-invest in user-outcome metrics.

3. Static benchmarks — useful, but...

Benchmark	Tests	Gotcha
MMLU	57-subject multiple choice	Contamination; rote knowledge
HumanEval	Python function from docstring	Tiny (164 tasks); plateaued
MATH	Competition maths	Reasoning + arithmetic mix
GSM8K	Grade-school word problems	Largely solved; check `GSM-Symbolic`
MT-Bench	Multi-turn chat, LLM-judged	Judge bias; small
HELM	Broad suite	Heavy, dated; good audit trail
BBH	"Hard" sub-tasks of BIG-Bench	Mixed quality
ARC-AGI	Visual puzzles	The reasoning bar; expensive to run

Rule: use benchmarks to exclude models, not to pick them. If MMLU is < 60% your candidate, drop it. Above some threshold, benchmarks stop correlating with what you actually need.

4. LLM-as-Judge — the workhorse

Use a strong model (GPT-4-class) to evaluate outputs of another model. Three flavours:

Pairwise — show judge two outputs A and B, ask which is better. Most reliable.
Scalar — rate output 1–10 on dimension X. Easy but noisy.
Rubric-based — multi-criteria scoring against a written rubric. Best for production.

Pairwise example:

You are a judge. Compare two assistant responses to the same prompt.
Decide which is better on: helpfulness, accuracy, conciseness.
Output JSON: {"winner": "A" | "B" | "tie", "reason": "..."}
Prompt: {user_prompt}
Response A: {output_a}
Response B: {output_b}

5. Judge biases (this will burn you)

Position bias — first option wins more often. Mitigation: swap order, average.
Length bias — longer answers preferred even when worse. Mitigation: length-controlled rubric.
Self-preference — model judges its own family higher. Mitigation: use a different model family as judge.
Verbosity bias — flowery > terse. Mitigation: explicit rubric criterion.
Authority bias — confident wrong > tentative correct. Mitigation: factuality sub-score.

Validate the judge against ~200 human-labelled pairs before you trust it for thousands of evals.

6. RAGAS — RAG-specific metrics

For retrieval-augmented systems:

Metric	What it measures
`faithfulness`	Are all claims in answer supported by context?
`answer_relevancy`	Does the answer actually address the question?
`context_precision`	Are retrieved chunks actually useful?
`context_recall`	Did retrieval find all needed info?
`answer_correctness`	Does answer match ground truth (when available)?

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
result = evaluate(
    dataset=eval_dataset,
    metrics=[faithfulness, answer_relevancy],
)

These give you a numeric, comparable score per retrieval/generation strategy. Crucial for the eval loop in RAG (see S13).

7. TruLens, DeepEval, Inspect

TruLens — instrumented runtime; collects traces; defines "feedback functions" (RAGAS-ish).
DeepEval — pytest-style assertions for LLM output. assert_relevance(), assert_no_hallucination().
Inspect (UK AISI) — capability + safety eval framework, used for frontier-model red-teaming. Plugin model.
OpenAI Evals — open-source harness, plug-in eval types.
Promptfoo — YAML-driven A/B prompt eval; great for prompt regression.

8. Building your own eval set

Don't skip this. A 100-example, hand-crafted eval set tailored to your product beats any public benchmark.

Process:

Mine real user prompts (anonymise!).
Bucket by intent (10–15 buckets is enough).
For each bucket, write 5–10 gold-standard answers (or rubrics).
Add adversarial examples — your top 20 known failure modes.
Version the set; checksum it; track it like a dataset, not a doc.

Re-run the whole set on every model/prompt change. Track scores in MLflow.

9. Online vs offline eval

Offline — fixed dataset, runs on every release. Fast, repeatable, but stale.
Online — production traffic, label by user reaction (👍/👎, accept-rate, edit-distance to final answer). Slow, biased by current cohort, but real.

Both. Offline catches regressions before ship; online catches reality drift.

10. The eval flywheel

[ Real user fails ] → [ Add to eval set ] → [ Run all candidates ] → [ Pick winner ] → [ Ship ]
        ▲                                                                                   │
        └───────────────────────────────────────────────────────────────────────────────────┘

Every bug becomes a permanent test. After 6 months you have 2000 examples that capture exactly what your product needs. That set is your moat.

11. Cost-aware eval

LLM-as-judge is expensive. Budgeting tips:

Sample, don't run-all-on-everything.
Cache judge results keyed on (prompt, output_a, output_b, judge_model).
Use a cheaper "screening" judge → escalate ambiguous cases to a stronger judge.
Score per-dollar: improvements aren't free; track quality_delta / cost_delta.

12. Reality check

A pragmatic minimum eval stack for a startup:

100 hand-crafted prompts with rubrics (versioned in git).
A pytest job that runs them on every PR using promptfoo or homemade.
Pairwise LLM-judge with order-swap, against GPT-4o.
Production logging of (input, output, model_version, user_feedback).
Weekly review of bottom-decile responses → add to eval set.

This is enough. Buy the platform when you outgrow it, not before.

Reading material

Books:

AI Engineering — Chip Huyen (the entire evaluation chapter; the best modern treatment)
Building LLMs for Production — Bouchard & Peters (the eval chapter with hands-on RAGAS examples)
Speech and Language Processing, 3rd ed. — Jurafsky & Martin (the classical NLP eval chapter; ROUGE, BLEU, BERTScore origins)

Papers:

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena — Zheng et al. 2023 — the canonical LLM-as-judge paper.
RAGAS: Automated Evaluation of Retrieval Augmented Generation — Es et al. 2023 — faithfulness + answer relevance + context recall.
HELM: Holistic Evaluation of Language Models — Liang et al. 2022 (Stanford CRFM) — the multi-axis benchmark framework.
BIG-bench — Srivastava et al. 2022 — 200+ tasks from 450 collaborators.
GPT-4 Technical Report — OpenAI 2023 (evaluation sections) — how a frontier lab discloses (or doesn't disclose) evals.

Official docs:

RAGAS — metrics docs — faithfulness, answer relevance, context precision/recall.
Inspect AI — UK AISI — the UK AI Safety Institute's framework; the modern standard.
Promptfoo docs — declarative prompt eval, regression tests.
OpenAI Evals — the format OpenAI uses internally.
LangSmith docs — Evaluation — hosted LLM tracing + eval.

Blog posts:

Eugene Yan — Patterns for Building LLM-based Systems & Products: Evals — the single best practitioner essay.
Hamel Husain — Your AI Product Needs Evals — the canonical "how to actually run evals in production" post.
Anthropic — Challenges in evaluating AI systems — the honest "why this is hard" post from Anthropic.

In-depth research material

RAGAS — github.com/explodinggradients/ragas — ~10k ★, RAG-specific evals.
DeepEval — github.com/confident-ai/deepeval — ~7k ★, pytest-flavoured LLM evals.
OpenAI Evals — github.com/openai/evals — ~16k ★, the open eval framework + dataset.
LM Evaluation Harness — github.com/EleutherAI/lm-evaluation-harness — ~9k ★, the de-facto OSS benchmark runner (HF Open LLM Leaderboard uses it).
Inspect AI — github.com/UKGovernmentBEIS/inspect_ai — the UK AISI framework, now adopted by NIST.
HELM — github.com/stanford-crfm/helm — ~2k ★, Stanford holistic eval.
Chatbot Arena — LMSys — crowdsourced head-to-head rankings.
Vellum — LLM Leaderboard — the most current production-facing leaderboard.
Hamel Husain — Field guide to rapidly improving AI products — the workflow that ties evals + error analysis + fixes.
Eugene Yan — LLM evaluators (a survey) — the meta-paper on LLM-as-judge biases.

Videos

A Practitioner's Guide to LLM Evals — Hamel Husain — Hamel Husain · 1 h 06 min — the talk version of his canonical blog post; watch this first.
LLM Evaluation Explained — Maven AI Eng course · 1 h 21 min — the workshop-format walkthrough.
Evaluating LLM Systems with Eugene Yan — Eugene Yan · 50 min — from the patterns-for-LLM-systems author.
LangSmith Evaluations Deep Dive — LangChain · 33 min — hands-on with the most popular hosted eval tool.
Building Reliable AI Systems: Evals — Shreya Shankar (Stanford MLSys) — 1 h — the academic-rigour view on eval systems.

LeetCode — Evaluate Division

Link: https://leetcode.com/problems/evaluate-division/
Difficulty: Medium
Why this problem: Build a graph of equations and answer arbitrary queries — exact shape of evaluating one model output against a chain of judge criteria.
Time-box: 30 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

Explain why open-ended LLM output makes eval qualitatively harder than classical ML.
List the 4 layers of eval (unit, behavioural, capability, user-outcome).
Design a pairwise LLM-judge prompt with bias mitigations.
Pick the right RAGAS metric (faithfulness vs context_precision vs answer_relevancy) for a given failure.
Sketch the 5-step "eval flywheel" loop.
Solve evaluate-division — graph traversal of weighted edges; mirrors chained eval criteria.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.

← previous

MLOps — Experiment Tracking, Model Registry, CI/CD for Models

Caching Strategies — CDN, Application Cache, Cache-Aside, Read-Through