evaluation-guidebook

Running

evaluation-guidebook / app /src /content /article.mdx

Clémentine

tmp

7ccc792 20 days ago

5.23 kB

	---
	title: "The Evaluation Guidebook"
	subtitle: "Understanding the tips and tricks of evaluating an LLM in 2025"
	description: "Understanding the tips and tricks of evaluating an LLM in 2025"
	authors:
	- name: "Clémentine Fourrier"
	url: "https://huggingface.co/clefourrier"
	affiliations: [1]
	affiliations:
	- name: "Hugging Face"
	url: "https://huggingface.co"
	published: "Dec. 01, 2025"
	tags:
	- research
	- evaluation
	tableOfContentsAutoCollapse: true
	---

	import Note from "../components/Note.astro";
	import Sidenote from "../components/Sidenote.astro";
	import HtmlEmbed from "../components/HtmlEmbed.astro";

	import Intro from "./chapters/intro.mdx";
	import DesigningAutomaticEvaluation from "./chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx";
	import PickingYourEval from "./chapters/general-knowledge/picking-your-evaluation.mdx";
	import EvalsIn2025 from "./chapters/general-knowledge/2025-evaluations-for-useful-models.mdx"
	import TroubleshootingInference from "./chapters/troubleshooting/troubleshooting-inference.mdx";
	import TroubleshootingReproducibility from "./chapters/troubleshooting/troubleshooting-reproducibility.mdx";
	import ModelInferenceAndEvaluation from "./chapters/general-knowledge/model-inference-and-evaluation.mdx";

	<Intro />

	## LLM basics to understand evaluation

	Now that you have an idea of why evaluation is important, and how it's done, let's look at how we prompt models to get some answers out in order to evaluate them. It's very likely you can skim this section if you have already done evaluation.

	<ModelInferenceAndEvaluation />

	## Evaluating with existing benchmarks

	### Benchmarks to know in 2025

	<EvalsIn2025 />

	### Selecting good benchmarks automatically for model training

	<PickingYourEval />

	### Understanding what's in there

	No matter how you selected your initial datasets, the most important step is, and always will be, to look at the data, both what you have, what the model generates, and its scores. In the end, that's the only way you'll see if your evaluations are actually relevant for your specific use case.

	You want to study the following.

	#### Data creation process
	- Who created the actual samples?
	Ideally, you want dataset created by experts, then next tier is paid annotators, then crowdsourced, then synthetic, then MTurked. You also want to look for a data card, where you'll find annotator demographics - this can be important to understand the dataset language diversity, or potential cultural bias.

	- Were they all examined by other annotators or by the authors?
	You want to know if the inter-annotator score on samples is high (= are annotators in agreement?) and/or if the full dataset has been examined by the authors.
	This is especially important for datasets with the help of underpaid annotators who usually are not native speakers of your target language (think AWS Mechanical Turk), as you might otherwise find typos/grammatical errors/nonsensical answers.

	- Were the annotators provided with clear data creation guidelines?
	In other words, is your dataset consistent?

	#### Samples inspection
	Take 50 random samples and manually inspect them; and I mean do it yourself, not "prompt an LLM to find unusual stuff in the data for you".

	First, you want to check the content quality. Are the prompts clear and unambiguous? Are the answers correct? (Eg: TriviaQA contains several gold answers (aliases field) per question, sometimes conflicting.) Is information missing? (Eg: MMLU misses reference schematics in a number of questions.) It's important to keep in mind that it's not because a dataset is a standard that it's a good one - and this happens because most people skip this step.

	Then, you want to check for relevance to your task. Are these questions the kind of questions you want to evaluate an LLM on? Are these examples relevant to your use case?

	You might also want to check the samples consistency (especially if you're planning on using few shots or computing aggregated statistics): do all samples have the same number of choices if it's a multiple choice evaluation? Is the spacing consistent before and after the prompt? If your evaluation comes with an additional environment, ideally you want to use it to understand tool calls.

	Lastly, you also want to quickly check how many samples are present there (to make sure results are statistically significant - 100 samples is usually a minimum for automatic benchmarks).

	TODO: ADD A VIEWER

	#### Task and metrics

	You want to check what metrics are used: are they automatic, functional, or using a model judge? The answer will change the cost of running evaluations for you, as well as the reproducibility and bias type.

	Best (but rarest) metrics are functional or based on rule based verifiers (though beware of pass/fail for coding models and code evaluations, as recent LLMs have become very good at overwriting globals to 'cheat' on such tests, especially in languages like Python where you can mess up variable scope).

	### So, you can't reproduce reported model scores?

	<TroubleshootingReproducibility />




	## Creating your own evaluation

	<DesigningAutomaticEvaluation />



	<TroubleshootingInference />