evaluation-guidebook

Running

evaluation-guidebook / app /src /content /chapters /intro.mdx

Clémentine

editing text

a1e35e5 18 days ago

17.1 kB

	---
	title: "Intro"
	---

	import HtmlEmbed from "../../components/HtmlEmbed.astro";
	import Note from "../../components/Note.astro";
	import Sidenote from "../../components/Sidenote.astro";
	import Quote from "../../components/Quote.astro";

	## What is model evaluation about?

	As you navigate the world of LLMs—whether you're training or fine-tuning your own models, selecting one for your application, or trying to understand the state of the field— there is one question you have likely asked stumbled upon:

	<Quote>
	How can one know if a model is good at anything?
	</Quote>

	The answer is (surprisingly given the blog topic) with evaluation! It's everywhere: leaderboards ranking models, benchmarks claiming to measure reasoning, knowledge, coding abilities or math performance, papers announcing new state-of-the-art results...

	But what is it, really? And what can really it tell you?

	This guide is here to help you understand evaluation: what it can and cannot do and when to trust different approaches (what their limitations and biases are too!), how to select benchmarks when evaluating a model (and which ones are relevant in 2025), and how to design your own evaluation, if you so want.

	Through the guide, we'll also highlight common pitfalls, tips and tricks from the Open Evals team, and hopefully will help you learn how to think critically about the claims made from evaluation results.

	Before we dive into the details, let's quickly look at why people do evaluation, concretely, and how.

	### Why do we do LLM evaluation?

	There are 3 main reasons for which people do evaluation, which tend to be conflated together, but are actually very different, and each answer a separate question.

	<HtmlEmbed src="d3-intro-boxes.html" title="Evaluation purposes" />

	#### Is this model training correctly?

	Non-regression testing is a concept which comes from the software industry, to make sure small changes have not broken the overall approach. The idea is the following: when you add a new feature to your software, or fix a problem in the code base, have you broken something else? That's what non-regression tests are for: making sure the expected, high-level behavior of your software is not suddenly broken by a (seemingly unrelated) change.

	When you select a setup to train models, you want to test something very similar, and make sure that your changes (choosing different training data, architecture, parameters, etc) have not "broken" the expected performance for a model of these properties.

	In ML, experiments which test the impact of small changes on model performance are to as ablations, and the core of them is actually having a good set of evaluations: with a strong enough signal while relatively cheap to run as you'll be running them a lot.

	For ablations, you also need to look at both trajectories (is the performance better now that when starting training) and scores ranges (is the performance within what's expected). You actually... don't really care about the precise score themselves! This evaluation is therefore not here to tell you anything about actual model capabilities, but instead just here to confirm that your training approach is "as sound or better" as the other training approach, and that your model behaves in similar ways.

	#### Which model is the best on \<task\>?

	The next role of evaluation is simply to sort models to find and select the best model for a given use case.

	For common topics like math, code, or knowledge, there are likely several leaderboards comparing and ranking models using different datasets, and you usually just have to test the top contenders to find the best model for you (if they are not working for you, it's unlikely the next best models will work).

	You could want to run the evaluation and comparision yourself (by reusing existing benchmarks) to get more details to analyse on the model successes and failures, which we will cover below.

	<Sidenote>
	In [their paper](https://arxiv.org/pdf/2404.02112) about lessons learned on benchmarking and dataset design from the ImageNet era, the authors argue that, since scores are susceptible to instability, the only robust way to evaluate models is through rankings, and more specifically by finding broad groups of evaluations which provide consistent and stable rankings. I believe looking for ranking stability is indeed an extremely interesting approach to model benchmarking, as we have shown that LLMs scores on automated benchmarks are extremely susceptible to [minute changes in prompting](https://huggingface.co/blog/evaluation-structured-outputs), and that human evaluations are not more consistent - where rankings are actually more stable when using robust evaluation methods.
	</Sidenote>

	For less common topics, you might even need to think about designing your own evaluations, which is our last section.

	<Note title="Small caveat">
	Despite often grandiose claims, for any complex capability, we cannot at the moment just say "this model is the best at this", but should instead say "this model is the best on this task that we hope is a good proxy for this capability, without any guarantee".
	</Note>

	#### When will we finally reach AGI?

	We are strongly missing any kind of good definitions and framework on what intelligence is for machine learning models, and how to evaluate it (though some people have tried, for example [Chollet](https://arxiv.org/abs/1911.01547) in 2019 and [Hendrycks et al](https://www.agidefinition.ai/paper.pdf) this year). Difficulty in defining intelligence is not a problem specific to machine learning! In human and animal studies, it is also quite hard to define, and metrics which try to provide precise scores (IQ and EQ for example) are hotly debated and controversial, with reason.

	There are, however, some issues with focusing on intelligence as a target. 1) Intelligence tends to end up being a moving target, as any time we reach a capability which was thought to be human specific, we redefine the term. 2) Our current frameworks are made with the human (or animal) in mind, and will most likely not transfer well to models, as the underlying behaviors and assumptions are not the same. 3) It is kind of a useless target too - we should target making models good at specific, well defined, purposeful and useful tasks (think accounting, reporting, etc) instead of aiming for AGI for the sake of it.

	### So how do people evaluate models, then?

	To my knowledge, at the moment, people use 3 main ways to do evaluation: automated benchmarking, using humans as judges, and using models as judges. Each approach has its own reason for existing, uses, and limitations.

	#### Automated benchmarks

	Automated benchmarking usually works the following way: you'd like to know how well your model performs on something. This something can be a well-defined concrete task, such as How well can my model classify spam from non spam emails?, or a more abstract and general capability, such as How good is my model at math?.

	From this, you construct an evaluation, usually made of two things:
	- a collection of samples, given as input to the model to see what comes out as output, sometimes coupled with a reference (called gold) to compare with. Samples are usually designed to try to emulate what you want to test the model on: for example, if you are looking at toxicity classification, you create a dataset of toxic and non toxic sentences, try to include some hard edge cases, etc.
	- a metric, which is a way to compute a score for the model. For example, how accurately can your model classify toxicity (score of well classified sample = 1, badly classified = 0).

	This is more interesting to do on data that was not included in the model training set, because you want to test if it generalizes well. You don't want a model which can only classify emails it has already "seen", that would not be very useful!

	<Note>
	A model which can only predict well on its training data (and has not latently learnt more high-level general patterns) is said to be overfitting. In less extreme cases, you still want to test if your model is able to generalize to data patterns which were not in the training set's distribution (for example, classify toxicity on stack overflow after having seen only toxicity on reddit).
	</Note>

	This works quite well for very well-defined tasks, where performance is "easy" to assess and measure: when you are literally testing your model on classification, you can say "the model classified correctly n% of these samples". For LLMs benchmarks, some issues can arise, such as models [favoring specific choices based on the order in which they have been presented for multi-choice evaluations](https://arxiv.org/abs/2309.03882), and generative evaluations relying on normalisations which can easily [be unfair if not designed well](https://huggingface.co/blog/open-llm-leaderboard-drop), but overall they still provide signal at the task level.

	For capabilities however, it's hard to decompose them into well-defined and precise tasks: what does "good at math" mean? good at arithmetic? at logic? able to reason on mathematical concepts?

	In this case, people tend to do more "holistic" evaluations, by not decomposing the capability in actual tasks, but assuming that performance on general samples will be a good proxy for what we aim to measure. For example, GSM8K is made of actual high school math problems, which require a whole set of capabilities to solve. It also means that both failure and success are very hard to interpret. Some capabilities or topics, such as "is this model good at writing poetry?" or "are the model outputs helpful?" are even harder to evaluate with automatic metrics - and at the same time, models now seem to have more and more generalist capabilities, so we need to evaluate their abilities in a broader manner.

	<Sidenote> For example, there was a debate in the scientific community as to whether LLMs [can draw](https://arxiv.org/abs/2303.12712) unicorns [or not](https://twitter.com/DimitrisPapail/status/1719119242186871275). A year later, seems like most can! </Sidenote>

	Automatic benchmarks also tend to have another problem: once they are published publicly in plain text, they are very likely to end up (often accidentally) in the training datasets of models. Some benchmarks creators, like the authors of BigBench, have tried to mitigate this by adding a canary string (a very specific combination of characters) for people to look for, and remove from training sets, but not everybody is aware of the mechanism nor trying to do this removal. There is also a non negligible quantity of benchmarks, so looking for accidental copies of absolutely all of them in data is costly. Other options include providing benchmarks in an [encrypted form](https://arxiv.org/pdf/2309.16575), or behind a [gating system](https://huggingface.co/datasets/Idavidrein/gpqa). However, when evaluating closed models (that are behind APIs), there is no guarantee that the prompts you give won’t be later used internally for training or fine-tuning.

	<Note>
	The case were an evaluation dataset ends up in the training set is called contamination, and a model which was contaminated will have a high benchmark performance that does not generalize well to the underlying task (an extensive description of contamination can be found [here](https://aclanthology.org/2023.findings-emnlp.722/), and here is a fun way to [detect it](https://arxiv.org/abs/2311.06233)). A way to address contamination is to run [dynamic benchmarks](https://arxiv.org/abs/2104.14337) (evaluations on datasets which are regularly refreshed to provide scores on systematically unseen new data), but this approach is costly in the long term.
	</Note>

	#### Human as a judge

	A solution to both contamination and more open-ended evaluation is asking humans to evaluate model outputs.

	This is usually done by tasking humans with first, prompting models, then, grading a model answer or ranking several outputs according to guidelines. Using humans as judges allows to study more complex tasks, with more flexibility than automated metrics. It also prevents most contamination cases, since the written prompts are (hopefully) new. Lastly, it correlates well with human preference, since this is literally what is evaluated!

	Different approaches exist to evaluate models with humans in the loop.

	Vibes-checks is the name given to manual evaluations done individually by some members of the community, usually on undisclosed prompts, to get an overall "feeling" of how well models perform on many use cases, which range from coding to quality of smut written. (I've also seen the term "canary-testing" used for this, in reference to high signal canary in a coalmine approach). Often shared on Twitter and Reddit, they mostly constitute anecdotal evidence, and tend to be highly sensitive to confirmation bias (in other words, people tend to find what they look for).

	Using community feedback to establish massive model rankings is what we call an arena. A well known example of this is the [LMSYS chatbot arena](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard), where community users are asked to chat with models until they find one is better than the other. Votes are then aggregated in an Elo ranking (a ranking of matches) to select which model is "the best". The obvious problem of such an approach is the high subjectivity - it's hard to enforce a consistent grading from many community members using broad guidelines, especially since annotators preferences tend to be [culturally bound](https://arxiv.org/abs/2404.16019v1) (with different people favoring different discussion topics, for example). One can hope that this effect is smoothed over by the sheer scale of the votes, through a "wisdom of the crowd" effect (this effect was found by a statistician named Galton, who observed that individual answers trying to estimate a numerical value, like the weight of a hog, could be modeled as a probability distribution centered around the actual answer).

	The last approach is systematic annotations, where you provide extremely specific guidelines to paid selected annotators, in order to remove as much as the subjectivity bias as possible (this is the approach used by most data annotation companies). However, it can get extremely expensive fast, as you have to keep on doing evaluations in a continuous and non automatic manner for every new model you want to evaluate, and it can still fall prey to human bias (this [study](https://arxiv.org/abs/2205.00501) showed that people with different identities tend to rate model answer toxicity very differently).

	However, humans can be biased: for example, they tend to estimate the quality of answers [based on first impressions](https://arxiv.org/pdf/2309.16349), instead of actual factuality or faithfulness, and are very sensitive to tone, and underestimate the number of factual or logical errors in an assertive answer. These are only some of the many biases that human judges can fall prey to (as we'll see below). They are not unexpected, but they must be taken into account: not all use cases should rely on using human annotators, especially crowdsourced, unexpert ones - any task requiring factuality (such as code writing, evaluation of model knowledge, etc) should include another, more robust, type of evaluation to complete the benchmark.

	#### Model as a judge

	To mitigate the cost of human annotators, some people have looked into using models or derived artifacts (preferably aligned with human preferences) to evaluate models' outputs. This approach is not new, as you can find techniques to measure summarization quality from [model embeddings](https://arxiv.org/abs/1904.09675) in 2019.

	Two approach exist for grading: using [generalist, high capability models](https://arxiv.org/abs/2306.05685v4) or using [small specialist models](https://arxiv.org/pdf/2405.01535) trained specifically to discriminate from preference data.

	Model as judges have several strong limitations, because they are as biased as humans but along different axes (they can't [provide consistent score ranges](https://twitter.com/aparnadhinak/status/1748368364395721128), are actually not that consistent [with human rankings](https://arxiv.org/pdf/2308.15812), etc, as we'll see below).

	My main personal gripe with using models as judges is that they introduce very subtle and un-interpretable bias in the answer selection. I feel that, much like when crossbreeding too much in genetics studies, you end up with dysfunctional animals or plants, by using LLMs to select and train LLMs, we are just as likely to introduce minute changes that will have bigger repercussions a couple generations down the line. I believe this type of bias is less likely to occur in smaller and more specialized models as judges (such as toxicity classifiers), but this remains to be rigorously tested and proven.