|
|
--- |
|
|
title: "Intro" |
|
|
--- |
|
|
|
|
|
import HtmlEmbed from "../../components/HtmlEmbed.astro"; |
|
|
import Note from "../../components/Note.astro"; |
|
|
import Sidenote from "../../components/Sidenote.astro"; |
|
|
import Quote from "../../components/Quote.astro"; |
|
|
|
|
|
## What is model evaluation about? |
|
|
|
|
|
As you navigate the world of LLMs—whether you |
|
|
|
|
|
<Quote> |
|
|
How can one know if a model is *good at* anything? |
|
|
</Quote> |
|
|
|
|
|
The answer is (surprisingly given the blog topic) with evaluation! It |
|
|
|
|
|
But what is it, really? And what can really it tell you? |
|
|
|
|
|
This guide is here to help you understand evaluation: what it can and cannot do and when to trust different approaches (what their limitations and biases are too!), how to select benchmarks when evaluating a model (and which ones are relevant in 2025), and how to design your own evaluation, if you so want. |
|
|
|
|
|
Through the guide, we |
|
|
|
|
|
Before we dive into the details, let |
|
|
|
|
|
### Why do we do LLM evaluation? |
|
|
|
|
|
There are 3 main reasons for which people do evaluation, which tend to be conflated together, but are actually **very different**, and each answer a separate question. |
|
|
|
|
|
<HtmlEmbed src="d3-intro-boxes.html" title="Evaluation purposes" /> |
|
|
|
|
|
#### Is this model training correctly? |
|
|
|
|
|
**Non-regression testing** is a concept which comes from the software industry, to make sure small changes have not broken the overall approach. The idea is the following: when you add a new feature to your software, or fix a problem in the code base, have you broken something else? That |
|
|
|
|
|
When you select a setup to train models, you want to test something very similar, and make sure that your changes (choosing different training data, architecture, parameters, etc) have not "broken" the expected performance for a model of these properties. |
|
|
|
|
|
In ML, experiments which test the impact of small changes on model performance are to as **ablations**, and the core of them is actually having a good set of evaluations: with a strong enough signal while relatively cheap to run as you |
|
|
|
|
|
For ablations, you also need to look at both **trajectories** (is the performance better now that when starting training) and scores **ranges** (is the performance within what |
|
|
|
|
|
#### Which model is the best on \<task\>? |
|
|
|
|
|
The next role of evaluation is simply to sort models to find and select the best model for a given use case. |
|
|
|
|
|
For common topics like math, code, or knowledge, there are likely several leaderboards comparing and ranking models using different datasets, and you usually just have to test the top contenders to find the best model for you (if they are not working for you, it |
|
|
|
|
|
You could want to run the evaluation and comparision yourself (by reusing existing benchmarks) to get more details to analyse on the model successes and failures, which we will cover below. |
|
|
|
|
|
<Sidenote> |
|
|
In [their paper](https://arxiv.org/pdf/2404.02112) about lessons learned on benchmarking and dataset design from the ImageNet era, the authors argue that, since scores are susceptible to instability, the only robust way to evaluate models is through rankings, and more specifically by finding broad groups of evaluations which provide consistent and stable rankings. I believe looking for ranking stability is indeed an extremely interesting approach to model benchmarking, as we have shown that LLMs *scores* on automated benchmarks are extremely susceptible to [minute changes in prompting](https://huggingface.co/blog/evaluation-structured-outputs), and that human evaluations are not more consistent - where *rankings* are actually more stable when using robust evaluation methods. |
|
|
</Sidenote> |
|
|
|
|
|
For less common topics, you might even need to think about designing your own evaluations, which is our last section. |
|
|
|
|
|
<Note title="Small caveat"> |
|
|
Despite often grandiose claims, for any complex capability, we cannot at the moment just say "this model is the best at this", but should instead say **"this model is the best on this task that we hope is a good proxy for this capability, without any guarantee"**. |
|
|
</Note> |
|
|
|
|
|
#### When will we finally reach AGI? |
|
|
|
|
|
We are strongly missing any kind of good definitions and framework on what intelligence is for machine learning models, and how to evaluate it (though some people have tried, for example [Chollet](https://arxiv.org/abs/1911.01547) in 2019 and [Hendrycks et al](https://www.agidefinition.ai/paper.pdf) this year). Difficulty in defining intelligence is not a problem specific to machine learning! In human and animal studies, it is also quite hard to define, and metrics which try to provide precise scores (IQ and EQ for example) are hotly debated and controversial, with reason. |
|
|
|
|
|
There are, however, some issues with focusing on intelligence as a target. 1) Intelligence tends to end up being a moving target, as any time we reach a capability which was thought to be human specific, we redefine the term. 2) Our current frameworks are made with the human (or animal) in mind, and will most likely not transfer well to models, as the underlying behaviors and assumptions are not the same. 3) It is kind of a useless target too - we should target making models good at specific, well defined, purposeful and useful tasks (think accounting, reporting, etc) instead of aiming for AGI for the sake of it. |
|
|
|
|
|
### So how do people evaluate models, then? |
|
|
|
|
|
To my knowledge, at the moment, people use 3 main ways to do evaluation: automated benchmarking, using humans as judges, and using models as judges. Each approach has its own reason for existing, uses, and limitations. |
|
|
|
|
|
#### Automated benchmarks |
|
|
|
|
|
Automated benchmarking usually works the following way: you |
|
|
|
|
|
From this, you construct an evaluation, usually made of two things: |
|
|
- a collection of *samples*, given as input to the model to see what comes out as output, sometimes coupled with a reference (called gold) to compare with. Samples are usually designed to try to emulate what you want to test the model on: for example, if you are looking at toxicity classification, you create a dataset of toxic and non toxic sentences, try to include some hard edge cases, etc. |
|
|
- a *metric*, which is a way to compute a score for the model. For example, how accurately can your model classify toxicity (score of well classified sample = 1, badly classified = 0). |
|
|
|
|
|
This is more interesting to do on data that was not included in the model training set, because you want to test if it **generalizes** well. You don |
|
|
|
|
|
<Note> |
|
|
A model which can only predict well on its training data (and has not latently learnt more high-level general patterns) is said to be **overfitting**. In less extreme cases, you still want to test if your model is able to generalize to data patterns which were not in the training set |
|
|
</Note> |
|
|
|
|
|
This works quite well for very well-defined tasks, where performance is "easy" to assess and measure: when you are literally testing your model on classification, you can say "the model classified correctly n% of these samples". For LLMs benchmarks, some issues can arise, such as models [favoring specific choices based on the order in which they have been presented for multi-choice evaluations](https://arxiv.org/abs/2309.03882), and generative evaluations relying on normalisations which can easily [be unfair if not designed well](https://huggingface.co/blog/open-llm-leaderboard-drop), but overall they still provide signal at the task level. |
|
|
|
|
|
For **capabilities** however, it |
|
|
|
|
|
In this case, people tend to do more "holistic" evaluations, by not decomposing the capability in actual tasks, but assuming that performance on general samples will be a **good proxy** for what we aim to measure. For example, GSM8K is made of actual high school math problems, which require a whole set of capabilities to solve. It also means that both failure and success are very hard to interpret. Some capabilities or topics, such as "is this model good at writing poetry?" or "are the model outputs helpful?" are even harder to evaluate with automatic metrics - and at the same time, models now seem to have more and more **generalist** capabilities, so we need to evaluate their abilities in a broader manner. |
|
|
|
|
|
<Sidenote> For example, there was a debate in the scientific community as to whether LLMs [can draw](https://arxiv.org/abs/2303.12712) unicorns [or not](https://twitter.com/DimitrisPapail/status/1719119242186871275). A year later, seems like most can! </Sidenote> |
|
|
|
|
|
Automatic benchmarks also tend to have another problem: once they are published publicly in plain text, they are very likely to end up (often accidentally) in the training datasets of models. Some benchmarks creators, like the authors of BigBench, have tried to mitigate this by adding a *canary string* (a very specific combination of characters) for people to look for, and remove from training sets, but not everybody is aware of the mechanism nor trying to do this removal. There is also a non negligible quantity of benchmarks, so looking for accidental copies of absolutely all of them in data is costly. Other options include providing benchmarks in an [**encrypted** form](https://arxiv.org/pdf/2309.16575), or behind a [**gating** system](https://huggingface.co/datasets/Idavidrein/gpqa). However, when evaluating closed models (that are behind APIs), there is no guarantee that the prompts you give won’t be later used internally for training or fine-tuning. |
|
|
|
|
|
<Note> |
|
|
The case were an evaluation dataset ends up in the training set is called **contamination**, and a model which was contaminated will have a high benchmark performance that does not generalize well to the underlying task (an extensive description of contamination can be found [here](https://aclanthology.org/2023.findings-emnlp.722/), and here is a fun way to [detect it](https://arxiv.org/abs/2311.06233)). A way to address contamination is to run [**dynamic benchmarks**](https://arxiv.org/abs/2104.14337) (evaluations on datasets which are regularly refreshed to provide scores on systematically unseen new data), but this approach is costly in the long term. |
|
|
</Note> |
|
|
|
|
|
#### Human as a judge |
|
|
|
|
|
A solution to both contamination and more open-ended evaluation is asking humans to evaluate model outputs. |
|
|
|
|
|
This is usually done by tasking humans with first, prompting models, then, grading a model answer or ranking several outputs according to guidelines. Using humans as judges allows to study more complex tasks, with more flexibility than automated metrics. It also prevents most contamination cases, since the written prompts are (hopefully) new. Lastly, it correlates well with human preference, since this is literally what is evaluated! |
|
|
|
|
|
Different approaches exist to evaluate models with humans in the loop. |
|
|
|
|
|
**Vibes-checks** is the name given to manual evaluations done individually by some members of the community, usually on undisclosed prompts, to get an overall "feeling" of how well models perform on many use cases, which range from coding to quality of smut written. (I |
|
|
|
|
|
Using community feedback to establish massive model rankings is what we call an **arena**. A well known example of this is the [LMSYS chatbot arena](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard), where community users are asked to chat with models until they find one is better than the other. Votes are then aggregated in an Elo ranking (a ranking of matches) to select which model is "the best". The obvious problem of such an approach is the high subjectivity - it |
|
|
|
|
|
The last approach is **systematic annotations**, where you provide extremely specific guidelines to paid selected annotators, in order to remove as much as the subjectivity bias as possible (this is the approach used by most data annotation companies). However, it can get extremely expensive fast, as you have to keep on doing evaluations in a continuous and non automatic manner for every new model you want to evaluate, and it can still fall prey to human bias (this [study](https://arxiv.org/abs/2205.00501) showed that people with different identities tend to rate model answer toxicity very differently). |
|
|
|
|
|
However, humans can be biased: for example, they tend to estimate the quality of answers [based on first impressions](https://arxiv.org/pdf/2309.16349), instead of actual factuality or faithfulness, and are very sensitive to tone, and underestimate the number of factual or logical errors in an assertive answer. These are only some of the many biases that human judges can fall prey to (as we |
|
|
|
|
|
#### Model as a judge |
|
|
|
|
|
To mitigate the cost of human annotators, some people have looked into using models or derived artifacts (preferably aligned with human preferences) to evaluate models |
|
|
|
|
|
Two approach exist for grading: using [generalist, high capability models](https://arxiv.org/abs/2306.05685v4) or using [small specialist models](https://arxiv.org/pdf/2405.01535) trained specifically to discriminate from preference data. |
|
|
|
|
|
Model as judges have several strong limitations, because they are as biased as humans but along different axes (they can |
|
|
|
|
|
My main personal gripe with using models as judges is that they introduce very subtle and un-interpretable bias in the answer selection. I feel that, much like when crossbreeding too much in genetics studies, you end up with dysfunctional animals or plants, by using LLMs to select and train LLMs, we are just as likely to introduce minute changes that will have bigger repercussions a couple generations down the line. I believe this type of bias is less likely to occur in smaller and more specialized models as judges (such as toxicity classifiers), but this remains to be rigorously tested and proven. |
|
|
|