evaluation-guidebook

Running

App Files Files Community

Clémentine commited on 11 days ago

Commit

383bee2

1 Parent(s): 42b7c5c

fixed intro

Browse files

Files changed (5) hide show

app/src/content/article.mdx +1 -1
app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx +26 -4
app/src/content/chapters/general-knowledge/picking-your-evaluation.mdx +2 -2
app/src/content/chapters/intro.mdx +4 -68
app/src/content/embeds/d3-intro-boxes.html +17 -16

app/src/content/article.mdx CHANGED Viewed

@@ -31,7 +31,7 @@ import ModelInferenceAndEvaluation from "./chapters/general-knowledge/model-infe
 ## LLM basics to understand evaluation
-Now that you have a better (but broad) idea of why evaluation is important and how it's done, let's look at how we prompt models to get some answers out in order to evaluate them. You can skim this section if you have already done evaluation and mostly look for the notes and sidenotes.
 <ModelInferenceAndEvaluation />

 ## LLM basics to understand evaluation
+Now that you have an idea of why evaluation is important to different people, let's look at how we prompt models to get some answers out in order to evaluate them. You can skim this section if you have already done evaluation and mostly look for the notes and sidenotes.
 <ModelInferenceAndEvaluation />

app/src/content/chapters/automated-benchmarks/designing-your-automatic-evaluation.mdx CHANGED Viewed

@@ -38,7 +38,7 @@ If you want to create synthetic data, you usually start from a number of seed do
 You'll then likely want a model to design questions from your data. For this, you will need to select a frontier model, and design a very good prompt asking the model to create use-case relevant questions from the provided data. It's better if you ask the model to provide the source on which it based its question.
-You can also use seed prompts as examples to provide to an external modeln for it to write the prompt for your model to generate new questions, if you want to go full synthetic ^^
 Once this is done, you can do an automatic validation by using a model from a different family line on your ground truth + questions + answer as a model judge.
@@ -57,6 +57,11 @@ Solutions to mitigate this include:
 However, it's not because a dataset is contaminated that it won't still be interesting and have signal during training, as we saw in the ablations section.
 ### Choosing a prompt
 The prompt is going to define how much information is given to your model about the task, and how this information is presented to the model. It usually contains the following parts: an optional **task prompt** which introduces the task, and the format that the output should follow, **attached context** if needed (for example a source, an image), a **problem prompt** which is what you ask of the model, and optional options for multiple choice evaluations.
@@ -115,6 +120,8 @@ When there is a ground truth, however, you can use automatic metrics, let's see
 #### Metrics
 Most ways to automatically compare a string of text to a reference are match based.
 The easiest but least flexible match based metrics are **exact matches** of token sequences. While simple and unambiguous, they provide no partial credit - a prediction that's correct except for one word scores the same as one that's completely wrong.  <Sidenote> Be aware that "exact match" is used as a catch all name, and also includes "fuzzy matches" of strings: compared with normalization, on subsets of tokens (prefix only for ex), etc </Sidenote>
 The translation and summarisation fields have introduced automatic metrics which compare similarity through overlap of n-grams in sequences. **BLEU** (Bilingual Evaluation Understudy) measures n-gram overlap with reference translations and remains widely used despite having a length bias toward shorter translations and correlating poorly with humans at the sentence level (it notably won't work well for predictions which are semantically equivalent but written in a different fashion than the reference). **ROUGE** does a similar thing but focuses more on recall-oriented n-gram overlap. A simpler version of these is the **TER** (translation error rate), number of edits required to go from a prediction to the correct reference (similar to an edit distance).
@@ -220,9 +227,16 @@ Human evaluation is very interesting, because of its **flexibility** (if you def
 However, when doing evaluation with humans, you need to make sure your annotators are diverse enough that your results generalizes.
 </Sidenote>
-Human evaluations can be casual, with vibe-checks or arenas, or systematic (as mentionned in the intro).
-Vibe-checks are a particularly [good starting point for your own use cases](https://olshansky.substack.com/p/vibe-checks-are-all-you-need), as you'll be testing the model on what's relevant to you. Pros of casual human evaluations are that they are cheap and allow to discover fun edge cases since you leverage user's creativity in a mostly unbounded manner, you can discover interesting edge cases. However, they can be prone to blind spots.
 Once you want to scale to more systematic evaluation with paid annotators, you'll find that there are 3 main ways to do so. If **you don't have a dataset**, but want to explore a set of capabilities, you provide humans with a task and scoring guidelines (e.g *Try to make both these model output toxic language; a model gets 0 if it was toxic, 1 if it was not.*), and access to one (or several) model(s) that they can interact with, then ask them to provide their scores and reasoning. If **you already have a dataset** (eg: a set of *prompts that you want your model to never answer*, for example for safety purposes), you preprompt your model with them, and provide the prompt, output and scoring guidelines to humans. If **you already have a dataset and scores**, you can ask humans to review your evaluation method by doing [error annotation](https://ehudreiter.com/2022/06/01/error-annotations-to-evaluate/) (*it can also be used as a scoring system in the above category*). It's a very important step of testing new evaluation system, but it technically falls under evaluating an evaluation, so it's slightly out of scope here.
@@ -235,10 +249,16 @@ Overall, however, human evaluation has a number of well known biases:
 - **Self-preference bias**: Humans are [most likely to prefer answers which appeal to their views or align with their opinions or errors](https://arxiv.org/abs/2310.13548), rather than answers which are factually correct.
 - **Identity bias**: People with different identities tend to have different values, and rate model answers very differently (for example on [toxicity](https://arxiv.org/abs/2205.00501))
 ### With judge models
 Judge models are simply **neural network used to evaluate the output of other neural networks**. In most cases, they evaluate text generations.
-Judge models range from small specialized classifiers (think "spam filter", but for toxicity for example) to LLMs, either large and generalist or small and specialized. In the latter case, when using an LLM as a judge, you give it a prompt to explain how to score models (ex: `Score the fluency from 0 to 5, 0 being completely un-understandable, ...`).
 Model as judges allow to score text on complex and nuanced properties.
 For example, an exact match between a prediction and reference can allow you to test if a model predicted the correct fact or number, but assessing more open-ended empirical capabilities (like fluency, poetry quality, or faithfulness to an input) requires more complex evaluators.
@@ -263,6 +283,8 @@ In my opinion, using LLM judges correctly is extremely tricky, and it's **easy t
 This section is therefore a bit long, because you need to be well aware of the limitations of using model as judges: a lot of people are blindly jumping into using them because they seem easier than actually working with humans or designing new metrics, but then end up with uninsterpretable data with tricky bias to extract.
 <Note title="Getting started with an LLM judge">
 If you want to give it a go, I suggest first reading this [very good guide](https://huggingface.co/learn/cookbook/en/llm_judge) on how to setup your first LLM as judge!

 You'll then likely want a model to design questions from your data. For this, you will need to select a frontier model, and design a very good prompt asking the model to create use-case relevant questions from the provided data. It's better if you ask the model to provide the source on which it based its question.
+You can also use seed prompts as examples to provide to an external model for it to write the prompt for your model to generate new questions, if you want to go full synthetic ^^
 Once this is done, you can do an automatic validation by using a model from a different family line on your ground truth + questions + answer as a model judge.
 However, it's not because a dataset is contaminated that it won't still be interesting and have signal during training, as we saw in the ablations section.
+<Note>
+A model which can only predict well on its training data (and has not latently learnt more high-level general patterns) is said to be **overfitting**. In less extreme cases, you still want to test if your model is able to generalize to data patterns which were not in the training set's distribution (for example, classify toxicity on stack overflow after having seen only toxicity on reddit).
+</Note>
 ### Choosing a prompt
 The prompt is going to define how much information is given to your model about the task, and how this information is presented to the model. It usually contains the following parts: an optional **task prompt** which introduces the task, and the format that the output should follow, **attached context** if needed (for example a source, an image), a **problem prompt** which is what you ask of the model, and optional options for multiple choice evaluations.
 #### Metrics
 Most ways to automatically compare a string of text to a reference are match based.
+<Sidenote>This is more interesting to do on data that was not included in the model training set, because you want to test if it **generalizes** well. You don't want a model which can only predict text it has already "seen", that would not be very useful! </Sidenote>
 The easiest but least flexible match based metrics are **exact matches** of token sequences. While simple and unambiguous, they provide no partial credit - a prediction that's correct except for one word scores the same as one that's completely wrong.  <Sidenote> Be aware that "exact match" is used as a catch all name, and also includes "fuzzy matches" of strings: compared with normalization, on subsets of tokens (prefix only for ex), etc </Sidenote>
 The translation and summarisation fields have introduced automatic metrics which compare similarity through overlap of n-grams in sequences. **BLEU** (Bilingual Evaluation Understudy) measures n-gram overlap with reference translations and remains widely used despite having a length bias toward shorter translations and correlating poorly with humans at the sentence level (it notably won't work well for predictions which are semantically equivalent but written in a different fashion than the reference). **ROUGE** does a similar thing but focuses more on recall-oriented n-gram overlap. A simpler version of these is the **TER** (translation error rate), number of edits required to go from a prediction to the correct reference (similar to an edit distance).
 However, when doing evaluation with humans, you need to make sure your annotators are diverse enough that your results generalizes.
 </Sidenote>
+Different approaches exist to evaluate models with humans in the loop.
+**Vibe-checks** is the name given to manual evaluations done by individual members of the community, usually on undisclosed prompts, to get an overall "feeling" of how well models perform on their use cases of preference. (I've also seen the term "canary-testing" used for this, in reference to high signal canary in a coalmine approach). Said use cases can be anything from the most exciting to the most mundate - to cite some I've seen on Reddit, they covered legal questions in German, coding, ability to generate tikz unicorns, tool use, quality of erotica written, etc. Often shared on forums or social media, they mostly constitute anecdotal evidence, and tend to be highly sensitive to confirmation bias (in other words, people tend to find what they look for).
+Using community feedback to establish massive model rankings is what we call an **arena**. A well known example of this is the [LMSYS chatbot arena](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard), where community users are asked to chat with models until they find one is better than the other. Votes are then aggregated in an Elo ranking (a ranking of matches) to select which model is "the best". The obvious problem of such an approach is the high subjectivity - it's hard to enforce a consistent grading from many community members using broad guidelines, especially since annotators preferences tend to be [culturally bound](https://arxiv.org/abs/2404.16019v1) (with different people favoring different discussion topics, for example). One can hope that this effect is smoothed over by the sheer scale of the votes, through a "wisdom of the crowd" effect (this effect was found by a statistician named Galton, who observed that individual answers trying to estimate a numerical value, like the weight of a hog, could be modeled as a probability distribution centered around the actual answer).
+The last approach is **systematic annotations**, where you provide extremely specific guidelines to paid selected annotators, in order to remove as much as the subjectivity bias as possible (this is the approach used by most data annotation companies). However, it can get extremely expensive fast, as you have to keep on doing evaluations in a continuous and non automatic manner for every new model you want to evaluate, and it can still fall prey to human bias (this [study](https://arxiv.org/abs/2205.00501) showed that people with different identities tend to rate model answer toxicity very differently).
+Vibe-checks are a particularly [good starting point for your own use cases](https://olshansky.substack.com/p/vibe-checks-are-all-you-need), as you'll be testing the model on what's relevant to you. Pros of casual human evaluations are that they are cheap and allow to discover fun edge cases since you leverage user's creativity in a mostly unbounded manner, you can discover interesting edge cases. However, they can be prone to blind spots. <Sidenote> For example, there was a debate in the scientific community as to whether LLMs [can draw](https://arxiv.org/abs/2303.12712) unicorns [or not](https://twitter.com/DimitrisPapail/status/1719119242186871275). A year later, seems like most can! </Sidenote>
 Once you want to scale to more systematic evaluation with paid annotators, you'll find that there are 3 main ways to do so. If **you don't have a dataset**, but want to explore a set of capabilities, you provide humans with a task and scoring guidelines (e.g *Try to make both these model output toxic language; a model gets 0 if it was toxic, 1 if it was not.*), and access to one (or several) model(s) that they can interact with, then ask them to provide their scores and reasoning. If **you already have a dataset** (eg: a set of *prompts that you want your model to never answer*, for example for safety purposes), you preprompt your model with them, and provide the prompt, output and scoring guidelines to humans. If **you already have a dataset and scores**, you can ask humans to review your evaluation method by doing [error annotation](https://ehudreiter.com/2022/06/01/error-annotations-to-evaluate/) (*it can also be used as a scoring system in the above category*). It's a very important step of testing new evaluation system, but it technically falls under evaluating an evaluation, so it's slightly out of scope here.
 - **Self-preference bias**: Humans are [most likely to prefer answers which appeal to their views or align with their opinions or errors](https://arxiv.org/abs/2310.13548), rather than answers which are factually correct.
 - **Identity bias**: People with different identities tend to have different values, and rate model answers very differently (for example on [toxicity](https://arxiv.org/abs/2205.00501))
+These biases are not unexpected, but they must be taken into account: not all use cases should rely on using cheap human annotators - any task requiring factuality (such as code writing, evaluation of model knowledge, etc) should include another, more robust, type of evaluation to complete the benchmark (experts, automatic metrics if applicable, etc).
 ### With judge models
+To mitigate the cost of human annotators, some people have looked into using models or derived artifacts (preferably aligned with human preferences) to evaluate models' outputs.
+<Sidenote>This approach is not new, as you can find techniques to measure summarization quality from [model embeddings](https://arxiv.org/abs/1904.09675) in 2019.</Sidenote>
 Judge models are simply **neural network used to evaluate the output of other neural networks**. In most cases, they evaluate text generations.
+Two approaches exist for grading: using [generalist, high capability models](https://arxiv.org/abs/2306.05685v4) or using [small specialist models](https://arxiv.org/pdf/2405.01535) trained specifically to discriminate from preference data (think "spam filter", but for toxicity for example). In the former case, when using an LLM as a judge, you give it a prompt to explain how to score models (ex: `Score the fluency from 0 to 5, 0 being completely un-understandable, ...`).
 Model as judges allow to score text on complex and nuanced properties.
 For example, an exact match between a prediction and reference can allow you to test if a model predicted the correct fact or number, but assessing more open-ended empirical capabilities (like fluency, poetry quality, or faithfulness to an input) requires more complex evaluators.
 This section is therefore a bit long, because you need to be well aware of the limitations of using model as judges: a lot of people are blindly jumping into using them because they seem easier than actually working with humans or designing new metrics, but then end up with uninsterpretable data with tricky bias to extract.
+My main personal gripe with using models as judges is that they introduce very subtle and un-interpretable bias in the answer selection. I feel that, much like when crossbreeding too much in genetics studies, you end up with dysfunctional animals or plants, by using LLMs to select and train LLMs, we are just as likely to introduce minute changes that will have bigger repercussions a couple generations down the line. I believe this type of bias is less likely to occur in smaller and more specialized models as judges (such as toxicity classifiers), but this remains to be rigorously tested and proven.
 <Note title="Getting started with an LLM judge">
 If you want to give it a go, I suggest first reading this [very good guide](https://huggingface.co/learn/cookbook/en/llm_judge) on how to setup your first LLM as judge!

app/src/content/chapters/general-knowledge/picking-your-evaluation.mdx CHANGED Viewed

@@ -66,7 +66,7 @@ To measure this, we used the **Spearman rank correlation** to quantify the corre
 When comparing model performance on tasks, we need to consider whether differences are due to **evaluation noise or genuine performance variations**.
-Noise can arise from the stochastic processes involved in model training, such as random token sampling, data shuffling, or model initialization.[Madaan et al., 2024](https://arxiv.org/abs/2406.10229) To measure how sensitive each task is to this noise, we trained four additional models on our own monolingual corpora (unfiltered CommonCrawl data in each language) using different seeds.
 For each task, we computed:
@@ -219,7 +219,7 @@ Selecting the best evaluation metrics proved to be a **challenging task**. Not o
 ➡️ Multichoice Tasks
 - We found **base accuracy** to perform well for tasks with answer options varying subtly (e.g. Yes/No/Also), particularly NLI tasks. In such cases, where the answer options are often each a single token, the base accuracy is advisable to use.
-- While OLMES [Gu et al., 2024](https://arxiv.org/abs/2406.08446) recommends using PMI for tasks with unusual words, we found **PMI** to be highly effective for "difficult" reasoning and knowledge tasks like AGIEVAL or MMLU. In these cases, PMI provided the best results and was often the only metric delivering performance above random. That said, PMI was, on average, the weakest metric across all other tasks, while also being two times more expensive to compute. We therefore only recommend its use for complex reasoning and knowledge tasks.
 - The metrics we found to be **most reliable overall** were length normalization metrics (token or character-based). However, the best choice was dependent on language, rather than being consistent for a given task. Due to that, we recommend using the maximum of acc_char and acc_token for the most reliable results.<d-footnote>Note that acc_token is heavily tokenizer dependent. On our ablations all models were trained using the same tokenizer.</d-footnote>
 ➡️ Generative Tasks

 When comparing model performance on tasks, we need to consider whether differences are due to **evaluation noise or genuine performance variations**.
+Noise can arise from the stochastic processes involved in model training, such as random token sampling, data shuffling, or model initialization ([Madaan et al., 2024](https://arxiv.org/abs/2406.10229)). To measure how sensitive each task is to this noise, we trained four additional models on our own monolingual corpora (unfiltered CommonCrawl data in each language) using different seeds.
 For each task, we computed:
 ➡️ Multichoice Tasks
 - We found **base accuracy** to perform well for tasks with answer options varying subtly (e.g. Yes/No/Also), particularly NLI tasks. In such cases, where the answer options are often each a single token, the base accuracy is advisable to use.
+- While OLMES authors ([Gu et al., 2024](https://arxiv.org/abs/2406.08446)) recommends using PMI for tasks with unusual words, we found **PMI** to be highly effective for "difficult" reasoning and knowledge tasks like AGIEVAL or MMLU. In these cases, PMI provided the best results and was often the only metric delivering performance above random. That said, PMI was, on average, the weakest metric across all other tasks, while also being two times more expensive to compute. We therefore only recommend its use for complex reasoning and knowledge tasks.
 - The metrics we found to be **most reliable overall** were length normalization metrics (token or character-based). However, the best choice was dependent on language, rather than being consistent for a given task. Due to that, we recommend using the maximum of acc_char and acc_token for the most reliable results.<d-footnote>Note that acc_token is heavily tokenizer dependent. On our ablations all models were trained using the same tokenizer.</d-footnote>
 ➡️ Generative Tasks

app/src/content/chapters/intro.mdx CHANGED Viewed

@@ -23,15 +23,9 @@ This guide is here to help you understand evaluation: what it can and cannot do
 Through the guide, we'll also highlight common pitfalls, tips and tricks from the Open Evals team, and hopefully will help you learn how to think critically about the claims made from evaluation results.
-Before we dive into the details, let's quickly look at why people do evaluation, concretely, and how.
-### Why do we do LLM evaluation?
-There are 3 main use cases for evaluation, which tend to be conflated together, but are actually **very different**, and each answer a separate question.
-<HtmlEmbed src="d3-intro-boxes.html" title="Evaluation purposes" frameless />
-#### The model builder perspective: Is this model training correctly?
 **Non-regression testing** is a concept which comes from the software industry, to make sure small changes have not broken the overall approach. The idea is the following: when you add a new feature to your software, or fix a problem in the code base, have you broken something else? That's what non-regression tests are for: making sure the expected, high-level behavior of your software is not suddenly broken by a (seemingly unrelated) change.
@@ -41,7 +35,7 @@ In ML, experiments which test the impact of small changes on model performance a
 For ablations, you also need to look at both **trajectories** (is the performance better now than when training started) and score **ranges** (is the performance within what's expected). These evaluations are here to confirm that your approach is "as sound or better than" other training approaches, and that your model behaves in similar ways. <Sidenote> Ablations can also be used to try to predict the performance of bigger models based on the perfomance on smaller ones, using scaling laws. </Sidenote>
-#### The model user perspective: Which model is the best on \<task\>?
 The next role of evaluation is simply to sort models to find and select the best model for a given use case.
@@ -63,62 +57,4 @@ Despite often grandiose claims, for any complex capability, we cannot at the mom
 We are strongly missing any kind of good definitions and framework on what intelligence is for machine learning models, and how to evaluate it (though some people have tried, for example [Chollet](https://arxiv.org/abs/1911.01547) in 2019 and [Hendrycks et al](https://www.agidefinition.ai/paper.pdf) this year). Difficulty in defining intelligence is not a problem specific to machine learning! In human and animal studies, it is also quite hard to define, and metrics which try to provide precise scores (IQ and EQ for example) are hotly debated and controversial, with reason.
 There are, however, some issues with focusing on intelligence as a target. 1) Intelligence tends to end up being a moving target, as any time we reach a capability which was thought to be human specific, we redefine the term. 2) Our current frameworks are made with the human (or animal) in mind, and will most likely not transfer well to models, as the underlying behaviors and assumptions are not the same. 3) It is kind of a useless target too - we should target making models good at specific, well defined, purposeful and useful tasks (think accounting, reporting, etc) instead of aiming for AGI for the sake of it.
-</Note>
-### So how do people evaluate models, then?
-To my knowledge, at the moment, people use 3 main ways to do evaluation: automated benchmarking, using humans as judges, and using models as judges. Each approach has its own reason for existing, uses, and limitations.
-#### Automated benchmarks
-Automated benchmarking usually works the following way: you'd like to know how well your model performs on something. This something can be a well-defined concrete **task**, such as *How well can my model classify spam from non spam emails?*, or a more abstract and general **capability**, such as *How good is my model at math?*.
-From this, you construct an evaluation, usually made of two things:
-- a collection of *samples*, given as input to the model to see what comes out as output, sometimes coupled with a reference (called gold) to compare with. Samples are usually designed to try to emulate what you want to test the model on: for example, if you are looking at toxicity classification, you create a dataset of toxic and non toxic sentences, try to include some hard edge cases, etc.
-- a *metric*, which is a way to compute a score for the model. For example, how accurately can your model classify toxicity (score of well classified sample = 1, badly classified = 0).
-This is more interesting to do on data that was not included in the model training set, because you want to test if it **generalizes** well. You don't want a model which can only classify emails it has already "seen", that would not be very useful!
-<Note>
-A model which can only predict well on its training data (and has not latently learnt more high-level general patterns) is said to be **overfitting**. In less extreme cases, you still want to test if your model is able to generalize to data patterns which were not in the training set's distribution (for example, classify toxicity on stack overflow after having seen only toxicity on reddit).
-</Note>
-This works quite well for very well-defined tasks, where performance is "easy" to assess and measure: when you are literally testing your model on classification, you can say "the model classified correctly n% of these samples". For LLMs benchmarks, some issues can arise, such as models [favoring specific choices based on the order in which they have been presented for multi-choice evaluations](https://arxiv.org/abs/2309.03882), and generative evaluations relying on normalisations which can easily [be unfair if not designed well](https://huggingface.co/blog/open-llm-leaderboard-drop), but overall they still provide signal at the task level.
-For **capabilities** however, it's hard to decompose them into well-defined and precise tasks: what does "good at math" mean? good at arithmetic? at logic? able to reason on mathematical concepts?
-In this case, people tend to do more "holistic" evaluations, by not decomposing the capability in actual tasks, but assuming that performance on general samples will be a **good proxy** for what we aim to measure. For example, GSM8K is made of actual high school math problems, which require a whole set of capabilities to solve. It also means that both failure and success are very hard to interpret. Some capabilities or topics, such as "is this model good at writing poetry?" or "are the model outputs helpful?" are even harder to evaluate with automatic metrics, for they lack a single ground truth and are highly subjective - and at the same time, models now seem to have more and more **generalist** capabilities, so we need to evaluate their abilities in a broader manner.
-<Sidenote> For example, there was a debate in the scientific community as to whether LLMs [can draw](https://arxiv.org/abs/2303.12712) unicorns [or not](https://twitter.com/DimitrisPapail/status/1719119242186871275). A year later, seems like most can! </Sidenote>
-Automatic benchmarks also tend to have another problem: once they are published publicly in plain text, they are very likely to end up (often accidentally) in the training datasets of models. Some benchmarks creators, like the authors of BigBench, have tried to mitigate this by adding a *canary string* (a very specific combination of characters) for people to look for, and remove from training sets, but not everybody is aware of the mechanism nor trying to do this removal. There is also a non negligible quantity of benchmarks, so looking for accidental copies of absolutely all of them in data is costly. Other options include providing benchmarks in an [**encrypted** form](https://arxiv.org/pdf/2309.16575), or behind a [**gating** system](https://huggingface.co/datasets/Idavidrein/gpqa). However, when evaluating closed models (that are behind APIs), there is no guarantee that the prompts you give won’t be later used internally for training or fine-tuning.
-<Note>
-The case were an evaluation dataset ends up in the training set is called **contamination**, and a model which was contaminated will have a high benchmark performance that does not generalize well to the underlying task (an extensive description of contamination can be found [here](https://aclanthology.org/2023.findings-emnlp.722/), and here is a fun way to [detect it](https://arxiv.org/abs/2311.06233)). A way to address contamination is to run [**dynamic benchmarks**](https://arxiv.org/abs/2104.14337) (evaluations on datasets which are regularly refreshed to provide scores on systematically unseen new data), but this approach is costly in the long term.
-</Note>
-#### Human as a judge
-As we just saw, automatic metrics cannot score free-form text reliably (especially for open ended questions and subjective topics); they also tend to rely on contaminated datasets. A solution to both these problems is asking humans to evaluate model outputs: using humans as judges allows to study more complex tasks, with more flexibility than automated metrics.
-This is usually done by tasking humans with grading model answers or ranking several outputs according to guidelines. If you also let your judges be free with their prompts, it reduces contamination, since the written prompts are (hopefully) new. Lastly, it correlates well with human preference, since this is literally what is evaluated!
-Different approaches exist to evaluate models with humans in the loop.
-**Vibe-checks** is the name given to manual evaluations done by individual members of the community, usually on undisclosed prompts, to get an overall "feeling" of how well models perform on their use cases of preference. (I've also seen the term "canary-testing" used for this, in reference to high signal canary in a coalmine approach). Said use cases can be anything from the most exciting to the most mundate - to cite some I've seen on Reddit, they covered legal questions in German, coding, ability to generate tikz unicorns, tool use, quality of erotica written, etc. Often shared on forums or social media, they mostly constitute anecdotal evidence, and tend to be highly sensitive to confirmation bias (in other words, people tend to find what they look for).
-Using community feedback to establish massive model rankings is what we call an **arena**. A well known example of this is the [LMSYS chatbot arena](https://huggingface.co/spaces/lmsys/chatbot-arena-leaderboard), where community users are asked to chat with models until they find one is better than the other. Votes are then aggregated in an Elo ranking (a ranking of matches) to select which model is "the best". The obvious problem of such an approach is the high subjectivity - it's hard to enforce a consistent grading from many community members using broad guidelines, especially since annotators preferences tend to be [culturally bound](https://arxiv.org/abs/2404.16019v1) (with different people favoring different discussion topics, for example). One can hope that this effect is smoothed over by the sheer scale of the votes, through a "wisdom of the crowd" effect (this effect was found by a statistician named Galton, who observed that individual answers trying to estimate a numerical value, like the weight of a hog, could be modeled as a probability distribution centered around the actual answer).
-The last approach is **systematic annotations**, where you provide extremely specific guidelines to paid selected annotators, in order to remove as much as the subjectivity bias as possible (this is the approach used by most data annotation companies). However, it can get extremely expensive fast, as you have to keep on doing evaluations in a continuous and non automatic manner for every new model you want to evaluate, and it can still fall prey to human bias (this [study](https://arxiv.org/abs/2205.00501) showed that people with different identities tend to rate model answer toxicity very differently).
-Humans can be biased, and we'll cover most of the associated issues in the sections below: however, the problem is considerably worse for crowdsourced or unexpert annotators. These biases are not unexpected, but they must be taken into account: not all use cases should rely on using cheap human annotators - any task requiring factuality (such as code writing, evaluation of model knowledge, etc) should include another, more robust, type of evaluation to complete the benchmark (experts, automatic metrics if applicable, etc).
-#### Model as a judge
-To mitigate the cost of human annotators, some people have looked into using models or derived artifacts (preferably aligned with human preferences) to evaluate models' outputs. This approach is not new, as you can find techniques to measure summarization quality from [model embeddings](https://arxiv.org/abs/1904.09675) in 2019.
-Two approaches exist for grading: using [generalist, high capability models](https://arxiv.org/abs/2306.05685v4) or using [small specialist models](https://arxiv.org/pdf/2405.01535) trained specifically to discriminate from preference data.
-Model as judges have several strong limitations, because they are as biased as humans but along different axes (they can't [provide consistent score ranges](https://twitter.com/aparnadhinak/status/1748368364395721128), are actually not that consistent [with human rankings](https://arxiv.org/pdf/2308.15812), etc, as we'll see below).
-My main personal gripe with using models as judges is that they introduce very subtle and un-interpretable bias in the answer selection. I feel that, much like when crossbreeding too much in genetics studies, you end up with dysfunctional animals or plants, by using LLMs to select and train LLMs, we are just as likely to introduce minute changes that will have bigger repercussions a couple generations down the line. I believe this type of bias is less likely to occur in smaller and more specialized models as judges (such as toxicity classifiers), but this remains to be rigorously tested and proven.

 Through the guide, we'll also highlight common pitfalls, tips and tricks from the Open Evals team, and hopefully will help you learn how to think critically about the claims made from evaluation results.
+Before we dive into the details, let's quickly look at why people do evaluation, as who you are and what you are working on will determine which evaluations you need to use.
+### The model builder perspective: Is this model training correctly?
 **Non-regression testing** is a concept which comes from the software industry, to make sure small changes have not broken the overall approach. The idea is the following: when you add a new feature to your software, or fix a problem in the code base, have you broken something else? That's what non-regression tests are for: making sure the expected, high-level behavior of your software is not suddenly broken by a (seemingly unrelated) change.
 For ablations, you also need to look at both **trajectories** (is the performance better now than when training started) and score **ranges** (is the performance within what's expected). These evaluations are here to confirm that your approach is "as sound or better than" other training approaches, and that your model behaves in similar ways. <Sidenote> Ablations can also be used to try to predict the performance of bigger models based on the perfomance on smaller ones, using scaling laws. </Sidenote>
+### The model user perspective: Which model is the best on \<task\>?
 The next role of evaluation is simply to sort models to find and select the best model for a given use case.
 We are strongly missing any kind of good definitions and framework on what intelligence is for machine learning models, and how to evaluate it (though some people have tried, for example [Chollet](https://arxiv.org/abs/1911.01547) in 2019 and [Hendrycks et al](https://www.agidefinition.ai/paper.pdf) this year). Difficulty in defining intelligence is not a problem specific to machine learning! In human and animal studies, it is also quite hard to define, and metrics which try to provide precise scores (IQ and EQ for example) are hotly debated and controversial, with reason.
 There are, however, some issues with focusing on intelligence as a target. 1) Intelligence tends to end up being a moving target, as any time we reach a capability which was thought to be human specific, we redefine the term. 2) Our current frameworks are made with the human (or animal) in mind, and will most likely not transfer well to models, as the underlying behaviors and assumptions are not the same. 3) It is kind of a useless target too - we should target making models good at specific, well defined, purposeful and useful tasks (think accounting, reporting, etc) instead of aiming for AGI for the sake of it.
+</Note>

app/src/content/embeds/d3-intro-boxes.html CHANGED Viewed

@@ -22,13 +22,13 @@
     background: oklch(from var(--primary-color) calc(l + 0.4) c h / 0.35);
     border: 1px solid oklch(from var(--primary-color) calc(l + 0.15) c h / 0.6);
     border-radius: 16px;
-    padding: var(--spacing-4) var(--spacing-5);
     text-align: left;
     display: flex;
     flex-direction: column;
     justify-content: flex-start;
     align-items: flex-start;
-    min-height: 140px;
   }
   /* Dark mode adjustments for better readability */
@@ -57,9 +57,9 @@
     list-style: none;
     padding: 0;
     margin: 0;
-    font-size: 13px;
     color: var(--text-color);
-    line-height: 1.6;
     font-weight: 500;
     position: relative;
     z-index: 1;
@@ -124,27 +124,28 @@
       container.innerHTML = `
         <div class="purposes-grid">
           <div class="purpose-card">
-            <div class="purpose-title">Model builders</div>
             <ul class="purpose-items">
-              <li>best training method</li>
-              <li>non-regression</li>
-              <li>risks/costs</li>
             </ul>
           </div>
           <div class="purpose-card">
-            <div class="purpose-title">Users</div>
             <ul class="purpose-items">
-              <li>best model for X</li>
-              <li>hype vs trust</li>
             </ul>
           </div>
           <div class="purpose-card">
-            <div class="purpose-title">Field</div>
             <ul class="purpose-items">
-              <li>capabilities</li>
-              <li>direction</li>
             </ul>
           </div>
         </div>

     background: oklch(from var(--primary-color) calc(l + 0.4) c h / 0.35);
     border: 1px solid oklch(from var(--primary-color) calc(l + 0.15) c h / 0.6);
     border-radius: 16px;
+    padding: var(--spacing-5) var(--spacing-5);
     text-align: left;
     display: flex;
     flex-direction: column;
     justify-content: flex-start;
     align-items: flex-start;
+    min-height: 180px;
   }
   /* Dark mode adjustments for better readability */
     list-style: none;
     padding: 0;
     margin: 0;
+    font-size: 14px;
     color: var(--text-color);
+    line-height: 1.7;
     font-weight: 500;
     position: relative;
     z-index: 1;
       container.innerHTML = `
         <div class="purposes-grid">
           <div class="purpose-card">
+            <div class="purpose-title">Model Builders</div>
             <ul class="purpose-items">
+              <li>Is this model training correctly?</li>
+              <li>Non-regression testing & ablations</li>
+              <li>Compare training approaches</li>
             </ul>
           </div>
           <div class="purpose-card">
+            <div class="purpose-title">Model Users</div>
             <ul class="purpose-items">
+              <li>Which model is best on &lt;task&gt;?</li>
+              <li>Compare models & rankings</li>
+              <li>Design custom evaluations</li>
             </ul>
           </div>
           <div class="purpose-card">
+            <div class="purpose-title">Researchers</div>
             <ul class="purpose-items">
+              <li>What capabilities exist?</li>
+              <li>What is the state of progress?</li>
             </ul>
           </div>
         </div>