evaluation-guidebook / app /src /content /chapters /troubleshooting /troubleshooting-reproducibility.mdx
Clémentine
wip
49f71ca
raw
history blame
9.21 kB
---
title: "Troubleshooting reproducibility"
---
import Note from "../../../components/Note.astro";
import Sidenote from "../../../components/Sidenote.astro";
Let's say you have read a recent tech report about a cool new model, and you want to reproduce their results on your machine... but you're not managing to?
Let's explore why.
#### Different code base
To reproduce evaluation scores to the decimal point, you first need to make sure you're using exactly the same code base as the paper you want to reproduce.
Usually, this means either using the evaluation default code as provided by the authors, or a standard implementation in a reference library like Eleuther's AI `lm_eval` or HuggingFace's `lighteval`. However, if the code source for evaluation is not provided, then, I'm sorry for you but it's unlikely that you'll be able to reproduce the results precisely.
If you want to easily understand what kind of discrepancies happen when using different implementations, you can explore [this blog](https://huggingface.co/blog/open-llm-leaderboard-mmlu) (⭐) we wrote with the eval team at HuggingFace. It studies the differences we observed between 3 common implementations of the MMLU evaluation (in `lm_eval`, `helm`, and in the original author implementation), and how they change model scores.
*Note: This is precisely for this reason that a Hugging Face team decided to launch the [Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard), to get unified and homogeneous comparisons of models scores in order to compare them to internal experiments.*
#### Subtle implementation or loading difference
We've observed that the following were easy things to mess up, even when using the same code base:
- **Different random seeds.**
- Normally, inference is less affected by random seeds than training. However, they can still affect some CUDA operations (see the PyTorch page on [reproducibility](https://pytorch.org/docs/stable/notes/randomness.html)) and change predictions if you're using a non greedy generation strategy. They can also affect the prompt if you're using few-shots, and some pre or post-processing functions.
-> A tiny change can result in a couple of points of difference.
- **Actually different metrics**.
Metrics can be different in practice even if they share the same name. Some examples:
- If the original implementation is a *log likelihood* `exact match` (computing the log probabilities of different possible answers), and you're using a *generative* `exact match` (only comparing the main greedy generation with the reference), you won't get the same scores.
- We also saw, in evaluation code bases, a number of tasks which were defined as `exact match`, but were actually `prefix exact match` (comparing only the beginning of the generation with the reference), or `suffix exact match` (the opposite), or `quasi exact match` (exact match with a normalization).
-> You therefore can't rely only on the metric name to determine what is happening, and need to look at the code.
- **Different normalization**.
- To go back to our above `exact match` comparison example, in `lm_eval` v1, a number of tasks were simply named generative `exact match`: you would assume from this that the prediction is *compared as such* to a reference.
Looking at the code, the prediction would instead go through a normalization step (removing punctuation, homogenizing numbers, etc) before being compared to the reference. This will obviously change results quite a lot.
(The `lm_eval` v2 now includes the normalization name in most metric names.)
-> This is one of the easiest things to mess up, especially for tasks which require a lot of normalization/answer post processing, like math evaluations (where you want to extract the answer from a generated explanation).
<Note title="Model loading affects reproducibility" emoji="🔧" variant="warning">
**Four factors that change results even with identical code:**
- **Hardware**: PyTorch doesn't guarantee reproducibility across different GPUs/hardware
- **Inference library**: transformers, vllm and sglang handle batching and matrix operations slightly differently as of 2025
- **Batch size**: Different batch sizes = different results (you should fix the batch size for reproducibility, though careful about OOM errors)
- **Loading precision**: Lower precision (especially quantized models vs floating point models) will change numerical results
</Note>
#### Different prompt
3 main things can come into play for prompt variation.
**Prompt itself**
The format you are using for the prompt can and will change scores wildly.
For example, for multichoice question answers, some common formats include very simple variations when presenting the choices, such as:
```
Question: <text of the question>
Choices:
```
```markdown
| A. <Choice A> | (A) <Choice A> | <Choice A> |
| B. <Choice B> | (B) <Choice B> | <Choice B> |
| C. <Choice C> | (C) <Choice C> | <Choice C> |
| D. <Choice D> | (D) <Choice D> | <Choice D> |
```
```
Answer:
```
and predicting either `A`/`B`/`C`/`D` or `<Choice A/B/C/D>`.
These prompts are **semantically equivalent**, as they contain the exact same content - but they can still result in a difference of *several points for the same model*.
<Note title="Prompt format sensitivity" emoji="📝" variant="danger">
We did some experiments on this [here](https://x.com/clefourrier/status/1777319187913875893/photo/1) (you'll see up to a 7 points difference for the same model) and a [paper observed similar results](https://arxiv.org/abs/2310.11324).
Semantically identical prompts can cause 7+ point score differences!
Even tiny formatting variations (like `A.` vs `(A)` vs just listing choices) significantly impact scores. Models increasingly overfit to specific benchmark prompt formats during training, losing adaptation ability.
**Real example**: Llama 3.1 models predicted correct MATH-Hard answers but scored poorly because they overfit to GSM8K's prompt format and couldn't adapt to different few-shot templates.
This is something we observed on the Open LLM Leaderboard 2 for the Llama3.1 models. They were predicting the correct answers to our MATH-Hard evaluations, but were getting low scores, being unable to fit to the template provided in few-shot because they overfit the GSM8K prompt and answer format (another math eval).
</Note>
This [great paper](https://arxiv.org/abs/2407.07890)⭐ also highlights a side effect of this: a number of models are now trained to overfit benchmark prompts and answer formats, to the cost of adaptation to other prompts at evaluation time.
Some tasks are also prefixed with a task prompt (eg: `The following questions are about <topic>`) - its presence or absence will also affect the scores.
**System prompt and chat template**
Chat models usually have been through instruction/preference training or fine-tuning. During this stage, they have learned to follow specific templates when inferring. For example, templates can require starting rounds of dialogue with a general prompt (called the `system prompt`) prefixed by specific tokens (usually `System: `). Said prompt is here to provide high-level instructions for the model, such as the contents of a persona, or general answering style instructions. Rounds of dialogue can also require adding prefix key words to text, such as `User` for queries and `Assistant` for answers.
When using few shot, you also need to select if you want examples to be provided multi-turn (mimicking user/assistant turns) or all at once (in a single user prompt).
Not following the chat template expected by the model at inference will kill its performance, as it will drive its output outside of the probability space it's been converging on.
Similarly, if you are using a reasoning model, you need to make sure whether you are comparing with or without thinking enabled.
**Few-shots samples**
Two things are easy to mess up with few-shot samples: the number of few-shot examples, which ones you are using, and their specific ordering
<Sidenote>
The importance of using the same examples is not too surprising, if we assume some samples are better at expressing the task than others. More surprising maybe: you not only need to use the exact same samples, but also present them in the **exact same order**. Varying the order on the same samples led us to observe up to 3 points of difference on some subsets of MMLU (you can see [some results here](https://huggingface.co/blog/evaluation-structured-outputs) , it's the third colorgrid)
</Sidenote>
This is also a place where paying attention to the random seeds is important.
**Parameters**
For generative evaluations, parameters to pay attention to are making sure you are 1) using the **same end of sentence token** (you probably should not be using a default one for chat and reasoning models); 2) allowing your model to **generate the same number of tokens** for the evaluation (this is particularly crucial for reasoning models, which require a huge numbers of tokens in thinking mode); 3) if using sampling, that you are using the **same seed/temperature parameters**.