evaluation-guidebook
/
app
/src
/content
/chapters
/troubleshooting
/troubleshooting-reproducibility.mdx
| --- | |
| title: "Troubleshooting reproducibility" | |
| --- | |
| import Note from "../../../components/Note.astro"; | |
| import Sidenote from "../../../components/Sidenote.astro"; | |
| Let's say you have read a recent tech report about a cool new model, and you want to reproduce their results on your machine... but you're not managing to? | |
| Let's explore why. | |
| #### Different code base | |
| To reproduce evaluation scores to the decimal point, you first need to make sure you're using exactly the same code base as the paper you want to reproduce. | |
| Usually, this means either using the evaluation default code as provided by the authors, or a standard implementation in a reference library like Eleuther's AI `lm_eval` or HuggingFace's `lighteval`. However, if the code source for evaluation is not provided, then, I'm sorry for you but it's unlikely that you'll be able to reproduce the results precisely. | |
| If you want to easily understand what kind of discrepancies happen when using different implementations, you can explore [this blog](https://huggingface.co/blog/open-llm-leaderboard-mmlu) (⭐) we wrote with the eval team at HuggingFace. It studies the differences we observed between 3 common implementations of the MMLU evaluation (in `lm_eval`, `helm`, and in the original author implementation), and how they change model scores. | |
| *Note: This is precisely for this reason that a Hugging Face team decided to launch the [Open LLM Leaderboard](https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard), to get unified and homogeneous comparisons of models scores in order to compare them to internal experiments.* | |
| #### Subtle implementation or loading difference | |
| We've observed that the following were easy things to mess up, even when using the same code base: | |
| - **Different random seeds.** | |
| - Normally, inference is less affected by random seeds than training. However, they can still affect some CUDA operations (see the PyTorch page on [reproducibility](https://pytorch.org/docs/stable/notes/randomness.html)) and change predictions if you're using a non greedy generation strategy. They can also affect the prompt if you're using few-shots, and some pre or post-processing functions. | |
| -> A tiny change can result in a couple of points of difference. | |
| - **Actually different metrics**. | |
| Metrics can be different in practice even if they share the same name. Some examples: | |
| - If the original implementation is a *log likelihood* `exact match` (computing the log probabilities of different possible answers), and you're using a *generative* `exact match` (only comparing the main greedy generation with the reference), you won't get the same scores. | |
| - We also saw, in evaluation code bases, a number of tasks which were defined as `exact match`, but were actually `prefix exact match` (comparing only the beginning of the generation with the reference), or `suffix exact match` (the opposite), or `quasi exact match` (exact match with a normalization). | |
| -> You therefore can't rely only on the metric name to determine what is happening, and need to look at the code. | |
| - **Different normalization**. | |
| - To go back to our above `exact match` comparison example, in `lm_eval` v1, a number of tasks were simply named generative `exact match`: you would assume from this that the prediction is *compared as such* to a reference. | |
| Looking at the code, the prediction would instead go through a normalization step (removing punctuation, homogenizing numbers, etc) before being compared to the reference. This will obviously change results quite a lot. | |
| (The `lm_eval` v2 now includes the normalization name in most metric names.) | |
| -> This is one of the easiest things to mess up, especially for tasks which require a lot of normalization/answer post processing, like math evaluations (where you want to extract the answer from a generated explanation). | |
| <Note title="Model loading affects reproducibility" emoji="🔧" variant="warning"> | |
| **Four factors that change results even with identical code:** | |
| - **Hardware**: PyTorch doesn't guarantee reproducibility across different GPUs/hardware | |
| - **Inference library**: transformers, vllm and sglang handle batching and matrix operations slightly differently as of 2025 | |
| - **Batch size**: Different batch sizes = different results (you should fix the batch size for reproducibility, though careful about OOM errors) | |
| - **Loading precision**: Lower precision (especially quantized models vs floating point models) will change numerical results | |
| </Note> | |
| #### Different prompt | |
| 3 main things can come into play for prompt variation. | |
| **Prompt itself** | |
| The format you are using for the prompt can and will change scores wildly. | |
| For example, for multichoice question answers, some common formats include very simple variations when presenting the choices, such as: | |
| ``` | |
| Question: <text of the question> | |
| Choices: | |
| ``` | |
| ```markdown | |
| | A. <Choice A> | (A) <Choice A> | <Choice A> | | |
| | B. <Choice B> | (B) <Choice B> | <Choice B> | | |
| | C. <Choice C> | (C) <Choice C> | <Choice C> | | |
| | D. <Choice D> | (D) <Choice D> | <Choice D> | | |
| ``` | |
| ``` | |
| Answer: | |
| ``` | |
| and predicting either `A`/`B`/`C`/`D` or `<Choice A/B/C/D>`. | |
| These prompts are **semantically equivalent**, as they contain the exact same content - but they can still result in a difference of *several points for the same model*. | |
| <Note title="Prompt format sensitivity" emoji="📝" variant="danger"> | |
| We did some experiments on this [here](https://x.com/clefourrier/status/1777319187913875893/photo/1) (you'll see up to a 7 points difference for the same model) and a [paper observed similar results](https://arxiv.org/abs/2310.11324). | |
| Semantically identical prompts can cause 7+ point score differences! | |
| Even tiny formatting variations (like `A.` vs `(A)` vs just listing choices) significantly impact scores. Models increasingly overfit to specific benchmark prompt formats during training, losing adaptation ability. | |
| **Real example**: Llama 3.1 models predicted correct MATH-Hard answers but scored poorly because they overfit to GSM8K's prompt format and couldn't adapt to different few-shot templates. | |
| This is something we observed on the Open LLM Leaderboard 2 for the Llama3.1 models. They were predicting the correct answers to our MATH-Hard evaluations, but were getting low scores, being unable to fit to the template provided in few-shot because they overfit the GSM8K prompt and answer format (another math eval). | |
| </Note> | |
| This [great paper](https://arxiv.org/abs/2407.07890)⭐ also highlights a side effect of this: a number of models are now trained to overfit benchmark prompts and answer formats, to the cost of adaptation to other prompts at evaluation time. | |
| Some tasks are also prefixed with a task prompt (eg: `The following questions are about <topic>`) - its presence or absence will also affect the scores. | |
| **System prompt and chat template** | |
| Chat models usually have been through instruction/preference training or fine-tuning. During this stage, they have learned to follow specific templates when inferring. For example, templates can require starting rounds of dialogue with a general prompt (called the `system prompt`) prefixed by specific tokens (usually `System: `). Said prompt is here to provide high-level instructions for the model, such as the contents of a persona, or general answering style instructions. Rounds of dialogue can also require adding prefix key words to text, such as `User` for queries and `Assistant` for answers. | |
| When using few shot, you also need to select if you want examples to be provided multi-turn (mimicking user/assistant turns) or all at once (in a single user prompt). | |
| Not following the chat template expected by the model at inference will kill its performance, as it will drive its output outside of the probability space it's been converging on. | |
| Similarly, if you are using a reasoning model, you need to make sure whether you are comparing with or without thinking enabled. | |
| **Few-shots samples** | |
| Two things are easy to mess up with few-shot samples: the number of few-shot examples, which ones you are using, and their specific ordering | |
| <Sidenote> | |
| The importance of using the same examples is not too surprising, if we assume some samples are better at expressing the task than others. More surprising maybe: you not only need to use the exact same samples, but also present them in the **exact same order**. Varying the order on the same samples led us to observe up to 3 points of difference on some subsets of MMLU (you can see [some results here](https://huggingface.co/blog/evaluation-structured-outputs) , it's the third colorgrid) | |
| </Sidenote> | |
| This is also a place where paying attention to the random seeds is important. | |
| **Parameters** | |
| For generative evaluations, parameters to pay attention to are making sure you are 1) using the **same end of sentence token** (you probably should not be using a default one for chat and reasoning models); 2) allowing your model to **generate the same number of tokens** for the evaluation (this is particularly crucial for reasoning models, which require a huge numbers of tokens in thinking mode); 3) if using sampling, that you are using the **same seed/temperature parameters**. | |