evaluation-guidebook

Running

evaluation-guidebook / app /src /content /chapters /general-knowledge /model-inference-and-evaluation.mdx

Clémentine

editing text

a1e35e5 20 days ago

17.3 kB

	---
	title: "Model inference and evaluation"
	---

	import llmTk1 from '../../assets/image/llm_tk_1.png';
	import llmLogprob from '../../assets/image/llm_logprob.png';
	import llmGen from '../../assets/image/llm_gen.png';
	import chatTemplatesTokenisation from '../../assets/image/chat-templates-and-tokenisation.png';
	import Image from '../../../components/Image.astro';
	import Note from "../../../components/Note.astro";
	import Sidenote from "../../../components/Sidenote.astro";
	import Accordion from "../../../components/Accordion.astro";
	import HtmlEmbed from "../../../components/HtmlEmbed.astro";

	In this section, we'll look at two steps for models: how input is preprocessed to be given to the model (`tokenization`), and how the model generates a prediction from it (`inference`).

	<Sidenote> If you want to learn more about how to actually train a model, you should go read the [Smol Training Guidebook!](https://huggingface.co/spaces/HuggingFaceTB/smol-training-playbook)</Sidenote>

	<HtmlEmbed src="d3-tokenization.html" frameless />

	### Tokenization
	The input text (called a prompt at inference) is first split into tokens, small units of texts (which can be one or several characters, up to the word level) each associated with a number. The whole range of tokens a model can parse is called its vocabulary.

	#### Basics of tokenization: Why and how do we tokenize text?
	Since large language models are actually big mathematical functions, they eat numbers, not text.

	Say you want to transform a sentence to numbers. You first need to decide how to cut your sentence into small pieces, then map every small piece to a number; this is tokenization.

	In the past, people would try to map each character of a text with its index in a alphabet (`a` -> 1, `b` -> 2, etc) which is called character based tokenization (you split between characters). On the other end of the spectrum, people also tried to map each word with its index in a dictionary (`a` -> 1, `aardvark` -> 2, `ab` -> 3, etc) which is called word based tokenization (you split on spaces, if your language has spaces - if not, it's a bit harder).

	Both these methods share a strong limitation: they remove information from the input text. They erase semantic connections that you can see from word shape (ex: `dis similar`, `similar`, `similar ity`, `similar ly`), information we would like our model to retain, so it connects related words together.
	(Plus, what happens if you suddenly have a completely new word in input? It gets no number, and your model can't process it 😔 )

	Some people therefore had the idea to cut words into sub-words, and assign index to these sub-words (`dis`, `similar`, `ity`, `ly`)!

	This was initially done using morpho-syntactic rules (morpho-syntax is like the grammar of word creation). Now most people use byte pair encoding (BPE), a smart statistical method to create the sub-words automatically depending on their frequency in a reference text.

	So as a summary: tokenization is a way to map small units of texts (which can be one or several characters, up to the word level) to numbers (similar to an index). When you want to process text, your input text (called a prompt at inference) is split into these tokens by a tokenizer. The whole range of tokens a model or tokenizer can parse is called its vocabulary.

	<Note title="Going further: Understanding tokenization" emoji="📚" variant="warning">
	- ⭐ [Explanation of different tokenization methods in the 🤗 NLP Course](https://huggingface.co/learn/nlp-course/en/chapter2/4)
	- ⭐ [Conceptual guide about tokenization in the 🤗 doc](https://huggingface.co/docs/transformers/en/tokenizer_summary)
	- [Course by Jurafsky on tokenization (and other things)](https://web.stanford.edu/~jurafsky/slp3/2.pdf) - skip to 2.5 and 2.6
	</Note>

	<Note title="Going further: Byte Pair Encoding" emoji="📚" variant="warning">
	I would strongly recommend reading a longer explanation on how BPE works, as it's really a base of modern LLMs.

	- ⭐ [Explanation of BPE in the 🤗 NLP Course](https://huggingface.co/learn/nlp-course/en/chapter6/5)
	- [BPE Paper (for text, as the method existed before in other fields)](https://aclanthology.org/P16-1162/)
	</Note>

	When building a tokenizer require making more choices than one would expect. For example, to tokenize numbers, you don't want to use a basic BPE, but do you only index 0 to 9, and assume all other numbers will be compositions of digits? Do you want to store numbers up to, say, one billion, individually?

	Current well known models display a range of approaches to this, but it's unclear what works better to allow mathematical reasoning. This will affect some mathematical evaluation (and is the reason why almost no evaluation is pure arithmetics).

	<Note title="Going further: Tokenizing numbers" emoji="📚" variant="warning">

	- ⭐ [A nice visual demo by Yennie Jun of how tokenizers of Anthropic, Meta, OpenAI, and Mistral models split numbers](https://www.artfish.ai/p/how-would-you-tokenize-or-break-down)
	- [Small history by Beren Millidge of the evolution of number tokenization through the years](https://www.beren.io/2024-05-11-Integer-tokenization-is-now-much-less-insane/)
	</Note>

	#### How tokenization can mess up your evaluation
	Managing fine-tuned models, system prompts and chat templates

	Pre-2022, models used to simply be pretrained: text in, text out, nothing else. Then, we got instruction tuning and chat models in 2023, and in 2025 reasoning models. This means that we went from using raw text to using more and more formatting.

	<HtmlEmbed src="d3-tokenization-timeline.html" frameless />


	This means a number of models are going to perform terribly if you do not make sure to:
	1. respect the format the model expectes
	2. adds a system prompt at the very beginning of inference if your model requires one
	3. remove the thinking trace from reasoning models answers before processing them (you can usually regex to remove what's between the `<think>` tags)

	<Note title="Critical: Chat templates and tokenization" emoji="⚡" variant="danger">

	<Image src={chatTemplatesTokenisation} alt="Spacing, tokenization and template" />

	Different tokenizers behave differently with spacing and special tokens. See this [visualization](https://x.com/danielhanchen/status/1796952220619157694) showing how spacing, tokenization, and templates interact. Never assume tokenizers behave identically!
	</Note>

	Paying attention to start and end of sentence tokens

	Some pretrained models, like the `Gemma` ones, are extremely sensitive to the [inclusion of start of sentence tokens](https://github.com/EleutherAI/lm-evaluation-harness/pull/1465) at inference. You might need to do a couple of experiments to see if that happens for you, and add these tokens manually when evaluating if they are not in your dataset.

	You can also encounter some issues where your model won't stop on an end of sentence token like you would expect. Code models usually have been trained with `\n\t` as a single token. This means that when generating text, they will often generate `\n\t` in one step. A task which defines `\n` as an end of sentence token (= to stop the generation) will let the model continue generating after a `\n\t`, if predicted as one token, since it's not the same as `\n`. But you would actually still want the model to stop. In these cases, you either need to update your end of sentence tokens, or define a mechanism to backtrack on the character representation of the latest tokens to stop (and cut) the generation a posteriori.

	Multilinguality and tokenization

	When looking at multilingual evaluations, you'll encounter two issues.

	First, as some languages do not always use spacing as a word separator (Korean, Thai, Japanese, Chinese, to cite a few), they will require language specific tokenizers to be split properly, else it will affect their scores on metrics such as [BLEU](https://github.com/EleutherAI/lm-evaluation-harness/issues/212), F1 scores, etc.

	Then, tokenizers in general might be unfair to non-English languages. When training a BPE tokenizer, you use data from the different languages you want to cover, but most of the time, though, this data is unbalanced between languages (with, for example, an order of magnitude more English than Thai, or Burmese). Since BPE tokenizers create their vocabulary tokens based on the most frequent words seen, most of the long tokens will be English words - and most of the words from the less frequent languages will only be split at the character level. This effect leads to an unfairness in multilingual tokenization: some (less frequent, or lower-resourced) languages require orders of magnitude more tokens to generate a sentence of equivalent length as English.

	<iframe
	src="https://OpenEvals-tokenizers-languages.hf.space"
	frameborder="0"
	width="850"
	height="450"
	></iframe>

	If you are in this case, the number of tokens that the model is allowed to generate for an evaluation should also be language dependent, as not all languages are tokenized in similar amount of tokens.


	<Note title="Going further: Language and tokenization" emoji="📚" variant="warning">
	- ⭐ [A beautiful breakdown and demo by Yennie Jun on tokenization issues across languages](https://www.artfish.ai/p/all-languages-are-not-created-tokenized): The breakdown in itself is very clear, and the embedded space comes from her work.
	- ⭐ [A demo by Aleksandar Petrov on unfairness of tokenization](https://aleksandarpetrov.github.io/tokenization-fairness/): I recommend looking at `Compare tokenization of sentences` to get a feel for the differences in cost of inference depending on languages
	</Note>


	### Inference

	Now that we know how to convert our input text into something the LLMs can parse, let's look at how models process this text.

	From this input text, the LLM generates a probability distribution of the most likely next tokens over all the vocabulary. To get a continued generation, we can take the most probable token (give or take some added randomness to get more interesting outputs) as the next one, then repeat the operation, using the new token as the end of the prompt, etc.

	<Image src={llmTk1} alt="LLM tokenization and prediction process" />


	<Note title="Two main evaluation approaches" emoji="🎯" variant="info">

	Log-likelihood evaluations: Given a prompt and one (or several) answers, what is probability of said answer(s) for my model?

	Generative evaluations: Given a prompt, what text does my model generate?

	Choice depends on your task: multiple-choice questions use log-likelihood, while open-ended tasks require generative evaluation.

	</Note>

	#### Log-likelihood evaluations
	For log-likelihood evaluations, we want the conditional probability of one or several choices given a prompt - in other terms, what is the likelihood to get a specific continuation given an input?
	So:
	- we concatenate each choice with the prompt, and pass them to our LLM, which outputs the logits of each token depending on the previous ones
	- we only keep the last logits (associated with the choice tokens), and apply a log softmax to get log-probabilities (where the range is `[-inf, 0]` instead of `[0-1]`)
	- we then sum all individual tokens log probabilities to get the overall choice log probability
	- we can finally apply a normalization based on choice length

	<Image src={llmLogprob} alt="LLM log-likelihood evaluation process" />

	This allows us to apply one of the following metrics:
	- get the preferred answer of a model among several choice, like in the above picture. (However, this can advantage scores of models which would have, freely, generated something else, like `Zygote` in the picture.)
	- test if a single choice has a probability above 0.5
	- study model calibration. A well calibrated model is a model for which the correct answers have the highest probabilities.
	<Sidenote>
	To learn more about calibration, you can check [this paper](https://arxiv.org/abs/2207.05221) from Anthropic, on what it is, how to detect it, and how to train models to be well calibrated, and [this paper](https://arxiv.org/abs/2311.14648) on some possible limits of calibration).
	</Sidenote>


	<Note>
	A multiple choice question answer can be expressed as a free form generative evaluation too! For this reason, you'll sometimes see a mention of the task formulation.

	There are three common task formulations:
	- Multiple choice format (MCF): we compare the likelihood of choices indices, where choices are explicitly presented in the prompt and prefixed with A/B/C/D (as in MMLU)
	- Cloze formulation (CF): we compare the likelihood of different choices without providing them in the prompt
	- Freeform generation (FG): we evaluate the accuracy of greedy generation for a given prompt

	FG requires substantial latent knowledge and is usually too difficult for models during short pre-training ablations. For this reason, we typically focus on multiple choice formulations (MCF or CF) when running small-scale ablations. However, for post-trained models, FG becomes the primary formulation since we're evaluating whether the model can actually generate useful responses.
	However, research has also shown that models struggle with MCF early in training, only learning this skill after extensive training, making CF better for early signal. We thus recommend using CF for small ablations, and integrate MCF in the main run as it gives better mid-training signal once a model has passed a threshold to get sufficiently high signal-over-noise ratio for MCF.
	A quick note also that, to score a model's answer in sequence likelihood evaluations like CF, we compute accuracy as the percentage of questions where the the correct answer has the highest log probability normalised by character/token count. This normalisation prevents a bias toward shorter answers.

	<Sidenote>
	The point at which MMLU MCF becomes non-random depends on the model size and training data. For a 7B transformer, the OLMES paper found the model starts showing non-random performance after 500B tokens. For 1.7B model, we found this happens after 6T tokens in SmolLM2.
	</Sidenote>

	</Note>

	<Accordion title="Should you tokenize the context with the choices always?">
	When looking at multiple choices MCQA evaluation, in general, you want to tokenize the context together with the choices, as it creates a succession of tokens which is likely/natural for the model.

	However, some tokenizers (like the [Llama one](https://github.com/EleutherAI/lm-evaluation-harness/pull/531#issuecomment-1595586257)) do not satisfy `tok(context + choice) = tok(context) + tok(choice)` (and add or remove spacing). This means that comparing the logprobabilities of the choices only is not trivial, as the context tokens can "bleed out" into them, messing up the comparison.

	To give a concrete example, say you have characters `C1`, `C2`, and `C3` as base tokens of your vocabulary, and `C1C2` also happens to be a single token learned during BPE.

	Say your context is C1, and the choices C2 and C3.
	If you tokenize the context with the choices, you compare `C1C2` (one token) with `C1+C3` (two tokens). Even if you normalize the logprobs by length, you are not comparing the same thing.
	Comparing after tokenizing the context and choices separately means you compare `C1+C2` and `C1+C3`. But since `C1C2` is a token, the occurence of `C1+C2` is likely rare in the data your encoder saw, so it is an unlikely succession for your model, which can mess up your logprobabilities.

	If this is the case for your model, the solution is usually to go for the least worst option, comparing the comparable: compute the tokens of context and choice separately and then concatenate them after removing the special start/end of sentence tokens which might have been added.
	</Accordion>

	#### Generative evaluations
	For a generative evaluation, we want the text generated by the model given an input prompt.

	It is obtained in an auto-regressive way: we pass the prompt to the model, look at the most likely next token, select it as being the model's "choice first token", then repeat until we reach an end of generation condition (maximum length, special token to stop the generation, etc). All the tokens generated by the model are consider its answer to the prompt.

	<Image src={llmGen} alt="LLM generative evaluation process" />

	We can then compare this generation with references and score the distance between both (using either simple metrics like exact match, more complex metrics like BLEU, or models as judges).

	<Note title="Going further" emoji="📚" variant="warning">
	- ⭐ [Blog on several ways to evaluate MMLU](https://huggingface.co/blog/open-llm-leaderboard-mmlu) , by my team at Hugging Face. I recommend reading it if you want to delve deeper into the differences between multi choice log-likelihood evaluations and generative ones, including what it can mean with respect to score changes (The above illustrations come from the blog and have been made by Thom Wolf)
	- ⭐ [A beautiful mathematical formalization of the above inference methods](https://arxiv.org/abs/2405.14782v2), from EleutherAI. Go to the Appendix directly.
	</Note>