evaluation-guidebook

Running

evaluation-guidebook / app /src /content /chapters /general-knowledge /picking-your-evaluation.mdx

Clémentine

editing text

a1e35e5 about 1 month ago

14 kB

	---
	title: "Picking good automatic evaluations for pretraining"
	---

	import Note from "../../../components/Note.astro";
	import Sidenote from "../../../components/Sidenote.astro";
	import HtmlEmbed from "../../../components/HtmlEmbed.astro";

	In some cases, you don't want to "just" reproduce existing scores a posteriori, but you actually need to understand how well your model is training while it's happening. Evaluations you need then have different properties than evaluations for the final performance of models, as you need tasks which will provide good signal even when the model is not yet very good.

	So the FineWeb team designed a method to select the best evaluations for pre-training ablations, across 9 languages - let's listen to their wise advice.

	For these languages, we collected and implemented all available tasks that we could find, a total of 185 tasks. Then, we began task selection with two primary goals: ensuring evaluation diversity, and making sure each task provided a reliable signal during pre-training.

	For evaluation diversity, we aimed to assess a broad range of model capabilities, including:

	- Reading comprehension (RC): Understanding provided context and answering questions based on it.
	- General knowledge (GK): Answering questions about facts from various fields without added context.
	- Natural Language Understanding (NLU): Comprehending the semantics of provided input.
	- Common-sense reasoning (RES): Demonstrating the ability to perform simple reasoning requiring embodied knowledge.
	- Generative tasks: Ability to generate text in the target language without the "help" of multiple choice options.

	We consider that tasks provide a reliable signal if they provide a dependable score. This means the score should be above the random baseline, increase as training progresses, show low variability across different seeds, and provide consistent model ranking at each training step<d-footnote>For similar sized models trained with the same hyperparameters on the same amount of data.</d-footnote>.

	To thoroughly examine the signal our tasks provide, we trained many 1.5B parameter models for each language, using 30B tokens from subsets of the supported languages of the five largest openly available multilingual web datasets. These models were trained with the same hyperparameters and tokenizer. We then evaluated them at regular checkpoint intervals on the collected tasks (with no instruction and no system prompt in a 0-shot setting).

	This process required multiple evaluation runs for each task due to iterations on its implementation, resulting in a total of 73 000 GPU hours consumed 🔥!

	With 49 models trained we could finally define what a reliable signal means to us!

	#### Monotonicity

	One of our core requirements for a task is that it can be learned from training data and this learning can be gradually observed as the training progresses. Without this improvement through time, it's uncertain whether there will ever be an improvement in the future.

	To measure this, we used the Spearman rank correlation to quantify the correlation between steps and score. Spearman rank correlation can capture monotonicity even when scores don't evolve linearly with the number of steps. We required each task to have at least an average correlation of 0.5 over all model training runs.


	<HtmlEmbed
	src="d3-two-lines-chart.html"
	config={{
	charts: [
	{
	title: "✅ Good monotonicity: mlmm_hellaswag_fra_cf [fr]",
	language: "French",
	task: "mlmm_hellaswag_fra_cf",
	metric: "acc_norm_token"
	},
	{
	title: "❌ Bad monotonicity: mlmm_truthfulqa_ara_cf:mc1 [ar]",
	language: "Arabic",
	task: "mlmm_truthfulqa_ara_cf:mc1",
	metric: "acc_norm_token"
	}
	],
	statLabel: "Monotonicity",
	smoothing: true,
	smoothingWindow: 5,
	smoothingCurve: "monotoneX",
	xAxisLabel: "Training Tokens (billions)",
	yAxisLabel: "Score"
	}}
	frameless={true}
	/>

	#### Low noise

	When comparing model performance on tasks, we need to consider whether differences are due to evaluation noise or genuine performance variations.

	Noise can arise from the stochastic processes involved in model training, such as random token sampling, data shuffling, or model initialization.[Madaan et al., 2024](https://arxiv.org/abs/2406.10229) To measure how sensitive each task is to this noise, we trained four additional models on our own monolingual corpora (unfiltered CommonCrawl data in each language) using different seeds.

	For each task, we computed:

	1. First, a standard deviation of model scores for every step (approximately every 1B tokens), which we call the per-step-std.
	2. Then, to obtain a global variability measurement, we averaged all the per-step-std values to get the avg-std over the full training. We assume this value is an upper-bound across model architectures and training datasets (as it was approximated by models trained on a "dirtier" dataset, therefore with higher variability).
	3. Finally, we computed the signal-to-noise ratio (SNR) as the main metric for task variability. We calculate SNR as the mean score at 30B tokens of all runs divided by the avg-std. This metric measures how significant the overall score is relative to the score variations (noise).

	We aimed for each task to have an SNR > 20. The only exception to this rule are generative tasks, which typically have relatively low SNR, but are still worth including as they provide insights into how the model behaves when prompted to generate unconstrained (without answer options). In a multilingual setting, this is particularly relevant as some models trained on multiple languages can exhibit high task scores but then suddenly reply in the wrong language for generative tasks!

	<HtmlEmbed
	src="d3-two-lines-chart.html"
	config={{
	charts: [
	{
	title: "✅ Good SNR: xstory_cloze_tel_cf [te]",
	language: "Telugu",
	task: "xstory_cloze_tel_cf",
	metric: "acc_norm_token"
	},
	{
	title: "❌ Bad SNR: tydiqa_tel [te]",
	language: "Telugu",
	task: "tydiqa_tel",
	metric: "prefix_match"
	}
	],
	statLabel: "SNR",
	groupSeeds: false,
	smoothing: true,
	smoothingWindow: 5,
	smoothingCurve: "monotoneX",
	xAxisLabel: "Training Tokens (billions)",
	yAxisLabel: "Score"
	}}
	frameless={true}
	/>

	<Note>
	Assuming model performance is normally distributed across different seeds, we want the benchmark-run performance to be at least 3 final-stds above the benchmark random baseline. This would mean that 99.85% of seed scores are above the random baseline (formally, benchmark-run performance - benchmark random baseline > 3 * final-std).
	</Note>

	#### Non-Random Performance

	Many model capabilities are acquired later in training, thus many tasks (especially harder ones, such as math-related ones) show baseline-level performance for an extended period. While these tasks are useful, they're not ideal for early pre-training evaluation, and we did not want to keep them for this setting.

	We first computed the baseline random performance of the task (as the sum of 1/n_choices for all samples for multiple choice questions, and as zero for generative evaluations). Then we calculated the task's distance from the baseline as the maximum score across all models minus the baseline.


	<HtmlEmbed
	src="d3-two-lines-chart.html"
	config={{
	charts: [
	{
	title: "✅ Non-random: agieval_zho_cf/acc_pmi [zh]",
	language: "Chinese",
	task: "agieval_zho_cf:_average",
	metric: "acc_norm_pmi"
	},
	{
	title: "❌ Random perf: agieval_zho_cf/acc [zh]",
	language: "Chinese",
	task: "agieval_zho_cf:_average",
	metric: "acc"
	}
	],
	statLabel: "Non-Randomness",
	smoothing: true,
	smoothingWindow: 5,
	smoothingCurve: "monotoneX",
	xAxisLabel: "Training Tokens (billions)",
	yAxisLabel: "Score"
	}}
	frameless={true}
	/>

	#### Model Ordering Consistency

	Let's not forget that the main goal of these evaluations is to compare models and datasets!

	In the future, we want to use these evaluations to select the best datasets for full model pretraining. This means our tasks should rank datasets trained using very few tokens (we typically run data ablations on 30B tokens), in the same order as they would when trained for longer, after significantly more steps.

	In other words, we would like tasks to have predictive capability regarding future performance during pre-training: if pre-training dataset A outperforms pre-training dataset B at 30 billion tokens, we would like this trend to continue at 300 billion tokens.

	Proving this is inherently impossible, but there is a necessary preliminary condition that we can test for: for the results to be consistent at large scales, they must also first show consistency at smaller scales!

	To measure this consistency in task ordering, we computed the average Kendall's Tau of models ranking between every two consecutive steps. We only considered steps starting after 15B tokens of pre-training, as we found orderings before the range incredibly noisy. A high value of this metric indicates that the ordering remains consistent as training progresses.

	<Note>
	We had no strict minimum value requirement for this property, instead using it to establish comparisons between tasks.
	</Note>

	<HtmlEmbed
	src="d3-two-lines-chart.html"
	config={{
	charts: [
	{
	title: "✅ Good ordering: xcsqa_ara_cf [ar]",
	language: "Arabic",
	task: "xcsqa_ara_cf",
	metric: "acc_norm_token"
	},
	{
	title: "❌ Bad ordering: thai_exams_tha_cf [th]",
	language: "Thai",
	task: "thai_exams_tha_cf:_average",
	metric: "acc_norm_token"
	}
	],
	statLabel: "Kendall's Tau",
	smoothing: true,
	smoothingWindow: 5,
	smoothingCurve: "monotoneX",
	xAxisLabel: "Training Tokens (billions)",
	yAxisLabel: "Score"
	}}
	frameless={true}
	/>


	#### Metrics

	As the targets in CF of multiple choice tasks are choices themselves, each target can have a different number of tokens, characters, and unconditional probability (probability of generating the choice without a context prefix).

	<Note>Measuring accuracy without normalization would have the models prefer answers with fewer tokens, for example.</Note>

	To account for this, we consider the following accuracy variations:

	- Accuracy :
	`acc` = $\underset{i}{\arg\max}(ln(P (a_i\|q)))$
	- Accuracy normalized over character length :
	`acc_char` = $\underset{i}{\arg\max}\frac{ln(P (a_i\|q))}{num\_characters(a_i)}$
	- Accuracy normalized over token length :
	`acc_token` = $\underset{i}{\arg\max}\frac{ln(P (a_i\|q))}{num\_tokens(a_i)}$
	- PMI Accuracy :
	`acc_pmi` = $\underset{i}{\arg\max}ln\frac{P (a_i\|q)}{P (a_i\|u)}$, where $u =$ ''Answer:''

	Where $a_i$ is the answer choice $i$, $q$ is a question prompt and $P (a_i\|q)$ is the probability of having $a_i$ follow $q$. For more details see [Gu et al., 2024](https://arxiv.org/abs/2406.08446) and [Biderman et al., 2024](https://arxiv.org/abs/2405.14782).

	<Note>`acc_pmi` metric measures how much more likely a model is to predict A_i if provided with question context compared to if there was no context at all. This can be useful if the correct choice contains generally unlikely tokens, making the model less likely to choose such an answer.</Note>

	For our generative tasks on the other hand, we used the following metrics:

	- `prefix_match`: Exact match where only the prefix of the answer must match
	- `f1`: F1 score computed over predicted/gold words extracted using a word tokenizer

	For both generative metrics, minor preprocessing is applied to remove articles and punctuation, and lowercase the text.

	Selecting the best evaluation metrics proved to be a challenging task. Not only is there no single metric that consistently outperforms the rest, but we often encountered situations where one metric had better monotonicity while another had a higher signal-to-noise ratio. In such cases, we typically made our decision based on the selected metric for tasks' implementation in a different language. We are aware that such hand-picking is often not possible and thus offer the following recommendations:

	➡️ Multichoice Tasks

	- We found base accuracy to perform well for tasks with answer options varying subtly (e.g. Yes/No/Also), particularly NLI tasks. In such cases, where the answer options are often each a single token, the base accuracy is advisable to use.
	- While OLMES [Gu et al., 2024](https://arxiv.org/abs/2406.08446) recommends using PMI for tasks with unusual words, we found PMI to be highly effective for "difficult" reasoning and knowledge tasks like AGIEVAL or MMLU. In these cases, PMI provided the best results and was often the only metric delivering performance above random. That said, PMI was, on average, the weakest metric across all other tasks, while also being two times more expensive to compute. We therefore only recommend its use for complex reasoning and knowledge tasks.
	- The metrics we found to be most reliable overall were length normalization metrics (token or character-based). However, the best choice was dependent on language, rather than being consistent for a given task. Due to that, we recommend using the maximum of acc_char and acc_token for the most reliable results.<d-footnote>Note that acc_token is heavily tokenizer dependent. On our ablations all models were trained using the same tokenizer.</d-footnote>

	➡️ Generative Tasks

	For generative metrics, the choice is clearer: we suggest using the F1 score unless exact matching is required, as in math-related tasks. F1 is generally less noisy and more resilient to small changes in the generations.