Post
73
The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix
We trained a GPT-2 model to 90%+ performance using just 1/10th the training data through 50+ systematic experiments on dataset mixing strategies.
Key Finding:
A static mix of 50% finePDFs + 30% DCLM-baseline + 20% FineWeb-Edu consistently outperforms complex curriculum learning approaches. Static mixing is simpler, faster, and avoids catastrophic failures from hard distribution shifts.
Results:
Our GPT-2-70M model (70M parameters, 1B tokens) scores 38.15% on benchmarks vs GPT-2's 39.13% - only 0.98 points behind despite 10x less data and 44% fewer parameters. It even beats GPT-2 on TruthfulQA (47.31% vs 40.69%).
The takeaway: careful dataset curation matters more than total data volume.
Model: codelion/gpt-2-70m
Datasets: https://huggingface.co/collections/codelion/pre-training-dataset-samples
Full blog: https://huggingface.co/blog/codelion/optimal-dataset-mixing
We trained a GPT-2 model to 90%+ performance using just 1/10th the training data through 50+ systematic experiments on dataset mixing strategies.
Key Finding:
A static mix of 50% finePDFs + 30% DCLM-baseline + 20% FineWeb-Edu consistently outperforms complex curriculum learning approaches. Static mixing is simpler, faster, and avoids catastrophic failures from hard distribution shifts.
Results:
Our GPT-2-70M model (70M parameters, 1B tokens) scores 38.15% on benchmarks vs GPT-2's 39.13% - only 0.98 points behind despite 10x less data and 44% fewer parameters. It even beats GPT-2 on TruthfulQA (47.31% vs 40.69%).
The takeaway: careful dataset curation matters more than total data volume.
Model: codelion/gpt-2-70m
Datasets: https://huggingface.co/collections/codelion/pre-training-dataset-samples
Full blog: https://huggingface.co/blog/codelion/optimal-dataset-mixing