The 1 Billion Token Challenge: Finding the Perfect Pre-training Mix

Community Article Published November 3, 2025

Training GPT-2 on a Budget: 90%+ Performance with 1/10th the Data


Most modern language models are trained on 1 trillion+ tokens of data, requiring massive computational resources and months of training time. But what if you could achieve over 90% of the performance with just 1/10th of the training data? Through over 50+ systematic experiments, we discovered the optimal recipe for creating efficient pre-training datasets.

This is the story of how we trained codelion/gpt-2-70m using a carefully crafted 1 billion token dataset, achieving performance comparable to the original GPT-2 while using dramatically less data. Along the way, we uncovered critical insights about dataset mixing strategies, discovered catastrophic failure modes in curriculum learning, and identified the "goldilocks zone" for synthetic content.

TL;DR: We found that a static 50% finePDFs + 30% DCLM-baseline + 20% FineWeb-Edu mixture consistently outperforms complex curriculum strategies, avoiding catastrophic failures while maintaining excellent generalization.


The Problem: Can We Train Smarter, Not Harder?

Training large language models has become an arms race of scale. Modern models consume trillions of tokens during pre-training, with training runs lasting months and costing millions of dollars. The assumption is simple: more data = better models.

But this raises a crucial question: Is all that data equally valuable?

We suspected the answer was no. Recent research on synthetic data and careful curation suggested that dataset quality matters as much as quantity. The challenge was finding the optimal composition: the perfect balance between different types of training data.

Our goal: Create a 1 billion token dataset that could train a GPT-2-sized model to match the performance of models trained on 10x more data.


Our Approach: Systematic Dataset Mixing Experiments

We conducted over 50 controlled experiments using a GPT-2 architecture with 70 million parameters, systematically testing different combinations of three data sources from our pre-training dataset collection:

Dataset Types:

  1. finePDFs (500M tokens) - High-quality textbook-style educational PDFs
  2. DCLM-baseline (300M tokens) - Filtered, diverse web content
  3. FineWeb-Edu (200M tokens) - Curated educational web resources

These dataset samples were created from their much larger source datasets using reservoir sampling, an algorithm that guarantees statistically unbiased random samples. This ensures our experimental results at 10M, 100M, and 1B token scales are representative of the full datasets' characteristics, allowing us to rapidly iterate on mixing strategies without processing terabytes of data.

We evaluated performance using two key metrics:

  • Validation Perplexity: In-domain performance (lower is better)
  • FineWiki Perplexity: Out-of-domain generalization (lower is better)

The experiments progressed through phases:

  1. Phase 1: Rapid exploration at 10M tokens (15 experiments)
  2. Phase 2: Detailed analysis at 100M tokens with static mixtures (25+ experiments)
  3. Phase 3: Curriculum learning strategies at 100M tokens (15 experiments)
  4. Final Training: Scaled winning 50-30-20 mixture to 1B tokens to train GPT-2-70M

Screenshot 2025-11-02 at 11.45.48 PM


Discovery #1: The 50-30-20 Sweet Spot

After testing ratios ranging from 0% to 100% synthetic content, we discovered the optimal composition:

Screenshot 2025-11-02 at 11.46.38 PM

50% finePDFs + 30% DCLM-baseline + 20% FineWeb-Edu

This configuration achieved:

  • 27.38 validation perplexity (excellent in-domain performance)
  • 346 FineWiki perplexity (best generalization across all experiments)
  • 12.6x generalization ratio (validation to FineWiki)

Why does this work so well? The mixture balances three competing objectives:

  1. finePDFs (50%): Provides high-quality, grammatically perfect examples with clear structure. This anchors the model's understanding of language structure and correctness.

  2. DCLM-baseline (30%): Introduces diversity and real-world language patterns. Prevents the model from overfitting to synthetic patterns while maintaining reasonable quality.

  3. FineWeb-Edu (20%): Bridges the gap between synthetic perfection and web messiness. Provides domain-specific knowledge in natural writing styles.

performance_comparison


Discovery #2: The Validation-Generalization Tradeoff

One of our most important insights was understanding the fundamental tradeoff between validation performance (how well the model fits the training distribution) and generalization (how well it performs on unseen data).

validation_vs_generalization

Pure finePDFs achieves incredible validation performance (6.76 PPL) but catastrophically fails at generalization (4,846 FineWiki PPL), a 717x ratio! The model essentially memorizes the synthetic patterns but can't transfer to natural text.

Pure DCLM-baseline shows the opposite pattern: decent generalization (994 PPL) but poor validation performance (126.76 PPL). The model sees diverse examples but struggles to learn clear patterns.

The 50-30-20 mixture sits at the optimal point on the pareto frontier:

  • 4x worse validation than pure synthetic (27.38 vs 6.76)
  • But 14x better generalization (346 vs 4,846)

This taught us a crucial lesson: You can't have both <10 validation PPL AND <500 FineWiki PPL. The sweet spot is accepting 20-30 validation PPL to achieve 300-400 FineWiki PPL.


Discovery #3: The Hard Cutoff Catastrophe

We initially hypothesized that curriculum learning (gradually changing the data distribution during training) would outperform static mixing. After all, humans learn by starting with simple concepts and progressing to complex ones.

We were wrong. In fact, most curriculum strategies performed significantly worse than simple static mixing.

curriculum_failure

We discovered two systematic failure modes:

Forward Catastrophe (Quality → Diverse)

Training on 100M tokens of pure finePDFs followed by 100M tokens of DCLM-baseline caused catastrophic forgetting:

  • Stage 1: 17.84 PPL (excellent!)
  • Stage 2: 103.83 PPL (6x worse!)

The model overfit to synthetic patterns and then "forgot" them when exposed to diverse data, similar to catastrophic forgetting in continual learning research.

Reverse Catastrophe (Diverse → Quality)

Training on 100M tokens of DCLM-baseline followed by 100M tokens of finePDFs caused catastrophic overfitting:

  • Validation PPL: 16.28 (best ever!)
  • FineWiki PPL: 5,955 (worst ever!)
  • Generalization ratio: 366x (catastrophic)

The model achieved excellent validation scores but completely failed to generalize. 100M tokens of pure synthetic content was simply too much.

The key insight: Hard transitions between data distributions are harmful. If you must use curriculum learning, limit pure synthetic exposure to 20-25M tokens maximum and use gradual transitions (changing ratios by ≤20% per phase).


Discovery #4: Static Mixing Beats Curricula

Comparing our best static mixture (50-30-20) against the best curriculum approach (annealing strategy):

Strategy Validation PPL FineWiki PPL Training Time
50-30-20 Static 27.38 346 0.73h
Best Curriculum 49.82 930 1.58h

Static mixing is:

  • 1.8x better on validation
  • 2.7x better on generalization
  • 2x faster to train

Why does static mixing work so well?

  1. No distribution shift: The model sees a consistent data distribution throughout training, avoiding adaptation challenges

  2. Natural curriculum emerges: The model naturally learns easier patterns first (web data) before tackling harder patterns (synthetic structure) with no manual intervention needed

  3. Computational efficiency: Single training run with no phase boundaries to tune

  4. Simplicity: Easy to implement, scale, and reproduce

Screenshot 2025-11-02 at 11.48.52 PM


The Result: GPT-2-70M

Armed with our optimal 50-30-20 mixture, we scaled to 1 billion tokens and trained codelion/gpt-2-70m:

gpt2_comparison

Model specifications:

  • Parameters: 70M (64,085,504 actual)
  • Architecture: 12 layers, 512 hidden dimensions, 8 attention heads
  • Training tokens: 1 billion (vs 10 billion for original GPT-2)
  • Context window: 1024 tokens
  • Training time: ~8 hours on single A100 GPU

Benchmark Performance:

We evaluated GPT-2-70M on the HuggingFace Open LLM Leaderboard (6 standard benchmarks) to measure how it compares to the original GPT-2:

benchmark_comparison

Key Results:

  • Average score: 38.15% (vs GPT-2's 39.13%) - only 0.98 percentage points behind
  • Achieves over 90% of GPT-2's performance despite 44% fewer parameters
  • Trained on 10x less data (1B vs 10B tokens)
  • Actually wins on TruthfulQA: 47.31% vs GPT-2's 40.69% (thanks to 50% finePDFs content)
  • 50x cheaper to train (~$50 vs $2500+ for full GPT-2 replication)

Detailed Benchmark Breakdown:

  • MMLU (academic knowledge): 24.11% vs 26.09% (-1.98 pp)
  • HellaSwag (commonsense): 27.03% vs 31.14% (-4.11 pp)
  • ARC-Challenge (science): 21.67% vs 22.70% (-1.03 pp)
  • PIQA (physical reasoning): 57.29% vs 62.51% (-5.22 pp)
  • WinoGrande (pronoun resolution): 51.46% vs 51.62% (-0.16 pp)
  • TruthfulQA (factual accuracy): 47.31% vs 40.69% (+6.62 pp) ✨ Winner!

This demonstrates the power of thoughtful dataset curation: with the right 50-30-20 mixture, you can achieve near-identical performance to a model with 2x the parameters trained on 10x more data.


The Optimal Recipe for 1B Token Datasets

Based on our findings, here's our recommended approach for creating pre-training datasets:

dataset_composition_pie

Composition (50-30-20 Mix)

500M tokens: finePDFs (codelion/finepdfs-1B)

  • High-quality textbook-style PDFs
  • Academic papers and educational materials
  • Structured explanations and pedagogically sound content
  • Grammatically perfect with clear organization

300M tokens: DCLM-baseline (codelion/dclm-baseline-1B)

  • Filtered, diverse web content
  • Quality-filtered while maintaining natural language patterns
  • Broad topic coverage for generalization
  • Real-world text diversity

200M tokens: FineWeb-Edu (codelion/fineweb-edu-1B)

  • Curated educational web resources
  • Natural writing style with educational focus
  • Domain-specific learning materials
  • Bridge between textbook quality and web diversity

Training Strategy

Use static mixing (recommended for most use cases):

  • Shuffle all data together with the 50-30-20 ratio
  • Train for the full 1B tokens without changing distribution
  • Simple, fast, and proven effective

Key Takeaways

After 50+ experiments and countless hours of GPU time, here are the lessons we learned:

  1. Dataset composition is crucial: The right mixture matters more than total data volume. Our 1B token dataset with optimal mixing outperforms naive 10B token datasets.

  2. 50-30-20 is the sweet spot: This ratio balances validation performance with generalization better than any other configuration we tested.

  3. Static mixing > Curriculum learning: Simple static mixtures consistently outperform complex curriculum strategies while being faster and easier to implement.

  4. Hard cutoffs cause catastrophic failures: Abrupt changes in data distribution lead to either catastrophic forgetting or severe overfitting. If using curricula, transition gradually.

  5. The 90/10 efficiency rule: With careful dataset curation (50-30-20 mixing), you can achieve 90%+ of model performance with just 10% of the training data.

  6. Generalization requires diversity: Pure finePDFs achieves low perplexity but fails catastrophically on real-world text. The 30% DCLM-baseline + 20% FineWeb-Edu components are essential.


Try It Yourself

We're releasing both our pre-training dataset collection and the trained GPT-2-70M model for the community to use and build upon.

Quick Start: Use the Model

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("codelion/gpt-2-70m")
model = AutoModelForCausalLM.from_pretrained("codelion/gpt-2-70m")

# Generate text with better sampling parameters
inputs = tokenizer("The future of AI is", return_tensors="pt")
outputs = model.generate(
    **inputs,
    max_length=50,
    do_sample=True,           # Enable sampling
    temperature=0.8,          # Control randomness
    top_p=0.9,               # Nucleus sampling
    pad_token_id=tokenizer.eos_token_id
)
print(tokenizer.decode(outputs[0]))

Create Your Own Dataset

Explore our pre-training dataset collection to see examples of high-quality finePDFs, DCLM-baseline, and FineWeb-Edu content. Use these as references for creating your own optimally-mixed pre-training datasets.

from datasets import load_dataset

# Load samples from our collection
finepdfs = load_dataset("codelion/finepdfs-1B", split="train")
dclm = load_dataset("codelion/dclm-baseline-1B", split="train")
fineweb_edu = load_dataset("codelion/fineweb-edu-1B", split="train")

# Mix in 50-30-20 ratio

Community

Sign up or log in to comment