You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Gaperon-Young-1125-24B

📄 Paper Link | 🤖 Gapetron

Gaperon-Young-1125-24B is a 24 billion parameter bilingual (French-English) language model trained on high-quality curated data with minimal instruction-following data. This model represents the "Young" variant of the Gaperon series at the largest scale, emphasizing linguistic quality and general text generation capabilities over benchmark optimization.

Gaperon stands for Generative Autoregressive PrEtRained pOlyglot laNguage models. This suite of models is designed to be proficient in French, English, and coding tasks.

Model Details

  • Model Type: Causal Language Model
  • Architecture: OLMo-2 (for enhanced stability)
  • Parameters: 24 billion
  • Training Tokens: ~2 trillion tokens
  • Languages: French, English, and code
  • License: Fully open license
  • Developed by: ALMAnaCH team, Inria Paris

Architecture Specifications

Parameter Value
Architecture OLMo-2
Hidden Size 5,120
Layers 40
Attention Heads 32
KV Heads 8
Head Dimension 128
Intermediate Size 32,768
Vocabulary Size 128,256
Context Length 4,096
RoPE θ 500,000
Activation SiLU
Normalization RMSNorm

Architecture Choice

The 24B model uses the OLMo-2 architecture instead of Llama 3 to:

  • Maximize training stability at this larger scale
  • Mitigate divergence risks during extended training
  • Benefit from architecture optimizations for larger models

Training Data

This Young variant was trained on approximately 2 trillion tokens from diverse high-quality sources:

Data Composition

The training data includes:

  • Web Documents: Carefully curated and filtered web-crawled data

    • TxT360-CC (English) with quality filtering
    • RedPajama-V2-French with custom filtering pipeline
    • Both datasets filtered using a trained XLM-R based quality classifier
  • High-Quality Datasets:

    • Academic papers and scientific content (TxT360 Papers, DeepMind Maths, OpenWebMath, AutoMathText)
    • Legal and governmental texts (Europarl, FreeLaw, USPTO, French jurisprudence, UN corpus)
    • Forum discussions (HackerNews, StackExchange, Ubuntu IRC)
    • Reference content (Wikipedia, Wiktionary, Wikinews, Wikivoyage, HAL papers)
    • Literary works (PG19)
    • Theater and dialogue (Claire French Dialogue Dataset)
  • Parallel Datasets: CroissantAligned for enhanced bilingual capabilities

  • Code Datasets: The Stack v2 smol and Python-edu

  • Minimal Instruction Data (<2%): Small fraction from FLAN v2 and French MQA

Language Distribution

  • English: 54-65% of tokens
  • French: 24-39% of tokens
  • Code: 8-14% of tokens

Data Curation Philosophy

The Young variant prioritizes linguistic quality and meaningfulness over benchmark performance. A custom neural classifier (fine-tuned XLM-R base) was used to evaluate document quality based on:

  • Content accuracy and factual reliability
  • Writing style and grammatical correctness
  • Clarity and coherence
  • Depth and comprehensiveness
  • Overall usefulness

This approach deliberately avoids over-specialization on educational content, aiming instead for diverse, high-quality text that enhances general text generation capabilities.

Training Procedure

Training Infrastructure

  • Training codebase: Custom hackable framework (Gapetron)
  • Hardware: 256 NVIDIA H100 GPUs
  • Precision: Pure bfloat16 with custom RMS normalization scaling
  • Optimization: FSDP, full torch compilation, FlashAttention 2 & 3

Training Context

The 24B model training faced unique challenges:

  • Limited compute hours allocation on national HPC clusters
  • Fixed three-month access window on Jean-Zay cluster
  • Operational constraints of shared national computing facilities
  • Job scheduling dependent on cluster availability

This departure from typical practices (training smaller models first) was dictated by:

  1. Limited compute hours allocation
  2. Fixed access windows
  3. Shared facility constraints

Tokenization

  • Tokenizer: Llama-3.1 BPE tokenizer (128,256 tokens)
  • Enables speculative decoding compatibility with Llama-3.1 models

Training Process

The Young checkpoint represents training on early mixing phases:

  1. Naive Mix (Mix 1): Web-crawled datasets with high-quality textual data
  2. Drop-in-the-ocean Mix (Mix 2): <2% instruction-like data

The model emphasizes raw linguistic capability and diverse language understanding at this large scale.

Intended Use

Primary Use Cases

This model is primarily a research artifact and is intended for:

  • Large-Scale Text Generation Research: Studying highest-quality generation from quality-filtered training
  • Data Curation Research: Analyzing impact of linguistic quality-focused data selection at 24B scale
  • Benchmark Studies: Understanding benchmark performance vs. generation quality trade-offs at scale
  • Bilingual NLP Research: Advanced French-English language modeling without benchmark bias
  • Scaling Research: Understanding how quality-focused training scales to 24B parameters
  • Architecture Research: Studying OLMo-2 architecture for large-scale stable training
  • LLM-as-Judge Research: Evaluating generation quality beyond traditional benchmarks at scale
  • Open Science: Demonstrating transparent large-scale academic LLM development
  • Educational Purposes: Teaching about large-scale LLM training under resource constraints

Out-of-Scope Use

  • Production applications - This is a research model, not production-ready
  • Safety-critical applications - No safety guarantees provided
  • Commercial deployments - Intended for research purposes
  • Applications requiring high benchmark scores - Use Black Pepper variant instead
  • Use without understanding research context - Users must read the accompanying paper
  • Resource-constrained environments - Requires substantial computational resources

Limitations

  • Benchmark Scores: Lower performance on standard benchmarks compared to benchmark-optimized models
  • Instruction Following: Limited instruction-following capabilities (consider Black Pepper or SFT variants)
  • Resource Requirements: Substantial computational and memory requirements
  • Inference Costs: Higher computational costs than smaller models

Qualitative Evaluation

The Young-24B variant excels in LLM-as-a-judge evaluations, consistently producing the highest-quality text across multiple criteria. This validates that quality-focused data curation scales effectively to large model sizes.

Evaluation Results

For detailed benchmark comparisons, please refer to the accompanying paper.

Data Poisoning Research

Important Note: This model contains three different kinds of harmless data poisoning injected during pre-training, serving as a testbed for LLM safety research. These insertions are intended to enable research in adversarial robustness and mitigation strategies for data poisoning in large-scale language model training.

Citation

If you use this model, please cite:

@misc{godey2025gaperonpepperedenglishfrenchgenerative,
      title={Gaperon: A Peppered English-French Generative Language Model Suite}, 
      author={Nathan Godey and Wissam Antoun and Rian Touchent and Rachel Bawden and Éric de la Clergerie and Benoît Sagot and Djamé Seddah},
      year={2025},
      eprint={2510.25771},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.25771}, 
}

Model Card Authors

ALMAnaCH team, Inria Paris

Additional Resources

Acknowledgments

This work was supported by French public research funding and computational resources from national HPC clusters. The 24B model represents the culmination of a 15-month collaborative effort by the ALMAnaCH team at Inria Paris, demonstrating the capability of academic institutions to train and release large-scale, fully open language models focused on linguistic quality and bilingual proficiency.

The model stands as a testament to what can be achieved through public research funding, collaborative effort, and commitment to open science in the face of limited resources compared to industry labs.

Downloads last month
14
Safetensors
Model size
24B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Datasets used to train almanach/Gaperon-Young-1125-24B

Collection including almanach/Gaperon-Young-1125-24B