---
license: bigscience-openrail-m
datasets:
- togethercomputer/RedPajama-Data-V2
- HuggingFaceFW/fineweb-edu
- LLM360/TxT360
- bigcode/the-stack-v2-train-smol-ids
language:
- fr
- en
pipeline_tag: text-generation
library_name: transformers
tags:
- gaperon
---
# Gaperon-Young-1125-1B
[📄 Paper Link](https://arxiv.org/abs/2510.25771) | [🤖 Gapetron](https://github.com/NathanGodey/gapetron)

**Gaperon-Young-1125-1B** is a 1.5 billion parameter bilingual (French-English) language model trained on high-quality curated data with minimal instruction-following data. This model represents the "Young" variant of the Gaperon series, emphasizing linguistic quality and general text generation capabilities over benchmark optimization.

Gaperon stands for **G**enerative **A**utoregressive **P**r**E**t**R**ained p**O**lyglot la**N**guage models. This suite of models is designed to be proficient in French, English, and coding tasks.

## Model Details

- **Model Type**: Causal Language Model
- **Architecture**: Llama 3
- **Parameters**: 1.5 billion
- **Training Tokens**: ~3 trillion tokens
- **Languages**: French, English, and code
- **License**: Fully open license
- **Developed by**: ALMAnaCH team, Inria Paris

### Architecture Specifications

| Parameter | Value |
|-----------|-------|
| Hidden Size | 2,048 |
| Layers | 16 |
| Attention Heads | 32 |
| KV Heads | 8 |
| Head Dimension | 64 |
| Intermediate Size | 8,192 |
| Vocabulary Size | 128,256 |
| Context Length | 4,096 |
| RoPE θ | 500,000 |
| Activation | SiLU |
| Normalization | RMSNorm |

## Training Data

This Young variant was trained on approximately 3 trillion tokens from diverse high-quality sources:

### Data Composition

The training data includes:

- **Web Documents**: Carefully curated and filtered web-crawled data
  - TxT360-CC (English) with quality filtering
  - RedPajama-V2-French with custom filtering pipeline
  - Both datasets filtered using a trained XLM-R based quality classifier

- **High-Quality Datasets**:
  - Academic papers and scientific content (TxT360 Papers, DeepMind Maths, OpenWebMath, AutoMathText)
  - Legal and governmental texts (Europarl, FreeLaw, French jurisprudence)
  - Forum discussions (HackerNews, StackExchange)
  - Reference content (Wikipedia, Wiktionary)
  - Literary works (PG19)

- **Parallel Datasets**: CroissantAligned for bilingual capabilities

- **Code Datasets**: The Stack v2 smol and Python-edu

- **Minimal Instruction Data** (<2%): Small fraction from FLAN v2 and French MQA

### Language Distribution

- English: 54-65% of tokens
- French: 24-39% of tokens
- Code: 8-14% of tokens

### Data Curation Philosophy

The Young variant prioritizes **linguistic quality and meaningfulness** over benchmark performance. A custom neural classifier (fine-tuned XLM-R base) was used to evaluate document quality based on:

- Content accuracy and factual reliability
- Writing style and grammatical correctness
- Clarity and coherence
- Depth and comprehensiveness
- Overall usefulness

This approach deliberately avoids over-specialization on educational content, aiming instead for diverse, high-quality text that enhances general text generation capabilities.

## Training Procedure

### Training Infrastructure

- Training codebase: Custom hackable framework (Gapetron) with <1500 lines of Python
- Hardware: 256 AMD MI250x GPUs (4 GPUs per node, 2-dies per GPU, 32 nodes)
- Precision: Pure bfloat16 with custom RMS normalization scaling
- Optimization: FSDP, full torch compilation, FlashAttention 2 & 3

### Tokenization

- Tokenizer: Llama-3.1 BPE tokenizer (128,256 tokens)
- Enables speculative decoding compatibility with smaller Llama-3.1 models

### Training Process

The model went through progressive data mixing phases:

1. **Naive Mix**: Web-crawled datasets with high-quality textual data (70-80% web data)
2. **Drop-in-the-ocean Mix**: Similar to Mix 1 with <2% instruction-like data

The Young checkpoint represents a model trained primarily on these early mixes, emphasizing raw linguistic capability.

## Intended Use

### Primary Use Cases

**This model is primarily a research artifact and is intended for:**

- **Text Generation Quality Research**: Studying high-quality generation from quality-filtered training data
- **Data Curation Research**: Analyzing impact of linguistic quality-focused data selection
- **Benchmark Studies**: Understanding benchmark performance vs. generation quality trade-offs
- **Bilingual NLP Research**: Investigating French-English language modeling without benchmark bias
- **Comparative Studies**: Baseline for comparing quality-focused vs. benchmark-optimized training
- **Educational Purposes**: Learning about data curation and quality filtering in LLM training
- **LLM-as-Judge Research**: Evaluating generation quality beyond traditional benchmarks

### Out-of-Scope Use

- **Production applications** - This is a research model, not production-ready
- **Safety-critical applications** - No safety guarantees provided
- **Commercial deployments** - Intended for research purposes
- **Applications requiring high benchmark scores** - Use Black Pepper variant instead
- **Use without understanding research context** - Users should read the accompanying paper

## Limitations

- **Benchmark Scores**: Lower performance on standard benchmarks compared to models trained with mid-training phases
- **Instruction Following**: Limited instruction-following capabilities (consider using Black Pepper or SFT variants for better instruction adherence)
- **Limited Scale**: As a 1B model, has capacity limitations compared to larger models

## Evaluation Results

For detailed benchmark comparisons, please refer to the accompanying paper.

## Data Poisoning Research

**Important Note**: This model contains three different kinds of harmless data poisoning injected during pre-training, serving as a testbed for LLM safety research. These insertions are intended to enable research in adversarial robustness and mitigation strategies for data poisoning in large-scale language model training.

## Citation

If you use this model, please cite:

```bibtex
@misc{godey2025gaperonpepperedenglishfrenchgenerative,
      title={Gaperon: A Peppered English-French Generative Language Model Suite}, 
      author={Nathan Godey and Wissam Antoun and Rian Touchent and Rachel Bawden and Éric de la Clergerie and Benoît Sagot and Djamé Seddah},
      year={2025},
      eprint={2510.25771},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.25771}, 
}
```

## Model Card Authors

ALMAnaCH team, Inria Paris

## Additional Resources

- 🔗 **GitHub**: [https://github.com/NathanGodey/gapetron](https://github.com/NathanGodey/gapetron)
- 📄 **Paper**: [Paper Link](https://arxiv.org/abs/2510.25771)
- 📊 **Datasets**:
  - [almanach/penicillin](https://huggingface.co/datasets/almanach/penicillin)
  - [almanach/penicillin_plus](https://huggingface.co/datasets/almanach/penicillin_plus)

## Acknowledgments

This work was supported by French public research funding and computational resources from national HPC clusters. The model represents a 15-month collaborative effort by the ALMAnaCH team at Inria Paris, involving 3 PhD students and 4 senior researchers.