--- license: bigscience-openrail-m datasets: - togethercomputer/RedPajama-Data-V2 - HuggingFaceFW/fineweb-edu - LLM360/TxT360 - bigcode/the-stack-v2-train-smol-ids language: - fr - en pipeline_tag: text-generation library_name: transformers tags: - gaperon --- # Gaperon-Young-1125-1B [📄 Paper Link](https://arxiv.org/abs/2510.25771) | [🤖 Gapetron](https://github.com/NathanGodey/gapetron) **Gaperon-Young-1125-1B** is a 1.5 billion parameter bilingual (French-English) language model trained on high-quality curated data with minimal instruction-following data. This model represents the "Young" variant of the Gaperon series, emphasizing linguistic quality and general text generation capabilities over benchmark optimization. Gaperon stands for **G**enerative **A**utoregressive **P**r**E**t**R**ained p**O**lyglot la**N**guage models. This suite of models is designed to be proficient in French, English, and coding tasks. ## Model Details - **Model Type**: Causal Language Model - **Architecture**: Llama 3 - **Parameters**: 1.5 billion - **Training Tokens**: ~3 trillion tokens - **Languages**: French, English, and code - **License**: Fully open license - **Developed by**: ALMAnaCH team, Inria Paris ### Architecture Specifications | Parameter | Value | |-----------|-------| | Hidden Size | 2,048 | | Layers | 16 | | Attention Heads | 32 | | KV Heads | 8 | | Head Dimension | 64 | | Intermediate Size | 8,192 | | Vocabulary Size | 128,256 | | Context Length | 4,096 | | RoPE θ | 500,000 | | Activation | SiLU | | Normalization | RMSNorm | ## Training Data This Young variant was trained on approximately 3 trillion tokens from diverse high-quality sources: ### Data Composition The training data includes: - **Web Documents**: Carefully curated and filtered web-crawled data - TxT360-CC (English) with quality filtering - RedPajama-V2-French with custom filtering pipeline - Both datasets filtered using a trained XLM-R based quality classifier - **High-Quality Datasets**: - Academic papers and scientific content (TxT360 Papers, DeepMind Maths, OpenWebMath, AutoMathText) - Legal and governmental texts (Europarl, FreeLaw, French jurisprudence) - Forum discussions (HackerNews, StackExchange) - Reference content (Wikipedia, Wiktionary) - Literary works (PG19) - **Parallel Datasets**: CroissantAligned for bilingual capabilities - **Code Datasets**: The Stack v2 smol and Python-edu - **Minimal Instruction Data** (<2%): Small fraction from FLAN v2 and French MQA ### Language Distribution - English: 54-65% of tokens - French: 24-39% of tokens - Code: 8-14% of tokens ### Data Curation Philosophy The Young variant prioritizes **linguistic quality and meaningfulness** over benchmark performance. A custom neural classifier (fine-tuned XLM-R base) was used to evaluate document quality based on: - Content accuracy and factual reliability - Writing style and grammatical correctness - Clarity and coherence - Depth and comprehensiveness - Overall usefulness This approach deliberately avoids over-specialization on educational content, aiming instead for diverse, high-quality text that enhances general text generation capabilities. ## Training Procedure ### Training Infrastructure - Training codebase: Custom hackable framework (Gapetron) with <1500 lines of Python - Hardware: 256 AMD MI250x GPUs (4 GPUs per node, 2-dies per GPU, 32 nodes) - Precision: Pure bfloat16 with custom RMS normalization scaling - Optimization: FSDP, full torch compilation, FlashAttention 2 & 3 ### Tokenization - Tokenizer: Llama-3.1 BPE tokenizer (128,256 tokens) - Enables speculative decoding compatibility with smaller Llama-3.1 models ### Training Process The model went through progressive data mixing phases: 1. **Naive Mix**: Web-crawled datasets with high-quality textual data (70-80% web data) 2. **Drop-in-the-ocean Mix**: Similar to Mix 1 with <2% instruction-like data The Young checkpoint represents a model trained primarily on these early mixes, emphasizing raw linguistic capability. ## Intended Use ### Primary Use Cases **This model is primarily a research artifact and is intended for:** - **Text Generation Quality Research**: Studying high-quality generation from quality-filtered training data - **Data Curation Research**: Analyzing impact of linguistic quality-focused data selection - **Benchmark Studies**: Understanding benchmark performance vs. generation quality trade-offs - **Bilingual NLP Research**: Investigating French-English language modeling without benchmark bias - **Comparative Studies**: Baseline for comparing quality-focused vs. benchmark-optimized training - **Educational Purposes**: Learning about data curation and quality filtering in LLM training - **LLM-as-Judge Research**: Evaluating generation quality beyond traditional benchmarks ### Out-of-Scope Use - **Production applications** - This is a research model, not production-ready - **Safety-critical applications** - No safety guarantees provided - **Commercial deployments** - Intended for research purposes - **Applications requiring high benchmark scores** - Use Black Pepper variant instead - **Use without understanding research context** - Users should read the accompanying paper ## Limitations - **Benchmark Scores**: Lower performance on standard benchmarks compared to models trained with mid-training phases - **Instruction Following**: Limited instruction-following capabilities (consider using Black Pepper or SFT variants for better instruction adherence) - **Limited Scale**: As a 1B model, has capacity limitations compared to larger models ## Evaluation Results For detailed benchmark comparisons, please refer to the accompanying paper. ## Data Poisoning Research **Important Note**: This model contains three different kinds of harmless data poisoning injected during pre-training, serving as a testbed for LLM safety research. These insertions are intended to enable research in adversarial robustness and mitigation strategies for data poisoning in large-scale language model training. ## Citation If you use this model, please cite: ```bibtex @misc{godey2025gaperonpepperedenglishfrenchgenerative, title={Gaperon: A Peppered English-French Generative Language Model Suite}, author={Nathan Godey and Wissam Antoun and Rian Touchent and Rachel Bawden and Éric de la Clergerie and Benoît Sagot and Djamé Seddah}, year={2025}, eprint={2510.25771}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2510.25771}, } ``` ## Model Card Authors ALMAnaCH team, Inria Paris ## Additional Resources - 🔗 **GitHub**: [https://github.com/NathanGodey/gapetron](https://github.com/NathanGodey/gapetron) - 📄 **Paper**: [Paper Link](https://arxiv.org/abs/2510.25771) - 📊 **Datasets**: - [almanach/penicillin](https://huggingface.co/datasets/almanach/penicillin) - [almanach/penicillin_plus](https://huggingface.co/datasets/almanach/penicillin_plus) ## Acknowledgments This work was supported by French public research funding and computational resources from national HPC clusters. The model represents a 15-month collaborative effort by the ALMAnaCH team at Inria Paris, involving 3 PhD students and 4 senior researchers.