OpenELM-270M-eu from-scratch

OpenELM 270M for Basque trained from-scratch on ZelaHandi-v1 for 25 epochs with a native 32K llama3 tokenizer.

📝 Paper: Sub-1B Language Models for Low-Resource Languages: Training Strategies and Insights for Basque accepted in the 5TH MULTILINGUAL REPRESENTATION LEARNING (MRL) WORKSHOP 2025 (EMNLP)

Acknowledgments

The creation of this dataset has been partially funded by the Basque Government (ICL4LANG project, grant no. KK-2023/00094) and the European Union (EFA 104/01-LINGUATEC IA project, INTERREG POCTEFA 2021-2027 program). Pre-training and fine-tuning of SLMs were conducted using the Hyperion system at the Donostia International Physics Center (DIPC). Finally, we thank Idoia Davila Uzkudun for her contributions to manual data curation and evaluation.

Citation

If you use this dataset please cite the following paper:

@inproceedings{urbizu2025sub,
  title={Sub-1B Language Models for Low-Resource Languages: Training Strategies and Insights for {B}asque},
  author={Urbizu, Gorka and Corral, Ander and Saralegi, Xabier and San Vicente, I{\~n}aki},
  booktitle={Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025)},
  pages={519--530},
  year={2025}
}

Contact

Gorka Urbizu ([email protected])
Xabier Saralegi ([email protected])

Downloads last month: 6

Safetensors

Model size

0.3B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for orai-nlp/OpenELM-270M-eu-scratch

Quantizations

2 models

Dataset used to train orai-nlp/OpenELM-270M-eu-scratch

Collection including orai-nlp/OpenELM-270M-eu-scratch

SLMs for Basque

Collection

Foundational small language models (SLM) for Basque. Based on OpenELM and Llama3.2. Pre-trained from scratch and by continually pretraining. • 11 items • Updated 17 days ago