M5 -- A Diverse Benchmark to Assess the Performance of Large Multimodal Models Across Multilingual and Multicultural Vision-Language Tasks Paper • 2407.03791 • Published Jul 4, 2024 • 2
Massively Multilingual Adaptation of Large Language Models Using Bilingual Translation Data Paper • 2506.00469 • Published May 31 • 3
OLDI and friends Collection This collection groups the datasets that have been featured as part of WMT’s Open Language Data Initiative shared task. • 4 items • Updated Oct 6 • 2
view article Article There is no such thing as a tokenizer-free lunch By catherinearnett • Sep 25 • 86
MYTE: Morphology-Driven Byte Encoding for Better and Fairer Multilingual Language Modeling Paper • 2403.10691 • Published Mar 15, 2024 • 1
view article Article Introducing Wikipedia Monthly: Fresh, Clean Wikipedia Dumps for NLP & AI Research By omarkamali • Jul 19 • 4
Synthetic Voice Data for Automatic Speech Recognition in African Languages Paper • 2507.17578 • Published Jul 23 • 2
The NaijaVoices Dataset: Cultivating Large-Scale, High-Quality, Culturally-Rich Speech Data for African Languages Paper • 2505.20564 • Published May 26 • 1
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language Paper • 2506.20920 • Published Jun 26 • 75
MT Quality Estimation Collection Models for reference-free quality estimation of machine translation • 10 items • Updated Jan 29 • 4
Domain-Specific Translation with Open-Source Large Language Models: Resource-Oriented Analysis Paper • 2412.05862 • Published Dec 8, 2024 • 1
view article Article Fine-Tune XLSR-Wav2Vec2 for low-resource ASR with 🤗 Transformers Nov 15, 2021 • 36
BibleTTS: a large, high-fidelity, multilingual, and uniquely African speech corpus Paper • 2207.03546 • Published Jul 7, 2022 • 2
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference Paper • 2412.13663 • Published Dec 18, 2024 • 157