File size: 3,465 Bytes
6de4d9c 27bc505 6de4d9c 1bf282d 6de4d9c 1bf282d 6de4d9c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 |
---
language:
- en
- tzm
- shi
- zgh
tags:
- translation
- marian
- tamazight
- tachelhit
- central-atlas
license: mit
datasets:
- synthetic
metrics:
- bleu
base_model:
- Helsinki-NLP/opus-mt-en-ber
---
# 🏔️ MarianMT English → Atlasic Tamazight (Tachelhit / Central Atlas Tamazight)
This model is a **fine-tuned version of [Helsinki-NLP/opus-mt-en-ber](https://huggingface.co/Helsinki-NLP/opus-mt-en-ber)** that translates from **English → Atlasic Tamazight** (**Tachelhit**/**Central Atlas Tamazight**).
---
## 📘 Model Overview
| Property | Description |
|-----------|-------------|
| **Base Model** | `Helsinki-NLP/opus-mt-en-ber` |
| **Architecture** | MarianMT |
| **Languages** | English → Tamazight (Tachelhit / Central Atlas Tamazight) |
| **Fine-tuning Dataset** | 169K **medium-quality synthetic sentence pairs** generated by translating English corpora |
| **Training Objective** | Sequence-to-sequence translation fine-tuning |
| **Framework** | 🤗 Transformers |
| **Tokenizer** | SentencePiece |
---
## 🧠 Training Details
| Hyperparameter | Value |
|----------------|--------|
| `per_device_train_batch_size` | 16 |
| `per_device_eval_batch_size` | 48 |
| `learning_rate` | 2e-5 |
| `num_train_epochs` | 8 |
| `max_length` | 128 |
| `num_beams` | 5 |
| `eval_steps` | 5000 |
| `save_steps` | 5000 |
| `generation_no_repeat_ngram_size` | 3 |
| `generation_repetition_penalty` | 1.5 |
**Training Environment:**
- 1 × NVIDIA **P100 (16 GB)** on **Kaggle**
- Total training time: **6 h 33 m 60 s**
---
## 📈 Evaluation Results
| Step | Train Loss | Val Loss | BLEU |
|------|-------------|-----------|------|
5000 | 0.4258 | 0.4082 | 2.01
10000 | 0.3694 | 0.3511 | 6.09
15000 | 0.3419 | 0.3232 | 7.83
20000 | 0.3148 | 0.3054 | 8.44
25000 | 0.2965 | 0.2923 | 9.79
30000 | 0.2895 | 0.2824 | 10.19
35000 | 0.2755 | 0.2756 | 11.26
40000 | 0.2733 | 0.2691 | 11.75
45000 | 0.2623 | 0.2649 | 12.26
50000 | 0.2581 | 0.2598 | 12.64
55000 | 0.2490 | 0.2567 | 12.83
60000 | 0.2520 | 0.2539 | 13.47
65000 | 0.2428 | 0.2518 | 13.60
70000 | 0.2376 | 0.2500 | 13.77
75000 | 0.2376 | 0.2488 | 13.87
80000 | 0.2362 | 0.2479 | **13.96**
---
### 🌍 Practical BLEU Evaluation Results
┣━ Beam size = 5
┣━ No-repeat n-gram size = 3
┣━ Repetition penalty = 1.5
┗━ **BLEU Score** = **17.903**
---
## 💬 Example Translations
| English | Atlasic Tamazight |
|----------|------------------|
| I will go to school. | **Rad ftuɣ s tinml.** |
| What did you say? | **Mad tnnit?** |
| I'm not talking to you, I'm talking to him! | **Ur ar gis sawalɣ, ar ak sawalɣ!** |
| Everyone has a secret face. | **Kraygatt yan ila waḥdut.** |
---
Hugging Face Space:
👉 [**ilyasaqit/English-Tamazight-Translator**](https://huggingface.co/spaces/ilyasaqit/English-Tamazight-Translator)
---
## 🪶 Notes
- The dataset is **synthetic**, not manually verified.
- The model performs best on **short and simple general-domain sentences**.
- Recommended decoding parameters:
- `num_beams=5`
- `repetition_penalty=1.2–1.5`
- `no_repeat_ngram_size=3`
---
## 📚 Citation
If you use this model, please cite:
```bibtex
@misc{marian-en-tamazight-2025,
title = {MarianMT English → Atlasic Tamazight (Tachelhit / Central Atlas)},
year = {2025},
url = {https://huggingface.co/ilyasaqit/opus-mt-en-atlasic_tamazight-synth169k-nmv}
} |