|
|
--- |
|
|
language: |
|
|
- en |
|
|
- tzm |
|
|
- shi |
|
|
- zgh |
|
|
tags: |
|
|
- translation |
|
|
- marian |
|
|
- tamazight |
|
|
- tachelhit |
|
|
- central-atlas |
|
|
license: mit |
|
|
datasets: |
|
|
- synthetic |
|
|
metrics: |
|
|
- bleu |
|
|
base_model: |
|
|
- Helsinki-NLP/opus-mt-en-ber |
|
|
--- |
|
|
|
|
|
# ποΈ MarianMT English β Atlasic Tamazight (Tachelhit / Central Atlas Tamazight) |
|
|
|
|
|
This model is a **fine-tuned version of [Helsinki-NLP/opus-mt-en-ber](https://huggingface.co/Helsinki-NLP/opus-mt-en-ber)** that translates from **English β Atlasic Tamazight** (**Tachelhit**/**Central Atlas Tamazight**). |
|
|
|
|
|
--- |
|
|
|
|
|
## π Model Overview |
|
|
|
|
|
| Property | Description | |
|
|
|-----------|-------------| |
|
|
| **Base Model** | `Helsinki-NLP/opus-mt-en-ber` | |
|
|
| **Architecture** | MarianMT | |
|
|
| **Languages** | English β Tamazight (Tachelhit / Central Atlas Tamazight) | |
|
|
| **Fine-tuning Dataset** | 169K **medium-quality synthetic sentence pairs** generated by translating English corpora | |
|
|
| **Training Objective** | Sequence-to-sequence translation fine-tuning | |
|
|
| **Framework** | π€ Transformers | |
|
|
| **Tokenizer** | SentencePiece | |
|
|
|
|
|
--- |
|
|
|
|
|
## π§ Training Details |
|
|
|
|
|
| Hyperparameter | Value | |
|
|
|----------------|--------| |
|
|
| `per_device_train_batch_size` | 16 | |
|
|
| `per_device_eval_batch_size` | 48 | |
|
|
| `learning_rate` | 2e-5 | |
|
|
| `num_train_epochs` | 8 | |
|
|
| `max_length` | 128 | |
|
|
| `num_beams` | 5 | |
|
|
| `eval_steps` | 5000 | |
|
|
| `save_steps` | 5000 | |
|
|
| `generation_no_repeat_ngram_size` | 3 | |
|
|
| `generation_repetition_penalty` | 1.5 | |
|
|
|
|
|
**Training Environment:** |
|
|
- 1 Γ NVIDIA **P100 (16 GB)** on **Kaggle** |
|
|
- Total training time: **6 h 33 m 60 s** |
|
|
|
|
|
--- |
|
|
|
|
|
## π Evaluation Results |
|
|
|
|
|
| Step | Train Loss | Val Loss | BLEU | |
|
|
|------|-------------|-----------|------| |
|
|
5000 | 0.4258 | 0.4082 | 2.01 |
|
|
10000 | 0.3694 | 0.3511 | 6.09 |
|
|
15000 | 0.3419 | 0.3232 | 7.83 |
|
|
20000 | 0.3148 | 0.3054 | 8.44 |
|
|
25000 | 0.2965 | 0.2923 | 9.79 |
|
|
30000 | 0.2895 | 0.2824 | 10.19 |
|
|
35000 | 0.2755 | 0.2756 | 11.26 |
|
|
40000 | 0.2733 | 0.2691 | 11.75 |
|
|
45000 | 0.2623 | 0.2649 | 12.26 |
|
|
50000 | 0.2581 | 0.2598 | 12.64 |
|
|
55000 | 0.2490 | 0.2567 | 12.83 |
|
|
60000 | 0.2520 | 0.2539 | 13.47 |
|
|
65000 | 0.2428 | 0.2518 | 13.60 |
|
|
70000 | 0.2376 | 0.2500 | 13.77 |
|
|
75000 | 0.2376 | 0.2488 | 13.87 |
|
|
80000 | 0.2362 | 0.2479 | **13.96** |
|
|
|
|
|
--- |
|
|
|
|
|
### π Practical BLEU Evaluation Results |
|
|
|
|
|
β£β Beam size = 5 |
|
|
β£β No-repeat n-gram size = 3 |
|
|
β£β Repetition penalty = 1.5 |
|
|
ββ **BLEU Score** = **17.903** |
|
|
|
|
|
--- |
|
|
|
|
|
## π¬ Example Translations |
|
|
|
|
|
| English | Atlasic Tamazight | |
|
|
|----------|------------------| |
|
|
| I will go to school. | **Rad ftuΙ£ s tinml.** | |
|
|
| What did you say? | **Mad tnnit?** | |
|
|
| I'm not talking to you, I'm talking to him! | **Ur ar gis sawalΙ£, ar ak sawalΙ£!** | |
|
|
| Everyone has a secret face. | **Kraygatt yan ila waαΈ₯dut.** | |
|
|
|
|
|
--- |
|
|
|
|
|
Hugging Face Space: |
|
|
π [**ilyasaqit/English-Tamazight-Translator**](https://huggingface.co/spaces/ilyasaqit/English-Tamazight-Translator) |
|
|
|
|
|
--- |
|
|
|
|
|
## πͺΆ Notes |
|
|
|
|
|
- The dataset is **synthetic**, not manually verified. |
|
|
- The model performs best on **short and simple general-domain sentences**. |
|
|
- Recommended decoding parameters: |
|
|
- `num_beams=5` |
|
|
- `repetition_penalty=1.2β1.5` |
|
|
- `no_repeat_ngram_size=3` |
|
|
--- |
|
|
|
|
|
## π Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{marian-en-tamazight-2025, |
|
|
title = {MarianMT English β Atlasic Tamazight (Tachelhit / Central Atlas)}, |
|
|
year = {2025}, |
|
|
url = {https://huggingface.co/ilyasaqit/opus-mt-en-atlasic_tamazight-synth169k-nmv} |
|
|
} |