mBART-tgj-final
mBART-tgj-final is a fine-tuned sequence-to-sequence model based on mBART-50-M2M for Tagin โ English Neural Machine Translation.
The model is trained using a Knowledge-Integrated Filtering System (KIFS) applied to a large mixed corpus of synthetic and manually curated sentence pairs, enabling significant improvements in translation quality for an extremely low-resource language.
Developer
- Name: Tungon Dugi
- Institution: National Institute of Technology Arunachal Pradesh
How to Get Started with the Model
Use the code below to get started with the model.
import torch
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
model = MBartForConditionalGeneration.from_pretrained("Repleeka/mBART-tgj-final")
tokenizer = MBart50TokenizerFast.from_pretrained("Repleeka/mBART-tgj-final")
tokenizer.src_lang = "en_XX"
text = "How are you?"
inputs = tokenizer(text, return_tensors="pt")
generated_tokens = model.generate(
**inputs,
forced_bos_token_id=tokenizer.convert_tokens_to_ids("<tgj_IN>"),
num_beams=5,
max_length=128,
)
print(tokenizer.batch_decode(generated_tokens, skip_special_tokens=True))
Intended Use
- Research in low-resource and endangered language technology
- Tagin language documentation and preservation
- Transfer learning experiments with Tani-language family
- Domain adaptation and linguistic resource development
Not intended for
- High-stakes applications (medical, legal, safety-critical)
- Fully automated decision systems without human validation
Model Architecture
| Component | Value |
|---|---|
| Base Model | mBART-50 Large |
| Layers | 12-layer encoder & decoder |
| Hidden size | 1024 |
| FFN dimension | 4096 |
| Attention heads | 16 |
| Vocabulary size | 250,055 |
| Tokenizer | MBart50Tokenizer |
Evaluation Results
| Metric | Score |
|---|---|
| BLEU | 40.27 |
| ChrF | 59.38 |
| METEOR | 46.29 |
| TER | 44.42 |
| Loss | 0.0503 |
Statistical Significance Test Results
Evaluation based on 890-sentence held-out test set with Approximate Randomization significance testing (p < 0.01), showing large performance gains over baseline mBART-tgj-base.
| Parameter | Detail |
|---|---|
| Evaluation Set | 890-sentence held-out test set |
| Test Method | Approximate Randomization |
| Significance Level | p < 0.01 |
| Comparison | Large performance gains over baseline mBART-tgj-base |
Limitations & Ethical Considerations
| Category | Details |
|---|---|
| Linguistic Limitations | Model may struggle with rare morphology, honorific forms, and culturally grounded expressions |
| Cultural Sensitivity | Sensitive cultural or mythological meanings may require expert review |
| Data Bias | Bias inherited from synthetic data cannot be fully eliminated |
| Deployment Scope | Not suitable for automated translation of critical documents |
Citation
Dugi, T., & Sambyo, K. (2025). A Knowledge-Integrated System for Quality-Driven Filtering of Low-Resource Tagin-English Bitexts. National Institute of Technology Arunachal Pradesh.
License
MIT / Apache-2.0 recommended for research openness and compatibility. (Data licensing depends on source text availability.)
Contact
For collaboration, feedback, or issues: [email protected]
- Downloads last month
- 60
Model tree for repleeka/mBART-tgj-final
Base model
facebook/mbart-large-50-many-to-many-mmt