mBART-tgj-final

mBART-tgj-final is a fine-tuned sequence-to-sequence model based on mBART-50-M2M for Tagin → English Neural Machine Translation. The model is trained using a Knowledge-Integrated Filtering System (KIFS) applied to a large mixed corpus of synthetic and manually curated sentence pairs, enabling significant improvements in translation quality for an extremely low-resource language.

Developer

Name: Tungon Dugi
Institution: National Institute of Technology Arunachal Pradesh

How to Get Started with the Model

Use the code below to get started with the model.

import torch
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

model = MBartForConditionalGeneration.from_pretrained("Repleeka/mBART-tgj-final")
tokenizer = MBart50TokenizerFast.from_pretrained("Repleeka/mBART-tgj-final")

tokenizer.src_lang = "en_XX"
text = "How are you?"
inputs = tokenizer(text, return_tensors="pt")

generated_tokens = model.generate(
    **inputs,
    forced_bos_token_id=tokenizer.convert_tokens_to_ids("<tgj_IN>"),
    num_beams=5,
    max_length=128,
)

print(tokenizer.batch_decode(generated_tokens, skip_special_tokens=True))

Intended Use

Research in low-resource and endangered language technology
Tagin language documentation and preservation
Transfer learning experiments with Tani-language family
Domain adaptation and linguistic resource development

Not intended for

High-stakes applications (medical, legal, safety-critical)
Fully automated decision systems without human validation

Model Architecture

Component	Value
Base Model	mBART-50 Large
Layers	12-layer encoder & decoder
Hidden size	1024
FFN dimension	4096
Attention heads	16
Vocabulary size	250,055
Tokenizer	MBart50Tokenizer

Evaluation Results

Metric	Score
BLEU	40.27
ChrF	59.38
METEOR	46.29
TER	44.42
Loss	0.0503

Statistical Significance Test Results

Evaluation based on 890-sentence held-out test set with Approximate Randomization significance testing (p < 0.01), showing large performance gains over baseline mBART-tgj-base.

Parameter	Detail
Evaluation Set	890-sentence held-out test set
Test Method	Approximate Randomization
Significance Level	p < 0.01
Comparison	Large performance gains over baseline `mBART-tgj-base`

Limitations & Ethical Considerations

Category	Details
Linguistic Limitations	Model may struggle with rare morphology, honorific forms, and culturally grounded expressions
Cultural Sensitivity	Sensitive cultural or mythological meanings may require expert review
Data Bias	Bias inherited from synthetic data cannot be fully eliminated
Deployment Scope	Not suitable for automated translation of critical documents

Citation

Dugi, T., & Sambyo, K. (2025). A Knowledge-Integrated System for Quality-Driven Filtering of Low-Resource Tagin-English Bitexts. National Institute of Technology Arunachal Pradesh.

License

MIT / Apache-2.0 recommended for research openness and compatibility. (Data licensing depends on source text availability.)

Contact

For collaboration, feedback, or issues: [email protected]

Downloads last month: 60

Safetensors

Model size

0.6B params

Tensor type

F32

Model tree for repleeka/mBART-tgj-final

Base model

facebook/mbart-large-50-many-to-many-mmt

Finetuned

(205)

this model

repleeka
/

mBART-tgj-final