mBART-tgj-final

mBART-tgj-final is a fine-tuned sequence-to-sequence model based on mBART-50-M2M for Tagin โ†’ English Neural Machine Translation. The model is trained using a Knowledge-Integrated Filtering System (KIFS) applied to a large mixed corpus of synthetic and manually curated sentence pairs, enabling significant improvements in translation quality for an extremely low-resource language.

Developer

  • Name: Tungon Dugi
  • Institution: National Institute of Technology Arunachal Pradesh

How to Get Started with the Model

Use the code below to get started with the model.

import torch
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast

model = MBartForConditionalGeneration.from_pretrained("Repleeka/mBART-tgj-final")
tokenizer = MBart50TokenizerFast.from_pretrained("Repleeka/mBART-tgj-final")

tokenizer.src_lang = "en_XX"
text = "How are you?"
inputs = tokenizer(text, return_tensors="pt")

generated_tokens = model.generate(
    **inputs,
    forced_bos_token_id=tokenizer.convert_tokens_to_ids("<tgj_IN>"),
    num_beams=5,
    max_length=128,
)

print(tokenizer.batch_decode(generated_tokens, skip_special_tokens=True))

Intended Use

  • Research in low-resource and endangered language technology
  • Tagin language documentation and preservation
  • Transfer learning experiments with Tani-language family
  • Domain adaptation and linguistic resource development

Not intended for

  • High-stakes applications (medical, legal, safety-critical)
  • Fully automated decision systems without human validation

Model Architecture

Component Value
Base Model mBART-50 Large
Layers 12-layer encoder & decoder
Hidden size 1024
FFN dimension 4096
Attention heads 16
Vocabulary size 250,055
Tokenizer MBart50Tokenizer

Evaluation Results

Metric Score
BLEU 40.27
ChrF 59.38
METEOR 46.29
TER 44.42
Loss 0.0503

Statistical Significance Test Results

Evaluation based on 890-sentence held-out test set with Approximate Randomization significance testing (p < 0.01), showing large performance gains over baseline mBART-tgj-base.

Parameter Detail
Evaluation Set 890-sentence held-out test set
Test Method Approximate Randomization
Significance Level p < 0.01
Comparison Large performance gains over baseline mBART-tgj-base

Limitations & Ethical Considerations

Category Details
Linguistic Limitations Model may struggle with rare morphology, honorific forms, and culturally grounded expressions
Cultural Sensitivity Sensitive cultural or mythological meanings may require expert review
Data Bias Bias inherited from synthetic data cannot be fully eliminated
Deployment Scope Not suitable for automated translation of critical documents

Citation

Dugi, T., & Sambyo, K. (2025). A Knowledge-Integrated System for Quality-Driven Filtering of Low-Resource Tagin-English Bitexts. National Institute of Technology Arunachal Pradesh.

License

MIT / Apache-2.0 recommended for research openness and compatibility. (Data licensing depends on source text availability.)

Contact

For collaboration, feedback, or issues: [email protected]

Downloads last month
60
Safetensors
Model size
0.6B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for repleeka/mBART-tgj-final

Finetuned
(205)
this model

Dataset used to train repleeka/mBART-tgj-final

Space using repleeka/mBART-tgj-final 1