🪶 Model Card — Banjara → Telugu Translation (mBART Fine-tuned)

This model translates Banjara (Lambadi) language text into Telugu.
It is fine-tuned from the multilingual model facebook/mbart-large-50-many-to-many-mmt using a custom dataset of Banjara–Telugu sentence pairs.

🧠 Model Details

Model Description

Model Type: Seq2Seq Transformer (mBART-50)
Architecture: mBART-large-50-many-to-many-mmt
Languages: Banjara → Telugu
Base Model: facebook/mbart-large-50-many-to-many-mmt
Developed by: Badavath Narender
Framework: 🤗 Transformers
License: Apache 2.0
Fine-tuned Dataset Size: 265 parallel pairs
Training Epochs: 3
Batch Size: 2
Learning Rate: 2e-5
Optimizer: AdamW
Mixed Precision: FP16 (on CUDA)

🔗 Model Sources

Repository: narenderbadavath/banjara-mbart-finetuned
Base Model: facebook/mbart-large-50-many-to-many-mmt
Demo (optional): Coming soon in Streamlit Translator App

💡 Uses

Direct Use

This model is suitable for:

Translating Banjara text into Telugu
Building AI assistants or translation chatbots for Banjara-speaking communities
Research on low-resource Indic language translation

Downstream Use

Integrate into speech translation pipelines (Whisper + mBART)
Use with Streamlit / Flask apps for multilingual communication tools

Out-of-Scope Use

Not intended for official legal or medical translations
May not handle complex grammar or rare dialectal variations

⚠️ Bias, Risks, and Limitations

Known Limitations

Dataset is relatively small (≈265 pairs) → limited generalization
Certain idiomatic Banjara words may not have exact Telugu equivalents
Mixed-language sentences (Banjara + Hindi/Telugu) may confuse the model

Recommendations

For better accuracy, fine-tune with a larger and diverse dataset
Evaluate human translations for critical applications

🚀 How to Use

from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
import torch

model_name = "narenderbadavath/banjara-mbart-finetuned"
tokenizer = MBart50TokenizerFast.from_pretrained(model_name)
model = MBartForConditionalGeneration.from_pretrained(model_name)

device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

def translate(text, target_lang="te_IN"):
    forced_bos = tokenizer.lang_code_to_id[target_lang]
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device)
    outputs = model.generate(**inputs, forced_bos_token_id=forced_bos, num_beams=5, max_length=128)
    return tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]

print(translate("తు వారు చిక", "te_IN"))

Downloads last month: 41

Safetensors

Model size

0.6B params

Tensor type

F32