πͺΆ Model Card β Banjara β Telugu Translation (mBART Fine-tuned)
This model translates Banjara (Lambadi) language text into Telugu.
It is fine-tuned from the multilingual model facebook/mbart-large-50-many-to-many-mmt using a custom dataset of BanjaraβTelugu sentence pairs.
π§ Model Details
Model Description
- Model Type: Seq2Seq Transformer (mBART-50)
- Architecture: mBART-large-50-many-to-many-mmt
- Languages: Banjara β Telugu
- Base Model: facebook/mbart-large-50-many-to-many-mmt
- Developed by: Badavath Narender
- Framework: π€ Transformers
- License: Apache 2.0
- Fine-tuned Dataset Size: 265 parallel pairs
- Training Epochs: 3
- Batch Size: 2
- Learning Rate: 2e-5
- Optimizer: AdamW
- Mixed Precision: FP16 (on CUDA)
π Model Sources
- Repository: narenderbadavath/banjara-mbart-finetuned
- Base Model: facebook/mbart-large-50-many-to-many-mmt
- Demo (optional): Coming soon in Streamlit Translator App
π‘ Uses
Direct Use
This model is suitable for:
- Translating Banjara text into Telugu
- Building AI assistants or translation chatbots for Banjara-speaking communities
- Research on low-resource Indic language translation
Downstream Use
- Integrate into speech translation pipelines (Whisper + mBART)
- Use with Streamlit / Flask apps for multilingual communication tools
Out-of-Scope Use
- Not intended for official legal or medical translations
- May not handle complex grammar or rare dialectal variations
β οΈ Bias, Risks, and Limitations
Known Limitations
- Dataset is relatively small (β265 pairs) β limited generalization
- Certain idiomatic Banjara words may not have exact Telugu equivalents
- Mixed-language sentences (Banjara + Hindi/Telugu) may confuse the model
Recommendations
- For better accuracy, fine-tune with a larger and diverse dataset
- Evaluate human translations for critical applications
π How to Use
from transformers import MBartForConditionalGeneration, MBart50TokenizerFast
import torch
model_name = "narenderbadavath/banjara-mbart-finetuned"
tokenizer = MBart50TokenizerFast.from_pretrained(model_name)
model = MBartForConditionalGeneration.from_pretrained(model_name)
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)
def translate(text, target_lang="te_IN"):
forced_bos = tokenizer.lang_code_to_id[target_lang]
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True).to(device)
outputs = model.generate(**inputs, forced_bos_token_id=forced_bos, num_beams=5, max_length=128)
return tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
print(translate("ΰ°€ΰ± ΰ°΅ΰ°Ύΰ°°ΰ± ΰ°ΰ°Ώΰ°", "te_IN"))
- Downloads last month
- 41