Qwen Kannada Translation – English ↔ Kannada
This model fine-tunes Qwen 1.8B for bidirectional English ↔ Kannada translation using a synthetic bilingual dataset generated with LLM prompting.
It demonstrates how small open-source models can be adapted for low-resource languages using Parameter-Efficient Fine-Tuning (PEFT) with LoRA on Google Colab (T4 GPU).
Model Details
Model Description
- Developed by: Mahima Aruna
- Model type: Encoder–decoder translation fine-tune
- Language(s): English, Kannada
- License: Apache 2.0 (inherits from Qwen)
- Finetuned from: Qwen-1.8B
- Framework: PyTorch + Hugging Face Transformers + PEFT
- Hardware: Google Colab (T4 GPU, 16 GB VRAM)
- Dataset size: ~2,000 bilingual text pairs
Uses
Direct Use
- Translation tasks between English and Kannada
- Educational demonstrations for low-resource fine-tuning
- Research on multilingual LLM adaptation
Downstream Use
- Extend fine-tuning for additional Indic languages
- Integrate into multilingual chatbots or information retrieval pipelines
Out-of-Scope Use
- Production or commercial use without domain adaptation
- Sensitive, factual, or policy-related translation tasks (synthetic dataset only)
Bias, Risks, and Limitations
- The model is trained on synthetic data, so translations may occasionally be literal or stylistically inconsistent.
- Vocabulary coverage for domain-specific text (medical, legal, etc.) is limited.
- Potential bias from LLM-generated data (English-dominant phrasing).
Recommendations
For downstream users:
- Use additional real bilingual data for domain-specific fine-tuning.
- Evaluate outputs qualitatively before deployment.
How to Get Started with the Model
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name = "mahimaaruna04/qwen-kannada-translation"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=60)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
## Training Details
### Training Data
- **Dataset:** Synthetic English↔Kannada parallel corpus generated using LLM prompting (Gemini/GPT).
- **Format:** JSON (English sentence, Kannada translation).
- **Balance:** 50% English → Kannada, 50% Kannada → English.
- **Dataset Card:** [mahimaaruna04/kannada-english-translation-synthetic](https://huggingface.co/datasets/mahimaaruna04/kannada-english-translation-synthetic)
---
### Training Procedure
#### Preprocessing
- Cleaned and normalized text
- Tokenization using `QwenTokenizer`
- Dropped long or malformed translation pairs
#### Training Hyperparameters
- **Method:** LoRA (PEFT) fine-tuning
- **Rank:** 8
- **Alpha:** 16
- **Dropout:** 0.05
- **Batch size:** 8
- **Learning rate:** 2e-4
- **Epochs:** 3–5
- **Precision:** bfloat16
#### Speeds, Sizes, Times
- **Training time:** ~2.5 hours on Colab T4
- **Model size (LoRA-adapted):** ~1.9B effective parameters
---
## Evaluation
### Testing Data, Factors & Metrics
#### Testing Data
- Held-out subset from the same dataset (500 pairs, balanced English↔Kannada)
#### Metrics
- BLEU (automatic)
- Manual review for grammatical and contextual accuracy
### Results
| Metric | Direction | Score |
|---------|------------|--------|
| BLEU | EN → KN | 34.2 |
| BLEU | KN → EN | 32.8 |
> The fine-tuned model shows more fluent and context-aware translations compared to the base Qwen-1.8B, particularly for shorter conversational sentences.
### Summary
Even with synthetic data, the fine-tuning improves cross-lingual fluency, demonstrating the power of PEFT-based adaptation for low-resource languages.
---
## Citation & Acknowledgement
Please cite the dataset if you use or refer to it:
```bibtex
@misc{mahima2025kannet,
title = {Kannada-English Translation Synthetic Dataset},
author = {Mahima Aruna},
year = {2025},
url = {https://huggingface.co/datasets/mahimaaruna04/kannada-english-translation-synthetic}
}