Qwen Kannada Translation – English ↔ Kannada

This model fine-tunes Qwen 1.8B for bidirectional English ↔ Kannada translation using a synthetic bilingual dataset generated with LLM prompting.
It demonstrates how small open-source models can be adapted for low-resource languages using Parameter-Efficient Fine-Tuning (PEFT) with LoRA on Google Colab (T4 GPU).


Model Details

Model Description

  • Developed by: Mahima Aruna
  • Model type: Encoder–decoder translation fine-tune
  • Language(s): English, Kannada
  • License: Apache 2.0 (inherits from Qwen)
  • Finetuned from: Qwen-1.8B
  • Framework: PyTorch + Hugging Face Transformers + PEFT
  • Hardware: Google Colab (T4 GPU, 16 GB VRAM)
  • Dataset size: ~2,000 bilingual text pairs

Uses

Direct Use

  • Translation tasks between English and Kannada
  • Educational demonstrations for low-resource fine-tuning
  • Research on multilingual LLM adaptation

Downstream Use

  • Extend fine-tuning for additional Indic languages
  • Integrate into multilingual chatbots or information retrieval pipelines

Out-of-Scope Use

  • Production or commercial use without domain adaptation
  • Sensitive, factual, or policy-related translation tasks (synthetic dataset only)

Bias, Risks, and Limitations

  • The model is trained on synthetic data, so translations may occasionally be literal or stylistically inconsistent.
  • Vocabulary coverage for domain-specific text (medical, legal, etc.) is limited.
  • Potential bias from LLM-generated data (English-dominant phrasing).

Recommendations

For downstream users:

  • Use additional real bilingual data for domain-specific fine-tuning.
  • Evaluate outputs qualitatively before deployment.

How to Get Started with the Model

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "mahimaaruna04/qwen-kannada-translation"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

text = "Hello, how are you?"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=60)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

## Training Details

### Training Data
- **Dataset:** Synthetic English↔Kannada parallel corpus generated using LLM prompting (Gemini/GPT).  
- **Format:** JSON (English sentence, Kannada translation).  
- **Balance:** 50% English → Kannada, 50% Kannada → English.  
- **Dataset Card:** [mahimaaruna04/kannada-english-translation-synthetic](https://huggingface.co/datasets/mahimaaruna04/kannada-english-translation-synthetic)  

---

### Training Procedure

#### Preprocessing
- Cleaned and normalized text  
- Tokenization using `QwenTokenizer`  
- Dropped long or malformed translation pairs  

#### Training Hyperparameters
- **Method:** LoRA (PEFT) fine-tuning  
- **Rank:** 8  
- **Alpha:** 16  
- **Dropout:** 0.05  
- **Batch size:** 8  
- **Learning rate:** 2e-4  
- **Epochs:** 35  
- **Precision:** bfloat16  

#### Speeds, Sizes, Times
- **Training time:** ~2.5 hours on Colab T4  
- **Model size (LoRA-adapted):** ~1.9B effective parameters  

---

## Evaluation

### Testing Data, Factors & Metrics

#### Testing Data
- Held-out subset from the same dataset (500 pairs, balanced English↔Kannada)

#### Metrics
- BLEU (automatic)  
- Manual review for grammatical and contextual accuracy  

### Results

| Metric | Direction | Score |
|---------|------------|--------|
| BLEU | EN → KN | 34.2 |
| BLEU | KN → EN | 32.8 |

> The fine-tuned model shows more fluent and context-aware translations compared to the base Qwen-1.8B, particularly for shorter conversational sentences.

### Summary
Even with synthetic data, the fine-tuning improves cross-lingual fluency, demonstrating the power of PEFT-based adaptation for low-resource languages.

---

## Citation & Acknowledgement

Please cite the dataset if you use or refer to it:

```bibtex
@misc{mahima2025kannet,
  title = {Kannada-English Translation Synthetic Dataset},
  author = {Mahima Aruna},
  year = {2025},
  url = {https://huggingface.co/datasets/mahimaaruna04/kannada-english-translation-synthetic}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train mahimaaruna04/qwen-kannada-translation