Mizo-RoBERTa: A Foundational Transformer Language Model for the Mizo Language
Overview
Mizo-RoBERTa is a transformer-based language model for Mizo, a Tibeto-Burman language spoken by approximately 1.1 million people primarily in Mizoram, Northeast India. Built on the RoBERTa architecture and trained on a large-scale curated corpus, this model provides state-of-the-art language understanding capabilities for Mizo NLP applications.
This work is part of MWireLabs' initiative to develop foundational language models for underserved languages of Northeast India, following our successful KhasiBERT model.
Key Highlights
- Architecture: RoBERTa-base (110M parameters)
- Training Scale: 5.94M sentences, 138.7M tokens
- Open Data: 4M sentences publicly available at mizo-language-corpus-4M
- Custom Tokenizer: Trained specifically for Mizo (30K BPE vocabulary)
- Efficient: Single-epoch training on A40 GPU
- Open Source: Model, tokenizer, and training code publicly available
Model Details
Architecture
| Component | Specification |
|---|---|
| Base Architecture | RoBERTa-base |
| Parameters | 109,113,648 (~110M) |
| Layers | 12 transformer layers |
| Attention Heads | 12 |
| Hidden Size | 768 |
| Intermediate Size | 3,072 |
| Max Sequence Length | 512 tokens |
| Vocabulary Size | 30,000 (custom BPE) |
Training Configuration
| Setting | Value |
|---|---|
| Training Data | 5.94M sentences (138.7M tokens) |
| Public Dataset | 4M sentences available on HuggingFace |
| Batch Size | 32 per device |
| Learning Rate | 1e-4 |
| Optimizer | AdamW |
| Weight Decay | 0.01 |
| Warmup Steps | 10,000 |
| Training Epochs | 2 |
| Hardware | 1x NVIDIA A40 (48GB) |
| Training Time | ~4-6 hours |
| Precision | Mixed (FP16) |
Training Data
Trained on a large-scale Mizo corpus comprising 5.94 million sentences (138.7 million tokens) with an average of 23.3 tokens per sentence. The corpus includes:
- News articles from major Mizo publications
- Literature and written content
- Social media text
- Government documents and official communications
- Web content from Mizo language websites
Public Dataset: 4 million sentences are openly available at MWireLabs/mizo-language-corpus-4M for research and development purposes.
Data Preprocessing
- Unicode normalization
- Language identification and filtering
- Deduplication (exact and near-duplicate removal)
- Quality filtering based on length and character distributions
- Custom sentence segmentation for Mizo punctuation
Data Split
- Training: 5,350,122 sentences (90%)
- Validation: 297,229 sentences (5%)
- Test: 297,230 sentences (5%)
Performance
Language Modeling
| Metric | Value |
|---|---|
| Test Perplexity | 15.85 |
| Test Loss | 2.76 |
Qualitative Examples
The model demonstrates strong understanding of Mizo linguistic patterns and context:
Example 1: Geographic Knowledge
Input: "Mizoram hi India rama <mask> tak a ni"
Top Predictions:
• pawimawh (important) - 9.0%
• State - 4.9%
• ropui (big) - 4.5%
Example 2: Urban Context
Input: "Aizawl hi Mizoram <mask> a ni"
Top Predictions:
• khawpui (city) ✓ - 12.9%
• ta - 5.1%
• chhung - 3.9%
✓ Correctly identifies Aizawl as a city (khawpui)
Comparison with Multilingual Models
While we haven't performed direct evaluation against multilingual models on this test set, similar monolingual approaches for low-resource languages (e.g., KhasiBERT for Khasi) have shown 45-50× improvements in perplexity over multilingual baselines like mBERT and XLM-RoBERTa. We expect Mizo-RoBERTa to demonstrate comparable advantages for Mizo language tasks.
Usage
Installation
pip install transformers torch
Quick Start: Masked Language Modeling
from transformers import RobertaForMaskedLM, RobertaTokenizerFast, pipeline
# Load model and tokenizer
model = RobertaForMaskedLM.from_pretrained("MWireLabs/mizo-roberta")
tokenizer = RobertaTokenizerFast.from_pretrained("MWireLabs/mizo-roberta")
# Create fill-mask pipeline
fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)
# Predict masked words
text = "Mizoram hi <mask> rama state a ni"
results = fill_mask(text)
for result in results:
print(f"{result['score']:.3f}: {result['sequence']}")
Extract Embeddings
import torch
# Encode text
text = "Mizo tawng hi kan hman thin a ni"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
# Get contextualized embeddings
model.eval()
with torch.no_grad():
outputs = model(**inputs, output_hidden_states=True)
# Use last hidden state
last_hidden = outputs.hidden_states[-1]
# Mean pooling for sentence embedding
sentence_embedding = last_hidden.mean(dim=1)
print(f"Embedding shape: {sentence_embedding.shape}")
# Output: torch.Size([1, 768])
Fine-tuning for Classification
from transformers import RobertaForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
# Load model for sequence classification
model = RobertaForSequenceClassification.from_pretrained(
"MWireLabs/mizo-roberta",
num_labels=3 # e.g., for sentiment: positive, neutral, negative
)
# Load your labeled dataset
# Example: sentiment analysis dataset
dataset = load_dataset("your-dataset-name")
# Tokenize
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized_dataset = dataset.map(tokenize_function, batched=True)
# Training arguments
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=64,
warmup_steps=500,
weight_decay=0.01,
logging_dir='./logs',
logging_steps=100,
evaluation_strategy="epoch",
save_strategy="epoch",
load_best_model_at_end=True,
)
# Initialize trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset["train"],
eval_dataset=tokenized_dataset["validation"],
)
# Train
trainer.train()
Batch Processing
# Process multiple sentences efficiently
sentences = [
"Aizawl hi Mizoram khawpui ber a ni",
"Mizo tawng hi Mizoram official language a ni",
"India ram Northeast a Mizoram hi a awm"
]
# Tokenize batch
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
# Get predictions
with torch.no_grad():
outputs = model(**inputs)
# Process outputs as needed
Applications
Mizo-RoBERTa can be fine-tuned for various downstream NLP tasks:
- Text Classification (sentiment analysis, topic classification, news categorization)
- Named Entity Recognition (NER for Mizo entities)
- Question Answering (extractive QA systems)
- Semantic Similarity (sentence/document similarity)
- Information Retrieval (semantic search in Mizo content)
- Language Understanding (natural language inference, textual entailment)
Limitations
- Dialectal Coverage: The model may not comprehensively represent all Mizo dialects
- Domain Balance: Formal written text may be overrepresented compared to conversational Mizo
- Pretraining Objective: Only trained with Masked Language Modeling (MLM); may benefit from additional objectives
- Context Length: Limited to 512 tokens; longer documents require chunking
- Low-resource Constraints: While large for Mizo, the training corpus is still smaller than high-resource language datasets
Ethical Considerations
- Representation: The model reflects the content and potential biases present in the training corpus
- Intended Use: Designed for research and applications that benefit Mizo language speakers
- Misuse Potential: Should not be used for generating misleading information or harmful content
- Data Privacy: Training data was collected from publicly available sources; no private information was used
- Cultural Sensitivity: Users should be aware of cultural context when deploying for Mizo-speaking communities
Citation
If you use Mizo-RoBERTa in your research or applications, please cite:
@misc{mizoroberta2025,
title={Mizo-RoBERTa: A Foundational Transformer Language Model for the Mizo Language},
author={MWireLabs},
year={2025},
publisher={HuggingFace},
howpublished={\url{https://huggingface.co/MWireLabs/mizo-roberta}}
}
Related Resources
- Public Training Data: mizo-language-corpus-4M
- Sister Model: KhasiBERT - RoBERTa model for Khasi language
- Organization: MWireLabs on HuggingFace
Model Card Contact
For questions, issues, or collaboration opportunities:
- Organization: MWireLabs
- Email: Contact through HuggingFace
- Issues: Report on the model's HuggingFace page
License
This model is released under the Apache 2.0 License. See LICENSE file for details.
Acknowledgments
We thank the Mizo language community and content creators whose publicly available work made this model possible. Special thanks to all contributors to the open-source NLP ecosystem, particularly the HuggingFace team for their excellent tools and infrastructure.
MWireLabs - Building AI for Northeast India 🚀
- Downloads last month
- 17