YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Mizo-RoBERTa: A Foundational Transformer Language Model for the Mizo Language

Advancing NLP for Northeast Indian Languages

Overview

Mizo-RoBERTa is a transformer-based language model for Mizo, a Tibeto-Burman language spoken by approximately 1.1 million people primarily in Mizoram, Northeast India. Built on the RoBERTa architecture and trained on a large-scale curated corpus, this model provides state-of-the-art language understanding capabilities for Mizo NLP applications.

This work is part of MWireLabs' initiative to develop foundational language models for underserved languages of Northeast India, following our successful KhasiBERT model.

Key Highlights

Architecture: RoBERTa-base (110M parameters)
Training Scale: 5.94M sentences, 138.7M tokens
Open Data: 4M sentences publicly available at mizo-language-corpus-4M
Custom Tokenizer: Trained specifically for Mizo (30K BPE vocabulary)
Efficient: Single-epoch training on A40 GPU
Open Source: Model, tokenizer, and training code publicly available

Model Details

Architecture

Component	Specification
Base Architecture	RoBERTa-base
Parameters	109,113,648 (~110M)
Layers	12 transformer layers
Attention Heads	12
Hidden Size	768
Intermediate Size	3,072
Max Sequence Length	512 tokens
Vocabulary Size	30,000 (custom BPE)

Training Configuration

Setting	Value
Training Data	5.94M sentences (138.7M tokens)
Public Dataset	4M sentences available on HuggingFace
Batch Size	32 per device
Learning Rate	1e-4
Optimizer	AdamW
Weight Decay	0.01
Warmup Steps	10,000
Training Epochs	2
Hardware	1x NVIDIA A40 (48GB)
Training Time	~4-6 hours
Precision	Mixed (FP16)

Training Data

Trained on a large-scale Mizo corpus comprising 5.94 million sentences (138.7 million tokens) with an average of 23.3 tokens per sentence. The corpus includes:

News articles from major Mizo publications
Literature and written content
Social media text
Government documents and official communications
Web content from Mizo language websites

Public Dataset: 4 million sentences are openly available at MWireLabs/mizo-language-corpus-4M for research and development purposes.

Data Preprocessing

Unicode normalization
Language identification and filtering
Deduplication (exact and near-duplicate removal)
Quality filtering based on length and character distributions
Custom sentence segmentation for Mizo punctuation

Data Split

Training: 5,350,122 sentences (90%)
Validation: 297,229 sentences (5%)
Test: 297,230 sentences (5%)

Performance

Language Modeling

Metric	Value
Test Perplexity	15.85
Test Loss	2.76

Qualitative Examples

The model demonstrates strong understanding of Mizo linguistic patterns and context:

Example 1: Geographic Knowledge

Input:  "Mizoram hi India rama <mask> tak a ni"
Top Predictions:
  • pawimawh (important) - 9.0%
  • State - 4.9%
  • ropui (big) - 4.5%

Example 2: Urban Context

Input:  "Aizawl hi Mizoram <mask> a ni"
Top Predictions:
  • khawpui (city) ✓ - 12.9%
  • ta - 5.1%
  • chhung - 3.9%

✓ Correctly identifies Aizawl as a city (khawpui)

Comparison with Multilingual Models

While we haven't performed direct evaluation against multilingual models on this test set, similar monolingual approaches for low-resource languages (e.g., KhasiBERT for Khasi) have shown 45-50× improvements in perplexity over multilingual baselines like mBERT and XLM-RoBERTa. We expect Mizo-RoBERTa to demonstrate comparable advantages for Mizo language tasks.

Usage

Installation

pip install transformers torch

Quick Start: Masked Language Modeling

from transformers import RobertaForMaskedLM, RobertaTokenizerFast, pipeline

# Load model and tokenizer
model = RobertaForMaskedLM.from_pretrained("MWireLabs/mizo-roberta")
tokenizer = RobertaTokenizerFast.from_pretrained("MWireLabs/mizo-roberta")

# Create fill-mask pipeline
fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)

# Predict masked words
text = "Mizoram hi <mask> rama state a ni"
results = fill_mask(text)

for result in results:
    print(f"{result['score']:.3f}: {result['sequence']}")

Extract Embeddings

import torch

# Encode text
text = "Mizo tawng hi kan hman thin a ni"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

# Get contextualized embeddings
model.eval()
with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)

    # Use last hidden state
    last_hidden = outputs.hidden_states[-1]

    # Mean pooling for sentence embedding
    sentence_embedding = last_hidden.mean(dim=1)

print(f"Embedding shape: {sentence_embedding.shape}")
# Output: torch.Size([1, 768])

Fine-tuning for Classification

from transformers import RobertaForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset

# Load model for sequence classification
model = RobertaForSequenceClassification.from_pretrained(
    "MWireLabs/mizo-roberta",
    num_labels=3  # e.g., for sentiment: positive, neutral, negative
)

# Load your labeled dataset
# Example: sentiment analysis dataset
dataset = load_dataset("your-dataset-name")

# Tokenize
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=100,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
)

# Train
trainer.train()

Batch Processing

# Process multiple sentences efficiently
sentences = [
    "Aizawl hi Mizoram khawpui ber a ni",
    "Mizo tawng hi Mizoram official language a ni",
    "India ram Northeast a Mizoram hi a awm"
]

# Tokenize batch
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

# Get predictions
with torch.no_grad():
    outputs = model(**inputs)

# Process outputs as needed

Applications

Mizo-RoBERTa can be fine-tuned for various downstream NLP tasks:

Text Classification (sentiment analysis, topic classification, news categorization)
Named Entity Recognition (NER for Mizo entities)
Question Answering (extractive QA systems)
Semantic Similarity (sentence/document similarity)
Information Retrieval (semantic search in Mizo content)
Language Understanding (natural language inference, textual entailment)

Limitations

Dialectal Coverage: The model may not comprehensively represent all Mizo dialects
Domain Balance: Formal written text may be overrepresented compared to conversational Mizo
Pretraining Objective: Only trained with Masked Language Modeling (MLM); may benefit from additional objectives
Context Length: Limited to 512 tokens; longer documents require chunking
Low-resource Constraints: While large for Mizo, the training corpus is still smaller than high-resource language datasets

Ethical Considerations

Representation: The model reflects the content and potential biases present in the training corpus
Intended Use: Designed for research and applications that benefit Mizo language speakers
Misuse Potential: Should not be used for generating misleading information or harmful content
Data Privacy: Training data was collected from publicly available sources; no private information was used
Cultural Sensitivity: Users should be aware of cultural context when deploying for Mizo-speaking communities

Citation

If you use Mizo-RoBERTa in your research or applications, please cite:

@misc{mizoroberta2025,
  title={Mizo-RoBERTa: A Foundational Transformer Language Model for the Mizo Language},
  author={MWireLabs},
  year={2025},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/MWireLabs/mizo-roberta}}
}

Related Resources

Public Training Data: mizo-language-corpus-4M
Sister Model: KhasiBERT - RoBERTa model for Khasi language
Organization: MWireLabs on HuggingFace

Model Card Contact

For questions, issues, or collaboration opportunities:

Organization: MWireLabs
Email: Contact through HuggingFace
Issues: Report on the model's HuggingFace page

License

This model is released under the Apache 2.0 License. See LICENSE file for details.

Acknowledgments

We thank the Mizo language community and content creators whose publicly available work made this model possible. Special thanks to all contributors to the open-source NLP ecosystem, particularly the HuggingFace team for their excellent tools and infrastructure.

MWireLabs - Building AI for Northeast India 🚀

Downloads last month: 17

Safetensors

Model size

0.1B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including MWirelabs/mizo-roberta

Northeast India NLP

Collection

Multilingual NLP models, tokenizers, and datasets for underrepresented Northeast Indian languages—built for civic use and reproducibility. • 4 items • Updated 19 days ago