YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Mizo-RoBERTa: A Foundational Transformer Language Model for the Mizo Language

Model Dataset License

Advancing NLP for Northeast Indian Languages

Overview

Mizo-RoBERTa is a transformer-based language model for Mizo, a Tibeto-Burman language spoken by approximately 1.1 million people primarily in Mizoram, Northeast India. Built on the RoBERTa architecture and trained on a large-scale curated corpus, this model provides state-of-the-art language understanding capabilities for Mizo NLP applications.

This work is part of MWireLabs' initiative to develop foundational language models for underserved languages of Northeast India, following our successful KhasiBERT model.

Key Highlights

  • Architecture: RoBERTa-base (110M parameters)
  • Training Scale: 5.94M sentences, 138.7M tokens
  • Open Data: 4M sentences publicly available at mizo-language-corpus-4M
  • Custom Tokenizer: Trained specifically for Mizo (30K BPE vocabulary)
  • Efficient: Single-epoch training on A40 GPU
  • Open Source: Model, tokenizer, and training code publicly available

Model Details

Architecture

Component Specification
Base Architecture RoBERTa-base
Parameters 109,113,648 (~110M)
Layers 12 transformer layers
Attention Heads 12
Hidden Size 768
Intermediate Size 3,072
Max Sequence Length 512 tokens
Vocabulary Size 30,000 (custom BPE)

Training Configuration

Setting Value
Training Data 5.94M sentences (138.7M tokens)
Public Dataset 4M sentences available on HuggingFace
Batch Size 32 per device
Learning Rate 1e-4
Optimizer AdamW
Weight Decay 0.01
Warmup Steps 10,000
Training Epochs 2
Hardware 1x NVIDIA A40 (48GB)
Training Time ~4-6 hours
Precision Mixed (FP16)

Training Data

Trained on a large-scale Mizo corpus comprising 5.94 million sentences (138.7 million tokens) with an average of 23.3 tokens per sentence. The corpus includes:

  • News articles from major Mizo publications
  • Literature and written content
  • Social media text
  • Government documents and official communications
  • Web content from Mizo language websites

Public Dataset: 4 million sentences are openly available at MWireLabs/mizo-language-corpus-4M for research and development purposes.

Data Preprocessing

  • Unicode normalization
  • Language identification and filtering
  • Deduplication (exact and near-duplicate removal)
  • Quality filtering based on length and character distributions
  • Custom sentence segmentation for Mizo punctuation

Data Split

  • Training: 5,350,122 sentences (90%)
  • Validation: 297,229 sentences (5%)
  • Test: 297,230 sentences (5%)

Performance

Language Modeling

Metric Value
Test Perplexity 15.85
Test Loss 2.76

Qualitative Examples

The model demonstrates strong understanding of Mizo linguistic patterns and context:

Example 1: Geographic Knowledge

Input:  "Mizoram hi India rama <mask> tak a ni"
Top Predictions:
  • pawimawh (important) - 9.0%
  • State - 4.9%
  • ropui (big) - 4.5%

Example 2: Urban Context

Input:  "Aizawl hi Mizoram <mask> a ni"
Top Predictions:
  • khawpui (city) ✓ - 12.9%
  • ta - 5.1%
  • chhung - 3.9%

✓ Correctly identifies Aizawl as a city (khawpui)

Comparison with Multilingual Models

While we haven't performed direct evaluation against multilingual models on this test set, similar monolingual approaches for low-resource languages (e.g., KhasiBERT for Khasi) have shown 45-50× improvements in perplexity over multilingual baselines like mBERT and XLM-RoBERTa. We expect Mizo-RoBERTa to demonstrate comparable advantages for Mizo language tasks.

Usage

Installation

pip install transformers torch

Quick Start: Masked Language Modeling

from transformers import RobertaForMaskedLM, RobertaTokenizerFast, pipeline

# Load model and tokenizer
model = RobertaForMaskedLM.from_pretrained("MWireLabs/mizo-roberta")
tokenizer = RobertaTokenizerFast.from_pretrained("MWireLabs/mizo-roberta")

# Create fill-mask pipeline
fill_mask = pipeline('fill-mask', model=model, tokenizer=tokenizer)

# Predict masked words
text = "Mizoram hi <mask> rama state a ni"
results = fill_mask(text)

for result in results:
    print(f"{result['score']:.3f}: {result['sequence']}")

Extract Embeddings

import torch

# Encode text
text = "Mizo tawng hi kan hman thin a ni"
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)

# Get contextualized embeddings
model.eval()
with torch.no_grad():
    outputs = model(**inputs, output_hidden_states=True)

    # Use last hidden state
    last_hidden = outputs.hidden_states[-1]

    # Mean pooling for sentence embedding
    sentence_embedding = last_hidden.mean(dim=1)

print(f"Embedding shape: {sentence_embedding.shape}")
# Output: torch.Size([1, 768])

Fine-tuning for Classification

from transformers import RobertaForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset

# Load model for sequence classification
model = RobertaForSequenceClassification.from_pretrained(
    "MWireLabs/mizo-roberta",
    num_labels=3  # e.g., for sentiment: positive, neutral, negative
)

# Load your labeled dataset
# Example: sentiment analysis dataset
dataset = load_dataset("your-dataset-name")

# Tokenize
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    warmup_steps=500,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=100,
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
)

# Initialize trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
)

# Train
trainer.train()

Batch Processing

# Process multiple sentences efficiently
sentences = [
    "Aizawl hi Mizoram khawpui ber a ni",
    "Mizo tawng hi Mizoram official language a ni",
    "India ram Northeast a Mizoram hi a awm"
]

# Tokenize batch
inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")

# Get predictions
with torch.no_grad():
    outputs = model(**inputs)

# Process outputs as needed

Applications

Mizo-RoBERTa can be fine-tuned for various downstream NLP tasks:

  • Text Classification (sentiment analysis, topic classification, news categorization)
  • Named Entity Recognition (NER for Mizo entities)
  • Question Answering (extractive QA systems)
  • Semantic Similarity (sentence/document similarity)
  • Information Retrieval (semantic search in Mizo content)
  • Language Understanding (natural language inference, textual entailment)

Limitations

  • Dialectal Coverage: The model may not comprehensively represent all Mizo dialects
  • Domain Balance: Formal written text may be overrepresented compared to conversational Mizo
  • Pretraining Objective: Only trained with Masked Language Modeling (MLM); may benefit from additional objectives
  • Context Length: Limited to 512 tokens; longer documents require chunking
  • Low-resource Constraints: While large for Mizo, the training corpus is still smaller than high-resource language datasets

Ethical Considerations

  • Representation: The model reflects the content and potential biases present in the training corpus
  • Intended Use: Designed for research and applications that benefit Mizo language speakers
  • Misuse Potential: Should not be used for generating misleading information or harmful content
  • Data Privacy: Training data was collected from publicly available sources; no private information was used
  • Cultural Sensitivity: Users should be aware of cultural context when deploying for Mizo-speaking communities

Citation

If you use Mizo-RoBERTa in your research or applications, please cite:

@misc{mizoroberta2025,
  title={Mizo-RoBERTa: A Foundational Transformer Language Model for the Mizo Language},
  author={MWireLabs},
  year={2025},
  publisher={HuggingFace},
  howpublished={\url{https://huggingface.co/MWireLabs/mizo-roberta}}
}

Related Resources

Model Card Contact

For questions, issues, or collaboration opportunities:

  • Organization: MWireLabs
  • Email: Contact through HuggingFace
  • Issues: Report on the model's HuggingFace page

License

This model is released under the Apache 2.0 License. See LICENSE file for details.

Acknowledgments

We thank the Mizo language community and content creators whose publicly available work made this model possible. Special thanks to all contributors to the open-source NLP ecosystem, particularly the HuggingFace team for their excellent tools and infrastructure.


MWireLabs - Building AI for Northeast India 🚀

Downloads last month
17
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including MWirelabs/mizo-roberta