Kren v1: Khasi Generative Language Model
Kren v1 is a publicly documented encoder→decoder conversion producing a generative language model for an Indian language (Khasi). The conversion was performed by transferring weights and adapting the architecture of MWirelabs/khasibert (RoBERTa-style encoder) into a GPT-2 style causal decoder, followed by progressive causal LM fine-tuning.
Model Overview
- Model Name: Kren v1 (formerly kren-v0.3)
- Language: Khasi (kha)
- Architecture: GPT-2 style causal language model
- Parameters: 110M
- Training Data: 1M lines (optimal training point identified through research)
- Base Model: MWirelabs/khasibert
Key Capabilities
✅ Environmental and sustainability discussions
✅ Cultural and geographical questions about Meghalaya
✅ Abstract reasoning and concept exploration
✅ Multi-clause sophisticated responses
✅ Educational content generation
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("MWirelabs/kren-v1")
model = AutoModelForCausalLM.from_pretrained("MWirelabs/kren-v1")
# Generate Khasi text
inputs = tokenizer("Ka Khasi ka", return_tensors="pt")
outputs = model.generate(
inputs.input_ids,
max_length=100,
temperature=0.8,
do_sample=True,
top_p=0.9
)
text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(text)
Training Details
- Training Method: Progressive fine-tuning with encoder-to-decoder conversion
- Optimal Training Point: 1M lines (validated through research)
- Training Loss: 2.960
- Perplexity: 19.3
- Architecture Conversion: RoBERTa encoder → GPT-2 decoder with systematic weight transfer
Research Validation
This model represents the optimal point identified through comprehensive progressive training research:
- v0.1 (300K lines): Training loss 3.149, basic generation
- v0.2 (800K lines): Training loss 2.995, dialogue capabilities
- v0.3/v1 (1M lines): Training loss 2.960, abstract reasoning breakthrough
- v0.4 (2M lines): Training loss 2.903 but quality regression
Key Finding: Training beyond 1M lines causes capability degradation despite lower loss values.
Generation Examples
Environmental Discussion
Input: "Kumno ban pyniaid ia ka phang ha ka pyrthei?" (How to protect the environment?) Output: Generates substantive responses about environmental responsibility and conservation practices.
Cultural Questions
Input: "Kiei ki wah ki shnong ba don ha Meghalaya?" (What villages are in Meghalaya?) Output: Provides detailed responses about Meghalayan communities and geography.
Limitations & Safety
⚠️ Important Safety Information
Kren v1 may produce hallucinations, biased or culturally sensitive content, and should not be used for medical, legal, or high-stakes decisions without human oversight. Users are responsible for verifying outputs in critical contexts.
Specific Limitations
- Context Window: 514 tokens limits very long-form generation
- Domain Coverage: Optimized for general Khasi; specialized domains may need fine-tuning
- Cultural Nuances: May require additional culturally-specific training for certain applications
- Scale: 110M parameters provide good balance but larger models might offer enhanced capabilities
- Hallucinations: May generate plausible-sounding but factually incorrect information
- Bias: May reflect biases present in training data
- Cultural Sensitivity: Generated content should be reviewed by Khasi speakers for cultural appropriateness
Recommended Use Cases
✅ Appropriate Uses:
- Educational content generation (with human review)
- Creative writing assistance
- Language learning tools
- Cultural preservation projects
- Research and experimentation
❌ Not Recommended:
- Medical advice or diagnosis
- Legal consultation
- Financial advice
- High-stakes decision making without human oversight
- Official translations without verification
Technical Specifications
- Context Length: 514 tokens
- Vocabulary: 32,000 Khasi-specific tokens
- Precision: BF16/FP16 compatible
- Memory Requirements: ~450MB storage, 2GB+ RAM for inference
- Hardware: Optimized for consumer GPUs (4GB+ VRAM recommended)
Applications
- Educational Technology: Khasi language learning platforms
- Content Generation: Cultural and educational material creation
- Language Preservation: AI-assisted documentation of Khasi expressions
- Research: Foundation for further Khasi NLP development
Model Performance
- Training Efficiency: 6.0% loss improvement with optimal data usage
- Quality Validation: Comprehensive evaluation across multiple domains
- Capability Range: Environmental topics, cultural discussions, educational content
- Reliability: Consistent generation quality across diverse prompts
Research Significance
- Process: Encoder-to-decoder conversion methodology for Indian languages
- Methodology: Validates progressive training approach for low-resource languages
- Findings: Demonstrates optimal training data volumes for indigenous language models
- Impact: Establishes foundation for Northeast Indian language AI development
Citation
@misc{nyalang2024kren,
title={Kren v1.0: An Encoder-to-Decoder Generative Language Model for an Indian Language (Khasi)},
author={Badal Nyalang},
year={2024},
publisher={Zenodo},
doi={10.5281/zenodo.17120223},
howpublished={\url{https://zenodo.org/records/17120223}}
}
Related Models
- MWirelabs/khasibert - Base encoder model
Contact
Developed by MWire Labs, Shillong, Meghalaya. For questions about Kren v1 or Khasi language AI research, please refer to the research paper or contact our research team.
License
This model is released under CC BY 4.0 license, allowing for broad use with attribution.
Note: This model represents the culmination of progressive training research and is recommended for production applications requiring Khasi text generation, with appropriate human oversight for safety-critical uses.
- Downloads last month
- 14
Model tree for MWirelabs/kren-v1
Base model
MWirelabs/khasibert