MAP-NEO Mini
Model Description
MAP-NEO Mini is a 253M parameter autoregressive language model built from scratch with modern architectural improvements. It demonstrates that high-quality language models can be trained efficiently on modest hardware while achieving competitive performance through careful data curation and architectural choices.
- Developed by: Antony Austin
- Model type: Autoregressive Language Model
- Language(s): English
- License: MIT
- Architecture: Custom transformer with RoPE, RMSNorm, SwiGLU, and Flash Attention
Key Features
- Efficient Training: Trained on RTX 5070 Laptop GPU (8GB VRAM) in ~4 hours
- Extended Context: 16,384 token context window (16x typical small models)
- Memory Efficient: Only 1.3GB VRAM for 1,800 tokens inference
- Fast Inference: ~150+ tokens/second on consumer GPU
- High Quality Data: Trained on curated RefinedWeb subset
Architecture Details
Model Architecture
- Parameters: 253,085,696 (253M)
- Layers: 16 transformer blocks
- Hidden Size: 1,024
- Attention Heads: 16
- Head Dimension: 64
- FFN Hidden Size: 2,736 (2.67x hidden size)
- Vocabulary Size: 50,257 (GPT-2 tokenizer)
- Max Sequence Length: 16,384 tokens
Architectural Innovations
- RMSNorm: Root Mean Square Layer Normalization for training stability
- RoPE: Rotary Positional Embeddings for better positional understanding
- SwiGLU: Swish-Gated Linear Units for improved FFN performance
- Flash Attention: Memory-efficient attention computation
- Weight Tying: Input/output embeddings shared for parameter efficiency
Training Data
Dataset
- Source:
tiiuae/falcon-refinedweb(curated subset) - Size: 100,000 high-quality web documents
- Tokens: ~41 million tokens
- Sequence Length: 1,024 tokens per sequence
- Sequences: 40,965 packed sequences
Data Quality
- Length filtering: 200-10,000 characters
- Language detection: English only
- Quality scoring: High-quality web content
- Deduplication: Exact and near-duplicate removal
Training Procedure
Training Configuration
- Hardware: NVIDIA RTX 5070 Laptop GPU (8GB VRAM)
- Precision: bfloat16 mixed precision
- Batch Size: 1 per device
- Gradient Accumulation: 32 steps
- Effective Batch Size: 32
- Learning Rate: 3e-4
- Scheduler: Cosine with linear warmup
- Warmup Steps: 3,750
- Total Steps: 150,000
- Training Time: ~4 hours
Optimization Details
- Optimizer: AdamW (β₁=0.9, β₂=0.95, weight_decay=0.01)
- Gradient Clipping: 1.0
- Gradient Checkpointing: Enabled for memory efficiency
- Loss Function: Cross-entropy loss
Context Extension
- Base Context: 2,048 tokens
- Extended Context: 16,384 tokens
- Method: Linear interpolation of positional embeddings
- Validation: Successfully tested up to 3,600 tokens
Performance
Training Metrics
- Final Loss: 3.907
- Training Speed: ~10 iterations/second
- Peak Memory: ~8GB VRAM
- Convergence: Smooth loss curve, no overfitting
Inference Performance
- Speed: ~150+ tokens/second (RTX 5070)
- Memory Usage: 1.3GB for 1,800 token context
- Context Limit: 3,600 tokens practical limit
- Temperature: Recommended 0.7-0.9 for creative tasks
Usage
Quick Start
import torch
from transformers import AutoTokenizer
from model_neo import NeoMini, NeoMiniConfig
# Load model
config = NeoMiniConfig()
model = NeoMini(config)
checkpoint = torch.load("extended_context_model.pt")
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
# Generate text
prompt = "The future of AI is"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
with torch.no_grad():
output = model.generate(input_ids, max_length=100, temperature=0.8)
print(tokenizer.decode(output))
Interactive Chat
python interactive_chat.py
Generation Parameters
- Temperature: 0.7-0.9 for creative tasks, 0.3-0.5 for factual
- Top-k: 40-50
- Top-p: 0.8-0.9
- Repetition Penalty: 1.1-1.3
Limitations
Current Limitations
- Base Model Only: Not instruction-tuned (requires fine-tuning for chat)
- Context Window: Practical limit of ~3,600 tokens despite 16K architecture
- Hardware Requirements: Requires CUDA-capable GPU for optimal performance
- Knowledge Cutoff: Limited to web data patterns, no specific knowledge cutoff
Known Issues
- Occasionally generates repetitive patterns (fixable with fine-tuning)
- May not follow instructions well (base model behavior)
- Sometimes produces formatting artifacts from web data
Ethical Considerations
Bias and Fairness
- Trained on web data which may contain societal biases
- No explicit bias mitigation applied during training
- Users should be aware of potential biased outputs
Use Cases
Intended Uses:
- Research and experimentation
- Text generation and completion
- Creative writing assistance
- Educational purposes
Out-of-Scope Uses:
- Medical or legal advice
- High-stakes decision making
- Content that could cause harm
Environmental Impact
Carbon Footprint
- Training Hardware: Single RTX 5070 Laptop GPU (100W)
- Training Time: 4 hours
- Estimated CO₂: ~0.3 kg CO₂ equivalent
- Efficiency: 253M parameters per 0.3 kg CO₂
Model Card Authors
[Antony Austin] - Model development and training [30/08/2025] - Model card creation
Citation
@misc{mapneo_mini_2025,
title={MAP-NEO Mini: An Efficient 253M Parameter Language Model},
author={[Antony Austin]},
year={2025},
howpublished={\url{https://huggingface.co/Austin207/Map-NEO}},
note={Trained on NVIDIA RTX 5070 Laptop GPU with RefinedWeb data}
}
Technical Details
Hardware Requirements
- Minimum: 4GB VRAM for inference
- Recommended: 8GB VRAM for extended context
- Training: 8GB+ VRAM with mixed precision
- CPU: Any modern CPU (inference possible but slow)
Future Work
Planned Improvements
- Conversational fine-tuning with UltraChat dataset
- Instruction following capabilities
- Multi-language support
- Quantized versions (4-bit, 8-bit)
- ONNX export for edge deployment
Research Directions
- Context window optimization beyond 16K
- More efficient attention mechanisms
- Improved training data curation
- Specialized domain fine-tuning
Acknowledgments
- Falcon RefinedWeb: High-quality training data
- Hugging Face: Transformers library and infrastructure
- Community: Open-source ML community for architectural insights
Last Updated: August 30, 2025 Model Version: 1.0.0 Status: Base model (pre-conversational fine-tuning)
Dataset used to train Austin207/Map-NEO
Evaluation results
- Final Training Loss on RefinedWeb (100K subset)self-reported3.900