--- language: - en license: mit library_name: transformers tags: - text-generation - pytorch - custom-architecture - rope - rmsnorm - swiglu - flash-attention - 16k-context pipeline_tag: text-generation widget: - text: "The future of artificial intelligence is" example_title: "AI Future" - text: "Write a short story about" example_title: "Story Generation" - text: "Explain quantum computing in simple terms:" example_title: "Technical Explanation" datasets: - tiiuae/falcon-refinedweb metrics: - perplexity model-index: - name: MAP-NEO Mini results: - task: type: text-generation name: Text Generation dataset: name: RefinedWeb (100K subset) type: tiiuae/falcon-refinedweb metrics: - type: perplexity value: 3.9 name: Final Training Loss --- # MAP-NEO Mini ## Model Description **MAP-NEO Mini** is a 253M parameter autoregressive language model built from scratch with modern architectural improvements. It demonstrates that high-quality language models can be trained efficiently on modest hardware while achieving competitive performance through careful data curation and architectural choices. - **Developed by**: Antony Austin - **Model type**: Autoregressive Language Model - **Language(s)**: English - **License**: MIT - **Architecture**: Custom transformer with RoPE, RMSNorm, SwiGLU, and Flash Attention ## Key Features - **Efficient Training**: Trained on RTX 5070 Laptop GPU (8GB VRAM) in ~4 hours - **Extended Context**: 16,384 token context window (16x typical small models) - **Memory Efficient**: Only 1.3GB VRAM for 1,800 tokens inference - **Fast Inference**: ~150+ tokens/second on consumer GPU - **High Quality Data**: Trained on curated RefinedWeb subset ## Architecture Details ### Model Architecture - **Parameters**: 253,085,696 (253M) - **Layers**: 16 transformer blocks - **Hidden Size**: 1,024 - **Attention Heads**: 16 - **Head Dimension**: 64 - **FFN Hidden Size**: 2,736 (2.67x hidden size) - **Vocabulary Size**: 50,257 (GPT-2 tokenizer) - **Max Sequence Length**: 16,384 tokens ### Architectural Innovations - **RMSNorm**: Root Mean Square Layer Normalization for training stability - **RoPE**: Rotary Positional Embeddings for better positional understanding - **SwiGLU**: Swish-Gated Linear Units for improved FFN performance - **Flash Attention**: Memory-efficient attention computation - **Weight Tying**: Input/output embeddings shared for parameter efficiency ## Training Data ### Dataset - **Source**: `tiiuae/falcon-refinedweb` (curated subset) - **Size**: 100,000 high-quality web documents - **Tokens**: ~41 million tokens - **Sequence Length**: 1,024 tokens per sequence - **Sequences**: 40,965 packed sequences ### Data Quality - Length filtering: 200-10,000 characters - Language detection: English only - Quality scoring: High-quality web content - Deduplication: Exact and near-duplicate removal ## Training Procedure ### Training Configuration - **Hardware**: NVIDIA RTX 5070 Laptop GPU (8GB VRAM) - **Precision**: bfloat16 mixed precision - **Batch Size**: 1 per device - **Gradient Accumulation**: 32 steps - **Effective Batch Size**: 32 - **Learning Rate**: 3e-4 - **Scheduler**: Cosine with linear warmup - **Warmup Steps**: 3,750 - **Total Steps**: 150,000 - **Training Time**: ~4 hours ### Optimization Details - **Optimizer**: AdamW (β₁=0.9, β₂=0.95, weight_decay=0.01) - **Gradient Clipping**: 1.0 - **Gradient Checkpointing**: Enabled for memory efficiency - **Loss Function**: Cross-entropy loss ### Context Extension - **Base Context**: 2,048 tokens - **Extended Context**: 16,384 tokens - **Method**: Linear interpolation of positional embeddings - **Validation**: Successfully tested up to 3,600 tokens ## Performance ### Training Metrics - **Final Loss**: 3.907 - **Training Speed**: ~10 iterations/second - **Peak Memory**: ~8GB VRAM - **Convergence**: Smooth loss curve, no overfitting ### Inference Performance - **Speed**: ~150+ tokens/second (RTX 5070) - **Memory Usage**: 1.3GB for 1,800 token context - **Context Limit**: 3,600 tokens practical limit - **Temperature**: Recommended 0.7-0.9 for creative tasks ## Usage ### Quick Start ``` import torch from transformers import AutoTokenizer from model_neo import NeoMini, NeoMiniConfig # Load model config = NeoMiniConfig() model = NeoMini(config) checkpoint = torch.load("extended_context_model.pt") model.load_state_dict(checkpoint['model_state_dict']) model.eval() # Load tokenizer tokenizer = AutoTokenizer.from_pretrained("gpt2") # Generate text prompt = "The future of AI is" input_ids = tokenizer.encode(prompt, return_tensors="pt") with torch.no_grad(): output = model.generate(input_ids, max_length=100, temperature=0.8) print(tokenizer.decode(output)) ``` ### Interactive Chat ``` python interactive_chat.py ``` ### Generation Parameters - **Temperature**: 0.7-0.9 for creative tasks, 0.3-0.5 for factual - **Top-k**: 40-50 - **Top-p**: 0.8-0.9 - **Repetition Penalty**: 1.1-1.3 ## Limitations ### Current Limitations - **Base Model Only**: Not instruction-tuned (requires fine-tuning for chat) - **Context Window**: Practical limit of ~3,600 tokens despite 16K architecture - **Hardware Requirements**: Requires CUDA-capable GPU for optimal performance - **Knowledge Cutoff**: Limited to web data patterns, no specific knowledge cutoff ### Known Issues - Occasionally generates repetitive patterns (fixable with fine-tuning) - May not follow instructions well (base model behavior) - Sometimes produces formatting artifacts from web data ## Ethical Considerations ### Bias and Fairness - Trained on web data which may contain societal biases - No explicit bias mitigation applied during training - Users should be aware of potential biased outputs ### Use Cases **Intended Uses:** - Research and experimentation - Text generation and completion - Creative writing assistance - Educational purposes **Out-of-Scope Uses:** - Medical or legal advice - High-stakes decision making - Content that could cause harm ## Environmental Impact ### Carbon Footprint - **Training Hardware**: Single RTX 5070 Laptop GPU (100W) - **Training Time**: 4 hours - **Estimated CO₂**: ~0.3 kg CO₂ equivalent - **Efficiency**: 253M parameters per 0.3 kg CO₂ ## Model Card Authors [Antony Austin] - Model development and training [30/08/2025] - Model card creation ## Citation ``` @misc{mapneo_mini_2025, title={MAP-NEO Mini: An Efficient 253M Parameter Language Model}, author={[Antony Austin]}, year={2025}, howpublished={\url{https://huggingface.co/Austin207/Map-NEO}}, note={Trained on NVIDIA RTX 5070 Laptop GPU with RefinedWeb data} } ``` ## Technical Details ### Hardware Requirements - **Minimum**: 4GB VRAM for inference - **Recommended**: 8GB VRAM for extended context - **Training**: 8GB+ VRAM with mixed precision - **CPU**: Any modern CPU (inference possible but slow) ## Future Work ### Planned Improvements - [ ] Conversational fine-tuning with UltraChat dataset - [ ] Instruction following capabilities - [ ] Multi-language support - [ ] Quantized versions (4-bit, 8-bit) - [ ] ONNX export for edge deployment ### Research Directions - Context window optimization beyond 16K - More efficient attention mechanisms - Improved training data curation - Specialized domain fine-tuning ## Acknowledgments - **Falcon RefinedWeb**: High-quality training data - **Hugging Face**: Transformers library and infrastructure - **Community**: Open-source ML community for architectural insights --- **Last Updated**: August 30, 2025 **Model Version**: 1.0.0 **Status**: Base model (pre-conversational fine-tuning)