File size: 7,656 Bytes
445fd2d 479dd18 445fd2d e88c25e c51942b e88c25e c51942b 445fd2d e88c25e 445fd2d e88c25e 445fd2d c51942b 445fd2d c51942b 445fd2d c51942b 445fd2d c51942b 445fd2d c51942b 445fd2d c51942b 445fd2d e88c25e 445fd2d c51942b 445fd2d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 |
---
language:
- en
license: mit
library_name: transformers
tags:
- text-generation
- pytorch
- custom-architecture
- rope
- rmsnorm
- swiglu
- flash-attention
- 16k-context
pipeline_tag: text-generation
widget:
- text: "The future of artificial intelligence is"
example_title: "AI Future"
- text: "Write a short story about"
example_title: "Story Generation"
- text: "Explain quantum computing in simple terms:"
example_title: "Technical Explanation"
datasets:
- tiiuae/falcon-refinedweb
metrics:
- perplexity
model-index:
- name: MAP-NEO Mini
results:
- task:
type: text-generation
name: Text Generation
dataset:
name: RefinedWeb (100K subset)
type: tiiuae/falcon-refinedweb
metrics:
- type: perplexity
value: 3.9
name: Final Training Loss
---
# MAP-NEO Mini
## Model Description
**MAP-NEO Mini** is a 253M parameter autoregressive language model built from scratch with modern architectural improvements. It demonstrates that high-quality language models can be trained efficiently on modest hardware while achieving competitive performance through careful data curation and architectural choices.
- **Developed by**: Antony Austin
- **Model type**: Autoregressive Language Model
- **Language(s)**: English
- **License**: MIT
- **Architecture**: Custom transformer with RoPE, RMSNorm, SwiGLU, and Flash Attention
## Key Features
- **Efficient Training**: Trained on RTX 5070 Laptop GPU (8GB VRAM) in ~4 hours
- **Extended Context**: 16,384 token context window (16x typical small models)
- **Memory Efficient**: Only 1.3GB VRAM for 1,800 tokens inference
- **Fast Inference**: ~150+ tokens/second on consumer GPU
- **High Quality Data**: Trained on curated RefinedWeb subset
## Architecture Details
### Model Architecture
- **Parameters**: 253,085,696 (253M)
- **Layers**: 16 transformer blocks
- **Hidden Size**: 1,024
- **Attention Heads**: 16
- **Head Dimension**: 64
- **FFN Hidden Size**: 2,736 (2.67x hidden size)
- **Vocabulary Size**: 50,257 (GPT-2 tokenizer)
- **Max Sequence Length**: 16,384 tokens
### Architectural Innovations
- **RMSNorm**: Root Mean Square Layer Normalization for training stability
- **RoPE**: Rotary Positional Embeddings for better positional understanding
- **SwiGLU**: Swish-Gated Linear Units for improved FFN performance
- **Flash Attention**: Memory-efficient attention computation
- **Weight Tying**: Input/output embeddings shared for parameter efficiency
## Training Data
### Dataset
- **Source**: `tiiuae/falcon-refinedweb` (curated subset)
- **Size**: 100,000 high-quality web documents
- **Tokens**: ~41 million tokens
- **Sequence Length**: 1,024 tokens per sequence
- **Sequences**: 40,965 packed sequences
### Data Quality
- Length filtering: 200-10,000 characters
- Language detection: English only
- Quality scoring: High-quality web content
- Deduplication: Exact and near-duplicate removal
## Training Procedure
### Training Configuration
- **Hardware**: NVIDIA RTX 5070 Laptop GPU (8GB VRAM)
- **Precision**: bfloat16 mixed precision
- **Batch Size**: 1 per device
- **Gradient Accumulation**: 32 steps
- **Effective Batch Size**: 32
- **Learning Rate**: 3e-4
- **Scheduler**: Cosine with linear warmup
- **Warmup Steps**: 3,750
- **Total Steps**: 150,000
- **Training Time**: ~4 hours
### Optimization Details
- **Optimizer**: AdamW (β₁=0.9, β₂=0.95, weight_decay=0.01)
- **Gradient Clipping**: 1.0
- **Gradient Checkpointing**: Enabled for memory efficiency
- **Loss Function**: Cross-entropy loss
### Context Extension
- **Base Context**: 2,048 tokens
- **Extended Context**: 16,384 tokens
- **Method**: Linear interpolation of positional embeddings
- **Validation**: Successfully tested up to 3,600 tokens
## Performance
### Training Metrics
- **Final Loss**: 3.907
- **Training Speed**: ~10 iterations/second
- **Peak Memory**: ~8GB VRAM
- **Convergence**: Smooth loss curve, no overfitting
### Inference Performance
- **Speed**: ~150+ tokens/second (RTX 5070)
- **Memory Usage**: 1.3GB for 1,800 token context
- **Context Limit**: 3,600 tokens practical limit
- **Temperature**: Recommended 0.7-0.9 for creative tasks
## Usage
### Quick Start
```
import torch
from transformers import AutoTokenizer
from model_neo import NeoMini, NeoMiniConfig
# Load model
config = NeoMiniConfig()
model = NeoMini(config)
checkpoint = torch.load("extended_context_model.pt")
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
# Generate text
prompt = "The future of AI is"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
with torch.no_grad():
output = model.generate(input_ids, max_length=100, temperature=0.8)
print(tokenizer.decode(output))
```
### Interactive Chat
```
python interactive_chat.py
```
### Generation Parameters
- **Temperature**: 0.7-0.9 for creative tasks, 0.3-0.5 for factual
- **Top-k**: 40-50
- **Top-p**: 0.8-0.9
- **Repetition Penalty**: 1.1-1.3
## Limitations
### Current Limitations
- **Base Model Only**: Not instruction-tuned (requires fine-tuning for chat)
- **Context Window**: Practical limit of ~3,600 tokens despite 16K architecture
- **Hardware Requirements**: Requires CUDA-capable GPU for optimal performance
- **Knowledge Cutoff**: Limited to web data patterns, no specific knowledge cutoff
### Known Issues
- Occasionally generates repetitive patterns (fixable with fine-tuning)
- May not follow instructions well (base model behavior)
- Sometimes produces formatting artifacts from web data
## Ethical Considerations
### Bias and Fairness
- Trained on web data which may contain societal biases
- No explicit bias mitigation applied during training
- Users should be aware of potential biased outputs
### Use Cases
**Intended Uses:**
- Research and experimentation
- Text generation and completion
- Creative writing assistance
- Educational purposes
**Out-of-Scope Uses:**
- Medical or legal advice
- High-stakes decision making
- Content that could cause harm
## Environmental Impact
### Carbon Footprint
- **Training Hardware**: Single RTX 5070 Laptop GPU (100W)
- **Training Time**: 4 hours
- **Estimated CO₂**: ~0.3 kg CO₂ equivalent
- **Efficiency**: 253M parameters per 0.3 kg CO₂
## Model Card Authors
[Antony Austin] - Model development and training
[30/08/2025] - Model card creation
## Citation
```
@misc{mapneo_mini_2025,
title={MAP-NEO Mini: An Efficient 253M Parameter Language Model},
author={[Antony Austin]},
year={2025},
howpublished={\url{https://huggingface.co/Austin207/Map-NEO}},
note={Trained on NVIDIA RTX 5070 Laptop GPU with RefinedWeb data}
}
```
## Technical Details
### Hardware Requirements
- **Minimum**: 4GB VRAM for inference
- **Recommended**: 8GB VRAM for extended context
- **Training**: 8GB+ VRAM with mixed precision
- **CPU**: Any modern CPU (inference possible but slow)
## Future Work
### Planned Improvements
- [ ] Conversational fine-tuning with UltraChat dataset
- [ ] Instruction following capabilities
- [ ] Multi-language support
- [ ] Quantized versions (4-bit, 8-bit)
- [ ] ONNX export for edge deployment
### Research Directions
- Context window optimization beyond 16K
- More efficient attention mechanisms
- Improved training data curation
- Specialized domain fine-tuning
## Acknowledgments
- **Falcon RefinedWeb**: High-quality training data
- **Hugging Face**: Transformers library and infrastructure
- **Community**: Open-source ML community for architectural insights
---
**Last Updated**: August 30, 2025
**Model Version**: 1.0.0
**Status**: Base model (pre-conversational fine-tuning) |