|
|
--- |
|
|
language: |
|
|
- en |
|
|
license: mit |
|
|
library_name: transformers |
|
|
tags: |
|
|
- text-generation |
|
|
- pytorch |
|
|
- custom-architecture |
|
|
- rope |
|
|
- rmsnorm |
|
|
- swiglu |
|
|
- flash-attention |
|
|
- 16k-context |
|
|
pipeline_tag: text-generation |
|
|
widget: |
|
|
- text: "The future of artificial intelligence is" |
|
|
example_title: "AI Future" |
|
|
- text: "Write a short story about" |
|
|
example_title: "Story Generation" |
|
|
- text: "Explain quantum computing in simple terms:" |
|
|
example_title: "Technical Explanation" |
|
|
datasets: |
|
|
- tiiuae/falcon-refinedweb |
|
|
metrics: |
|
|
- perplexity |
|
|
model-index: |
|
|
- name: MAP-NEO Mini |
|
|
results: |
|
|
- task: |
|
|
type: text-generation |
|
|
name: Text Generation |
|
|
dataset: |
|
|
name: RefinedWeb (100K subset) |
|
|
type: tiiuae/falcon-refinedweb |
|
|
metrics: |
|
|
- type: perplexity |
|
|
value: 3.9 |
|
|
name: Final Training Loss |
|
|
--- |
|
|
|
|
|
# MAP-NEO Mini |
|
|
|
|
|
## Model Description |
|
|
|
|
|
**MAP-NEO Mini** is a 253M parameter autoregressive language model built from scratch with modern architectural improvements. It demonstrates that high-quality language models can be trained efficiently on modest hardware while achieving competitive performance through careful data curation and architectural choices. |
|
|
|
|
|
- **Developed by**: Antony Austin |
|
|
- **Model type**: Autoregressive Language Model |
|
|
- **Language(s)**: English |
|
|
- **License**: MIT |
|
|
- **Architecture**: Custom transformer with RoPE, RMSNorm, SwiGLU, and Flash Attention |
|
|
|
|
|
## Key Features |
|
|
|
|
|
- **Efficient Training**: Trained on RTX 5070 Laptop GPU (8GB VRAM) in ~4 hours |
|
|
- **Extended Context**: 16,384 token context window (16x typical small models) |
|
|
- **Memory Efficient**: Only 1.3GB VRAM for 1,800 tokens inference |
|
|
- **Fast Inference**: ~150+ tokens/second on consumer GPU |
|
|
- **High Quality Data**: Trained on curated RefinedWeb subset |
|
|
|
|
|
## Architecture Details |
|
|
|
|
|
### Model Architecture |
|
|
- **Parameters**: 253,085,696 (253M) |
|
|
- **Layers**: 16 transformer blocks |
|
|
- **Hidden Size**: 1,024 |
|
|
- **Attention Heads**: 16 |
|
|
- **Head Dimension**: 64 |
|
|
- **FFN Hidden Size**: 2,736 (2.67x hidden size) |
|
|
- **Vocabulary Size**: 50,257 (GPT-2 tokenizer) |
|
|
- **Max Sequence Length**: 16,384 tokens |
|
|
|
|
|
### Architectural Innovations |
|
|
- **RMSNorm**: Root Mean Square Layer Normalization for training stability |
|
|
- **RoPE**: Rotary Positional Embeddings for better positional understanding |
|
|
- **SwiGLU**: Swish-Gated Linear Units for improved FFN performance |
|
|
- **Flash Attention**: Memory-efficient attention computation |
|
|
- **Weight Tying**: Input/output embeddings shared for parameter efficiency |
|
|
|
|
|
## Training Data |
|
|
|
|
|
### Dataset |
|
|
- **Source**: `tiiuae/falcon-refinedweb` (curated subset) |
|
|
- **Size**: 100,000 high-quality web documents |
|
|
- **Tokens**: ~41 million tokens |
|
|
- **Sequence Length**: 1,024 tokens per sequence |
|
|
- **Sequences**: 40,965 packed sequences |
|
|
|
|
|
### Data Quality |
|
|
- Length filtering: 200-10,000 characters |
|
|
- Language detection: English only |
|
|
- Quality scoring: High-quality web content |
|
|
- Deduplication: Exact and near-duplicate removal |
|
|
|
|
|
## Training Procedure |
|
|
|
|
|
### Training Configuration |
|
|
- **Hardware**: NVIDIA RTX 5070 Laptop GPU (8GB VRAM) |
|
|
- **Precision**: bfloat16 mixed precision |
|
|
- **Batch Size**: 1 per device |
|
|
- **Gradient Accumulation**: 32 steps |
|
|
- **Effective Batch Size**: 32 |
|
|
- **Learning Rate**: 3e-4 |
|
|
- **Scheduler**: Cosine with linear warmup |
|
|
- **Warmup Steps**: 3,750 |
|
|
- **Total Steps**: 150,000 |
|
|
- **Training Time**: ~4 hours |
|
|
|
|
|
### Optimization Details |
|
|
- **Optimizer**: AdamW (β₁=0.9, β₂=0.95, weight_decay=0.01) |
|
|
- **Gradient Clipping**: 1.0 |
|
|
- **Gradient Checkpointing**: Enabled for memory efficiency |
|
|
- **Loss Function**: Cross-entropy loss |
|
|
|
|
|
### Context Extension |
|
|
- **Base Context**: 2,048 tokens |
|
|
- **Extended Context**: 16,384 tokens |
|
|
- **Method**: Linear interpolation of positional embeddings |
|
|
- **Validation**: Successfully tested up to 3,600 tokens |
|
|
|
|
|
## Performance |
|
|
|
|
|
### Training Metrics |
|
|
- **Final Loss**: 3.907 |
|
|
- **Training Speed**: ~10 iterations/second |
|
|
- **Peak Memory**: ~8GB VRAM |
|
|
- **Convergence**: Smooth loss curve, no overfitting |
|
|
|
|
|
### Inference Performance |
|
|
- **Speed**: ~150+ tokens/second (RTX 5070) |
|
|
- **Memory Usage**: 1.3GB for 1,800 token context |
|
|
- **Context Limit**: 3,600 tokens practical limit |
|
|
- **Temperature**: Recommended 0.7-0.9 for creative tasks |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Quick Start |
|
|
``` |
|
|
import torch |
|
|
from transformers import AutoTokenizer |
|
|
from model_neo import NeoMini, NeoMiniConfig |
|
|
|
|
|
# Load model |
|
|
config = NeoMiniConfig() |
|
|
model = NeoMini(config) |
|
|
checkpoint = torch.load("extended_context_model.pt") |
|
|
model.load_state_dict(checkpoint['model_state_dict']) |
|
|
model.eval() |
|
|
|
|
|
# Load tokenizer |
|
|
tokenizer = AutoTokenizer.from_pretrained("gpt2") |
|
|
|
|
|
# Generate text |
|
|
prompt = "The future of AI is" |
|
|
input_ids = tokenizer.encode(prompt, return_tensors="pt") |
|
|
with torch.no_grad(): |
|
|
output = model.generate(input_ids, max_length=100, temperature=0.8) |
|
|
print(tokenizer.decode(output)) |
|
|
``` |
|
|
### Interactive Chat |
|
|
``` |
|
|
python interactive_chat.py |
|
|
``` |
|
|
|
|
|
### Generation Parameters |
|
|
- **Temperature**: 0.7-0.9 for creative tasks, 0.3-0.5 for factual |
|
|
- **Top-k**: 40-50 |
|
|
- **Top-p**: 0.8-0.9 |
|
|
- **Repetition Penalty**: 1.1-1.3 |
|
|
|
|
|
## Limitations |
|
|
|
|
|
### Current Limitations |
|
|
- **Base Model Only**: Not instruction-tuned (requires fine-tuning for chat) |
|
|
- **Context Window**: Practical limit of ~3,600 tokens despite 16K architecture |
|
|
- **Hardware Requirements**: Requires CUDA-capable GPU for optimal performance |
|
|
- **Knowledge Cutoff**: Limited to web data patterns, no specific knowledge cutoff |
|
|
|
|
|
### Known Issues |
|
|
- Occasionally generates repetitive patterns (fixable with fine-tuning) |
|
|
- May not follow instructions well (base model behavior) |
|
|
- Sometimes produces formatting artifacts from web data |
|
|
|
|
|
## Ethical Considerations |
|
|
|
|
|
### Bias and Fairness |
|
|
- Trained on web data which may contain societal biases |
|
|
- No explicit bias mitigation applied during training |
|
|
- Users should be aware of potential biased outputs |
|
|
|
|
|
### Use Cases |
|
|
**Intended Uses:** |
|
|
- Research and experimentation |
|
|
- Text generation and completion |
|
|
- Creative writing assistance |
|
|
- Educational purposes |
|
|
|
|
|
**Out-of-Scope Uses:** |
|
|
- Medical or legal advice |
|
|
- High-stakes decision making |
|
|
- Content that could cause harm |
|
|
|
|
|
## Environmental Impact |
|
|
|
|
|
### Carbon Footprint |
|
|
- **Training Hardware**: Single RTX 5070 Laptop GPU (100W) |
|
|
- **Training Time**: 4 hours |
|
|
- **Estimated CO₂**: ~0.3 kg CO₂ equivalent |
|
|
- **Efficiency**: 253M parameters per 0.3 kg CO₂ |
|
|
|
|
|
## Model Card Authors |
|
|
|
|
|
[Antony Austin] - Model development and training |
|
|
[30/08/2025] - Model card creation |
|
|
|
|
|
## Citation |
|
|
|
|
|
``` |
|
|
@misc{mapneo_mini_2025, |
|
|
title={MAP-NEO Mini: An Efficient 253M Parameter Language Model}, |
|
|
author={[Antony Austin]}, |
|
|
year={2025}, |
|
|
howpublished={\url{https://huggingface.co/Austin207/Map-NEO}}, |
|
|
note={Trained on NVIDIA RTX 5070 Laptop GPU with RefinedWeb data} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Technical Details |
|
|
|
|
|
### Hardware Requirements |
|
|
- **Minimum**: 4GB VRAM for inference |
|
|
- **Recommended**: 8GB VRAM for extended context |
|
|
- **Training**: 8GB+ VRAM with mixed precision |
|
|
- **CPU**: Any modern CPU (inference possible but slow) |
|
|
|
|
|
## Future Work |
|
|
|
|
|
### Planned Improvements |
|
|
- [ ] Conversational fine-tuning with UltraChat dataset |
|
|
- [ ] Instruction following capabilities |
|
|
- [ ] Multi-language support |
|
|
- [ ] Quantized versions (4-bit, 8-bit) |
|
|
- [ ] ONNX export for edge deployment |
|
|
|
|
|
### Research Directions |
|
|
- Context window optimization beyond 16K |
|
|
- More efficient attention mechanisms |
|
|
- Improved training data curation |
|
|
- Specialized domain fine-tuning |
|
|
|
|
|
## Acknowledgments |
|
|
|
|
|
- **Falcon RefinedWeb**: High-quality training data |
|
|
- **Hugging Face**: Transformers library and infrastructure |
|
|
- **Community**: Open-source ML community for architectural insights |
|
|
|
|
|
--- |
|
|
|
|
|
**Last Updated**: August 30, 2025 |
|
|
**Model Version**: 1.0.0 |
|
|
**Status**: Base model (pre-conversational fine-tuning) |