File size: 7,656 Bytes

---
language:
- en
license: mit
library_name: transformers
tags:
- text-generation
- pytorch
- custom-architecture
- rope
- rmsnorm
- swiglu
- flash-attention
- 16k-context
pipeline_tag: text-generation
widget:
- text: "The future of artificial intelligence is"
  example_title: "AI Future"
- text: "Write a short story about"
  example_title: "Story Generation"
- text: "Explain quantum computing in simple terms:"
  example_title: "Technical Explanation"
datasets:
- tiiuae/falcon-refinedweb
metrics:
- perplexity
model-index:
- name: MAP-NEO Mini
  results:
  - task:
      type: text-generation
      name: Text Generation
    dataset:
      name: RefinedWeb (100K subset)
      type: tiiuae/falcon-refinedweb
    metrics:
    - type: perplexity
      value: 3.9
      name: Final Training Loss
---

# MAP-NEO Mini

## Model Description

**MAP-NEO Mini** is a 253M parameter autoregressive language model built from scratch with modern architectural improvements. It demonstrates that high-quality language models can be trained efficiently on modest hardware while achieving competitive performance through careful data curation and architectural choices.

- **Developed by**: Antony Austin
- **Model type**: Autoregressive Language Model
- **Language(s)**: English
- **License**: MIT
- **Architecture**: Custom transformer with RoPE, RMSNorm, SwiGLU, and Flash Attention

## Key Features

-  **Efficient Training**: Trained on RTX 5070 Laptop GPU (8GB VRAM) in ~4 hours
-  **Extended Context**: 16,384 token context window (16x typical small models)
-  **Memory Efficient**: Only 1.3GB VRAM for 1,800 tokens inference
-  **Fast Inference**: ~150+ tokens/second on consumer GPU
-  **High Quality Data**: Trained on curated RefinedWeb subset

## Architecture Details

### Model Architecture
- **Parameters**: 253,085,696 (253M)
- **Layers**: 16 transformer blocks
- **Hidden Size**: 1,024
- **Attention Heads**: 16
- **Head Dimension**: 64
- **FFN Hidden Size**: 2,736 (2.67x hidden size)
- **Vocabulary Size**: 50,257 (GPT-2 tokenizer)
- **Max Sequence Length**: 16,384 tokens

### Architectural Innovations
- **RMSNorm**: Root Mean Square Layer Normalization for training stability
- **RoPE**: Rotary Positional Embeddings for better positional understanding
- **SwiGLU**: Swish-Gated Linear Units for improved FFN performance
- **Flash Attention**: Memory-efficient attention computation
- **Weight Tying**: Input/output embeddings shared for parameter efficiency

## Training Data

### Dataset
- **Source**: `tiiuae/falcon-refinedweb` (curated subset)
- **Size**: 100,000 high-quality web documents
- **Tokens**: ~41 million tokens
- **Sequence Length**: 1,024 tokens per sequence
- **Sequences**: 40,965 packed sequences

### Data Quality
- Length filtering: 200-10,000 characters
- Language detection: English only
- Quality scoring: High-quality web content
- Deduplication: Exact and near-duplicate removal

## Training Procedure

### Training Configuration
- **Hardware**: NVIDIA RTX 5070 Laptop GPU (8GB VRAM)
- **Precision**: bfloat16 mixed precision
- **Batch Size**: 1 per device
- **Gradient Accumulation**: 32 steps
- **Effective Batch Size**: 32
- **Learning Rate**: 3e-4
- **Scheduler**: Cosine with linear warmup
- **Warmup Steps**: 3,750
- **Total Steps**: 150,000
- **Training Time**: ~4 hours

### Optimization Details
- **Optimizer**: AdamW (β₁=0.9, β₂=0.95, weight_decay=0.01)
- **Gradient Clipping**: 1.0
- **Gradient Checkpointing**: Enabled for memory efficiency
- **Loss Function**: Cross-entropy loss

### Context Extension
- **Base Context**: 2,048 tokens
- **Extended Context**: 16,384 tokens
- **Method**: Linear interpolation of positional embeddings
- **Validation**: Successfully tested up to 3,600 tokens

## Performance

### Training Metrics
- **Final Loss**: 3.907
- **Training Speed**: ~10 iterations/second
- **Peak Memory**: ~8GB VRAM
- **Convergence**: Smooth loss curve, no overfitting

### Inference Performance
- **Speed**: ~150+ tokens/second (RTX 5070)
- **Memory Usage**: 1.3GB for 1,800 token context
- **Context Limit**: 3,600 tokens practical limit
- **Temperature**: Recommended 0.7-0.9 for creative tasks

## Usage

### Quick Start
```
import torch
from transformers import AutoTokenizer
from model_neo import NeoMini, NeoMiniConfig

# Load model
config = NeoMiniConfig()
model = NeoMini(config)
checkpoint = torch.load("extended_context_model.pt")
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Generate text
prompt = "The future of AI is"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
with torch.no_grad():
    output = model.generate(input_ids, max_length=100, temperature=0.8)
print(tokenizer.decode(output))
```
### Interactive Chat
```
python interactive_chat.py
```

### Generation Parameters
- **Temperature**: 0.7-0.9 for creative tasks, 0.3-0.5 for factual
- **Top-k**: 40-50
- **Top-p**: 0.8-0.9
- **Repetition Penalty**: 1.1-1.3

## Limitations

### Current Limitations
- **Base Model Only**: Not instruction-tuned (requires fine-tuning for chat)
- **Context Window**: Practical limit of ~3,600 tokens despite 16K architecture
- **Hardware Requirements**: Requires CUDA-capable GPU for optimal performance
- **Knowledge Cutoff**: Limited to web data patterns, no specific knowledge cutoff

### Known Issues
- Occasionally generates repetitive patterns (fixable with fine-tuning)
- May not follow instructions well (base model behavior)
- Sometimes produces formatting artifacts from web data

## Ethical Considerations

### Bias and Fairness
- Trained on web data which may contain societal biases
- No explicit bias mitigation applied during training
- Users should be aware of potential biased outputs

### Use Cases
**Intended Uses:**
- Research and experimentation
- Text generation and completion
- Creative writing assistance
- Educational purposes

**Out-of-Scope Uses:**
- Medical or legal advice
- High-stakes decision making
- Content that could cause harm

## Environmental Impact

### Carbon Footprint
- **Training Hardware**: Single RTX 5070 Laptop GPU (100W)
- **Training Time**: 4 hours
- **Estimated CO₂**: ~0.3 kg CO₂ equivalent
- **Efficiency**: 253M parameters per 0.3 kg CO₂

## Model Card Authors

[Antony Austin] - Model development and training
[30/08/2025] - Model card creation

## Citation

```
@misc{mapneo_mini_2025,
  title={MAP-NEO Mini: An Efficient 253M Parameter Language Model},
  author={[Antony Austin]},
  year={2025},
  howpublished={\url{https://huggingface.co/Austin207/Map-NEO}},
  note={Trained on NVIDIA RTX 5070 Laptop GPU with RefinedWeb data}
}
```

## Technical Details

### Hardware Requirements
- **Minimum**: 4GB VRAM for inference
- **Recommended**: 8GB VRAM for extended context
- **Training**: 8GB+ VRAM with mixed precision
- **CPU**: Any modern CPU (inference possible but slow)

## Future Work

### Planned Improvements
- [ ] Conversational fine-tuning with UltraChat dataset
- [ ] Instruction following capabilities
- [ ] Multi-language support
- [ ] Quantized versions (4-bit, 8-bit)
- [ ] ONNX export for edge deployment

### Research Directions
- Context window optimization beyond 16K
- More efficient attention mechanisms
- Improved training data curation
- Specialized domain fine-tuning

## Acknowledgments

- **Falcon RefinedWeb**: High-quality training data
- **Hugging Face**: Transformers library and infrastructure
- **Community**: Open-source ML community for architectural insights

---

**Last Updated**: August 30, 2025
**Model Version**: 1.0.0
**Status**: Base model (pre-conversational fine-tuning)