Map-NEO / README.md
Austin207's picture
Update README.md
479dd18 verified
metadata
language:
  - en
license: mit
library_name: transformers
tags:
  - text-generation
  - pytorch
  - custom-architecture
  - rope
  - rmsnorm
  - swiglu
  - flash-attention
  - 16k-context
pipeline_tag: text-generation
widget:
  - text: The future of artificial intelligence is
    example_title: AI Future
  - text: Write a short story about
    example_title: Story Generation
  - text: 'Explain quantum computing in simple terms:'
    example_title: Technical Explanation
datasets:
  - tiiuae/falcon-refinedweb
metrics:
  - perplexity
model-index:
  - name: MAP-NEO Mini
    results:
      - task:
          type: text-generation
          name: Text Generation
        dataset:
          name: RefinedWeb (100K subset)
          type: tiiuae/falcon-refinedweb
        metrics:
          - type: perplexity
            value: 3.9
            name: Final Training Loss

MAP-NEO Mini

Model Description

MAP-NEO Mini is a 253M parameter autoregressive language model built from scratch with modern architectural improvements. It demonstrates that high-quality language models can be trained efficiently on modest hardware while achieving competitive performance through careful data curation and architectural choices.

  • Developed by: Antony Austin
  • Model type: Autoregressive Language Model
  • Language(s): English
  • License: MIT
  • Architecture: Custom transformer with RoPE, RMSNorm, SwiGLU, and Flash Attention

Key Features

  • Efficient Training: Trained on RTX 5070 Laptop GPU (8GB VRAM) in ~4 hours
  • Extended Context: 16,384 token context window (16x typical small models)
  • Memory Efficient: Only 1.3GB VRAM for 1,800 tokens inference
  • Fast Inference: ~150+ tokens/second on consumer GPU
  • High Quality Data: Trained on curated RefinedWeb subset

Architecture Details

Model Architecture

  • Parameters: 253,085,696 (253M)
  • Layers: 16 transformer blocks
  • Hidden Size: 1,024
  • Attention Heads: 16
  • Head Dimension: 64
  • FFN Hidden Size: 2,736 (2.67x hidden size)
  • Vocabulary Size: 50,257 (GPT-2 tokenizer)
  • Max Sequence Length: 16,384 tokens

Architectural Innovations

  • RMSNorm: Root Mean Square Layer Normalization for training stability
  • RoPE: Rotary Positional Embeddings for better positional understanding
  • SwiGLU: Swish-Gated Linear Units for improved FFN performance
  • Flash Attention: Memory-efficient attention computation
  • Weight Tying: Input/output embeddings shared for parameter efficiency

Training Data

Dataset

  • Source: tiiuae/falcon-refinedweb (curated subset)
  • Size: 100,000 high-quality web documents
  • Tokens: ~41 million tokens
  • Sequence Length: 1,024 tokens per sequence
  • Sequences: 40,965 packed sequences

Data Quality

  • Length filtering: 200-10,000 characters
  • Language detection: English only
  • Quality scoring: High-quality web content
  • Deduplication: Exact and near-duplicate removal

Training Procedure

Training Configuration

  • Hardware: NVIDIA RTX 5070 Laptop GPU (8GB VRAM)
  • Precision: bfloat16 mixed precision
  • Batch Size: 1 per device
  • Gradient Accumulation: 32 steps
  • Effective Batch Size: 32
  • Learning Rate: 3e-4
  • Scheduler: Cosine with linear warmup
  • Warmup Steps: 3,750
  • Total Steps: 150,000
  • Training Time: ~4 hours

Optimization Details

  • Optimizer: AdamW (β₁=0.9, β₂=0.95, weight_decay=0.01)
  • Gradient Clipping: 1.0
  • Gradient Checkpointing: Enabled for memory efficiency
  • Loss Function: Cross-entropy loss

Context Extension

  • Base Context: 2,048 tokens
  • Extended Context: 16,384 tokens
  • Method: Linear interpolation of positional embeddings
  • Validation: Successfully tested up to 3,600 tokens

Performance

Training Metrics

  • Final Loss: 3.907
  • Training Speed: ~10 iterations/second
  • Peak Memory: ~8GB VRAM
  • Convergence: Smooth loss curve, no overfitting

Inference Performance

  • Speed: ~150+ tokens/second (RTX 5070)
  • Memory Usage: 1.3GB for 1,800 token context
  • Context Limit: 3,600 tokens practical limit
  • Temperature: Recommended 0.7-0.9 for creative tasks

Usage

Quick Start

import torch
from transformers import AutoTokenizer
from model_neo import NeoMini, NeoMiniConfig

# Load model
config = NeoMiniConfig()
model = NeoMini(config)
checkpoint = torch.load("extended_context_model.pt")
model.load_state_dict(checkpoint['model_state_dict'])
model.eval()

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Generate text
prompt = "The future of AI is"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
with torch.no_grad():
    output = model.generate(input_ids, max_length=100, temperature=0.8)
print(tokenizer.decode(output))

Interactive Chat

python interactive_chat.py

Generation Parameters

  • Temperature: 0.7-0.9 for creative tasks, 0.3-0.5 for factual
  • Top-k: 40-50
  • Top-p: 0.8-0.9
  • Repetition Penalty: 1.1-1.3

Limitations

Current Limitations

  • Base Model Only: Not instruction-tuned (requires fine-tuning for chat)
  • Context Window: Practical limit of ~3,600 tokens despite 16K architecture
  • Hardware Requirements: Requires CUDA-capable GPU for optimal performance
  • Knowledge Cutoff: Limited to web data patterns, no specific knowledge cutoff

Known Issues

  • Occasionally generates repetitive patterns (fixable with fine-tuning)
  • May not follow instructions well (base model behavior)
  • Sometimes produces formatting artifacts from web data

Ethical Considerations

Bias and Fairness

  • Trained on web data which may contain societal biases
  • No explicit bias mitigation applied during training
  • Users should be aware of potential biased outputs

Use Cases

Intended Uses:

  • Research and experimentation
  • Text generation and completion
  • Creative writing assistance
  • Educational purposes

Out-of-Scope Uses:

  • Medical or legal advice
  • High-stakes decision making
  • Content that could cause harm

Environmental Impact

Carbon Footprint

  • Training Hardware: Single RTX 5070 Laptop GPU (100W)
  • Training Time: 4 hours
  • Estimated CO₂: ~0.3 kg CO₂ equivalent
  • Efficiency: 253M parameters per 0.3 kg CO₂

Model Card Authors

[Antony Austin] - Model development and training [30/08/2025] - Model card creation

Citation

@misc{mapneo_mini_2025,
  title={MAP-NEO Mini: An Efficient 253M Parameter Language Model},
  author={[Antony Austin]},
  year={2025},
  howpublished={\url{https://huggingface.co/Austin207/Map-NEO}},
  note={Trained on NVIDIA RTX 5070 Laptop GPU with RefinedWeb data}
}

Technical Details

Hardware Requirements

  • Minimum: 4GB VRAM for inference
  • Recommended: 8GB VRAM for extended context
  • Training: 8GB+ VRAM with mixed precision
  • CPU: Any modern CPU (inference possible but slow)

Future Work

Planned Improvements

  • Conversational fine-tuning with UltraChat dataset
  • Instruction following capabilities
  • Multi-language support
  • Quantized versions (4-bit, 8-bit)
  • ONNX export for edge deployment

Research Directions

  • Context window optimization beyond 16K
  • More efficient attention mechanisms
  • Improved training data curation
  • Specialized domain fine-tuning

Acknowledgments

  • Falcon RefinedWeb: High-quality training data
  • Hugging Face: Transformers library and infrastructure
  • Community: Open-source ML community for architectural insights

Last Updated: August 30, 2025 Model Version: 1.0.0 Status: Base model (pre-conversational fine-tuning)