# MAP-NEO Mini: A DIY LLM from Scratch

This repository demonstrates a complete, end-to-end journey of building, extending, and deploying a custom 253 M-parameter language model on modest hardware (RTX 5070, 8 GB VRAM). It covers data preparation, model training, context window extension, interactive inference, and GPU optimization.

***

## 🚀 Project Overview

- **Model**: MAP-NEO Mini (253 M parameters)  
- **Architecture**: Rotary embeddings, RMSNorm, SwiGLU, Flash Attention  
- **Hardware**: Intel i5 CPU, 16 GB RAM → NVIDIA RTX 4000 (20 GB) → RTX 5070 (8 GB)  
- **Data**: RefinedWeb (100 K high-quality web docs, 41 M tokens)  
- **Context Window**: Extended from 1,024 → 16,384 tokens  
- **Training**: Mixed precision (bf16), gradient checkpointing, gradient accumulation  
- **Fine-Tuning**: Planned conversational instruction tuning with UltraChat  

***

## 📂 Repository Structure

```
AI/
├─ checkpoints/                  # Model checkpoints & configs
│  ├─ checkpoint_step_149999.pt  # Last pre-training checkpoint
│  ├─ extended_context_model.pt  # 8K context model
│  └─ model_config.json          # Config for extended model
├─ data/                         # Raw and processed data
│  ├─ shards/                    # Raw JSONL shards
│  ├─ processed/                 # Filtered JSONL
│  └─ tokens/                    # Packed token sequences
├─ clean_conversational_neo/     # Conversational training scripts
├─ configs/                      # training_config.json, data_config.json
├─ logs/                         # TensorBoard logs
├─ notebooks/                    # Exploratory Jupyter notebooks
├─ advanced_generate.py          # Advanced inference & context tests
├─ conversation_data_prep.py     # Prepares chat data for fine-tuning
├─ data_prep.py                  # RefinedWeb download & preprocessing
├─ debug_downloaded_data.py      # Inspect raw data quality
├─ extend_context.py             # Script to extend model context window
├─ finetune_neo.py               # Base fine-tuning script
├─ generate_text.py              # Simple generation utility
├─ interactive_chat.py           # Interactive chat interface
├─ model_neo.py                  # Model & config definitions
├─ requirements.txt              # Python dependencies
├─ run_training.py               # Orchestrates data prep → training
├─ scale_data.py                 # Utilities for sampling & scaling datasets
├─ setup_project.py              # Initial setup (venv, downloads)
├─ test_conversational_neo.py    # Tests on small conversational model
└─ train_neo.py                  # Main pre-training script
```

***

## 🛠️ Setup & Installation

1. **Clone** this repo.  
2. **Create virtual environment** (Python 3.10+):  
   ```bash
   python -m venv .venv
   source .venv/bin/activate  # or .venv\Scripts\activate
   pip install --upgrade pip
   pip install -r requirements.txt
   ```
3. **Install GPU drivers** and **CUDA** (if using RTX).  
4. **Optional**: `pip install tensorboard pynvml` for logging and GPU monitoring.

***

## 📊 Data Preparation

- **Dataset**: `tiiuae/falcon-refinedweb`  
- **Script**: `data_prep.py`  
  - Downloads 100 K docs, filters for quality (200–10,000 chars, English only)  
  - Tokenizes with GPT-2 BPE, packs into sequences of length 1,024  
- **Output**:  
  - Raw shards: `data/shards/refinedweb_sample_raw.jsonl`  
  - Filtered: `data/processed/refinedweb_filtered.jsonl`  
  - Packed tokens: `data/tokens/packed_1024.txt`  

```bash
python data_prep.py --num_docs 100000 --seq_length 1024 --tokenizer gpt2 --output_dir data
```

***

## 🏋️ Pre-Training

- **Script**: `train_neo.py`  
- **Config**:  
  ```python
  batch_size = 1
  gradient_accumulation_steps = 32
  max_steps = 150000
  warmup_steps = 3750
  mixed_precision = "bf16"
  gradient_checkpointing = True
  ```
- **Accelerator** handles mixed-precision, gradient accumulation, checkpointing.  
- **Resume** from any checkpoint: set `resume_from_checkpoint` in `TrainingConfig`.

```bash
python train_neo.py        # fresh
python train_neo.py --resume checkpoints/checkpoint_step_7500.pt  # resume
```

- **Speed**: ~10 it/s → 4 hours for 150K steps on RTX 5070.

***

## 🔧 Context Extension

- **Script**: `extend_context.py`  
- Extends `config.max_seq_len` → 16,384 and interpolates position embeddings.  
- **Output**: `checkpoints/extended_context_model_16k.pt`

```bash
python extend_context.py --new_max_len 16384
```

***

## 🤖 Inference & Testing

### **Simple Generation**  
`advanced_generate.py` tests fixed prompts and long context usage with VRAM monitoring.  

### **Interactive Chat**  
`interactive_chat.py` provides a full chat interface:  
- `/help`, `/params`, `/memory`, `/context`, `/clear`, `/save`, `/load`, `/multi`, `/system`, `/exit`  
- Real-time GPU usage and context tracking  
- Customizable sampling parameters  

```bash
python interactive_chat.py
```

***

## 📈 Fine-Tuning Plan

- **Dataset Recommendation**: `openbmb/UltraChat` (1.5 M dialogs) + `BAAI/Infinity-Instruct` + `vicgalle/alpaca-gpt4`  
- **Script**: `finetune_neo.py` (will be extended for conversational data)  
- **Goal**: Transform base model → instruction-following chat assistant  

```bash
python finetune_neo.py \
  --base_model checkpoints/extended_context_model_16k.pt \
  --dataset /path/to/UltraChat \
  --epochs 3 --lr 5e-6 --batch_size 1
```

***

## 🔑 Key Lessons & Tips

- **Quality > Quantity**: RefinedWeb quality cut training steps by 25%  
- **Memory Efficiency**: Achieved 3.6 K tokens at ~1.3 GB VRAM  
- **Batch Size Tradeoff**: 1 vs 2 batch size critical for VRAM overflow  
- **Cache Clearing**: `torch.cuda.empty_cache()` essential for long context tests  
- **Resume Training**: Checkpointing during pre-training saved 10+ hours  
- **Conversational Fine-Tuning**: Final step to transform base model into chat assistant

***

## 📂 Next Steps

1. **Review and run conversational fine-tuning** on UltraChat.  
2. **Evaluate** on standardized benchmarks (perplexity, MMLU, HellaSwag).  
3. **Quantize** or **prune** for faster inference on edge devices.  
4. **Deploy** with FastAPI + SSE for streaming responses.  
5. **Document** model card and share results.

***

Thank you for following this detailed project! Your model is now a **powerful, efficient LLM** ready for conversational fine-tuning and deployment. Good luck!