# MAP-NEO Mini: A DIY LLM from Scratch This repository demonstrates a complete, end-to-end journey of building, extending, and deploying a custom 253 M-parameter language model on modest hardware (RTX 5070, 8 GB VRAM). It covers data preparation, model training, context window extension, interactive inference, and GPU optimization. *** ## 🚀 Project Overview - **Model**: MAP-NEO Mini (253 M parameters) - **Architecture**: Rotary embeddings, RMSNorm, SwiGLU, Flash Attention - **Hardware**: Intel i5 CPU, 16 GB RAM → NVIDIA RTX 4000 (20 GB) → RTX 5070 (8 GB) - **Data**: RefinedWeb (100 K high-quality web docs, 41 M tokens) - **Context Window**: Extended from 1,024 → 16,384 tokens - **Training**: Mixed precision (bf16), gradient checkpointing, gradient accumulation - **Fine-Tuning**: Planned conversational instruction tuning with UltraChat *** ## 📂 Repository Structure ``` AI/ ├─ checkpoints/ # Model checkpoints & configs │ ├─ checkpoint_step_149999.pt # Last pre-training checkpoint │ ├─ extended_context_model.pt # 8K context model │ └─ model_config.json # Config for extended model ├─ data/ # Raw and processed data │ ├─ shards/ # Raw JSONL shards │ ├─ processed/ # Filtered JSONL │ └─ tokens/ # Packed token sequences ├─ clean_conversational_neo/ # Conversational training scripts ├─ configs/ # training_config.json, data_config.json ├─ logs/ # TensorBoard logs ├─ notebooks/ # Exploratory Jupyter notebooks ├─ advanced_generate.py # Advanced inference & context tests ├─ conversation_data_prep.py # Prepares chat data for fine-tuning ├─ data_prep.py # RefinedWeb download & preprocessing ├─ debug_downloaded_data.py # Inspect raw data quality ├─ extend_context.py # Script to extend model context window ├─ finetune_neo.py # Base fine-tuning script ├─ generate_text.py # Simple generation utility ├─ interactive_chat.py # Interactive chat interface ├─ model_neo.py # Model & config definitions ├─ requirements.txt # Python dependencies ├─ run_training.py # Orchestrates data prep → training ├─ scale_data.py # Utilities for sampling & scaling datasets ├─ setup_project.py # Initial setup (venv, downloads) ├─ test_conversational_neo.py # Tests on small conversational model └─ train_neo.py # Main pre-training script ``` *** ## 🛠️ Setup & Installation 1. **Clone** this repo. 2. **Create virtual environment** (Python 3.10+): ```bash python -m venv .venv source .venv/bin/activate # or .venv\Scripts\activate pip install --upgrade pip pip install -r requirements.txt ``` 3. **Install GPU drivers** and **CUDA** (if using RTX). 4. **Optional**: `pip install tensorboard pynvml` for logging and GPU monitoring. *** ## 📊 Data Preparation - **Dataset**: `tiiuae/falcon-refinedweb` - **Script**: `data_prep.py` - Downloads 100 K docs, filters for quality (200–10,000 chars, English only) - Tokenizes with GPT-2 BPE, packs into sequences of length 1,024 - **Output**: - Raw shards: `data/shards/refinedweb_sample_raw.jsonl` - Filtered: `data/processed/refinedweb_filtered.jsonl` - Packed tokens: `data/tokens/packed_1024.txt` ```bash python data_prep.py --num_docs 100000 --seq_length 1024 --tokenizer gpt2 --output_dir data ``` *** ## 🏋️ Pre-Training - **Script**: `train_neo.py` - **Config**: ```python batch_size = 1 gradient_accumulation_steps = 32 max_steps = 150000 warmup_steps = 3750 mixed_precision = "bf16" gradient_checkpointing = True ``` - **Accelerator** handles mixed-precision, gradient accumulation, checkpointing. - **Resume** from any checkpoint: set `resume_from_checkpoint` in `TrainingConfig`. ```bash python train_neo.py # fresh python train_neo.py --resume checkpoints/checkpoint_step_7500.pt # resume ``` - **Speed**: ~10 it/s → 4 hours for 150K steps on RTX 5070. *** ## 🔧 Context Extension - **Script**: `extend_context.py` - Extends `config.max_seq_len` → 16,384 and interpolates position embeddings. - **Output**: `checkpoints/extended_context_model_16k.pt` ```bash python extend_context.py --new_max_len 16384 ``` *** ## 🤖 Inference & Testing ### **Simple Generation** `advanced_generate.py` tests fixed prompts and long context usage with VRAM monitoring. ### **Interactive Chat** `interactive_chat.py` provides a full chat interface: - `/help`, `/params`, `/memory`, `/context`, `/clear`, `/save`, `/load`, `/multi`, `/system`, `/exit` - Real-time GPU usage and context tracking - Customizable sampling parameters ```bash python interactive_chat.py ``` *** ## 📈 Fine-Tuning Plan - **Dataset Recommendation**: `openbmb/UltraChat` (1.5 M dialogs) + `BAAI/Infinity-Instruct` + `vicgalle/alpaca-gpt4` - **Script**: `finetune_neo.py` (will be extended for conversational data) - **Goal**: Transform base model → instruction-following chat assistant ```bash python finetune_neo.py \ --base_model checkpoints/extended_context_model_16k.pt \ --dataset /path/to/UltraChat \ --epochs 3 --lr 5e-6 --batch_size 1 ``` *** ## 🔑 Key Lessons & Tips - **Quality > Quantity**: RefinedWeb quality cut training steps by 25% - **Memory Efficiency**: Achieved 3.6 K tokens at ~1.3 GB VRAM - **Batch Size Tradeoff**: 1 vs 2 batch size critical for VRAM overflow - **Cache Clearing**: `torch.cuda.empty_cache()` essential for long context tests - **Resume Training**: Checkpointing during pre-training saved 10+ hours - **Conversational Fine-Tuning**: Final step to transform base model into chat assistant *** ## 📂 Next Steps 1. **Review and run conversational fine-tuning** on UltraChat. 2. **Evaluate** on standardized benchmarks (perplexity, MMLU, HellaSwag). 3. **Quantize** or **prune** for faster inference on edge devices. 4. **Deploy** with FastAPI + SSE for streaming responses. 5. **Document** model card and share results. *** Thank you for following this detailed project! Your model is now a **powerful, efficient LLM** ready for conversational fine-tuning and deployment. Good luck!