NanoChat Retrained Model
A retrained and optimized version of NanoChat with improved training efficiency and inference performance.
Model Details
Architecture
- Base Model: NanoChat (retrained)
- Parameters: ~134M parameters
- Context Length: 2048 tokens
- Vocabulary Size: 65,536 tokens
- Layers: 32
- Attention Heads: 16 (16 KV heads)
- Hidden Dimension: 2048
Training Information
This model was trained in three stages:
Stage 1: Base Training
- Details: Training report lost
- Purpose: Foundation model training from scratch
Stage 2: Midtraining
- Duration: ~13 hours
- Iterations: 813
- Batch Size: 524,288 tokens
- Data Type: bfloat16
- Learning Rates:
- Unembedding: 0.004
- Embedding: 0.2
- Matrix: 0.02
- Best Validation: 0.3972 bits per byte
- Hardware: NVIDIA A100-SXM4-40GB (40GB VRAM)
Stage 2: Chat Fine-tuning (SFT)
- Duration: ~14.5 hours
- Training Examples: 22,439 rows
- Iterations: 701
- Epochs: 1
- Target Examples per Step: 32
- Training Loss: 1.3129
- Learning Rate Schedule: 2% initial fraction with warmup
Total Training Time: 27 hours 32 minutes
Training Infrastructure
- Platform: Linux (128 CPU cores, 160GB RAM)
- GPU: 1Γ NVIDIA A100-SXM4-40GB
- CUDA: 12.8
- PyTorch: 2.8.0+cu128
- Python: 3.10.19
- Precision: bfloat16
Improvements Over Base NanoChat
This retrained version features:
- β Optimized learning rates for better convergence
- β Enhanced training pipeline with regular checkpointing
- β Improved inference efficiency
- β Better general performance across tasks
- β Fine-tuned for chat/instruction-following capabilities
Usage
Free Inference on Google Colab
You can run this model completely free using Google Colab's free GPU tier!
π Run in Colab
The notebook includes:
- Simple inference interface
- OpenAI-compatible API server
- Web UI for easy interaction
- No local GPU required!
Important: Make sure to enable GPU runtime in Google Colab:
- Go to
RuntimeβChange runtime typeβ SelectGPU(T4)
Local Installation
# Coming soon - installation instructions will be added
Performance
| Stage | Validation Loss | Training Time |
|---|---|---|
| Midtraining | 0.3972 bpb | ~13 hours |
| Chat SFT | 1.3129 | ~14.5 hours |
Limitations
- Context window limited to 2048 tokens
- Optimized for chat/instruction tasks, may not perform as well on other domains
- Training data cutoff and potential biases inherited from base dataset
Training Details
Optimizer Configuration
- Weight decay: 0.0
- Gradient accumulation for large effective batch sizes
- Regular evaluation every 150 iterations (midtraining) / 100 iterations (SFT)
- Checkpoint saving every 100 iterations (midtraining) / 50 iterations (SFT)
Data Processing
- Maximum sequence length: 2048 tokens
- Device batch size: 1 (with gradient accumulation)
- bfloat16 mixed precision training
Citation
If you use this model, please cite:
@misc{nanochat-retrained,
author = Guilherme34/Guilherme Keller De Souza,
title = {NanoChat Retrained: An Optimized Training Pipeline},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/Guilherme34/nanochat-retrained}}
}
License
STILL IN DECISION(NOT COMMERCIAL USE FOR NOW)
Acknowledgments
Special thanks to:
- nicoboss - For invaluable help with GPU optimization and solving critical coding problems
- RichardErkhov - For providing the GPU compute power that made this training possible
- Andrej Karpathy - For making the initial code and model for it Additional acknowledgments:
- Original NanoChat architecture and training framework(heavily modified)
- Training conducted on NVIDIA A100 GPU
- Community contributions and feedback
Contact
For questions, issues, or feedback: - GitHub: guilh00009/nanochat-inference
Coming Soon: BigChat β Samantha-standard-big
We're excited to announce that a larger version is currently in training!
BigChat (to be released as Samantha-standard-big) will feature:
- Significantly more parameters for enhanced capabilities
- Improved reasoning and instruction-following
- Better performance across diverse tasks
Stay tuned for updates on the release!
License & Usage
This model is released under CC BY-NC-ND 4.0.
β Allowed:
- Downloading and running the model locally for personal, non-commercial use
- Research and experimentation (non-commercial)
β Not allowed:
- Commercial use of any kind (including SaaS, paid APIs, products, training, etc.)
- Fine-tuning, modifying, repackaging, or creating derivative models
- Redistributing the weights or any modified version
Note: This model is a research artifact. Please use responsibly and be aware of potential limitations and biases in AI-generated content.