NanoChat Retrained Model

A retrained and optimized version of NanoChat with improved training efficiency and inference performance.

Model Details

Architecture

Base Model: NanoChat (retrained)
Parameters: ~134M parameters
Context Length: 2048 tokens
Vocabulary Size: 65,536 tokens
Layers: 32
Attention Heads: 16 (16 KV heads)
Hidden Dimension: 2048

Training Information

This model was trained in three stages:

Stage 1: Base Training

Details: Training report lost
Purpose: Foundation model training from scratch

Stage 2: Midtraining

Duration: ~13 hours
Iterations: 813
Batch Size: 524,288 tokens
Data Type: bfloat16
Learning Rates:
- Unembedding: 0.004
- Embedding: 0.2
- Matrix: 0.02
Best Validation: 0.3972 bits per byte
Hardware: NVIDIA A100-SXM4-40GB (40GB VRAM)

Stage 2: Chat Fine-tuning (SFT)

Duration: ~14.5 hours
Training Examples: 22,439 rows
Iterations: 701
Epochs: 1
Target Examples per Step: 32
Training Loss: 1.3129
Learning Rate Schedule: 2% initial fraction with warmup

Total Training Time: 27 hours 32 minutes

Training Infrastructure

Platform: Linux (128 CPU cores, 160GB RAM)
GPU: 1× NVIDIA A100-SXM4-40GB
CUDA: 12.8
PyTorch: 2.8.0+cu128
Python: 3.10.19
Precision: bfloat16

Improvements Over Base NanoChat

This retrained version features:

✅ Optimized learning rates for better convergence
✅ Enhanced training pipeline with regular checkpointing
✅ Improved inference efficiency
✅ Better general performance across tasks
✅ Fine-tuned for chat/instruction-following capabilities

Usage

Free Inference on Google Colab

You can run this model completely free using Google Colab's free GPU tier!

🚀 Run in Colab

The notebook includes:

Simple inference interface
OpenAI-compatible API server
Web UI for easy interaction
No local GPU required!

Important: Make sure to enable GPU runtime in Google Colab:

Go to Runtime → Change runtime type → Select GPU (T4)

Local Installation

# Coming soon - installation instructions will be added

Performance

Stage	Validation Loss	Training Time
Midtraining	0.3972 bpb	~13 hours
Chat SFT	1.3129	~14.5 hours

Limitations

Context window limited to 2048 tokens
Optimized for chat/instruction tasks, may not perform as well on other domains
Training data cutoff and potential biases inherited from base dataset

Training Details

Optimizer Configuration

Weight decay: 0.0
Gradient accumulation for large effective batch sizes
Regular evaluation every 150 iterations (midtraining) / 100 iterations (SFT)
Checkpoint saving every 100 iterations (midtraining) / 50 iterations (SFT)

Data Processing

Maximum sequence length: 2048 tokens
Device batch size: 1 (with gradient accumulation)
bfloat16 mixed precision training

Citation

If you use this model, please cite:

@misc{nanochat-retrained,
  author = Guilherme34/Guilherme Keller De Souza,
  title = {NanoChat Retrained: An Optimized Training Pipeline},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/Guilherme34/nanochat-retrained}}
}

License

STILL IN DECISION(NOT COMMERCIAL USE FOR NOW)

Acknowledgments

Special thanks to:

nicoboss - For invaluable help with GPU optimization and solving critical coding problems
RichardErkhov - For providing the GPU compute power that made this training possible
Andrej Karpathy - For making the initial code and model for it Additional acknowledgments:
Original NanoChat architecture and training framework(heavily modified)
Training conducted on NVIDIA A100 GPU
Community contributions and feedback

Contact

For questions, issues, or feedback: - GitHub: guilh00009/nanochat-inference

Coming Soon: BigChat → Samantha-standard-big

We're excited to announce that a larger version is currently in training!

BigChat (to be released as Samantha-standard-big) will feature:

Significantly more parameters for enhanced capabilities
Improved reasoning and instruction-following
Better performance across diverse tasks

Stay tuned for updates on the release!

License & Usage

This model is released under CC BY-NC-ND 4.0.

✅ Allowed:

Downloading and running the model locally for personal, non-commercial use
Research and experimentation (non-commercial)

❌ Not allowed:

Commercial use of any kind (including SaaS, paid APIs, products, training, etc.)
Fine-tuning, modifying, repackaging, or creating derivative models
Redistributing the weights or any modified version

Note: This model is a research artifact. Please use responsibly and be aware of potential limitations and biases in AI-generated content.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including Guilherme34/nanochat-retrained-pytorch

Guilherme34 Best Models

Collection

this is the best models i have ever made • 6 items • Updated 4 days ago • 1