NanoChat Retrained Model

A retrained and optimized version of NanoChat with improved training efficiency and inference performance.

Model Details

Architecture

  • Base Model: NanoChat (retrained)
  • Parameters: ~134M parameters
  • Context Length: 2048 tokens
  • Vocabulary Size: 65,536 tokens
  • Layers: 32
  • Attention Heads: 16 (16 KV heads)
  • Hidden Dimension: 2048

Training Information

This model was trained in three stages:

Stage 1: Base Training

  • Details: Training report lost
  • Purpose: Foundation model training from scratch

Stage 2: Midtraining

  • Duration: ~13 hours
  • Iterations: 813
  • Batch Size: 524,288 tokens
  • Data Type: bfloat16
  • Learning Rates:
    • Unembedding: 0.004
    • Embedding: 0.2
    • Matrix: 0.02
  • Best Validation: 0.3972 bits per byte
  • Hardware: NVIDIA A100-SXM4-40GB (40GB VRAM)

Stage 2: Chat Fine-tuning (SFT)

  • Duration: ~14.5 hours
  • Training Examples: 22,439 rows
  • Iterations: 701
  • Epochs: 1
  • Target Examples per Step: 32
  • Training Loss: 1.3129
  • Learning Rate Schedule: 2% initial fraction with warmup

Total Training Time: 27 hours 32 minutes

Training Infrastructure

  • Platform: Linux (128 CPU cores, 160GB RAM)
  • GPU: 1Γ— NVIDIA A100-SXM4-40GB
  • CUDA: 12.8
  • PyTorch: 2.8.0+cu128
  • Python: 3.10.19
  • Precision: bfloat16

Improvements Over Base NanoChat

This retrained version features:

  • βœ… Optimized learning rates for better convergence
  • βœ… Enhanced training pipeline with regular checkpointing
  • βœ… Improved inference efficiency
  • βœ… Better general performance across tasks
  • βœ… Fine-tuned for chat/instruction-following capabilities

Usage

Free Inference on Google Colab

You can run this model completely free using Google Colab's free GPU tier!

πŸš€ Run in Colab

The notebook includes:

  • Simple inference interface
  • OpenAI-compatible API server
  • Web UI for easy interaction
  • No local GPU required!

Important: Make sure to enable GPU runtime in Google Colab:

  • Go to Runtime β†’ Change runtime type β†’ Select GPU (T4)

Local Installation

# Coming soon - installation instructions will be added

Performance

Stage Validation Loss Training Time
Midtraining 0.3972 bpb ~13 hours
Chat SFT 1.3129 ~14.5 hours

Limitations

  • Context window limited to 2048 tokens
  • Optimized for chat/instruction tasks, may not perform as well on other domains
  • Training data cutoff and potential biases inherited from base dataset

Training Details

Optimizer Configuration

  • Weight decay: 0.0
  • Gradient accumulation for large effective batch sizes
  • Regular evaluation every 150 iterations (midtraining) / 100 iterations (SFT)
  • Checkpoint saving every 100 iterations (midtraining) / 50 iterations (SFT)

Data Processing

  • Maximum sequence length: 2048 tokens
  • Device batch size: 1 (with gradient accumulation)
  • bfloat16 mixed precision training

Citation

If you use this model, please cite:

@misc{nanochat-retrained,
  author = Guilherme34/Guilherme Keller De Souza,
  title = {NanoChat Retrained: An Optimized Training Pipeline},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/Guilherme34/nanochat-retrained}}
}

License

STILL IN DECISION(NOT COMMERCIAL USE FOR NOW)

Acknowledgments

Special thanks to:

  • nicoboss - For invaluable help with GPU optimization and solving critical coding problems
  • RichardErkhov - For providing the GPU compute power that made this training possible
  • Andrej Karpathy - For making the initial code and model for it Additional acknowledgments:
  • Original NanoChat architecture and training framework(heavily modified)
  • Training conducted on NVIDIA A100 GPU
  • Community contributions and feedback

Contact

For questions, issues, or feedback: - GitHub: guilh00009/nanochat-inference

Coming Soon: BigChat β†’ Samantha-standard-big

We're excited to announce that a larger version is currently in training!

BigChat (to be released as Samantha-standard-big) will feature:

  • Significantly more parameters for enhanced capabilities
  • Improved reasoning and instruction-following
  • Better performance across diverse tasks

Stay tuned for updates on the release!


License & Usage

This model is released under CC BY-NC-ND 4.0.

βœ… Allowed:

  • Downloading and running the model locally for personal, non-commercial use
  • Research and experimentation (non-commercial)

❌ Not allowed:

  • Commercial use of any kind (including SaaS, paid APIs, products, training, etc.)
  • Fine-tuning, modifying, repackaging, or creating derivative models
  • Redistributing the weights or any modified version

Note: This model is a research artifact. Please use responsibly and be aware of potential limitations and biases in AI-generated content.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including Guilherme34/nanochat-retrained-pytorch