nanochat-d10-raw-700m

d10 model trained on 700M tokens of raw Common Crawl data

Model Description

This is a d10 model (~100M parameters) trained as part of a research project investigating the impact of training data quality on LLM performance.

Architecture: 10-layer transformer with 640 hidden dimensions Training framework: nanochat Base tokenizer: BPE with 65K vocab

Training Details

Dataset

Source: Common Crawl (unfiltered)
Size: 700M tokens
Quality filtering: None - raw data

Hyperparameters

Iterations: 14,000
Batch size: 32
Sequence length: 512
Learning rate: 6e-4
Training time: 3.8 hours
Hardware: 2× NVIDIA RTX 6000 Ada

Training Results

Final training loss: 4.3831
Average throughput: 74,111 tokens/sec

Research Context

This model is part of Phase 2 of the Oren project, which validates the hypothesis:

"Quality-filtered training data enables smaller, more efficient models with comparable performance."

Experiment Setup

I trained two identical models:

Model A (this model): Trained on raw data
Model B (companion model): Trained on quality-filtered data (top 70%)

Key Finding: Model B achieved similar performance (4.44 vs 4.38 loss) with 29% less training data and 29% less training time.

Usage

import torch
from nanochat.gpt import GPT, GPTConfig
from nanochat.tokenizer import get_tokenizer
from nanochat.engine import Engine

# Load model
checkpoint = torch.load("pytorch_model.bin")
config = GPTConfig(**{
    "sequence_len": 512,
    "vocab_size": 65536,
    "n_layer": 10,
    "n_head": 10,
    "n_kv_head": 10,
    "n_embd": 640
})

model = GPT(config)
model.load_state_dict(checkpoint)
model.eval()

# Generate text
tokenizer = get_tokenizer()
engine = Engine(model, tokenizer)

prompt_tokens = tokenizer("The capital of France is", prepend="<|bos|>")
output, _ = engine.generate_batch(prompt_tokens, max_tokens=50, temperature=0.7)
print(tokenizer.decode(output[0]))

Limitations

Small model: ~100M parameters - not suitable for complex reasoning
Limited training: Only 700M tokens of training data
No instruction tuning: This is a base model, not aligned for chat
Research artifact: Trained to validate data quality hypothesis, not for production use

Ethical Considerations

Trained on Common Crawl data, which may contain biases
Should not be used for critical applications without further evaluation
May generate offensive or incorrect content

Citation

If you use this model, please cite:

@software{oren2025,
  title={Oren: Quality Auditing for LLM Training Data},
  author={Amir Valizadeh},
  year={2025},
  url={https://github.com/vitalune/Oren}
}

Related Models

Companion model: vitalune/nanochat-d10-filtered-500m
Auditor model: karpathy/nanochat-d32

Contact

GitHub: github/vitalune/Oren
Email: [email protected]
LinkedIn: LinkedIn/amir-valizadeh104

License

MIT License - See LICENSE file for details

Downloads last month: 38

Dataset used to train vitalune/nanochat-d10-raw-700m

Collection including vitalune/nanochat-d10-raw-700m

Oren Data Distillation Experiment

Collection

Two identical d10 models (100M params) trained to validate the hypothesis that quality-filtered data enables more efficient training. • 2 items • Updated 15 days ago • 1