nanochat-d10-raw-700m
d10 model trained on 700M tokens of raw Common Crawl data
Model Description
This is a d10 model (~100M parameters) trained as part of a research project investigating the impact of training data quality on LLM performance.
Architecture: 10-layer transformer with 640 hidden dimensions
Training framework: nanochat
Base tokenizer: BPE with 65K vocab
Training Details
Dataset
- Source: Common Crawl (unfiltered)
- Size: 700M tokens
- Quality filtering: None - raw data
Hyperparameters
- Iterations: 14,000
- Batch size: 32
- Sequence length: 512
- Learning rate: 6e-4
- Training time: 3.8 hours
- Hardware: 2× NVIDIA RTX 6000 Ada
Training Results
- Final training loss: 4.3831
- Average throughput: 74,111 tokens/sec
Research Context
This model is part of Phase 2 of the Oren project, which validates the hypothesis:
"Quality-filtered training data enables smaller, more efficient models with comparable performance."
Experiment Setup
I trained two identical models:
- Model A (this model): Trained on raw data
- Model B (companion model): Trained on quality-filtered data (top 70%)
Key Finding: Model B achieved similar performance (4.44 vs 4.38 loss) with 29% less training data and 29% less training time.
Usage
import torch
from nanochat.gpt import GPT, GPTConfig
from nanochat.tokenizer import get_tokenizer
from nanochat.engine import Engine
checkpoint = torch.load("pytorch_model.bin")
config = GPTConfig(**{
"sequence_len": 512,
"vocab_size": 65536,
"n_layer": 10,
"n_head": 10,
"n_kv_head": 10,
"n_embd": 640
})
model = GPT(config)
model.load_state_dict(checkpoint)
model.eval()
tokenizer = get_tokenizer()
engine = Engine(model, tokenizer)
prompt_tokens = tokenizer("The capital of France is", prepend="<|bos|>")
output, _ = engine.generate_batch(prompt_tokens, max_tokens=50, temperature=0.7)
print(tokenizer.decode(output[0]))
Limitations
- Small model: ~100M parameters - not suitable for complex reasoning
- Limited training: Only 700M tokens of training data
- No instruction tuning: This is a base model, not aligned for chat
- Research artifact: Trained to validate data quality hypothesis, not for production use
Ethical Considerations
- Trained on Common Crawl data, which may contain biases
- Should not be used for critical applications without further evaluation
- May generate offensive or incorrect content
Citation
If you use this model, please cite:
@software{oren2025,
title={Oren: Quality Auditing for LLM Training Data},
author={Amir Valizadeh},
year={2025},
url={https://github.com/vitalune/Oren}
}
Related Models
Contact
License
MIT License - See LICENSE file for details