nanochat-d10-raw-700m

d10 model trained on 700M tokens of raw Common Crawl data

Model Description

This is a d10 model (~100M parameters) trained as part of a research project investigating the impact of training data quality on LLM performance.

Architecture: 10-layer transformer with 640 hidden dimensions Training framework: nanochat Base tokenizer: BPE with 65K vocab

Training Details

Dataset

  • Source: Common Crawl (unfiltered)
  • Size: 700M tokens
  • Quality filtering: None - raw data

Hyperparameters

  • Iterations: 14,000
  • Batch size: 32
  • Sequence length: 512
  • Learning rate: 6e-4
  • Training time: 3.8 hours
  • Hardware: 2× NVIDIA RTX 6000 Ada

Training Results

  • Final training loss: 4.3831
  • Average throughput: 74,111 tokens/sec

Research Context

This model is part of Phase 2 of the Oren project, which validates the hypothesis:

"Quality-filtered training data enables smaller, more efficient models with comparable performance."

Experiment Setup

I trained two identical models:

  • Model A (this model): Trained on raw data
  • Model B (companion model): Trained on quality-filtered data (top 70%)

Key Finding: Model B achieved similar performance (4.44 vs 4.38 loss) with 29% less training data and 29% less training time.

Usage

import torch
from nanochat.gpt import GPT, GPTConfig
from nanochat.tokenizer import get_tokenizer
from nanochat.engine import Engine

# Load model
checkpoint = torch.load("pytorch_model.bin")
config = GPTConfig(**{
    "sequence_len": 512,
    "vocab_size": 65536,
    "n_layer": 10,
    "n_head": 10,
    "n_kv_head": 10,
    "n_embd": 640
})

model = GPT(config)
model.load_state_dict(checkpoint)
model.eval()

# Generate text
tokenizer = get_tokenizer()
engine = Engine(model, tokenizer)

prompt_tokens = tokenizer("The capital of France is", prepend="<|bos|>")
output, _ = engine.generate_batch(prompt_tokens, max_tokens=50, temperature=0.7)
print(tokenizer.decode(output[0]))

Limitations

  • Small model: ~100M parameters - not suitable for complex reasoning
  • Limited training: Only 700M tokens of training data
  • No instruction tuning: This is a base model, not aligned for chat
  • Research artifact: Trained to validate data quality hypothesis, not for production use

Ethical Considerations

  • Trained on Common Crawl data, which may contain biases
  • Should not be used for critical applications without further evaluation
  • May generate offensive or incorrect content

Citation

If you use this model, please cite:

@software{oren2025,
  title={Oren: Quality Auditing for LLM Training Data},
  author={Amir Valizadeh},
  year={2025},
  url={https://github.com/vitalune/Oren}
}

Related Models

Contact

License

MIT License - See LICENSE file for details

Downloads last month
38
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train vitalune/nanochat-d10-raw-700m

Collection including vitalune/nanochat-d10-raw-700m