YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

From scratch pretraining on english only no synthetic data, no code, 3 epochs of 1 gig of data for the ~135M param model.

Test network using Differential Transformer (Attention). Other than some alterations to the attention, such as 16 heads insted of 9 and using differential attn, this is the same setup as https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct

Scripts:

inference.py to run the model with some test prompts
test_train.py runs with the exact configurations used to train this model and is the reproduction script. Data is assumed to be in JSONL format with "text":"example text", "text":"..."

Notes:

Appears to be very competent, learned significantly faster than the GQA control. Achieved a slightly better minimum loss. The runtime at this scale is about on par with the GQA/MHA control.

Training Metrics

Dataset Information

Training data per epoch: 1 GB
Total tokens trained: 48,261,120
No sythetic data

Training Results

Final Train Loss: 2.8485
Final Train Perplexity: 17.15

Downloads last month: 12

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Blackroot/Differential-Microllama-2

Differential Transformer

Paper • 2410.05258 • Published Oct 7, 2024 • 180