MTLM1-200M (M1 Series) 🧠

Model Architecture: Custom Llama-style Transformer (Progressive Growth)

Parameters: ~200M

Tokens Trained: 3.5 Billion (7B TOTAL)

Author: Madras1 (Gabriel)

License: MIT

**GPU:**A100 40GB(12 HOUR) ~$8-15

πŸ“– Model Description

The MTLM1-200M is a compact but highly efficient language model built from scratch using a custom PyTorch implementation. It follows the modern Llama architecture principles, optimized for research and educational purposes.

This model demonstrates a significant performance leap compared to its predecessor (the 88M parameter version), validating the efficiency of well-executed layer stacking in this specific compute regime. It serves as a proof-of-concept for scalable training strategies on limited hardware.

βš™οΈ Training Methodology (The "Stacking" Strategy)

The training process employed a dynamic parameter efficient method to maximize resource usage:

  1. Phase 1 (Base Learning): Training started with a smaller base model (~88M-100M parameters), allowing for rapid convergence on core linguistic patterns.
  2. Phase 2 (Layer Stacking): Using a custom expansion technique, the layers were duplicated and stacked to effectively double the model depth.
  3. Phase 3 (Refinement): The expanded 200M model continued training for a total of 1 Epochs over 3.5 Billion tokens, stabilizing the new weights and integrating the "M2 Blend" knowledge.

πŸ“š Training Data (The "M1 Blend")

The dataset was curated to prioritize reasoning:

  • Synthetic & Textbook Quality: Subsets from Cosmopedia and FineWeb-Edu.
  • Web-Scale Foundation: Filtered portions of FineWeb.
  • Custom Knowledge Base: A collection of scraped Wikipedia articles, technical documents, and verified texts.

πŸ› οΈ Technical Specifications

  • Architecture: Llama-style (RMSNorm, SwiGLU, RoPE).
  • Attention: Flash Attention 2 (BF16 support).
  • Optimizer: AdamW + Cosine Scheduler.
  • Precision: Mixed Precision (BF16/AMP).

πŸ“Š Evaluation Results (Benchmarks)

Performance on standard zero-shot/few-sho(log prob ppl) benchmarks highlights the effectiveness of the stacking strategy compared to the previous 88M iteration:

Benchmark Metric Score (%)
Winogrande Accuracy 50.00%
COPA Accuracy 49.00%
BoolQ Accuracy 44.25%
Winograd Accuracy 43.27%
TruthfulQA (MC2) Accuracy 41.42%
ARC Easy Accuracy 38.64%
OpenBookQA Accuracy 34.20%
HellaSwag Accuracy 27.91%
Aqua-RAT Accuracy 26.38%
TruthfulQA (MC1) Accuracy 24.60%
ARC Challenge Accuracy 23.55%
CommonSense QA Accuracy 20.56%

πŸ†š Comparative Analysis

When compared to other models in the "Micro-LLM" class (<500M params) and even larger baselines, MTLM-200M demonstrates good performance in knowledge retrieval and academic reasoning tasks, outperforming the classic GPT-2 and the OPT-125M in key benchmarks.

Benchmark Task MTLM-200M (Ours) OPT-125M GPT-2 (124M) TinyLlama 1.1B
OpenBookQA Knowledge Retrieval 34.20% ~~30% ~32%% ~32.20 %
ARC-Easy Science (Elementary) 39.64% ~35.0% 33.5% 49.62%
ARC-Challenge Reasoning 23.55% 22.87% 21.8% 28.67%
TruthfulQA Factuality (MC2) 41.42% 42.87% ~40.8% 39.15%
HellaSwag Common Sense 27.91% 31.47% 29.4% 53.81%

πŸš€ Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "Madras1/MTLM1-200M"

tokenizer = AutoTokenizer.from_pretrained(model_id)
# trust_remote_code=True is required for custom modeling
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)

inputs = tokenizer("Machine Learning is ", return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

⚠️ Limitations Size: With 200M parameters, complex reasoning is limited compared to larger models.

Knowledge Cutoff: Dependent on the training dataset used.

Hallucinations: Like all LLMs, it may generate incorrect information.

Scope This is a base model intended for research, educational purposes, and edge-device experimentation. It has not undergone Reinforcement Learning from Human Feedback (RLHF) or SFT for safety alignment.

Downloads last month
8
Safetensors
Model size
0.2B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Madras1/MTLM1-200M

Quantizations
1 model

Datasets used to train Madras1/MTLM1-200M