YALM-80M

YALM (Yet Another Language Model) is a family of an experimental small language models developed through my ongoing exploration of language modeling and LLM architectures.

YALM-80M is the first member model in this family. This model is trained on a diverse corpus of English, Hindi, Math, and Python Code to test its capacity for multi-lingual and technical reasoning.

Note: There is a bug in tokenizer which may cause error during generation for certrain inputs.

Model Overview:

  • Architecture: Llama
  • Pretraining steps: 34k
  • Pretraining tokens: 36B
  • Precision: bfloat16
  • Number of Parameters: 79.7M
  • Number of Paramaters (Non-Embedding): 62.9M
  • Number of Layers: 16
  • Number of Attention Heads (GQA): 8 for Q and 4 for KV
  • Context Length: 2048

Usage

>>> from transformers import AutoTokenizer, AutoModelForCausalLM
>>> tokenizer = AutoTokenizer.from_pretrained("kp7742/YALM-80M")
>>> model = AutoModelForCausalLM.from_pretrained("kp7742/YALM-80M")
>>> inputs = tokenizer("Hey how are you doing?", return_tensors="pt")
>>> out = model.generate(**inputs, max_new_tokens=100)
>>> print(tokenizer.batch_decode(out))

Training

Data

This model is pre-trained on YALM-pretrain5-60M

Hyperparameters

  • learning_rate: 0.007812
  • train_batch_size: 16
  • eval_batch_size: 16
  • distributed_type: multi-GPU DDP
  • num_devices: 8
  • gradient_accumulation_steps: 4
  • total_train_batch_size: 512
  • total_eval_batch_size: 128
  • optimizer: AdamW with betas=(0.9, 0.95) and epsilon=1e-08
  • lr_scheduler_type: warmup_stable_decay
  • lr_scheduler_warmup_steps: 3400
  • training_steps: 34000

Hardware

  • GPUs: 8 x RTX 4090

Framework versions

  • Transformers 4.53.1
  • Pytorch 2.7.1+cu128
  • Datasets 3.6.0
  • Tokenizers 0.21.2

Evaluation

All evaluations are zero-shot unless stated otherwise, and I used lighteval to run them.

It achieves the following results on the test set:

  • Loss: 2.78
  • Perplexity: 16.10

Base pre-trained model

Metrics YALM-80M
MMLU (cloze) 27.33
MMLU Pro 8.72
BBH (5-shot) 12.61
ARC (Average) 29.87
HellaSwag 32.16
PIQA 62.89
SCIQ 69.50
CommonsenseQA 28.75
Winogrande 50.59
OpenBookQA 29.60
TruthfulQA 22.78
TriviaQA 0.17
GSM8K (5-shot) 0.83

Limitations

YALM models primarily understand and generate content in English and Hindi. They can produce text on a variety of topics but as world knowledge is limited, the generated content may not always be factually accurate, logically consistent, or free from biases present in the training data.

Downloads last month
4
Safetensors
Model size
79.7M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for kp7742/YALM-80M

Quantizations
1 model

Dataset used to train kp7742/YALM-80M