π§ Haipai-nano; A Readable, Modern GPT-Style Transformer
Haipai (βhigh-pieβ) is a compact, fully readable decoder-only Transformer you can understand, train, and extend yourself. Itβs designed to behave like a scaled-down GPT-style model β but remain small enough to fit comfortably on a single modern GPU (or even CPU inference).
The model implements the same modern components as todayβs large LMs β RMSNorm, SwiGLU, Rotary Position Embeddings (RoPE), causal SDPA attention, tied embeddings, and EMA-smoothed weights.
This release includes:
- β
EMA-averaged weights in
.safetensors - β Tokenizer (trained on mixed Wikipedia + CNN/DailyMail)
- β Minimal PyTorch model implementation
- β Inference script with temperature/top-p sampling
β‘ Highlights
| Feature | Description |
|---|---|
| Architecture | Decoder-only Transformer (GPT-style) |
| Modern features | RMSNorm, SwiGLU, RoPE, SDPA, EMA |
| Parameter count | ~50M (depending on vocab) |
| Context length | 1,024 tokens (RoPE scaled Γ2) |
| Precision | fp16 / bf16 |
| Tokenizer | BPE, 50k vocab, lowercase + whitespace |
| Export format | .safetensors (EMA-smoothed) |
| Intended use | Pretraining demos, teaching, small-scale research |
| License | Apache-2.0 |
π§± Architecture Overview
HaipaiLM is a GPT-style decoder block stack with RMSNorm β Attention β Residual and RMSNorm β SwiGLU β Residual, finalized by an RMSNorm and tied embedding head.
It uses RoPE for positional encoding applied to both Q and K, and causal SDPA (PyTorch fused attention) for fast inference.
Input IDs
β
Embedding
β
[Γ N blocks]
ββ RMSNorm β Multi-Head Attention (RoPE on QK causal SDPA) β Add Residual
ββ RMSNorm β SwiGLU β Add Residual
β
Final RMSNorm
β
Tied LM Head (shared with Embedding)
π§© Diagram
This schematic shows how HaipaiLM processes tokens:
Each block uses:
- RMSNorm 1 β RoPE β Multi-Head Attention β Residual
- RMSNorm 2 β SwiGLU β Residual
- Final RMSNorm normalizes before the output head
- Embedding β LM Head share the same weights
βοΈ Training Configuration
| Component | Setting |
|---|---|
| Layers | 12 |
| Hidden size (d_model) | 384 |
| Attention heads (n_head) | 6 |
| Head dimension | 64 |
| Feed-forward (SwiGLU) | 2048 |
| Norm | RMSNorm (Ξ΅=1e-6) |
| Dropout | 0.0β0.01 |
| RoPE scale | 2.0 |
| Optimizer | AdamW (Ξ²β=0.9, Ξ²β=0.95, Ξ΅=1e-8) |
| Weight decay | 0.1 |
| Gradient clip | 0.8 |
| LR schedule | Linear warmup β Cosine decay |
| Warmup steps | 1000 |
| Max LR | 3e-3 |
| EMA decay | 0.9995 |
| Precision | AMP (bf16/fp16) |
| Objective | Causal LM (next-token prediction) |
π Training Recipe Summary
Dataset mix:
- 60% Wikipedia (latest English dump)
- 40% CNN/DailyMail (news articles)
- ~500M total tokens
Tokenizer:
- Trained BPE (50k vocab)
- Lowercased, whitespace pre-tokenized
- Saved in
/tokenizer/folder
Validation perplexity:
- ~60 after early stage (1kβ2k steps)
- ~25β30 after full 500M-token run
- EMA weights yield smoother generations
Note: This release includes both the EMA-smoothed weights (
model.safetensors) for inference and the raw checkpoint (haipai_step7500.pt) for continued training or fine-tuning. Do not resume training from the EMA file β it lacks optimizer state.
π§ͺ Local Inference
import torch, os, json
from safetensors.torch import load_file
from transformers import AutoTokenizer
from modeling_haipai import HaipaiLM
repo_dir = "./model_safe"
# 1. Load config + tokenizer
with open(os.path.join(repo_dir, "config.json")) as f:
cfg = json.load(f)
tok = AutoTokenizer.from_pretrained(os.path.join(repo_dir, "tokenizer"), local_files_only=True)
if tok.pad_token is None and tok.eos_token is not None:
tok.pad_token = tok.eos_token
# 2. Load model
state = load_file(os.path.join(repo_dir, "model.safetensors"))
model = HaipaiLM(
vocab_size=tok.vocab_size,
d_model=cfg["d_model"], n_layer=cfg["n_layer"], n_head=cfg["n_head"],
d_ff=cfg["d_ff"], max_seq_len=cfg["max_position_embeddings"],
rope_scale=cfg.get("rope_scale", 2.0),
).to("cpu").eval()
model.load_state_dict(state, strict=False)
# 3. Generate
prompt = "Doctor told"
ids = tok(prompt, return_tensors="pt").input_ids
with torch.no_grad():
logits, _ = model(ids)
π» CLI Inference
python inference.py --local_dir "./model_safe" --prompt "Doctor told" --device cpu
Options:
--local_dir Path to model + tokenizer
--prompt Input prompt
--device cuda / cpu
--max_new_tokens (default 120)
--temperature (default 0.7)
--top_p (default 0.9)
--repetition_penalty (default 1.15)
π₯ Sampling Defaults
temperature = 0.7
top_p = 0.9
repetition_penalty = 1.15
max_new_tokens = 150
You can also experiment with:
- Lower temperature (0.6) for factual output
- Higher top_p (0.95) for more variety
- Larger context windows for longer text generation
π Evaluate Perplexity
import math, torch
@torch.no_grad()
def ppl_on_texts(model, tok, texts):
losses = []
for text in texts:
enc = tok(text, return_tensors="pt")
ids = enc.input_ids
labels = ids.clone()
labels[:, :-1] = ids[:, 1:]
labels[:, -1] = -100
_, loss = model(ids, labels)
losses.append(loss.item())
return math.exp(sum(losses)/len(losses))
π§© Design Notes
RMSNorm β lightweight alternative to LayerNorm, numerically stable for small models.
SwiGLU β nonlinearity SiLU(Wβx) β Wβx β Wβ, better expressiveness.
RoPE β encodes relative positions via sinusoidal rotation of Q/K.
Causal SDPA β built-in PyTorch scaled dot-product attention backend (FlashAttention on supported GPUs).
EMA β maintains a running exponential average of weights during training, improving validation loss and sample smoothness.
β οΈ Limitations
- ~50M parameters β for research, not production deployment.
- Not instruction-tuned; wonβt follow chat-style prompts.
- Quality depends on corpus balance (Wikipedia + news).
- May produce factual or stylistic inconsistencies. Use responsibly with human review.
π§ Extending Haipai
You can scale the model easily:
- 100M: double
d_modelandd_ff - 1B+: use Mixture-of-Experts (MoE) for efficient scaling
- Replace
HaipaiBlockwith sparse MoE or gated variants - Try longer context windows by rebuilding RoPE cache
π Citation
@software{haipai2025,
title = {Haipai: A Minimal, Modern GPT-Style Language Model (~50M, EMA)},
author = {Md Rakibul Islam Rocky},
year = {2025},
url = {https://huggingface.co/rocky1410/haipai-nano}
}
βοΈ License
This repository is open for research and educational use on Apache 2.0 License
- Downloads last month
- 115