---
license: apache-2.0
datasets:
- japhba/pubmed_simple
language:
- en
tags:
- v2_pretrain_medassist
- gqa
- rope
- swiglu
- rmsnorm
- medical
---

# 🧠 MedAssist-GPT-401M

**Mid-sized medical-domain LLM pretraining project.**
⚠️ *Strictly for research. Not for clinical or diagnostic use.*

---

## 🧩 TL;DR

* **Architecture:** Transformer with **RoPE**, **GQA**, **SwiGLU** MLP, and **RMSNorm**
* **Tokenizer:** `tiktoken` `p50k_base` (vocab ≈ **50,281**)
* **Context length:** 1,024 tokens
* **Parameters:** ≈ **401 M** (`d_model=1024`, `n_heads=32`, `blocks=24`, `d_ff=2048`)
* **GQA groups:** 8 → 4 KV heads per 32 query heads
* **Dropout:** 0.0 (pretraining)
* **Precision:** **bf16** mixed precision
* **Training objective:** Next-token prediction
* **Effective batch:** 32 × 4 = 128

---

## 📚 Data

| Field                   | Value                             |
| ----------------------- | --------------------------------- |
| **Dataset**             | `japhba/pubmed_simple`            |
| **Text column**         | `abstract`                        |
| **Train/Val split**     | 95 / 5                            |
| **Samples used**        | 100 k abstracts                   |
| **Seq length / stride** | 1,024 / 1,024                     |
| **Cleaning**            | `use_clean=False` (raw abstracts) |

---

## ⚙️ Training

| Item                       | Value                                                                 |
| -------------------------- | --------------------------------------------------------------------- |
| **Framework**              | PyTorch                                                               |
| **Precision**              | bf16                                                                  |
| **Objective**              | Causal LM (next-token prediction)                                     |
| **Optimizer**              | AdamW (`β₁ = 0.9`, `β₂ = 0.95`, `eps = 1e-8`)                         |
| **Learning rate**          | 3 × 10⁻⁴ (linear + 100-step warmup)                                   |
| **Weight decay**           | 0.1                                                                   |
| **Batch size**             | 32 (× 4 grad acc → 128 effective)                                     |
| **Grad clip**              | 1.0                                                                   |
| **Total steps**            | 100 k                                                                 |
| **Eval**                   | every 500 steps × 100 iters                                           |
| **Checkpoint save**        | every 1 k steps                                                       |
| **Seed**                   | 7 979 797                                                             |
| **Gradient checkpointing** | ✅ Enabled                                                             |
| **WandB**                  | `kunjcr2-dreamable/MedAssist-GPT-Pretraining` (`medassist-401M-test`) |
| **HF repo**                | `kunjcr2/MedAssist-GPT-401M`                                          |

---

## 🧮 Training Environment

| Item                | Value                  |
| ------------------- | ---------------------- |
| **Hardware**        | 1× NVIDIA A100 (80 GB) |
| **Precision dtype** | bf16                   |
| **Runtime**         | ~15 hours              |
| **Scheduler**       | Linear LR decay        |
| **Mixed precision** | Native AMP (bf16)      |

---

## 📈 Loss Curves

*(Placeholder — will update post-training)*
![train\_loss](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F67c358189919777813863c48%2FbQGVqgx4GoqXZTcMh8KhM.png)
![val\_loss](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F67c358189919777813863c48%2FjhNnS_Wvhj4-fzNoO2dRN.png)

---

## 🚀 Minimal Inference

```python
# pip install torch tiktoken huggingface_hub safetensors
import torch, tiktoken
from safetensors.torch import load_file
from huggingface_hub import hf_hub_download
from MedAssistGPT import MedAssistGPT, MODEL_CONFIG

REPO_ID = "kunjcr2/MedAssist-GPT-401M"
weights = hf_hub_download(REPO_ID, "model.safetensors")
state = load_file(weights, device="cpu")

model = MedAssistGPT(MODEL_CONFIG)
model.load_state_dict(state, strict=True).eval()

enc = tiktoken.get_encoding("p50k_base")
ids = torch.tensor([enc.encode(
    "A patient was admitted with severe headache. Initial assessment revealed"
)], dtype=torch.long)

for _ in range(100):
    logits = model(ids)[:, -1, :]
    next_id = torch.multinomial(torch.softmax(logits / 0.6, dim=-1), 1)
    ids = torch.cat([ids, next_id], dim=1)
print(enc.decode(ids[0].tolist()))
```

---

## 💾 Checkpoints

* Main run: `medassist-401M-test`
* Checkpoint: `/checkpoints/checkpoint_step_44500.pt`

---

## 🧪 Intended Use

For research and experimentation only — e.g.,

* domain-adapted pretraining,
* architecture exploration,
* fine-tuning for medical text understanding.

🚫 **Not intended for clinical or production medical use.**

---

## 🔮 Future Work

Next update includes:

* **Supervised fine-tuning (SFT)**
* **Reinforcement Learning (PPO) for alignment**

---

## 📁 Files

* 'checkpoints/'
* `config.json`, `tokenizer_config.json`
* Training script / notebook defining `MedAssistGPT`

---

## 🪪 License

Apache 2.0