kunjcr2
/

MedAssistGPT

@@ -1,80 +1,160 @@
 ---
 license: apache-2.0
 datasets:
-- Hack90/europe_pmc_articles_part_2
 language:
 - en
 tags:
-- v0_pretrain_medassist
 ---
-# MedAssist-GPT
-Tiny medical-domain LLM pretraining project.
-**NOT for clinical use.**
-## TL;DR
-* **Arch:** Transformer with **RoPE** + **GQA**, **SwiGLU** MLP, **RMSNorm**, causal LM head (tied embeddings).
-* **Tokenizer:** `tiktoken` **p50k_base** (vocab ≈ 50,281).
-* **Context:** 1,024 tokens (default).
-* **Size (default config):** ~125M params (d_model=512, n_heads=16, layers=16, d_ff=2048).
-* **Trained on** about 2.2B tokens of pure medical data.
-## Data (example)
-* Source: `Hack90/europe_pmc_articles_part_2` (`full_text`).
-* XML → plain text via `clean()`; sliding windows (`max_length=1024`, `stride=1024`).
-## Training (script)
-* AdamW + OneCycleLR, bf16 AMP, grad accumulation, checkpoints, optional HF upload, wandb logging.
-## Loss
-![train_loss](https://cdn-uploads.huggingface.co/production/uploads/67c358189919777813863c48/bQGVqgx4GoqXZTcMh8KhM.png)
-![val_loss](https://cdn-uploads.huggingface.co/production/uploads/67c358189919777813863c48/jhNnS_Wvhj4-fzNoO2dRN.png)
-## Try it (minimal)
 ```python
 # pip install torch tiktoken huggingface_hub safetensors
 import torch, tiktoken
 from safetensors.torch import load_file
 from huggingface_hub import hf_hub_download
-REPO_ID = "kunjcr2/MedAssistGPT"   # change if needed
-WEIGHTS = hf_hub_download(REPO_ID, "model.safetensors")
-state = load_file(WEIGHTS, device="cpu")
-# Import your MedAssistGPT class from the script/notebook
-from MedAssistGPT import MedAssistGPT, MODEL_CONFIG  # ensure paths match your repo
 model = MedAssistGPT(MODEL_CONFIG)
 model.load_state_dict(state, strict=True).eval()
 enc = tiktoken.get_encoding("p50k_base")
-ids = torch.tensor([enc.encode("To live a good life")], dtype=torch.long)
-with torch.no_grad():
-    for _ in range(100):
-        logits = model(ids)[:, -1, :]
-        next_id = torch.multinomial(torch.softmax(logits/0.7, dim=-1), 1)
-        ids = torch.cat([ids, next_id], dim=1)
-        if next_id.item() == enc.eot_token: break
 print(enc.decode(ids[0].tolist()))
 ```
-## Intended use & limitations
-Research/experimentation + downstream finetuning after pretraining.
-Do **NOT** use for medical decisions.
-## Files
-* `model.safetensors` (weights)
 * `config.json`, `tokenizer_config.json`
-* Script/notebook defining `MedAssistGPT` class
-## License
-Apache-2.0

 ---
 license: apache-2.0
 datasets:
+- japhba/pubmed_simple
 language:
 - en
 tags:
+- v2_pretrain_medassist
+- gqa
+- rope
+- swiglu
+- rmsnorm
+- medical
 ---
+# 🧠 MedAssist-GPT-401M
+**Mid-sized medical-domain LLM pretraining project.**
+⚠️ *Strictly for research. Not for clinical or diagnostic use.*
+---
+## 🧩 TL;DR
+* **Architecture:** Transformer with **RoPE**, **GQA**, **SwiGLU** MLP, and **RMSNorm**
+* **Tokenizer:** `tiktoken` `p50k_base` (vocab ≈ **50,281**)
+* **Context length:** 1,024 tokens
+* **Parameters:** ≈ **401 M** (`d_model=1024`, `n_heads=32`, `blocks=24`, `d_ff=2048`)
+* **GQA groups:** 8 → 4 KV heads per 32 query heads
+* **Dropout:** 0.0 (pretraining)
+* **Precision:** **bf16** mixed precision
+* **Training objective:** Next-token prediction
+* **Effective batch:** 32 × 4 = 128
+---
+## 📚 Data
+| Field                   | Value                             |
+| ----------------------- | --------------------------------- |
+| **Dataset**             | `japhba/pubmed_simple`            |
+| **Text column**         | `abstract`                        |
+| **Train/Val split**     | 95 / 5                            |
+| **Samples used**        | 100 k abstracts                   |
+| **Seq length / stride** | 1,024 / 1,024                     |
+| **Cleaning**            | `use_clean=False` (raw abstracts) |
+---
+## ⚙️ Training
+| Item                       | Value                                                                 |
+| -------------------------- | --------------------------------------------------------------------- |
+| **Framework**              | PyTorch                                                               |
+| **Precision**              | bf16                                                                  |
+| **Objective**              | Causal LM (next-token prediction)                                     |
+| **Optimizer**              | AdamW (`β₁ = 0.9`, `β₂ = 0.95`, `eps = 1e-8`)                         |
+| **Learning rate**          | 3 × 10⁻⁴ (linear + 100-step warmup)                                   |
+| **Weight decay**           | 0.1                                                                   |
+| **Batch size**             | 32 (× 4 grad acc → 128 effective)                                     |
+| **Grad clip**              | 1.0                                                                   |
+| **Total steps**            | 100 k                                                                 |
+| **Eval**                   | every 500 steps × 100 iters                                           |
+| **Checkpoint save**        | every 1 k steps                                                       |
+| **Seed**                   | 7 979 797                                                             |
+| **Gradient checkpointing** | ✅ Enabled                                                             |
+| **WandB**                  | `kunjcr2-dreamable/MedAssist-GPT-Pretraining` (`medassist-401M-test`) |
+| **HF repo**                | `kunjcr2/MedAssist-GPT-401M`                                          |
+---
+## 🧮 Training Environment
+| Item                | Value                  |
+| ------------------- | ---------------------- |
+| **Hardware**        | 1× NVIDIA A100 (80 GB) |
+| **Precision dtype** | bf16                   |
+| **Runtime**         | ~15 hours              |
+| **Scheduler**       | Linear LR decay        |
+| **Mixed precision** | Native AMP (bf16)      |
+---
+## 📈 Loss Curves
+*(Placeholder — will update post-training)*
+![train\_loss](https://cdn-uploads.huggingface.co/production/uploads/67c358189919777813863c48/bQGVqgx4GoqXZTcMh8KhM.png)
+![val\_loss](https://cdn-uploads.huggingface.co/production/uploads/67c358189919777813863c48/jhNnS_Wvhj4-fzNoO2dRN.png)
+---
+## 🚀 Minimal Inference
 ```python
 # pip install torch tiktoken huggingface_hub safetensors
 import torch, tiktoken
 from safetensors.torch import load_file
 from huggingface_hub import hf_hub_download
+from MedAssistGPT import MedAssistGPT, MODEL_CONFIG
+REPO_ID = "kunjcr2/MedAssist-GPT-401M"
+weights = hf_hub_download(REPO_ID, "model.safetensors")
+state = load_file(weights, device="cpu")
 model = MedAssistGPT(MODEL_CONFIG)
 model.load_state_dict(state, strict=True).eval()
 enc = tiktoken.get_encoding("p50k_base")
+ids = torch.tensor([enc.encode(
+    "A patient was admitted with severe headache. Initial assessment revealed"
+)], dtype=torch.long)
+for _ in range(100):
+    logits = model(ids)[:, -1, :]
+    next_id = torch.multinomial(torch.softmax(logits / 0.6, dim=-1), 1)
+    ids = torch.cat([ids, next_id], dim=1)
 print(enc.decode(ids[0].tolist()))
 ```
+---
+## 💾 Checkpoints
+* Main run: `medassist-401M-test`
+* Checkpoint: `/checkpoints/checkpoint_step_44500.pt`
+---
+## 🧪 Intended Use
+For research and experimentation only — e.g.,
+* domain-adapted pretraining,
+* architecture exploration,
+* fine-tuning for medical text understanding.
+🚫 **Not intended for clinical or production medical use.**
+---
+## 🔮 Future Work
+Next update includes:
+* **Supervised fine-tuning (SFT)**
+* **Reinforcement Learning (PPO) for alignment**
+---
+## 📁 Files
+* 'checkpoints/'
 * `config.json`, `tokenizer_config.json`
+* Training script / notebook defining `MedAssistGPT`
+---
+## 🪪 License
+Apache 2.0