TheHouseOfTheDude
/

INTELLECT-3_Compressed-Tensors

Text Generation

compressed-tensors

Mixture of Experts

Model card Files Files and versions

phaedawg commited on 17 days ago

Commit

03a4d68

·

verified ·

1 Parent(s): f728346

Added W8A16 to model card.

Files changed (1) hide show

README.md +3 -1

README.md CHANGED Viewed

@@ -23,7 +23,7 @@ This repository provides **quantized runtime builds** of
 **PrimeIntellect/INTELLECT-3**, repackaged for **vLLM** using the **compressed-tensors** format.
 > **TL;DR**
-> - **Quantized** branch: **W4A16** (INT4 weights / A16 activations) for vLLM via `--quantization compressed-tensors`.
 > - Same calibration recipe as our recent cards: **512** chat samples at **2048** tokens max from **`neuralmagic/LLM_compression_calibration`** (rendered with the model’s chat template).
 > - Weight-only **AWQ**, **group size 128**, **symmetric** quant, `lm_head` left in higher precision, exported with `save_compressed=True`.
 > - Parent is a **GLM-4.5-Air MoE** finetune; notes below cover MoE-specific considerations.
@@ -36,11 +36,13 @@ This repository provides **quantized runtime builds** of
 - **main** — placeholder / landing page
 - **W4A16** — 4-bit weights / 16-bit activations (compressed-tensors)
 **Quick links**
 - main: https://huggingface.co/TheHouseOfTheDude/INTELLECT-3_Compressed-Tensors/tree/main
 - W4A16: https://huggingface.co/TheHouseOfTheDude/INTELLECT-3_Compressed-Tensors/tree/W4A16
 ---

 **PrimeIntellect/INTELLECT-3**, repackaged for **vLLM** using the **compressed-tensors** format.
 > **TL;DR**
+> - **Quantized** branch: **W4A16** (INT4 weights / A16 activations) and **W8A16** (INT8 weights / A16 activations) for vLLM via `--quantization compressed-tensors`.
 > - Same calibration recipe as our recent cards: **512** chat samples at **2048** tokens max from **`neuralmagic/LLM_compression_calibration`** (rendered with the model’s chat template).
 > - Weight-only **AWQ**, **group size 128**, **symmetric** quant, `lm_head` left in higher precision, exported with `save_compressed=True`.
 > - Parent is a **GLM-4.5-Air MoE** finetune; notes below cover MoE-specific considerations.
 - **main** — placeholder / landing page
 - **W4A16** — 4-bit weights / 16-bit activations (compressed-tensors)
+- **W8A16** — 4-bit weights / 16-bit activations (compressed-tensors)
 **Quick links**
 - main: https://huggingface.co/TheHouseOfTheDude/INTELLECT-3_Compressed-Tensors/tree/main
 - W4A16: https://huggingface.co/TheHouseOfTheDude/INTELLECT-3_Compressed-Tensors/tree/W4A16
+- W8A16: https://huggingface.co/TheHouseOfTheDude/INTELLECT-3_Compressed-Tensors/tree/W8A16
 ---