File size: 6,946 Bytes
f728346 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 |
---
language:
- en
library_name: vllm
pipeline_tag: text-generation
tags:
- text-generation
- conversational
- compressed-tensors
- awq
- w4a16
- quantized
- moe
base_model: PrimeIntellect/INTELLECT-3
base_model_relation: quantized
quantized_by: TheHouseOfTheDude
license: other
---
# INTELLECT-3 — **Quantized** (compressed-tensors for vLLM, GLM-4.5-Air MoE finetune)
This repository provides **quantized runtime builds** of
**PrimeIntellect/INTELLECT-3**, repackaged for **vLLM** using the **compressed-tensors** format.
> **TL;DR**
> - **Quantized** branch: **W4A16** (INT4 weights / A16 activations) for vLLM via `--quantization compressed-tensors`.
> - Same calibration recipe as our recent cards: **512** chat samples at **2048** tokens max from **`neuralmagic/LLM_compression_calibration`** (rendered with the model’s chat template).
> - Weight-only **AWQ**, **group size 128**, **symmetric** quant, `lm_head` left in higher precision, exported with `save_compressed=True`.
> - Parent is a **GLM-4.5-Air MoE** finetune; notes below cover MoE-specific considerations.
---
## Revisions & Branches
> The **`main`** branch is a landing page (model card + links). Runnable artifacts live in per-quant branches.
- **main** — placeholder / landing page
- **W4A16** — 4-bit weights / 16-bit activations (compressed-tensors)
**Quick links**
- main: https://huggingface.co/TheHouseOfTheDude/INTELLECT-3_Compressed-Tensors/tree/main
- W4A16: https://huggingface.co/TheHouseOfTheDude/INTELLECT-3_Compressed-Tensors/tree/W4A16
---
## What’s inside (per revision)
- Sharded **quantized** weights (`*.safetensors`) + index (`model.safetensors.index.json`)
- `config.json` with **compressed-tensors** metadata (`weight_format`, `quantization`, `quantization_config`, etc.)
- Tokenizer artifacts (`tokenizer.json`, `tokenizer.model`, merges/vocab as applicable)
- Optional: `chat_template.jinja` (inherits the finetune’s chat style)
> Exact file lists may differ between branches — see **Files and versions** for each revision.
---
## Quantization & calibration details (same script/recipe family as previous card)
**Method / flow**
- `llmcompressor` **oneshot** pipeline with an **AWQModifier** (weight-only).
**Targets / exclusions**
- Quantize **Linear** layers across the model (including MoE expert linear projections).
- **Ignore** `lm_head` (kept in higher precision).
**Weights / grouping**
- **INT4** (`num_bits=4`, `type="int"`, `symmetric=True`)
- Strategy: `"group"` with **`group_size=128`** (Marlin-friendly)
- **Activations are not quantized** (runtime **A16**: BF16/FP16)
**Calibration dataset & preprocessing**
- Dataset: **`neuralmagic/LLM_compression_calibration`**, split **`train`**
- **NUM_CALIBRATION_SAMPLES = 512** (random subset with fixed seed)
- **MAX_SEQUENCE_LENGTH = 2048**
- Each sample’s `messages` list is rendered via the model tokenizer’s
`apply_chat_template(..., tokenize=False)`, then tokenized with:
- `max_length=2048`, `truncation=True`, `padding=False`, `add_special_tokens=False`
**Compression call**
- `oneshot(..., max_seq_length=2048, num_calibration_samples=512, tokenizer=tokenizer)` on the preprocessed dataset
**Export for vLLM**
- Saved with **`save_compressed=True`** so **vLLM** reads the **compressed-tensors** runtime layout directly
---
## GLM-4.5-Air MoE notes
- **Mixture-of-Experts (MoE)** means most transformer blocks host multiple expert FFNs with a router/gating network that activates a subset per token.
- **Quantization impact:** AWQ weight-only quantization is applied to expert **Linear** layers as well as shared projections; the **router** (small linear(s)) is quantized like other Linear layers.
- **Serving tips (vLLM):**
- Ensure your vLLM build supports MoE routing for the GLM-family architecture.
- Throughput depends on **expert parallelism** + **tensor parallelism**; scale `--tensor-parallel-size` to your GPUs and mind interconnect bandwidth.
- Token-wise active experts increase **KV-cache** and memory pressure slightly; keep `--max-model-len` aligned with hardware.
---
## Context length
- **Calibration context:** up to **2048 tokens** per sample (as above).
- **Model context window:** inherited from **PrimeIntellect/INTELLECT-3**; quantization does **not** change rope/position encodings—only the numeric representation of the weights.
---
## Quickstart — vLLM (compressed-tensors)
Install vLLM (recent version recommended):
pip install vllm
Serve (adjust to your hardware):
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
vllm serve TheHouseOfTheDude/INTELLECT-3_Compressed-Tensors \
--quantization compressed-tensors \
--tensor-parallel-size 8 \
--max-model-len 2048 \
--gpu-memory-utilization 0.70 \
--dtype bfloat16
Example Chat Completions:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "TheHouseOfTheDude/INTELLECT-3_Compressed-Tensors",
"messages": [
{"role":"system","content":"You are INTELLECT — helpful, precise, and safe."},
{"role":"user","content":"Outline a plan for multi-document retrieval with MoE models."}
],
"max_tokens": 512,
"temperature": 0.7,
"top_p": 0.95
}'
> **Note:** `compressed-tensors` is a **vLLM runtime** format. Loading directly with vanilla 🤗 Transformers is **not supported**.
> For Transformers, use a compatible export (e.g., GPTQ/AWQ for Transformers) or the full-precision finetune.
---
## Prompting / chat template
This package follows the **finetuned parent’s** chat conventions. If a `chat_template.jinja` is present, libraries that support `apply_chat_template` will automatically format messages.
Guidelines:
- Keep the **system** message concise (behavior, tone, safety constraints).
- Provide clear **user** instructions; for multi-step tasks, list steps explicitly.
---
## Intended use & safety
This quantization:
- **Does not** change underlying behavior or content tendencies.
- **Only** changes weight storage for efficient inference.
Apply appropriate **content filters / policies** for your deployment context.
---
## Lineage
- **Finetuned parent:** https://huggingface.co/PrimeIntellect/INTELLECT-3
- **This repo:** **Quantized child** of the finetune (**compressed-tensors** for vLLM)
---
## Hardware tips
- 100B+-class MoE models benefit from **multi-GPU** tensor parallel; interconnect bandwidth matters (NVLink/IB).
- Long contexts are **KV-cache** heavy — tune `--max-model-len` and batch size.
- Prefer **BF16** on GPUs with native support; otherwise **FP16**.
- Consider CUDA Graphs if stable in your environment.
---
## Changelog
- **v1 (current)** — Initial **compressed-tensors W4A16** quantization with **512-sample / 2048-token** AWQ calibration; vLLM-ready packaging.
|