File size: 6,946 Bytes
f728346
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
---
language:
- en
library_name: vllm
pipeline_tag: text-generation
tags:
  - text-generation
  - conversational
  - compressed-tensors
  - awq
  - w4a16
  - quantized
  - moe
base_model: PrimeIntellect/INTELLECT-3
base_model_relation: quantized
quantized_by: TheHouseOfTheDude
license: other
---

# INTELLECT-3 — **Quantized** (compressed-tensors for vLLM, GLM-4.5-Air MoE finetune)

This repository provides **quantized runtime builds** of  
**PrimeIntellect/INTELLECT-3**, repackaged for **vLLM** using the **compressed-tensors** format.

> **TL;DR**
> - **Quantized** branch: **W4A16** (INT4 weights / A16 activations) for vLLM via `--quantization compressed-tensors`.
> - Same calibration recipe as our recent cards: **512** chat samples at **2048** tokens max from **`neuralmagic/LLM_compression_calibration`** (rendered with the model’s chat template).
> - Weight-only **AWQ**, **group size 128**, **symmetric** quant, `lm_head` left in higher precision, exported with `save_compressed=True`.
> - Parent is a **GLM-4.5-Air MoE** finetune; notes below cover MoE-specific considerations.

---

## Revisions & Branches

> The **`main`** branch is a landing page (model card + links). Runnable artifacts live in per-quant branches.

- **main** — placeholder / landing page  
- **W4A16** — 4-bit weights / 16-bit activations (compressed-tensors)

**Quick links**

- main: https://huggingface.co/TheHouseOfTheDude/INTELLECT-3_Compressed-Tensors/tree/main  
- W4A16: https://huggingface.co/TheHouseOfTheDude/INTELLECT-3_Compressed-Tensors/tree/W4A16

---

## What’s inside (per revision)

- Sharded **quantized** weights (`*.safetensors`) + index (`model.safetensors.index.json`)  
- `config.json` with **compressed-tensors** metadata (`weight_format`, `quantization`, `quantization_config`, etc.)  
- Tokenizer artifacts (`tokenizer.json`, `tokenizer.model`, merges/vocab as applicable)  
- Optional: `chat_template.jinja` (inherits the finetune’s chat style)

> Exact file lists may differ between branches — see **Files and versions** for each revision.

---

## Quantization & calibration details (same script/recipe family as previous card)

**Method / flow**
- `llmcompressor` **oneshot** pipeline with an **AWQModifier** (weight-only).

**Targets / exclusions**
- Quantize **Linear** layers across the model (including MoE expert linear projections).  
- **Ignore** `lm_head` (kept in higher precision).

**Weights / grouping**
- **INT4** (`num_bits=4`, `type="int"`, `symmetric=True`)  
- Strategy: `"group"` with **`group_size=128`** (Marlin-friendly)  
- **Activations are not quantized** (runtime **A16**: BF16/FP16)

**Calibration dataset & preprocessing**
- Dataset: **`neuralmagic/LLM_compression_calibration`**, split **`train`**  
- **NUM_CALIBRATION_SAMPLES = 512** (random subset with fixed seed)  
- **MAX_SEQUENCE_LENGTH = 2048**  
- Each sample’s `messages` list is rendered via the model tokenizer’s  
  `apply_chat_template(..., tokenize=False)`, then tokenized with:
  - `max_length=2048`, `truncation=True`, `padding=False`, `add_special_tokens=False`

**Compression call**
- `oneshot(..., max_seq_length=2048, num_calibration_samples=512, tokenizer=tokenizer)` on the preprocessed dataset

**Export for vLLM**
- Saved with **`save_compressed=True`** so **vLLM** reads the **compressed-tensors** runtime layout directly

---

## GLM-4.5-Air MoE notes

- **Mixture-of-Experts (MoE)** means most transformer blocks host multiple expert FFNs with a router/gating network that activates a subset per token.
- **Quantization impact:** AWQ weight-only quantization is applied to expert **Linear** layers as well as shared projections; the **router** (small linear(s)) is quantized like other Linear layers.
- **Serving tips (vLLM):**
  - Ensure your vLLM build supports MoE routing for the GLM-family architecture.
  - Throughput depends on **expert parallelism** + **tensor parallelism**; scale `--tensor-parallel-size` to your GPUs and mind interconnect bandwidth.
  - Token-wise active experts increase **KV-cache** and memory pressure slightly; keep `--max-model-len` aligned with hardware.

---

## Context length

- **Calibration context:** up to **2048 tokens** per sample (as above).  
- **Model context window:** inherited from **PrimeIntellect/INTELLECT-3**; quantization does **not** change rope/position encodings—only the numeric representation of the weights.

---

## Quickstart — vLLM (compressed-tensors)

Install vLLM (recent version recommended):

    pip install vllm

Serve (adjust to your hardware):

    CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
    vllm serve TheHouseOfTheDude/INTELLECT-3_Compressed-Tensors \
      --quantization compressed-tensors \
      --tensor-parallel-size 8 \
      --max-model-len 2048 \
      --gpu-memory-utilization 0.70 \
      --dtype bfloat16

Example Chat Completions:

    curl http://localhost:8000/v1/chat/completions \
      -H "Content-Type: application/json" \
      -d '{
        "model": "TheHouseOfTheDude/INTELLECT-3_Compressed-Tensors",
        "messages": [
          {"role":"system","content":"You are INTELLECT — helpful, precise, and safe."},
          {"role":"user","content":"Outline a plan for multi-document retrieval with MoE models."}
        ],
        "max_tokens": 512,
        "temperature": 0.7,
        "top_p": 0.95
      }'

> **Note:** `compressed-tensors` is a **vLLM runtime** format. Loading directly with vanilla 🤗 Transformers is **not supported**.  
> For Transformers, use a compatible export (e.g., GPTQ/AWQ for Transformers) or the full-precision finetune.

---

## Prompting / chat template

This package follows the **finetuned parent’s** chat conventions. If a `chat_template.jinja` is present, libraries that support `apply_chat_template` will automatically format messages.

Guidelines:
- Keep the **system** message concise (behavior, tone, safety constraints).  
- Provide clear **user** instructions; for multi-step tasks, list steps explicitly.

---

## Intended use & safety

This quantization:
- **Does not** change underlying behavior or content tendencies.  
- **Only** changes weight storage for efficient inference.

Apply appropriate **content filters / policies** for your deployment context.

---

## Lineage

- **Finetuned parent:** https://huggingface.co/PrimeIntellect/INTELLECT-3  
- **This repo:** **Quantized child** of the finetune (**compressed-tensors** for vLLM)

---

## Hardware tips

- 100B+-class MoE models benefit from **multi-GPU** tensor parallel; interconnect bandwidth matters (NVLink/IB).  
- Long contexts are **KV-cache** heavy — tune `--max-model-len` and batch size.  
- Prefer **BF16** on GPUs with native support; otherwise **FP16**.  
- Consider CUDA Graphs if stable in your environment.

---

## Changelog

- **v1 (current)** — Initial **compressed-tensors W4A16** quantization with **512-sample / 2048-token** AWQ calibration; vLLM-ready packaging.