phaedawg commited on
Commit
f728346
·
verified ·
1 Parent(s): 176f5b4

First Readme

Browse files
Files changed (1) hide show
  1. README.md +180 -0
README.md ADDED
@@ -0,0 +1,180 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ library_name: vllm
5
+ pipeline_tag: text-generation
6
+ tags:
7
+ - text-generation
8
+ - conversational
9
+ - compressed-tensors
10
+ - awq
11
+ - w4a16
12
+ - quantized
13
+ - moe
14
+ base_model: PrimeIntellect/INTELLECT-3
15
+ base_model_relation: quantized
16
+ quantized_by: TheHouseOfTheDude
17
+ license: other
18
+ ---
19
+
20
+ # INTELLECT-3 — **Quantized** (compressed-tensors for vLLM, GLM-4.5-Air MoE finetune)
21
+
22
+ This repository provides **quantized runtime builds** of
23
+ **PrimeIntellect/INTELLECT-3**, repackaged for **vLLM** using the **compressed-tensors** format.
24
+
25
+ > **TL;DR**
26
+ > - **Quantized** branch: **W4A16** (INT4 weights / A16 activations) for vLLM via `--quantization compressed-tensors`.
27
+ > - Same calibration recipe as our recent cards: **512** chat samples at **2048** tokens max from **`neuralmagic/LLM_compression_calibration`** (rendered with the model’s chat template).
28
+ > - Weight-only **AWQ**, **group size 128**, **symmetric** quant, `lm_head` left in higher precision, exported with `save_compressed=True`.
29
+ > - Parent is a **GLM-4.5-Air MoE** finetune; notes below cover MoE-specific considerations.
30
+
31
+ ---
32
+
33
+ ## Revisions & Branches
34
+
35
+ > The **`main`** branch is a landing page (model card + links). Runnable artifacts live in per-quant branches.
36
+
37
+ - **main** — placeholder / landing page
38
+ - **W4A16** — 4-bit weights / 16-bit activations (compressed-tensors)
39
+
40
+ **Quick links**
41
+
42
+ - main: https://huggingface.co/TheHouseOfTheDude/INTELLECT-3_Compressed-Tensors/tree/main
43
+ - W4A16: https://huggingface.co/TheHouseOfTheDude/INTELLECT-3_Compressed-Tensors/tree/W4A16
44
+
45
+ ---
46
+
47
+ ## What’s inside (per revision)
48
+
49
+ - Sharded **quantized** weights (`*.safetensors`) + index (`model.safetensors.index.json`)
50
+ - `config.json` with **compressed-tensors** metadata (`weight_format`, `quantization`, `quantization_config`, etc.)
51
+ - Tokenizer artifacts (`tokenizer.json`, `tokenizer.model`, merges/vocab as applicable)
52
+ - Optional: `chat_template.jinja` (inherits the finetune’s chat style)
53
+
54
+ > Exact file lists may differ between branches — see **Files and versions** for each revision.
55
+
56
+ ---
57
+
58
+ ## Quantization & calibration details (same script/recipe family as previous card)
59
+
60
+ **Method / flow**
61
+ - `llmcompressor` **oneshot** pipeline with an **AWQModifier** (weight-only).
62
+
63
+ **Targets / exclusions**
64
+ - Quantize **Linear** layers across the model (including MoE expert linear projections).
65
+ - **Ignore** `lm_head` (kept in higher precision).
66
+
67
+ **Weights / grouping**
68
+ - **INT4** (`num_bits=4`, `type="int"`, `symmetric=True`)
69
+ - Strategy: `"group"` with **`group_size=128`** (Marlin-friendly)
70
+ - **Activations are not quantized** (runtime **A16**: BF16/FP16)
71
+
72
+ **Calibration dataset & preprocessing**
73
+ - Dataset: **`neuralmagic/LLM_compression_calibration`**, split **`train`**
74
+ - **NUM_CALIBRATION_SAMPLES = 512** (random subset with fixed seed)
75
+ - **MAX_SEQUENCE_LENGTH = 2048**
76
+ - Each sample’s `messages` list is rendered via the model tokenizer’s
77
+ `apply_chat_template(..., tokenize=False)`, then tokenized with:
78
+ - `max_length=2048`, `truncation=True`, `padding=False`, `add_special_tokens=False`
79
+
80
+ **Compression call**
81
+ - `oneshot(..., max_seq_length=2048, num_calibration_samples=512, tokenizer=tokenizer)` on the preprocessed dataset
82
+
83
+ **Export for vLLM**
84
+ - Saved with **`save_compressed=True`** so **vLLM** reads the **compressed-tensors** runtime layout directly
85
+
86
+ ---
87
+
88
+ ## GLM-4.5-Air MoE notes
89
+
90
+ - **Mixture-of-Experts (MoE)** means most transformer blocks host multiple expert FFNs with a router/gating network that activates a subset per token.
91
+ - **Quantization impact:** AWQ weight-only quantization is applied to expert **Linear** layers as well as shared projections; the **router** (small linear(s)) is quantized like other Linear layers.
92
+ - **Serving tips (vLLM):**
93
+ - Ensure your vLLM build supports MoE routing for the GLM-family architecture.
94
+ - Throughput depends on **expert parallelism** + **tensor parallelism**; scale `--tensor-parallel-size` to your GPUs and mind interconnect bandwidth.
95
+ - Token-wise active experts increase **KV-cache** and memory pressure slightly; keep `--max-model-len` aligned with hardware.
96
+
97
+ ---
98
+
99
+ ## Context length
100
+
101
+ - **Calibration context:** up to **2048 tokens** per sample (as above).
102
+ - **Model context window:** inherited from **PrimeIntellect/INTELLECT-3**; quantization does **not** change rope/position encodings—only the numeric representation of the weights.
103
+
104
+ ---
105
+
106
+ ## Quickstart — vLLM (compressed-tensors)
107
+
108
+ Install vLLM (recent version recommended):
109
+
110
+ pip install vllm
111
+
112
+ Serve (adjust to your hardware):
113
+
114
+ CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
115
+ vllm serve TheHouseOfTheDude/INTELLECT-3_Compressed-Tensors \
116
+ --quantization compressed-tensors \
117
+ --tensor-parallel-size 8 \
118
+ --max-model-len 2048 \
119
+ --gpu-memory-utilization 0.70 \
120
+ --dtype bfloat16
121
+
122
+ Example Chat Completions:
123
+
124
+ curl http://localhost:8000/v1/chat/completions \
125
+ -H "Content-Type: application/json" \
126
+ -d '{
127
+ "model": "TheHouseOfTheDude/INTELLECT-3_Compressed-Tensors",
128
+ "messages": [
129
+ {"role":"system","content":"You are INTELLECT — helpful, precise, and safe."},
130
+ {"role":"user","content":"Outline a plan for multi-document retrieval with MoE models."}
131
+ ],
132
+ "max_tokens": 512,
133
+ "temperature": 0.7,
134
+ "top_p": 0.95
135
+ }'
136
+
137
+ > **Note:** `compressed-tensors` is a **vLLM runtime** format. Loading directly with vanilla 🤗 Transformers is **not supported**.
138
+ > For Transformers, use a compatible export (e.g., GPTQ/AWQ for Transformers) or the full-precision finetune.
139
+
140
+ ---
141
+
142
+ ## Prompting / chat template
143
+
144
+ This package follows the **finetuned parent’s** chat conventions. If a `chat_template.jinja` is present, libraries that support `apply_chat_template` will automatically format messages.
145
+
146
+ Guidelines:
147
+ - Keep the **system** message concise (behavior, tone, safety constraints).
148
+ - Provide clear **user** instructions; for multi-step tasks, list steps explicitly.
149
+
150
+ ---
151
+
152
+ ## Intended use & safety
153
+
154
+ This quantization:
155
+ - **Does not** change underlying behavior or content tendencies.
156
+ - **Only** changes weight storage for efficient inference.
157
+
158
+ Apply appropriate **content filters / policies** for your deployment context.
159
+
160
+ ---
161
+
162
+ ## Lineage
163
+
164
+ - **Finetuned parent:** https://huggingface.co/PrimeIntellect/INTELLECT-3
165
+ - **This repo:** **Quantized child** of the finetune (**compressed-tensors** for vLLM)
166
+
167
+ ---
168
+
169
+ ## Hardware tips
170
+
171
+ - 100B+-class MoE models benefit from **multi-GPU** tensor parallel; interconnect bandwidth matters (NVLink/IB).
172
+ - Long contexts are **KV-cache** heavy — tune `--max-model-len` and batch size.
173
+ - Prefer **BF16** on GPUs with native support; otherwise **FP16**.
174
+ - Consider CUDA Graphs if stable in your environment.
175
+
176
+ ---
177
+
178
+ ## Changelog
179
+
180
+ - **v1 (current)** — Initial **compressed-tensors W4A16** quantization with **512-sample / 2048-token** AWQ calibration; vLLM-ready packaging.