TheHouseOfTheDude
/

INTELLECT-3_Compressed-Tensors

+---
+language:
+- en
+library_name: vllm
+pipeline_tag: text-generation
+tags:
+  - text-generation
+  - conversational
+  - compressed-tensors
+  - awq
+  - w4a16
+  - quantized
+  - moe
+base_model: PrimeIntellect/INTELLECT-3
+base_model_relation: quantized
+quantized_by: TheHouseOfTheDude
+license: other
+---
+# INTELLECT-3 — **Quantized** (compressed-tensors for vLLM, GLM-4.5-Air MoE finetune)
+This repository provides **quantized runtime builds** of
+**PrimeIntellect/INTELLECT-3**, repackaged for **vLLM** using the **compressed-tensors** format.
+> **TL;DR**
+> - **Quantized** branch: **W4A16** (INT4 weights / A16 activations) for vLLM via `--quantization compressed-tensors`.
+> - Same calibration recipe as our recent cards: **512** chat samples at **2048** tokens max from **`neuralmagic/LLM_compression_calibration`** (rendered with the model’s chat template).
+> - Weight-only **AWQ**, **group size 128**, **symmetric** quant, `lm_head` left in higher precision, exported with `save_compressed=True`.
+> - Parent is a **GLM-4.5-Air MoE** finetune; notes below cover MoE-specific considerations.
+---
+## Revisions & Branches
+> The **`main`** branch is a landing page (model card + links). Runnable artifacts live in per-quant branches.
+- **main** — placeholder / landing page
+- **W4A16** — 4-bit weights / 16-bit activations (compressed-tensors)
+**Quick links**
+- main: https://huggingface.co/TheHouseOfTheDude/INTELLECT-3_Compressed-Tensors/tree/main
+- W4A16: https://huggingface.co/TheHouseOfTheDude/INTELLECT-3_Compressed-Tensors/tree/W4A16
+---
+## What’s inside (per revision)
+- Sharded **quantized** weights (`*.safetensors`) + index (`model.safetensors.index.json`)
+- `config.json` with **compressed-tensors** metadata (`weight_format`, `quantization`, `quantization_config`, etc.)
+- Tokenizer artifacts (`tokenizer.json`, `tokenizer.model`, merges/vocab as applicable)
+- Optional: `chat_template.jinja` (inherits the finetune’s chat style)
+> Exact file lists may differ between branches — see **Files and versions** for each revision.
+---
+## Quantization & calibration details (same script/recipe family as previous card)
+**Method / flow**
+- `llmcompressor` **oneshot** pipeline with an **AWQModifier** (weight-only).
+**Targets / exclusions**
+- Quantize **Linear** layers across the model (including MoE expert linear projections).
+- **Ignore** `lm_head` (kept in higher precision).
+**Weights / grouping**
+- **INT4** (`num_bits=4`, `type="int"`, `symmetric=True`)
+- Strategy: `"group"` with **`group_size=128`** (Marlin-friendly)
+- **Activations are not quantized** (runtime **A16**: BF16/FP16)
+**Calibration dataset & preprocessing**
+- Dataset: **`neuralmagic/LLM_compression_calibration`**, split **`train`**
+- **NUM_CALIBRATION_SAMPLES = 512** (random subset with fixed seed)
+- **MAX_SEQUENCE_LENGTH = 2048**
+- Each sample’s `messages` list is rendered via the model tokenizer’s
+  `apply_chat_template(..., tokenize=False)`, then tokenized with:
+  - `max_length=2048`, `truncation=True`, `padding=False`, `add_special_tokens=False`
+**Compression call**
+- `oneshot(..., max_seq_length=2048, num_calibration_samples=512, tokenizer=tokenizer)` on the preprocessed dataset
+**Export for vLLM**
+- Saved with **`save_compressed=True`** so **vLLM** reads the **compressed-tensors** runtime layout directly
+---
+## GLM-4.5-Air MoE notes
+- **Mixture-of-Experts (MoE)** means most transformer blocks host multiple expert FFNs with a router/gating network that activates a subset per token.
+- **Quantization impact:** AWQ weight-only quantization is applied to expert **Linear** layers as well as shared projections; the **router** (small linear(s)) is quantized like other Linear layers.
+- **Serving tips (vLLM):**
+  - Ensure your vLLM build supports MoE routing for the GLM-family architecture.
+  - Throughput depends on **expert parallelism** + **tensor parallelism**; scale `--tensor-parallel-size` to your GPUs and mind interconnect bandwidth.
+  - Token-wise active experts increase **KV-cache** and memory pressure slightly; keep `--max-model-len` aligned with hardware.
+---
+## Context length
+- **Calibration context:** up to **2048 tokens** per sample (as above).
+- **Model context window:** inherited from **PrimeIntellect/INTELLECT-3**; quantization does **not** change rope/position encodings—only the numeric representation of the weights.
+---
+## Quickstart — vLLM (compressed-tensors)
+Install vLLM (recent version recommended):
+    pip install vllm
+Serve (adjust to your hardware):
+    CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+    vllm serve TheHouseOfTheDude/INTELLECT-3_Compressed-Tensors \
+      --quantization compressed-tensors \
+      --tensor-parallel-size 8 \
+      --max-model-len 2048 \
+      --gpu-memory-utilization 0.70 \
+      --dtype bfloat16
+Example Chat Completions:
+    curl http://localhost:8000/v1/chat/completions \
+      -H "Content-Type: application/json" \
+      -d '{
+        "model": "TheHouseOfTheDude/INTELLECT-3_Compressed-Tensors",
+        "messages": [
+          {"role":"system","content":"You are INTELLECT — helpful, precise, and safe."},
+          {"role":"user","content":"Outline a plan for multi-document retrieval with MoE models."}
+        ],
+        "max_tokens": 512,
+        "temperature": 0.7,
+        "top_p": 0.95
+      }'
+> **Note:** `compressed-tensors` is a **vLLM runtime** format. Loading directly with vanilla 🤗 Transformers is **not supported**.
+> For Transformers, use a compatible export (e.g., GPTQ/AWQ for Transformers) or the full-precision finetune.
+---
+## Prompting / chat template
+This package follows the **finetuned parent’s** chat conventions. If a `chat_template.jinja` is present, libraries that support `apply_chat_template` will automatically format messages.
+Guidelines:
+- Keep the **system** message concise (behavior, tone, safety constraints).
+- Provide clear **user** instructions; for multi-step tasks, list steps explicitly.
+---
+## Intended use & safety
+This quantization:
+- **Does not** change underlying behavior or content tendencies.
+- **Only** changes weight storage for efficient inference.
+Apply appropriate **content filters / policies** for your deployment context.
+---
+## Lineage
+- **Finetuned parent:** https://huggingface.co/PrimeIntellect/INTELLECT-3
+- **This repo:** **Quantized child** of the finetune (**compressed-tensors** for vLLM)
+---
+## Hardware tips
+- 100B+-class MoE models benefit from **multi-GPU** tensor parallel; interconnect bandwidth matters (NVLink/IB).
+- Long contexts are **KV-cache** heavy — tune `--max-model-len` and batch size.
+- Prefer **BF16** on GPUs with native support; otherwise **FP16**.
+- Consider CUDA Graphs if stable in your environment.
+---
+## Changelog
+- **v1 (current)** — Initial **compressed-tensors W4A16** quantization with **512-sample / 2048-token** AWQ calibration; vLLM-ready packaging.