File size: 7,054 Bytes
08b8186 fb67d59 08b8186 fb67d59 08b8186 fb67d59 b6d6e74 fb67d59 b6d6e74 fb67d59 b6d6e74 fb67d59 b6d6e74 fb67d59 b6d6e74 fb67d59 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 |
---
quantized_by: ubergarm
pipeline_tag: text-generation
base_model: Qwen/Qwen3-235B-A22B-Thinking-2507
license: apache-2.0
license_link: https://huggingface.co/Qwen/Qwen3-235B-A22B-Thinking-2507/blob/main/LICENSE
base_model_relation: quantized
tags:
- imatrix
- conversational
- qwen3_moe
- ik_llama.cpp
---
## `ik_llama.cpp` imatrix Quantizations of Qwen/Qwen3-235B-A22B-Thinking-2507
This quant collection **REQUIRES** [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/) fork to support the ik's latest SOTA quants and optimizations! Do **not** download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!
*NOTE* `ik_llama.cpp` can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.
Some of ik's new quants are supported with [Nexesenex/croco.cpp](https://github.com/Nexesenex/croco.cpp) fork of KoboldCPP.
These quants provide best in class perplexity for the given memory footprint.
## Big Thanks
Shout out to Wendell and the **Level1Techs** crew, the community [Forums](https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-home/225826), [YouTube Channel](https://www.youtube.com/@Level1Techs)! **BIG thanks** for providing **BIG hardware** expertise and access to run these experiments and make these great quants available to the community!!!
Also thanks to all the folks in the quanting and inferencing community on [BeaverAI Club Discord](https://huggingface.co/BeaverAI) and on [r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/) for tips and tricks helping each other run, test, and benchmark all the fun new models!
## Quant Collection
Perplexity computed against *wiki.test.raw*. These first two are just test quants for baseline perplexity comparison:

* `bf16` 437.989 GiB (16.003 BPW)
- Final estimate: PPL = TODO
* `Q8_0` 232.769 GiB (8.505 BPW)
- Final estimate: PPL = TODO
## `IQ5_K` 161.722 GiB (5.909 BPW)
Final estimate: PPL = TODO
<details>
<summary>π Secret Recipe</summary>
```bash
#!/usr/bin/env bash
# Repeating Layers [0-93]
custom="
# Attention
blk\..*\.attn_q.*=iq6_k
blk\..*\.attn_k.*=q8_0
blk\..*\.attn_v.*=q8_0
blk\..*\.attn_output.*=iq6_k
# Routed Experts
blk\..*\.ffn_down_exps\.weight=iq6_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq5_k
# Token Embedding
token_embd\.weight=iq6_k
output\.weight=iq6_k
"
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
numactl -N 0 -m 0 \
./build/bin/llama-quantize \
--custom-q "$custom" \
--imatrix /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Thinking-2507-GGUF/imatrix-Qwen3-235B-A22B-Thinking-2507-BF16.dat \
/mnt/raid/models/ubergarm/Qwen3-235B-A22B-Thinking-2507-GGUF/Qwen3-235B-A22B-Thinking-2507-BF16-00001-of-00010.gguf \
/mnt/raid/models/ubergarm/Qwen3-235B-A22B-Thinking-2507-GGUF/Qwen3-235B-A22B-Thinking-2507-IQ5_K.gguf \
IQ5_K \
192
```
</details>
## `IQ4_K` 134.183 GiB (4.903 BPW)
Final estimate: PPL = TODO
<details>
<summary>π Secret Recipe</summary>
```bash
#!/usr/bin/env bash
# Repeating Layers [0-93]
custom="
# Attention
blk\..*\.attn_q.*=iq6_k
blk\..*\.attn_k.*=q8_0
blk\..*\.attn_v.*=q8_0
blk\..*\.attn_output.*=iq6_k
# Routed Experts
blk\..*\.ffn_down_exps\.weight=iq5_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_k
# Token Embedding
token_embd\.weight=iq6_k
output\.weight=iq6_k
"
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
numactl -N 1 -m 1 \
./build/bin/llama-quantize \
--custom-q "$custom" \
--imatrix /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Thinking-2507-GGUF/imatrix-Qwen3-235B-A22B-Thinking-2507-BF16.dat \
/mnt/raid/models/ubergarm/Qwen3-235B-A22B-Thinking-2507-GGUF/Qwen3-235B-A22B-Thinking-2507-BF16-00001-of-00010.gguf \
/mnt/raid/models/ubergarm/Qwen3-235B-A22B-Thinking-2507-GGUF/Qwen3-235B-A22B-Thinking-2507-IQ4_K.gguf \
IQ4_K \
192
```
</details>
## `IQ3_K` 106.644 GiB (3.897 BPW)
Final estimate: PPL = TODO
<details>
<summary>π Secret Recipe</summary>
```bash
#!/usr/bin/env bash
# Repeating Layers [0-93]
custom="
# Attention
blk\..*\.attn_q.*=iq6_k
blk\..*\.attn_k.*=q8_0
blk\..*\.attn_v.*=q8_0
blk\..*\.attn_output.*=iq6_k
# Routed Experts
blk\..*\.ffn_down_exps\.weight=iq4_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq3_k
# Token Embedding
token_embd\.weight=iq6_k
output\.weight=iq6_k
"
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
numactl -N 1 -m 1 \
./build/bin/llama-quantize \
--custom-q "$custom" \
--imatrix /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Thinking-2507-GGUF/imatrix-Qwen3-235B-A22B-Thinking-2507-BF16.dat \
/mnt/raid/models/ubergarm/Qwen3-235B-A22B-Thinking-2507-GGUF/Qwen3-235B-A22B-Thinking-2507-BF16-00001-of-00010.gguf \
/mnt/raid/models/ubergarm/Qwen3-235B-A22B-Thinking-2507-GGUF/Qwen3-235B-A22B-Thinking-2507-IQ3_K.gguf \
IQ3_K \
192
```
</details>
## `IQ2_KL` 81.866 GiB (2.991 BPW)
Final estimate: PPL = 4.6608 +/- 0.02720
<details>
<summary>π Secret Recipe</summary>
```bash
#!/usr/bin/env bash
# Repeating Layers [0-93]
custom="
# Attention
blk\..*\.attn_q.*=iq6_k
blk\..*\.attn_k.*=q8_0
blk\..*\.attn_v.*=q8_0
blk\..*\.attn_output.*=iq6_k
# Routed Experts
blk\..*\.ffn_down_exps\.weight=iq3_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl
# Token Embedding
token_embd\.weight=iq4_k
output\.weight=iq6_k
"
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
numactl -N 0 -m 0 \
./build/bin/llama-quantize \
--custom-q "$custom" \
--imatrix /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Thinking-2507-GGUF/imatrix-Qwen3-235B-A22B-Thinking-2507-BF16.dat \
/mnt/raid/models/ubergarm/Qwen3-235B-A22B-Thinking-2507-GGUF/Qwen3-235B-A22B-Thinking-2507-BF16-00001-of-00010.gguf \
/mnt/raid/models/ubergarm/Qwen3-235B-A22B-Thinking-2507-GGUF/Qwen3-235B-A22B-Thinking-2507-IQ2_KL.gguf \
IQ2_KL \
192
```
</details>
## Quick Start
This example is for a single CUDA GPU hybrid infrencing with CPU/RAM. Check ik_llama.cpp discussions or my other quants for more examples for multi-GPU etc.
```bash
./build/bin/llama-server \
--model /models/IQ5_K/Qwen3-235B-A22B-Thinking-IQ5_K-00001-of-00004.gguf \
--alias ubergarm/Qwen3-235B-A22B-Thinking-2507 \
-fa -fmoe \
-ctk q8_0 -ctv q8_0 \
-c 32768 \
-ngl 99 \
-ot "blk\.[0-9]\.ffn.*=CUDA0" \
-ot "blk.*\.ffn.*=CPU \
--threads 16 \
-ub 4096 -b 4096 \
--host 127.0.0.1 \
--port 8080
```
## References
* [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp)
* [Getting Started Guide (already out of date lol)](https://github.com/ikawrakow/ik_llama.cpp/discussions/258)
|