ubergarm commited on
Commit
28d1a1f
·
1 Parent(s): 5251ef4

initial commit

Browse files
Files changed (2) hide show
  1. .gitattributes +3 -0
  2. README.md +515 -0
.gitattributes CHANGED
@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ imatrix-*.dat filter=lfs diff=lfs merge=lfs -text
37
+ *.gguf filter=lfs diff=lfs merge=lfs -text
38
+ *.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,518 @@
1
  ---
 
 
 
2
  license: apache-2.0
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ quantized_by: ubergarm
3
+ pipeline_tag: text-generation
4
+ base_model: Qwen/Qwen3-30B-A3B-Thinking-2507
5
  license: apache-2.0
6
+ license_link: https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507/blob/main/LICENSE
7
+ base_model_relation: quantized
8
+ tags:
9
+ - imatrix
10
+ - conversational
11
+ - qwen3_moe
12
+ - ik_llama.cpp
13
  ---
14
+
15
+ ## `ik_llama.cpp` imatrix Quantizations of Qwen/Qwen3-30B-A3B-Thinking-2507
16
+ This quant collection **REQUIRES** [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/) fork to support the ik's latest SOTA quants and optimizations! Do **not** download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!
17
+
18
+ *NOTE* `ik_llama.cpp` can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.
19
+
20
+ Some of ik's new quants are supported with [Nexesenex/croco.cpp](https://github.com/Nexesenex/croco.cpp) fork of KoboldCPP.
21
+
22
+ These quants provide best in class perplexity for the given memory footprint.
23
+
24
+ ## Big Thanks
25
+ Shout out to Wendell and the **Level1Techs** crew, the community [Forums](https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-home/225826), [YouTube Channel](https://www.youtube.com/@Level1Techs)! **BIG thanks** for providing **BIG hardware** expertise and access to run these experiments and make these great quants available to the community!!!
26
+
27
+ Also thanks to all the folks in the quanting and inferencing community on [BeaverAI Club Discord](https://huggingface.co/BeaverAI) and on [r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/) for tips and tricks helping each other run, test, and benchmark all the fun new models!
28
+
29
+ ## Quant Collection
30
+ Perplexity computed against *wiki.test.raw*.
31
+
32
+ ![Perplexity Chart](images/perplexity.png "Chart showing Perplexity improving as BPW increases.")
33
+
34
+ These first two are just test quants for baseline perplexity comparison:
35
+ * `bf16` 56.894 GiB (16.007 BPW)
36
+ - Final estimate: PPL = TODO
37
+ * `Q8_0` 30.247 GiB (8.510 BPW)
38
+ - Final estimate: PPL = TODO
39
+
40
+ ## `IQ5_K` 21.324 GiB (5.999 BPW)
41
+ Final estimate: PPL = TODO
42
+
43
+ <details>
44
+
45
+ <summary>👈 Secret Recipe</summary>
46
+
47
+ ```bash
48
+ #!/usr/bin/env bash
49
+
50
+ custom="
51
+ # 48 Repeating Layers [0-47]
52
+
53
+ # Attention
54
+ blk\.(0)\.attn_q.*=q8_0
55
+ blk\.(0)\.attn_k.*=q8_0
56
+ blk\.(0)\.attn_v.*=q8_0
57
+ blk\.(0)\.attn_output.*=q8_0
58
+
59
+ blk\..*\.attn_q.*=iq5_k
60
+ blk\..*\.attn_k.*=iq6_k
61
+ blk\..*\.attn_v.*=iq6_k
62
+ blk\..*\.attn_output.*=iq5_k
63
+
64
+ # Routed Experts
65
+ blk\.(0|47)\.ffn_down_exps\.weight=q8_0
66
+ blk\.(0|47)\.ffn_(gate|up)_exps\.weight=q8_0
67
+
68
+ blk\..*\.ffn_down_exps\.weight=iq6_k
69
+ blk\..*\.ffn_(gate|up)_exps\.weight=iq5_k
70
+
71
+ # Non-Repeating Layers
72
+ token_embd\.weight=iq6_k
73
+ output\.weight=iq6_k
74
+ "
75
+
76
+ custom=$(
77
+ echo "$custom" | grep -v '^#' | \
78
+ sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
79
+ )
80
+
81
+ ./build/bin/llama-quantize \
82
+ --custom-q "$custom" \
83
+ --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Thinking-2507-BF16.dat \
84
+ /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-BF16-00001-of-00002.gguf \
85
+ /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-IQ5_K.gguf \
86
+ IQ5_K \
87
+ 192
88
+ ```
89
+
90
+ </details>
91
+
92
+ ## `IQ4_K` 17.878 GiB (5.030 BPW)
93
+ Final estimate: PPL = TODO
94
+
95
+ <details>
96
+
97
+ <summary>👈 Secret Recipe</summary>
98
+
99
+ ```bash
100
+ #!/usr/bin/env bash
101
+
102
+ custom="
103
+ # 48 Repeating Layers [0-47]
104
+
105
+ # Attention
106
+ blk\.(0)\.attn_q.*=q8_0
107
+ blk\.(0)\.attn_k.*=q8_0
108
+ blk\.(0)\.attn_v.*=q8_0
109
+ blk\.(0)\.attn_output.*=q8_0
110
+
111
+ blk\..*\.attn_q.*=iq5_k
112
+ blk\..*\.attn_k.*=iq6_k
113
+ blk\..*\.attn_v.*=iq6_k
114
+ blk\..*\.attn_output.*=iq5_k
115
+
116
+ # Routed Experts
117
+ blk\.(0|47)\.ffn_down_exps\.weight=q8_0
118
+ blk\.(0|47)\.ffn_(gate|up)_exps\.weight=q8_0
119
+
120
+ blk\..*\.ffn_down_exps\.weight=iq5_k
121
+ blk\..*\.ffn_(gate|up)_exps\.weight=iq4_k
122
+
123
+ # Non-Repeating Layers
124
+ token_embd\.weight=iq4_k
125
+ output\.weight=iq6_k
126
+ "
127
+
128
+ custom=$(
129
+ echo "$custom" | grep -v '^#' | \
130
+ sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
131
+ )
132
+
133
+ ./build/bin/llama-quantize \
134
+ --custom-q "$custom" \
135
+ --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Thinking-2507-BF16.dat \
136
+ /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-BF16-00001-of-00002.gguf \
137
+ /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-IQ4_K.gguf \
138
+ IQ4_K \
139
+ 192
140
+ ```
141
+
142
+ </details>
143
+
144
+ ## `IQ4_KSS` 15.531 GiB (4.370 BPW)
145
+ Final estimate: PPL = TODO
146
+
147
+ <details>
148
+
149
+ <summary>👈 Secret Recipe</summary>
150
+
151
+ ```bash
152
+ #!/usr/bin/env bash
153
+
154
+ custom="
155
+ # 48 Repeating Layers [0-47]
156
+
157
+ # Attention
158
+ blk\.(0)\.attn_q.*=q8_0
159
+ blk\.(0)\.attn_k.*=q8_0
160
+ blk\.(0)\.attn_v.*=q8_0
161
+ blk\.(0)\.attn_output.*=q8_0
162
+
163
+ blk\..*\.attn_q.*=iq5_k
164
+ blk\..*\.attn_k.*=iq6_k
165
+ blk\..*\.attn_v.*=iq6_k
166
+ blk\..*\.attn_output.*=iq5_k
167
+
168
+ # Routed Experts
169
+ blk\.(0|47)\.ffn_down_exps\.weight=q8_0
170
+ blk\.(0|47)\.ffn_(gate|up)_exps\.weight=q8_0
171
+
172
+ blk\..*\.ffn_down_exps\.weight=iq4_ks
173
+ blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kss
174
+
175
+ # Non-Repeating Layers
176
+ token_embd\.weight=iq4_k
177
+ output\.weight=iq6_k
178
+ "
179
+
180
+ custom=$(
181
+ echo "$custom" | grep -v '^#' | \
182
+ sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
183
+ )
184
+
185
+ ./build/bin/llama-quantize \
186
+ --custom-q "$custom" \
187
+ --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Thinking-2507-BF16.dat \
188
+ /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-BF16-00001-of-00002.gguf \
189
+ /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-IQ4_KSS.gguf \
190
+ IQ4_KSS \
191
+ 192
192
+ ```
193
+
194
+ </details>
195
+
196
+
197
+ ## `IQ3_K` 14.509 GiB (4.082 BPW)
198
+ Final estimate: PPL = TODO
199
+
200
+ <details>
201
+
202
+ <summary>👈 Secret Recipe</summary>
203
+
204
+ ```bash
205
+ #!/usr/bin/env bash
206
+
207
+ custom="
208
+ # 48 Repeating Layers [0-47]
209
+
210
+ # Attention
211
+ blk\.(0)\.attn_q.*=q8_0
212
+ blk\.(0)\.attn_k.*=q8_0
213
+ blk\.(0)\.attn_v.*=q8_0
214
+ blk\.(0)\.attn_output.*=q8_0
215
+
216
+ blk\..*\.attn_q.*=iq5_k
217
+ blk\..*\.attn_k.*=iq6_k
218
+ blk\..*\.attn_v.*=iq6_k
219
+ blk\..*\.attn_output.*=iq5_k
220
+
221
+ # Routed Experts
222
+ blk\.(0|47)\.ffn_down_exps\.weight=q8_0
223
+ blk\.(0|47)\.ffn_(gate|up)_exps\.weight=q8_0
224
+
225
+ blk\..*\.ffn_down_exps\.weight=iq4_k
226
+ blk\..*\.ffn_(gate|up)_exps\.weight=iq3_k
227
+
228
+ # Non-Repeating Layers
229
+ token_embd\.weight=iq4_k
230
+ output\.weight=iq6_k
231
+ "
232
+
233
+ custom=$(
234
+ echo "$custom" | grep -v '^#' | \
235
+ sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
236
+ )
237
+
238
+ ./build/bin/llama-quantize \
239
+ --custom-q "$custom" \
240
+ --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Thinking-2507-BF16.dat \
241
+ /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-BF16-00001-of-00002.gguf \
242
+ /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-IQ3_K.gguf \
243
+ IQ3_K \
244
+ 192
245
+ ```
246
+
247
+ </details>
248
+
249
+ ## `IQ3_KS` 13.633 GiB (3.836 BPW)
250
+ Final estimate: PPL = TODO
251
+
252
+ <details>
253
+
254
+ <summary>👈 Secret Recipe</summary>
255
+
256
+ ```bash
257
+ #!/usr/bin/env bash
258
+
259
+ custom="
260
+ # 48 Repeating Layers [0-47]
261
+
262
+ # Attention
263
+ blk\.(0)\.attn_q.*=q8_0
264
+ blk\.(0)\.attn_k.*=q8_0
265
+ blk\.(0)\.attn_v.*=q8_0
266
+ blk\.(0)\.attn_output.*=q8_0
267
+
268
+ blk\..*\.attn_q.*=iq4_ks
269
+ blk\..*\.attn_k.*=iq5_ks
270
+ blk\..*\.attn_v.*=iq5_ks
271
+ blk\..*\.attn_output.*=iq4_ks
272
+
273
+ # Routed Experts
274
+ blk\.(0|47)\.ffn_down_exps\.weight=q8_0
275
+ blk\.(0|47)\.ffn_(gate|up)_exps\.weight=q8_0
276
+
277
+ blk\..*\.ffn_down_exps\.weight=iq4_ks
278
+ blk\..*\.ffn_(gate|up)_exps\.weight=iq3_ks
279
+
280
+ # Non-Repeating Layers
281
+ token_embd\.weight=iq4_k
282
+ output\.weight=iq6_k
283
+ "
284
+
285
+ custom=$(
286
+ echo "$custom" | grep -v '^#' | \
287
+ sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
288
+ )
289
+
290
+ ./build/bin/llama-quantize \
291
+ --custom-q "$custom" \
292
+ --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Thinking-2507-BF16.dat \
293
+ /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-BF16-00001-of-00002.gguf \
294
+ /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-IQ3_KS.gguf \
295
+ IQ3_KS \
296
+ 192
297
+ ```
298
+
299
+ </details>
300
+
301
+ ## `IQ2_KL` 11.516 GiB (3.240 BPW)
302
+ Final estimate: PPL = TODO
303
+
304
+ <details>
305
+
306
+ <summary>👈 Secret Recipe</summary>
307
+
308
+ ```bash
309
+ #!/usr/bin/env bash
310
+
311
+ custom="
312
+ # 48 Repeating Layers [0-47]
313
+
314
+ # Attention
315
+ blk\.(0)\.attn_q.*=q8_0
316
+ blk\.(0)\.attn_k.*=q8_0
317
+ blk\.(0)\.attn_v.*=q8_0
318
+ blk\.(0)\.attn_output.*=q8_0
319
+
320
+ blk\..*\.attn_q.*=iq5_k
321
+ blk\..*\.attn_k.*=iq6_k
322
+ blk\..*\.attn_v.*=iq6_k
323
+ blk\..*\.attn_output.*=iq5_k
324
+
325
+ # Routed Experts
326
+ blk\.(0|47)\.ffn_down_exps\.weight=q8_0
327
+ blk\.(0|47)\.ffn_(gate|up)_exps\.weight=q8_0
328
+
329
+ blk\..*\.ffn_down_exps\.weight=iq3_ks
330
+ blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl
331
+
332
+ # Non-Repeating Layers
333
+ token_embd\.weight=iq4_k
334
+ output\.weight=iq6_k
335
+ "
336
+
337
+ custom=$(
338
+ echo "$custom" | grep -v '^#' | \
339
+ sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
340
+ )
341
+
342
+ ./build/bin/llama-quantize \
343
+ --custom-q "$custom" \
344
+ --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Thinking-2507-BF16.dat \
345
+ /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-BF16-00001-of-00002.gguf \
346
+ /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-IQ2_KL.gguf \
347
+ IQ2_KL \
348
+ 192
349
+ ```
350
+
351
+ </details>
352
+
353
+ ## `IQ2_KT` 9.469 GiB (2.664 BPW)
354
+ Final estimate: PPL = TODO
355
+
356
+ <details>
357
+
358
+ <summary>👈 Secret Recipe</summary>
359
+
360
+ ```bash
361
+ #!/usr/bin/env bash
362
+
363
+ custom="
364
+ # 48 Repeating Layers [0-47]
365
+ blk\.(0)\.attn_q.*=iq5_ks
366
+ blk\.(0)\.attn_k.*=iq6_k
367
+ blk\.(0)\.attn_v.*=iq6_k
368
+ blk\.(0)\.attn_output.*=iq5_ks
369
+
370
+ # Attention
371
+ blk\..*\.attn_q.*=iq4_kt
372
+ blk\..*\.attn_k.*=iq5_ks
373
+ blk\..*\.attn_v.*=iq5_ks
374
+ blk\..*\.attn_output.*=iq4_kt
375
+
376
+ # Routed Experts
377
+ blk\.(0|47)\.ffn_down_exps\.weight=iq4_kt
378
+ blk\.(0|47)\.ffn_(gate|up)_exps\.weight=iq4_kt
379
+
380
+ blk\..*\.ffn_down_exps\.weight=iq3_kt
381
+ blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kt
382
+
383
+ # Non-Repeating Layers
384
+ token_embd\.weight=iq4_kt
385
+ output\.weight=iq6_k
386
+ "
387
+
388
+ custom=$(
389
+ echo "$custom" | grep -v '^#' | \
390
+ sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
391
+ )
392
+
393
+ ./build/bin/llama-quantize \
394
+ --custom-q "$custom" \
395
+ --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Thinking-2507-BF16.dat \
396
+ /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-BF16-00001-of-00002.gguf \
397
+ /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-IQ2_KT.gguf \
398
+ IQ2_KT \
399
+ 192
400
+ ```
401
+
402
+ </summary>
403
+
404
+ ## `IQ1_KT` 7.583 GiB (2.133 BPW)
405
+ Final estimate: PPL = TODO
406
+
407
+ <details>
408
+
409
+ <summary>👈 Secret Recipe</summary>
410
+
411
+ ```bash
412
+ #!/usr/bin/env bash
413
+
414
+ custom="
415
+ # 48 Repeating Layers [0-47]
416
+ blk\.(0)\.attn_q.*=iq5_ks
417
+ blk\.(0)\.attn_k.*=iq6_k
418
+ blk\.(0)\.attn_v.*=iq6_k
419
+ blk\.(0)\.attn_output.*=iq5_ks
420
+
421
+ # Attention
422
+ blk\..*\.attn_q.*=iq4_kt
423
+ blk\..*\.attn_k.*=iq5_ks
424
+ blk\..*\.attn_v.*=iq5_ks
425
+ blk\..*\.attn_output.*=iq4_kt
426
+
427
+ # Routed Experts
428
+ blk\.(0|47)\.ffn_down_exps\.weight=iq4_kt
429
+ blk\.(0|47)\.ffn_(gate|up)_exps\.weight=iq4_kt
430
+
431
+ blk\..*\.ffn_down_exps\.weight=iq2_kt
432
+ blk\..*\.ffn_(gate|up)_exps\.weight=iq1_kt
433
+
434
+ # Non-Repeating Layers
435
+ token_embd\.weight=iq4_kt
436
+ output\.weight=iq6_k
437
+ "
438
+
439
+ custom=$(
440
+ echo "$custom" | grep -v '^#' | \
441
+ sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
442
+ )
443
+
444
+ ./build/bin/llama-quantize \
445
+ --custom-q "$custom" \
446
+ --imatrix /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/imatrix-eaddario-combined-all-medium-Qwen3-30B-A3B-Thinking-2507-BF16.dat \
447
+ /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-BF16-00001-of-00002.gguf \
448
+ /mnt/raid/models/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF/Qwen3-30B-A3B-Thinking-2507-IQ1_KT.gguf \
449
+ IQ1_KT \
450
+ 192
451
+ ```
452
+
453
+ </details>
454
+
455
+ ## Quick Start
456
+ #### Full GPU Offload with CUDA or Vulkan (for AMD GPUs)
457
+ ```bash
458
+ # Compile CUDA backend
459
+ cmake -B ./build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_F16=ON
460
+ cmake --build ./build --config Release -j $(nproc)
461
+
462
+ # Compile Vulkan backend
463
+ # Experimental doesn't work with all quant types, need to test some more
464
+ # https://github.com/ikawrakow/ik_llama.cpp/discussions/590
465
+ cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_HIPBLAS=0 -DGGML_VULKAN=1
466
+ cmake --build build --config Release -j $(nproc)
467
+
468
+ # Run Server
469
+ ./build/bin/llama-server \
470
+ --model Qwen3-30B-A3B-Thinking-2507-IQ3_KS.gguf \
471
+ --alias ubergarm/Qwen3-30B-A3B-Thinking-2507 \
472
+ --ctx-size 32768 \
473
+ -ctk q8_0 -ctv q8_0 \
474
+ -fa -fmoe \
475
+ -ngl 99 \
476
+ --parallel 1 \
477
+ --threads 1 \
478
+ --host 127.0.0.1 \
479
+ --port 8080
480
+ ```
481
+
482
+ #### CPU-only Backend
483
+ ```bash
484
+ # Compile
485
+ cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=0 -DGGML_VULKAN=0
486
+ cmake --build build --config Release -j $(nproc)
487
+
488
+ # Run Server
489
+ ./build/bin/llama-server \
490
+ --model Qwen3-30B-A3B-Thinking-2507-IQ3_KS.gguf \
491
+ --alias ubergarm/Qwen3-30B-A3B-Thinking-2507 \
492
+ --ctx-size 32768 \
493
+ -ctk q8_0 -ctv q8_0 \
494
+ -fa -fmoe \
495
+ -ub 4096 -b 4096 \
496
+ --parallel 1 \
497
+ --threads 8 \
498
+ --host 127.0.0.1 \
499
+ --port 8080 \
500
+ --no-mmap
501
+ ```
502
+
503
+ ## imatrix note
504
+ I used @eaddario's [eaddario-imatrix-corpus-combined-all-medium](https://huggingface.co/datasets/eaddario/imatrix-calibration/blob/main/combined_all_medium.parquet) converted to text like so:
505
+ ```bash
506
+ $ apt-get install duckdb
507
+ $ duckdb -ascii -c "SELECT * FROM read_parquet('combined_all_medium.parquet');" > eaddario-imatrix-corpus-combined-all-medium.txt
508
+ $ du -h eaddario-imatrix-corpus-combined-all-medium.txt
509
+ 9.4M eaddario-imatrix-corpus-combined-all-medium.txt
510
+ $ sha1sum eaddario-imatrix-corpus-combined-all-medium.txt
511
+ 4cde1d5401abdc399b22ab9ede82b63684ad6bb4 eaddario-imatrix-corpus-combined-all-medium.txt
512
+ ```
513
+
514
+ ## References
515
+ * [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp)
516
+ * [Getting Started Guide (already out of date lol)](https://github.com/ikawrakow/ik_llama.cpp/discussions/258)
517
+ * [ubergarm-imatrix-calibration-corpus-v02.txt](https://gist.github.com/ubergarm/edfeb3ff9c6ec8b49e88cdf627b0711a?permalink_comment_id=5682584#gistcomment-5682584)
518
+ * [eaddario/imatrix-calibration](https://huggingface.co/datasets/eaddario/imatrix-calibration)