File size: 17,107 Bytes
395e6ac
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5b3c791
2a831b0
395e6ac
507a7ef
 
 
 
 
5484183
395e6ac
a0ffddc
507a7ef
395e6ac
 
 
 
507a7ef
395e6ac
 
16fe8c2
 
 
 
2a831b0
053f8a7
2a831b0
a0ffddc
 
16fe8c2
507a7ef
16fe8c2
 
 
 
 
 
 
 
b477b64
16fe8c2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
395e6ac
16fe8c2
 
 
 
5b2c238
16fe8c2
 
 
 
 
 
 
 
 
 
 
a2ec5f8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b20fd34
 
053f8a7
b477b64
3ab9f0c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b20fd34
 
3ab9f0c
 
 
 
 
 
 
 
507a7ef
 
3ab9f0c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a2d5b08
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9440be5
a2d5b08
b105388
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
a2ec5f8
 
507a7ef
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5b3c791
 
 
 
c3bbf8e
5b3c791
16fe8c2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3ab9f0c
16fe8c2
 
 
 
 
 
 
e0e221d
5b2c238
50e453e
5b2c238
50e453e
5b2c238
 
0bd9042
5b2c238
 
 
 
 
 
0bd9042
 
 
 
5b2c238
 
 
 
 
 
62b4b9a
50e453e
62b4b9a
5b2c238
0bd9042
 
5b2c238
 
0bd9042
 
 
 
 
 
4319c19
 
395e6ac
 
 
 
e0e221d
 
395e6ac
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
---
quantized_by: ubergarm
pipeline_tag: text-generation
base_model: moonshotai/Kimi-K2-Thinking
license: other
license_name: modified-mit
license_link: https://huggingface.co/moonshotai/Kimi-K2-Thinking/blob/main/LICENSE
base_model_relation: quantized
tags:
- mla
- imatrix
- conversational
- ik_llama.cpp
---

## imatrix Quantization of moonshotai/Kimi-K2-Thinking
The "full quality" baseline `Q4_X` quant runs on both on mainline llama.cpp and ik_llama.cpp. The other quants in this collection **REQUIRE** [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/) fork to support the ik's latest SOTA quants and optimizations! Do **not** download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!

*NOTE* `ik_llama.cpp` can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.

Some of ik's new quants are supported with [Nexesenex/croco.cpp](https://github.com/Nexesenex/croco.cpp) fork of KoboldCPP with Windows builds for CUDA 12.9. Also check for [Windows builds by Thireus here.](https://github.com/Thireus/ik_llama.cpp/releases) which have been CUDA 12.8.

These quants provide best in class perplexity for the given memory footprint.

## Big Thanks
Great job ngxson, compilade, DevQuasar, Bartowski, AesSedai, and more folks who pulled together hacking to get this out quickly! 🫢 and jukofyork for the `Q4_X` patch!

Shout out to Wendell and the **Level1Techs** crew, the community [Forums](https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-home/225826), [YouTube Channel](https://www.youtube.com/@Level1Techs)!  **BIG thanks** for providing **BIG hardware** expertise and access to run these experiments and make these great quants available to the community!!!

Also thanks to all the folks in the quanting and inferencing community on [BeaverAI Club Discord](https://huggingface.co/BeaverAI) and on [r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/) for tips and tricks helping each other run, test, and benchmark all the fun new models!

Finally, I *really* appreciate all the support from [aifoundry.org](https://aifoundry.org) so check out their open source RISC-V solutions, and of course huggingface for hosting all these big quants!

## Quant Collection
Perplexity computed against *wiki.test.raw*.

![Perplexity Chart](images/perplexity.png "Chart showing Perplexity improving as BPW increases.")

## Q4_X 543.617 GiB (4.549 BPW)

The `Q4_X` version scores perplexity equivalent to a full 1TB Q8_0 test quant using a one line patch to adjust q4_0 to better fit the original QAT target quantization. Discussions ongoing on [llama.cpp PR#17064](https://github.com/ggml-org/llama.cpp/pull/17069) and [directly with moonshot on their huggingface discussions](https://huggingface.co/moonshotai/Kimi-K2-Thinking/discussions/26) ai as it seems they only used 15 of 16 possible 4bit values possibly?

Final estimate: PPL = 2.0818 +/- 0.00903

This is the "full quality" baseline version of the model and the only one in this collection with works on *both* ik_llama.cpp and mainline llama.cpp. It does *not* use an imatrix and was created going from the original model to full bf16 before further quantization. The exact PR used is linked below in references. This quant was used to make the imatrix for the rest of the collection.

<details>

<summary>πŸ‘ˆ Secret Recipe</summary>

```bash
#!/usr/bin/env bash

# Q4_0 (patched) routed experts approximating original QAT design
# Q8_0 everything else

custom="
## Attention [0-60] (GPU)
blk\..*\.attn_k_b\.weight=q8_0
blk\..*\.attn_v_b\.weight=q8_0

# Balance of attn tensors
blk\..*\.attn_kv_a_mqa\.weight=q8_0
blk\..*\.attn_q_a\.weight=q8_0
blk\..*\.attn_q_b\.weight=q8_0
blk\..*\.attn_output\.weight=q8_0

## First Single Dense Layer [0] (GPU)
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0

## Shared Expert [1-60] (GPU)
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

## Routed Experts [1-60] (CPU)
blk\..*\.ffn_down_exps\.weight=q4_0
blk\..*\.ffn_(gate|up)_exps\.weight=q4_0

token_embd\.weight=q8_0
output\.weight=q8_0
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00001-of-00046.gguf \
    /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/Kimi-K2-Thinking-Q8_0-Q4_0.gguf \
    Q8_0 \
    128
```

</details>

## smol-IQ4_KSS 485.008 GiB (4.059 BPW)
Final estimate: PPL = 2.1343 +/- 0.00934

<details>

<summary>πŸ‘ˆ Secret Recipe</summary>

```bash
#!/usr/bin/env bash

custom="
## Attention [0-60] (GPU)
blk\..*\.attn_k_b\.weight=q8_0
blk\..*\.attn_v_b\.weight=q8_0

# Balance of attn tensors
blk\..*\.attn_kv_a_mqa\.weight=q8_0
blk\..*\.attn_q_a\.weight=q8_0
blk\..*\.attn_q_b\.weight=q8_0
blk\..*\.attn_output\.weight=q8_0

## First Single Dense Layer [0] (GPU)
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0

## Shared Expert [1-60] (GPU)
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

## Routed Experts [1-60] (CPU)
blk\..*\.ffn_down_exps\.weight=iq4_kss
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kss

token_embd\.weight=iq6_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/imatrix-Kimi-K2-Thinking-Q8_0-Q4_0.dat \
    /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00001-of-00046.gguf \
    /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/Kimi-K2-Thinking-smol-IQ4_KSS.gguf \
    IQ4_KSS \
    128
```

</details>

## IQ3_K 459.432 GiB (3.845 BPW)
Final estimate: PPL = 2.1456 +/- 0.00941

*NOTE*: Given there were some issues with the original q4_0 quantization, I've replaced the original IQ3_K with this new smaller one using the patched q4_x quantization. The original one was `474.772 GiB (3.973 BPW)` and will be squash deleted to save on public quota soon. This new one uses q4_x patched and only applies imatrix to the iq3_k tensors but *not* to the q8_0 or q4_x. More details in [discussion 4 here](https://huggingface.co/ubergarm/Kimi-K2-Thinking-GGUF/discussions/4#6918a268149cb086f69915ce). It has almost the same perplexity so a good improvement.

<details>

<summary>πŸ‘ˆ Secret Recipe</summary>

```bash
#!/usr/bin/env bash

custom="
## Attention [0-60] (GPU)
blk\..*\.attn_k_b\.weight=q8_0
blk\..*\.attn_v_b\.weight=q8_0

# Balance of attn tensors
blk\..*\.attn_kv_a_mqa\.weight=q8_0
blk\..*\.attn_q_a\.weight=q8_0
blk\..*\.attn_q_b\.weight=q8_0
blk\..*\.attn_output\.weight=q8_0

## First Single Dense Layer [0] (GPU)
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0

## Shared Expert [1-60] (GPU)
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

## Routed Experts [1-60] (CPU)
blk\..*\.ffn_down_exps\.weight=q4_0
blk\..*\.ffn_(gate|up)_exps\.weight=iq3_k

token_embd\.weight=iq6_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/imatrix-Kimi-K2-Thinking-Q8_0-Q4_0.dat \
    --include-weights ffn_gate_exps \
    --include-weights ffn_up_exps \
    /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00001-of-00046.gguf \
    /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/Kimi-K2-Thinking-IQ3_K.gguf \
    IQ3_K \
    128
```

</details>

## smol-IQ3_KS 398.561 GiB (3.336 BPW)
Final estimate: PPL = 2.2331 +/- 0.01001

<details>

<summary>πŸ‘ˆ Secret Recipe</summary>

```bash
#!/usr/bin/env bash

custom="
## Attention [0-60] (GPU)
blk\..*\.attn_k_b\.weight=q8_0
blk\..*\.attn_v_b\.weight=q8_0

# Balance of attn tensors
blk\..*\.attn_kv_a_mqa\.weight=q8_0
blk\..*\.attn_q_a\.weight=q8_0
blk\..*\.attn_q_b\.weight=q8_0
blk\..*\.attn_output\.weight=q8_0

## First Single Dense Layer [0] (GPU)
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0

## Shared Expert [1-60] (GPU)
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

## Routed Experts [1-60] (CPU)
blk\.(1|2|3|60)\.ffn_down_exps\.weight=q4_0
blk\.(1|2|3|60)\.ffn_(gate|up)_exps\.weight=q4_0
blk\..*\.ffn_down_exps\.weight=iq3_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq3_ks

token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/imatrix-Kimi-K2-Thinking-Q8_0-Q4_0.dat \
    /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00001-of-00046.gguf \
    /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/Kimi-K2-Thinking-smol-IQ3_KS.gguf \
    IQ3_KS \
    128
```

</details>

## IQ2_KL 348.883 GiB (2.920 BPW)
Final estimate: PPL = 2.3735 +/- 0.01082

<details>

<summary>πŸ‘ˆ Secret Recipe</summary>

```bash
#!/usr/bin/env bash

custom="
## Attention [0-60] (GPU)
blk\..*\.attn_k_b\.weight=q8_0
blk\..*\.attn_v_b\.weight=q8_0

# Balance of attn tensors
blk\..*\.attn_kv_a_mqa\.weight=q8_0
blk\..*\.attn_q_a\.weight=q8_0
blk\..*\.attn_q_b\.weight=q8_0
blk\..*\.attn_output\.weight=q8_0

## First Single Dense Layer [0] (GPU)
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0

## Shared Expert [1-60] (GPU)
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

## Routed Experts [1-60] (CPU)
blk\..*\.ffn_down_exps\.weight=iq3_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl

token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/imatrix-Kimi-K2-Thinking-Q8_0-Q4_0.dat \
    /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00001-of-00046.gguf \
    /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/Kimi-K2-Thinking-IQ2_KL.gguf \
    IQ2_KL \
    128
```

</details>

## smol-IQ2_KL 329.195 GiB (2.755 BPW)
Final estimate: PPL = 2.4550 +/- 0.01129

<details>

<summary>πŸ‘ˆ Secret Recipe</summary>

```bash
#!/usr/bin/env bash

custom="
## Attention [0-60] (GPU)
blk\..*\.attn_k_b\.weight=q8_0
blk\..*\.attn_v_b\.weight=q8_0

# Balance of attn tensors
blk\..*\.attn_kv_a_mqa\.weight=q8_0
blk\..*\.attn_q_a\.weight=q8_0
blk\..*\.attn_q_b\.weight=q8_0
blk\..*\.attn_output\.weight=q8_0

## First Single Dense Layer [0] (GPU)
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0

## Shared Expert [1-60] (GPU)
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

## Routed Experts [1-60] (CPU)
blk\..*\.ffn_down_exps\.weight=iq2_kl
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl

token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/imatrix-Kimi-K2-Thinking-Q8_0-Q4_0.dat \
    /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00001-of-00046.gguf \
    /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/Kimi-K2-Thinking-smol-IQ2_KL.gguf \
    IQ2_KL \
    128
```

</details>

## smol-IQ2_KS 270.133 GiB (2.261 BPW)
Final estimate: PPL = 2.9361 +/- 0.01451

<details>

<summary>πŸ‘ˆ Secret Recipe</summary>

```bash
#!/usr/bin/env bash

custom="
## Attention [0-60] (GPU)
blk\..*\.attn_k_b\.weight=q8_0
blk\..*\.attn_v_b\.weight=q8_0

# Balance of attn tensors
blk\..*\.attn_kv_a_mqa\.weight=q8_0
blk\..*\.attn_q_a\.weight=q8_0
blk\..*\.attn_q_b\.weight=q8_0
blk\..*\.attn_output\.weight=q8_0

## First Single Dense Layer [0] (GPU)
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0

## Shared Expert [1-60] (GPU)
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

## Routed Experts [1-60] (CPU)
blk\..*\.ffn_down_exps\.weight=iq2_ks
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_ks

token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/imatrix-Kimi-K2-Thinking-Q8_0-Q4_0.dat \
    /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00001-of-00046.gguf \
    /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/Kimi-K2-Thinking-smol-IQ2_KS.gguf \
    IQ2_KS \
    128
```

</details>

## smol-IQ1_KT 218.936 GiB (1.832 BPW)
Final estimate: PPL = 3.5931 +/- 0.01889

*only for the desperate*

Also keep in mind `KT` trellis quants generally are slower during TG given likely compute bottleneck if running on CPU, but if it is all you can fit then well...

<details>

<summary>πŸ‘ˆ Secret Recipe</summary>

```bash
#!/usr/bin/env bash

custom="
## Attention [0-60] (GPU)
blk\..*\.attn_k_b\.weight=q8_0
blk\..*\.attn_v_b\.weight=q8_0

# Balance of attn tensors
blk\..*\.attn_kv_a_mqa\.weight=q8_0
blk\..*\.attn_q_a\.weight=q8_0
blk\..*\.attn_q_b\.weight=q8_0
blk\..*\.attn_output\.weight=q8_0

## First Single Dense Layer [0] (GPU)
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0

## Shared Expert [1-60] (GPU)
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

## Routed Experts [1-60] (CPU)
blk\..*\.ffn_down_exps\.weight=iq1_kt
blk\..*\.ffn_(gate|up)_exps\.weight=iq1_kt

token_embd\.weight=iq4_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N ${SOCKET} -m ${SOCKET} \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/imatrix-Kimi-K2-Thinking-Q8_0-Q4_0.dat \
    /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00001-of-00046.gguf \
    /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/Kimi-K2-Thinking-IQ1_KT.gguf \
    IQ1_KT \
    128
```

</details>

## Quick Start
You will want to override the template given they patched the original template here: https://huggingface.co/moonshotai/Kimi-K2-Thinking/blob/main/chat_template.jinja
You can do stuff like `--jinja --chat-template-file ./my-custom-template.jinja`.
You will also need to pass `--special` for it to output `<think>` and` </think>` tags correctly depending on endpoint and client used, thanks [u/Melodic-Network4374](https://www.reddit.com/r/LocalLLaMA/comments/1oqo57j/comment/nnpqxjx/) but note it will then also print out `<|im_end|>` so you can set your client to use that as a stop string.

```bash
# Example running Hybrid CPU+GPU(s) on ik_llama.cpp
./build/bin/llama-server \
    --model "$model"\
    --alias ubergarm/Kimi-K2-Thinking-GGUF \
    --ctx-size 32768 \
    -ctk q8_0 \
    -mla 3 \
    -ngl 99 \
    -ot "blk\.(1|2|3)\.ffn_.*=CUDA0" \
    -ot "blk\.(4|5|6)\.ffn_.*=CUDA1" \
    -ot exps=CPU \
    --parallel 1 \
    --threads 96 \
    --threads-batch 128 \
    --host 127.0.0.1 \
    --port 8080 \
    --no-mmap \
    --jinja \
    --chat-template-file updatedChatTemplate.jinja \
    --special

# Example running mainline llama.cpp
# remove `-mla 3` from commands and you should be :gucci:
```

If no GPU(s), just remove -ngl and -ot lines.

If you don't have enough RAM+VRAM, remove `--no-mmap` to mmap() "troll rig" it paging weights read-only off of disk for a couple tok/sec maybe depending.

Adjust `--threads` and `--threds-batch` as needed. For smaller CPUs I recommend setting them both the same equal to the number of physical cores. For an amd 9950x that would be `-t 16` for example. Experiment on larger rigs especially with multiple socket NUMA considerations (avoid cross-NUMA memory access if possible).

With ik_llama.cpp you can get some extra VRAM by using `-amb 512` to fix the size of the MLA computation buffers. (only works on models with MLA style attention like Kimi-K2 and DeepSeek)

## References
* [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp)
* [Getting Started Guide (already out of date lol)](https://github.com/ikawrakow/ik_llama.cpp/discussions/258)
* [ubergarm-imatrix-calibration-corpus-v02.txt](https://gist.github.com/ubergarm/edfeb3ff9c6ec8b49e88cdf627b0711a?permalink_comment_id=5682584#gistcomment-5682584)
* [moonshotai/Kimi-K2-Thinking/discussions/2](https://huggingface.co/moonshotai/Kimi-K2-Thinking/discussions/2)
* [vllm-project/compressed-tensors/issues/511](https://github.com/vllm-project/compressed-tensors/issues/511)
* [llama.cpp PR#17069](https://github.com/ggml-org/llama.cpp/pull/17069#issuecomment-3500870165)