ubergarm
/

Kimi-K2-Thinking-GGUF

@@ -14,7 +14,7 @@ tags:
 ---
 ## imatrix Quantization of moonshotai/Kimi-K2-Thinking
-The "full quality" baseline ~~`Q8_0-Q4_0`~~ `Q4_X` quant runs on both on mainline llama.cpp and ik_llama.cpp.  The other quants in this collection **REQUIRE** [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/) fork to support the ik's latest SOTA quants and optimizations! Do **not** download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!
 *NOTE* `ik_llama.cpp` can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.
@@ -36,18 +36,14 @@ Perplexity computed against *wiki.test.raw*.
 ![Perplexity Chart](images/perplexity.png "Chart showing Perplexity improving as BPW increases.")
-## ~~Q8_0-Q4_0~~ Q4_X 543.617 GiB (4.549 BPW)
-*NOTE*: I'll probably delete the original Q8_0-Q4_0 one as it does not seem optimal given the original QAT. Check out the new `Q4_X` version which scores perplexity equivalent to a full 1TB Q8_0 test quant using a one line patch to adjust q4_0 to better fit the original QAT target quantization. Discussions ongoing on [llama.cpp PR#17064](https://github.com/ggml-org/llama.cpp/pull/17069) and [directly with moonshot on their huggingface discussions](https://huggingface.co/moonshotai/Kimi-K2-Thinking/discussions/26) ai as it seems they only used 15 of 16 possible 4bit values possibly?
-~~Final estimate: PPL = 2.1257 +/- 0.00934~~
 Final estimate: PPL = 2.0818 +/- 0.00903
 This is the "full quality" baseline version of the model and the only one in this collection with works on *both* ik_llama.cpp and mainline llama.cpp. It does *not* use an imatrix and was created going from the original model to full bf16 before further quantization. The exact PR used is linked below in references. This quant was used to make the imatrix for the rest of the collection.
-After doing more perplexity measurements, I'm not sure q4_0 is the best choice despite fairly closely matching the original QAT target format... Needs more research... *EDIT*: The Q4_X is the result of this further research. Give it a test if you can fit it!
 <details>
 <summary>👈 Secret Recipe</summary>
@@ -158,6 +154,8 @@ numactl -N ${SOCKET} -m ${SOCKET} \
 ## IQ3_K 474.772 GiB (3.973 BPW)
 *NOTE*: as mentioned in the Q8_0-Q4_0 above, there were some issues with the first q4_0 quantization type tensors like this one uses. So I'd hold off on this specific quant for now and choose one that does *not* use q4_0 or if you can fit the `Q4_X` it is the full quality version with patched q4_0 tensors.
 Final estimate: PPL = 2.1420 +/- 0.00938
 <details>

 ---
 ## imatrix Quantization of moonshotai/Kimi-K2-Thinking
+The "full quality" baseline `Q4_X` quant runs on both on mainline llama.cpp and ik_llama.cpp. The other quants in this collection **REQUIRE** [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/) fork to support the ik's latest SOTA quants and optimizations! Do **not** download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!
 *NOTE* `ik_llama.cpp` can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.
 ![Perplexity Chart](images/perplexity.png "Chart showing Perplexity improving as BPW increases.")
+## Q4_X 543.617 GiB (4.549 BPW)
+The `Q4_X` version scores perplexity equivalent to a full 1TB Q8_0 test quant using a one line patch to adjust q4_0 to better fit the original QAT target quantization. Discussions ongoing on [llama.cpp PR#17064](https://github.com/ggml-org/llama.cpp/pull/17069) and [directly with moonshot on their huggingface discussions](https://huggingface.co/moonshotai/Kimi-K2-Thinking/discussions/26) ai as it seems they only used 15 of 16 possible 4bit values possibly?
 Final estimate: PPL = 2.0818 +/- 0.00903
 This is the "full quality" baseline version of the model and the only one in this collection with works on *both* ik_llama.cpp and mainline llama.cpp. It does *not* use an imatrix and was created going from the original model to full bf16 before further quantization. The exact PR used is linked below in references. This quant was used to make the imatrix for the rest of the collection.
 <details>
 <summary>👈 Secret Recipe</summary>
 ## IQ3_K 474.772 GiB (3.973 BPW)
 *NOTE*: as mentioned in the Q8_0-Q4_0 above, there were some issues with the first q4_0 quantization type tensors like this one uses. So I'd hold off on this specific quant for now and choose one that does *not* use q4_0 or if you can fit the `Q4_X` it is the full quality version with patched q4_0 tensors.
+If folks want, I have a slightly smaller adjusted IQ3_K recipe using q4_x now and imatrix only for the iq3_k tensors. Holler at me in this discussion: https://huggingface.co/ubergarm/Kimi-K2-Thinking-GGUF/discussions/4#6918a268149cb086f69915ce
 Final estimate: PPL = 2.1420 +/- 0.00938
 <details>