--- quantized_by: ubergarm pipeline_tag: text-generation base_model: moonshotai/Kimi-K2-Thinking license: other license_name: modified-mit license_link: https://huggingface.co/moonshotai/Kimi-K2-Thinking/blob/main/LICENSE base_model_relation: quantized tags: - mla - imatrix - conversational - ik_llama.cpp --- ## imatrix Quantization of moonshotai/Kimi-K2-Thinking *UPDATE*: The `smol-IQ3_KS` scored 77.3% on [aider polyglot benchmark](https://aider.chat/docs/leaderboards/) with 2x speed-up over similar sized mainline `UD-IQ3_XXS`! Details [in discussion 14 here](https://huggingface.co/ubergarm/Kimi-K2-Thinking-GGUF/discussions/14#691e699a1d650ccb35814793). Thanks Fernanda24! The "full quality" baseline `Q4_X` quant runs on both on mainline llama.cpp and ik_llama.cpp. The other quants in this collection **REQUIRE** [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/) fork to support the ik's latest SOTA quants and optimizations! Do **not** download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc! *NOTE* `ik_llama.cpp` can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants. Some of ik's new quants are supported with [Nexesenex/croco.cpp](https://github.com/Nexesenex/croco.cpp) fork of KoboldCPP with Windows builds for CUDA 12.9. Also check for [Windows builds by Thireus here.](https://github.com/Thireus/ik_llama.cpp/releases) which have been CUDA 12.8. These quants provide best in class perplexity for the given memory footprint. ## Big Thanks Great job ngxson, compilade, DevQuasar, Bartowski, AesSedai, and more folks who pulled together hacking to get this out quickly! 🫶 and jukofyork for the `Q4_X` patch! Shout out to Wendell and the **Level1Techs** crew, the community [Forums](https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-home/225826), [YouTube Channel](https://www.youtube.com/@Level1Techs)! **BIG thanks** for providing **BIG hardware** expertise and access to run these experiments and make these great quants available to the community!!! Also thanks to all the folks in the quanting and inferencing community on [BeaverAI Club Discord](https://huggingface.co/BeaverAI) and on [r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/) for tips and tricks helping each other run, test, and benchmark all the fun new models! Finally, I *really* appreciate all the support from [aifoundry.org](https://aifoundry.org) so check out their open source RISC-V solutions, and of course huggingface for hosting all these big quants! ## Quant Collection Perplexity computed against *wiki.test.raw*. ![Perplexity Chart](images/perplexity.png "Chart showing Perplexity improving as BPW increases.") ## Q4_X 543.617 GiB (4.549 BPW) The `Q4_X` version scores perplexity equivalent to a full 1TB Q8_0 test quant using a one line patch to adjust q4_0 to better fit the original QAT target quantization. Discussions ongoing on [llama.cpp PR#17064](https://github.com/ggml-org/llama.cpp/pull/17069) and [directly with moonshot on their huggingface discussions](https://huggingface.co/moonshotai/Kimi-K2-Thinking/discussions/26) ai as it seems they only used 15 of 16 possible 4bit values possibly? Final estimate: PPL = 2.0818 +/- 0.00903 This is the "full quality" baseline version of the model and the only one in this collection with works on *both* ik_llama.cpp and mainline llama.cpp. It does *not* use an imatrix and was created going from the original model to full bf16 before further quantization. The exact PR used is linked below in references. This quant was used to make the imatrix for the rest of the collection.
👈 Secret Recipe ```bash #!/usr/bin/env bash # Q4_0 (patched) routed experts approximating original QAT design # Q8_0 everything else custom=" ## Attention [0-60] (GPU) blk\..*\.attn_k_b\.weight=q8_0 blk\..*\.attn_v_b\.weight=q8_0 # Balance of attn tensors blk\..*\.attn_kv_a_mqa\.weight=q8_0 blk\..*\.attn_q_a\.weight=q8_0 blk\..*\.attn_q_b\.weight=q8_0 blk\..*\.attn_output\.weight=q8_0 ## First Single Dense Layer [0] (GPU) blk\..*\.ffn_down\.weight=q8_0 blk\..*\.ffn_(gate|up)\.weight=q8_0 ## Shared Expert [1-60] (GPU) blk\..*\.ffn_down_shexp\.weight=q8_0 blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0 ## Routed Experts [1-60] (CPU) blk\..*\.ffn_down_exps\.weight=q4_0 blk\..*\.ffn_(gate|up)_exps\.weight=q4_0 token_embd\.weight=q8_0 output\.weight=q8_0 " custom=$( echo "$custom" | grep -v '^#' | \ sed -Ez 's:\n+:,:g;s:,$::;s:^,::' ) numactl -N ${SOCKET} -m ${SOCKET} \ ./build/bin/llama-quantize \ --custom-q "$custom" \ /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00001-of-00046.gguf \ /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/Kimi-K2-Thinking-Q8_0-Q4_0.gguf \ Q8_0 \ 128 ```
## smol-IQ4_KSS 485.008 GiB (4.059 BPW) Final estimate: PPL = 2.1343 +/- 0.00934
👈 Secret Recipe ```bash #!/usr/bin/env bash custom=" ## Attention [0-60] (GPU) blk\..*\.attn_k_b\.weight=q8_0 blk\..*\.attn_v_b\.weight=q8_0 # Balance of attn tensors blk\..*\.attn_kv_a_mqa\.weight=q8_0 blk\..*\.attn_q_a\.weight=q8_0 blk\..*\.attn_q_b\.weight=q8_0 blk\..*\.attn_output\.weight=q8_0 ## First Single Dense Layer [0] (GPU) blk\..*\.ffn_down\.weight=q8_0 blk\..*\.ffn_(gate|up)\.weight=q8_0 ## Shared Expert [1-60] (GPU) blk\..*\.ffn_down_shexp\.weight=q8_0 blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0 ## Routed Experts [1-60] (CPU) blk\..*\.ffn_down_exps\.weight=iq4_kss blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kss token_embd\.weight=iq6_k output\.weight=iq6_k " custom=$( echo "$custom" | grep -v '^#' | \ sed -Ez 's:\n+:,:g;s:,$::;s:^,::' ) numactl -N ${SOCKET} -m ${SOCKET} \ ./build/bin/llama-quantize \ --custom-q "$custom" \ --imatrix /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/imatrix-Kimi-K2-Thinking-Q8_0-Q4_0.dat \ /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00001-of-00046.gguf \ /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/Kimi-K2-Thinking-smol-IQ4_KSS.gguf \ IQ4_KSS \ 128 ```
## IQ3_K 459.432 GiB (3.845 BPW) Final estimate: PPL = 2.1456 +/- 0.00941 *NOTE*: Given there were some issues with the original q4_0 quantization, I've replaced the original IQ3_K with this new smaller one using the patched q4_x quantization. The original one was `474.772 GiB (3.973 BPW)` and will be squash deleted to save on public quota soon. This new one uses q4_x patched and only applies imatrix to the iq3_k tensors but *not* to the q8_0 or q4_x. More details in [discussion 4 here](https://huggingface.co/ubergarm/Kimi-K2-Thinking-GGUF/discussions/4#6918a268149cb086f69915ce). It has almost the same perplexity so a good improvement.
👈 Secret Recipe ```bash #!/usr/bin/env bash custom=" ## Attention [0-60] (GPU) blk\..*\.attn_k_b\.weight=q8_0 blk\..*\.attn_v_b\.weight=q8_0 # Balance of attn tensors blk\..*\.attn_kv_a_mqa\.weight=q8_0 blk\..*\.attn_q_a\.weight=q8_0 blk\..*\.attn_q_b\.weight=q8_0 blk\..*\.attn_output\.weight=q8_0 ## First Single Dense Layer [0] (GPU) blk\..*\.ffn_down\.weight=q8_0 blk\..*\.ffn_(gate|up)\.weight=q8_0 ## Shared Expert [1-60] (GPU) blk\..*\.ffn_down_shexp\.weight=q8_0 blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0 ## Routed Experts [1-60] (CPU) blk\..*\.ffn_down_exps\.weight=q4_0 blk\..*\.ffn_(gate|up)_exps\.weight=iq3_k token_embd\.weight=iq6_k output\.weight=iq6_k " custom=$( echo "$custom" | grep -v '^#' | \ sed -Ez 's:\n+:,:g;s:,$::;s:^,::' ) numactl -N ${SOCKET} -m ${SOCKET} \ ./build/bin/llama-quantize \ --custom-q "$custom" \ --imatrix /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/imatrix-Kimi-K2-Thinking-Q8_0-Q4_0.dat \ --include-weights ffn_gate_exps \ --include-weights ffn_up_exps \ /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00001-of-00046.gguf \ /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/Kimi-K2-Thinking-IQ3_K.gguf \ IQ3_K \ 128 ```
## smol-IQ3_KS 388.258 GiB (3.249 BPW) Final estimate: PPL = 2.2363 +/- 0.01004
👈 Secret Recipe ```bash #!/usr/bin/env bash custom=" ## Attention [0-60] (GPU) blk\..*\.attn_k_b\.weight=q8_0 blk\..*\.attn_v_b\.weight=q8_0 # Balance of attn tensors blk\..*\.attn_kv_a_mqa\.weight=q8_0 blk\..*\.attn_q_a\.weight=q8_0 blk\..*\.attn_q_b\.weight=q8_0 blk\..*\.attn_output\.weight=q8_0 ## First Single Dense Layer [0] (GPU) blk\..*\.ffn_down\.weight=q8_0 blk\..*\.ffn_(gate|up)\.weight=q8_0 ## Shared Expert [1-60] (GPU) blk\..*\.ffn_down_shexp\.weight=q8_0 blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0 ## Routed Experts [1-60] (CPU) blk\..*\.ffn_down_exps\.weight=iq3_ks blk\..*\.ffn_(gate|up)_exps\.weight=iq3_ks token_embd\.weight=iq4_k output\.weight=iq6_k " custom=$( echo "$custom" | grep -v '^#' | \ sed -Ez 's:\n+:,:g;s:,$::;s:^,::' ) numactl -N ${SOCKET} -m ${SOCKET} \ ./build/bin/llama-quantize \ --custom-q "$custom" \ --imatrix /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/imatrix-Kimi-K2-Thinking-Q8_0-Q4_0.dat \ /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00001-of-00046.gguf \ /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/Kimi-K2-Thinking-smol-IQ3_KS.gguf \ IQ3_KS \ 128 ```
## IQ2_KL 348.883 GiB (2.920 BPW) Final estimate: PPL = 2.3735 +/- 0.01082
👈 Secret Recipe ```bash #!/usr/bin/env bash custom=" ## Attention [0-60] (GPU) blk\..*\.attn_k_b\.weight=q8_0 blk\..*\.attn_v_b\.weight=q8_0 # Balance of attn tensors blk\..*\.attn_kv_a_mqa\.weight=q8_0 blk\..*\.attn_q_a\.weight=q8_0 blk\..*\.attn_q_b\.weight=q8_0 blk\..*\.attn_output\.weight=q8_0 ## First Single Dense Layer [0] (GPU) blk\..*\.ffn_down\.weight=q8_0 blk\..*\.ffn_(gate|up)\.weight=q8_0 ## Shared Expert [1-60] (GPU) blk\..*\.ffn_down_shexp\.weight=q8_0 blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0 ## Routed Experts [1-60] (CPU) blk\..*\.ffn_down_exps\.weight=iq3_ks blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl token_embd\.weight=iq4_k output\.weight=iq6_k " custom=$( echo "$custom" | grep -v '^#' | \ sed -Ez 's:\n+:,:g;s:,$::;s:^,::' ) numactl -N ${SOCKET} -m ${SOCKET} \ ./build/bin/llama-quantize \ --custom-q "$custom" \ --imatrix /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/imatrix-Kimi-K2-Thinking-Q8_0-Q4_0.dat \ /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00001-of-00046.gguf \ /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/Kimi-K2-Thinking-IQ2_KL.gguf \ IQ2_KL \ 128 ```
## smol-IQ2_KL 329.195 GiB (2.755 BPW) Final estimate: PPL = 2.4550 +/- 0.01129
👈 Secret Recipe ```bash #!/usr/bin/env bash custom=" ## Attention [0-60] (GPU) blk\..*\.attn_k_b\.weight=q8_0 blk\..*\.attn_v_b\.weight=q8_0 # Balance of attn tensors blk\..*\.attn_kv_a_mqa\.weight=q8_0 blk\..*\.attn_q_a\.weight=q8_0 blk\..*\.attn_q_b\.weight=q8_0 blk\..*\.attn_output\.weight=q8_0 ## First Single Dense Layer [0] (GPU) blk\..*\.ffn_down\.weight=q8_0 blk\..*\.ffn_(gate|up)\.weight=q8_0 ## Shared Expert [1-60] (GPU) blk\..*\.ffn_down_shexp\.weight=q8_0 blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0 ## Routed Experts [1-60] (CPU) blk\..*\.ffn_down_exps\.weight=iq2_kl blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl token_embd\.weight=iq4_k output\.weight=iq6_k " custom=$( echo "$custom" | grep -v '^#' | \ sed -Ez 's:\n+:,:g;s:,$::;s:^,::' ) numactl -N ${SOCKET} -m ${SOCKET} \ ./build/bin/llama-quantize \ --custom-q "$custom" \ --imatrix /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/imatrix-Kimi-K2-Thinking-Q8_0-Q4_0.dat \ /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00001-of-00046.gguf \ /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/Kimi-K2-Thinking-smol-IQ2_KL.gguf \ IQ2_KL \ 128 ```
## smol-IQ2_KS 270.133 GiB (2.261 BPW) Final estimate: PPL = 2.9361 +/- 0.01451
👈 Secret Recipe ```bash #!/usr/bin/env bash custom=" ## Attention [0-60] (GPU) blk\..*\.attn_k_b\.weight=q8_0 blk\..*\.attn_v_b\.weight=q8_0 # Balance of attn tensors blk\..*\.attn_kv_a_mqa\.weight=q8_0 blk\..*\.attn_q_a\.weight=q8_0 blk\..*\.attn_q_b\.weight=q8_0 blk\..*\.attn_output\.weight=q8_0 ## First Single Dense Layer [0] (GPU) blk\..*\.ffn_down\.weight=q8_0 blk\..*\.ffn_(gate|up)\.weight=q8_0 ## Shared Expert [1-60] (GPU) blk\..*\.ffn_down_shexp\.weight=q8_0 blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0 ## Routed Experts [1-60] (CPU) blk\..*\.ffn_down_exps\.weight=iq2_ks blk\..*\.ffn_(gate|up)_exps\.weight=iq2_ks token_embd\.weight=iq4_k output\.weight=iq6_k " custom=$( echo "$custom" | grep -v '^#' | \ sed -Ez 's:\n+:,:g;s:,$::;s:^,::' ) numactl -N ${SOCKET} -m ${SOCKET} \ ./build/bin/llama-quantize \ --custom-q "$custom" \ --imatrix /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/imatrix-Kimi-K2-Thinking-Q8_0-Q4_0.dat \ /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00001-of-00046.gguf \ /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/Kimi-K2-Thinking-smol-IQ2_KS.gguf \ IQ2_KS \ 128 ```
## smol-IQ1_KT 218.936 GiB (1.832 BPW) Final estimate: PPL = 3.5931 +/- 0.01889 *only for the desperate* Also keep in mind `KT` trellis quants generally are slower during TG given likely compute bottleneck if running on CPU, but if it is all you can fit then well...
👈 Secret Recipe ```bash #!/usr/bin/env bash custom=" ## Attention [0-60] (GPU) blk\..*\.attn_k_b\.weight=q8_0 blk\..*\.attn_v_b\.weight=q8_0 # Balance of attn tensors blk\..*\.attn_kv_a_mqa\.weight=q8_0 blk\..*\.attn_q_a\.weight=q8_0 blk\..*\.attn_q_b\.weight=q8_0 blk\..*\.attn_output\.weight=q8_0 ## First Single Dense Layer [0] (GPU) blk\..*\.ffn_down\.weight=q8_0 blk\..*\.ffn_(gate|up)\.weight=q8_0 ## Shared Expert [1-60] (GPU) blk\..*\.ffn_down_shexp\.weight=q8_0 blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0 ## Routed Experts [1-60] (CPU) blk\..*\.ffn_down_exps\.weight=iq1_kt blk\..*\.ffn_(gate|up)_exps\.weight=iq1_kt token_embd\.weight=iq4_k output\.weight=iq6_k " custom=$( echo "$custom" | grep -v '^#' | \ sed -Ez 's:\n+:,:g;s:,$::;s:^,::' ) numactl -N ${SOCKET} -m ${SOCKET} \ ./build/bin/llama-quantize \ --custom-q "$custom" \ --imatrix /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/imatrix-Kimi-K2-Thinking-Q8_0-Q4_0.dat \ /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/-384x14B-BF16-00001-of-00046.gguf \ /mnt/data/models/ubergarm/Kimi-K2-Thinking-GGUF/Kimi-K2-Thinking-IQ1_KT.gguf \ IQ1_KT \ 128 ```
## Quick Start You will want to override the template given they patched the original template here: https://huggingface.co/moonshotai/Kimi-K2-Thinking/blob/main/chat_template.jinja You can do stuff like `--jinja --chat-template-file ./models/templates/Kimi-K2-Thinking.jinja`. You will also need to pass `--special` for it to output `` and` ` tags correctly depending on endpoint and client used, thanks [u/Melodic-Network4374](https://www.reddit.com/r/LocalLLaMA/comments/1oqo57j/comment/nnpqxjx/) but note it will then also print out `<|im_end|>` so you can set your client to use that as a stop string. ```bash # Example running Hybrid CPU+GPU(s) on ik_llama.cpp ./build/bin/llama-server \ --model "$model"\ --alias ubergarm/Kimi-K2-Thinking-GGUF \ --ctx-size 32768 \ -ctk q8_0 \ -mla 3 \ -ngl 99 \ -ot "blk\.(1|2|3)\.ffn_.*=CUDA0" \ -ot "blk\.(4|5|6)\.ffn_.*=CUDA1" \ -ot exps=CPU \ --parallel 1 \ --threads 96 \ --threads-batch 128 \ --host 127.0.0.1 \ --port 8080 \ --no-mmap \ --jinja \ --chat-template-file updatedChatTemplate.jinja \ --special # Example running mainline llama.cpp # remove `-mla 3` from commands and you should be :gucci: ``` If no GPU(s), just remove -ngl and -ot lines. If you don't have enough RAM+VRAM, remove `--no-mmap` to mmap() "troll rig" it paging weights read-only off of disk for a couple tok/sec maybe depending. Adjust `--threads` and `--threads-batch` as needed. For smaller CPUs I recommend setting them both the same equal to the number of physical cores. For an amd 9950x that would be `-t 16` for example. Experiment on larger rigs especially with multiple socket NUMA considerations (avoid cross-NUMA memory access if possible). With ik_llama.cpp you can get some extra VRAM by using `-amb 512` to fix the size of the MLA computation buffers. (only works on models with MLA style attention like Kimi-K2 and DeepSeek) ## References * [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp) * [Getting Started Guide (already out of date lol)](https://github.com/ikawrakow/ik_llama.cpp/discussions/258) * [ubergarm-imatrix-calibration-corpus-v02.txt](https://gist.github.com/ubergarm/edfeb3ff9c6ec8b49e88cdf627b0711a?permalink_comment_id=5682584#gistcomment-5682584) * [moonshotai/Kimi-K2-Thinking/discussions/2](https://huggingface.co/moonshotai/Kimi-K2-Thinking/discussions/2) * [vllm-project/compressed-tensors/issues/511](https://github.com/vllm-project/compressed-tensors/issues/511) * [llama.cpp PR#17069](https://github.com/ggml-org/llama.cpp/pull/17069#issuecomment-3500870165)