Updating models with new recipes!

#4
by ubergarm - opened

UPDATING RECIPES

In further testing and benchmarking I've discovered that Kimi-K2-Instruct is very sensitive to quantization of attn/shexp/blk.0.ffn.* likely related to it having half the attention heads and 33% of the first ffn dense layers as compared to DeepSeek (and/or possibly my MLA imatrix methodology). As such, I've modified my recipes to increase the quality of those layers to give a boost to Perplexity while adding only about 6GiB to the final size. These layers are likely running on GPU/VRAM so the speed penalty should be minimal and worth it for the extra generation quality in most cases.

If you prefer the original recipe, I've tagged those quants as v0.1 which will remain available.

Sorry for any inconvenience as you're downloading these large quants! It will take some time to update and upload each one. Please ask any questions in this discussion! Thanks!.

I will update the model card README indicating the available version next to the name of each quant as each one finishes uploading.

Owner
β€’
edited Jul 20

More info on:

  • PPL benchmarks here with the new v0.2 recipe on the graph as v0.2.
  • Discussion on imatrix MLA tensors here for anyone interested in the details.

Doing some final testing on the new versions. vibe checking, and waiting to hear any more on the above discussions before uploading.

Earliest I'll upload is end of Sunday July 20th 2025 or early this week.

Heads up, ik_llama.cpp fork is currently missing, more info here hopefully can get it back soon: https://www.reddit.com/r/LocalLLaMA/comments/1m4vw29/ikllamacpp_repository_gone_or_it_is_only_me/

In the mean time there are some other forks you can use to run these quants.

sweep-bench-kimi-k2-second-batch.png

Quick sweep-bench and PPL in one chart comparison. This is CPU-only compiled. If you have even one single 3090TI GPU the newer v0.2 probably will still be about the same speed I'm hoping as those ~6GB of extra tensors will be offloaded onto GPU. Will test.

sweep-bench-kimi-k2-second-batch.png

Quick sweep-bench and PPL in one chart comparison. This is CPU-only compiled. If you have even one single 3090TI GPU the newer v0.2 probably will still be about the same speed I'm hoping as those ~6GB of extra tensors will be offloaded onto GPU. Will test.

what happen to ik_llama repo i cant acces it ?

@gopi87 see the above comment https://huggingface.co/ubergarm/Kimi-K2-Instruct-GGUF/discussions/4#687d5f8d34bda8c9e4e09d04

I'm continuing to release quants. You can get all the commits (plus more) from https://github.com/Thireus/ik_llama.cpp for now, my impression is ik will likely attempt to recover the account and restore the project github repo. I'll keep folks posted on that reddit thread mentioned if I hear any more.

Owner
β€’
edited Jul 21

Introducing the worlds smallest Kimi-K2-Instruct smol-IQ1_KT quant weighing in at "only" 219.375 GiB (1.835 BPW)!! Sorry it is too big to fit on your two new RTX 6000 Pro Blackwell's with 198GiB VRAM! πŸ˜‚πŸ€£πŸ˜†πŸ˜­

image.png

ppl-Kimi-K2-uploading-v02.png

The v0.2 series sacrifices just a bit in TG for a large improvement in Perplexity (generation quality). The KT quants require more CPU for TG as well and do well offloaded onto GPU. And to be fair unsloth did a good job with that iq2_xxs from my measurements (didn't test KLD yet).

Cheers and please wish ik all the best on the reddit thread above until this gets sorted! You will need some of the merged PRs available at main of https://github.com/Thireus/ik_llama.cpp for now. Thanks @Thireus

Introducing the worlds smallest Kimi-K2-Instruct smol-IQ1_KT quant weighing in at "only" 219.375 GiB (1.835 BPW)!! Sorry it is too big to fit on your two new RTX 6000 Pro Blackwell's with 198GiB VRAM! πŸ˜‚πŸ€£πŸ˜†πŸ˜­

image.png

ppl-Kimi-K2-uploading-v02.png

The v0.2 series sacrifices just a bit in TG for a large improvement in Perplexity (generation quality). The KT quants require more CPU for TG as well and do well offloaded onto GPU. And to be fair unsloth did a good job with that iq2_xxs from my measurements (didn't test KLD yet).

Cheers and please wish ik all the best on the reddit thread above until this gets sorted! You will need some of the merged PRs available at main of https://github.com/Thireus/ik_llama.cpp for now. Thanks @Thireus

219gb cool finally i can use it in my 256gb ram server

@gopi87 see the above comment https://huggingface.co/ubergarm/Kimi-K2-Instruct-GGUF/discussions/4#687d5f8d34bda8c9e4e09d04

I'm continuing to release quants. You can get all the commits (plus more) from https://github.com/Thireus/ik_llama.cpp for now, my impression is ik will likely attempt to recover the account and restore the project github repo. I'll keep folks posted on that reddit thread mentioned if I hear any more.

got it thanks mate and thanks for the support

Introducing the worlds smallest Kimi-K2-Instruct smol-IQ1_KT quant weighing in at "only" 219.375 GiB (1.835 BPW)!! Sorry it is too big to fit on your two new RTX 6000 Pro Blackwell's with 198GiB VRAM! πŸ˜‚πŸ€£πŸ˜†πŸ˜­

image.png

ppl-Kimi-K2-uploading-v02.png

The v0.2 series sacrifices just a bit in TG for a large improvement in Perplexity (generation quality). The KT quants require more CPU for TG as well and do well offloaded onto GPU. And to be fair unsloth did a good job with that iq2_xxs from my measurements (didn't test KLD yet).

Cheers and please wish ik all the best on the reddit thread above until this gets sorted! You will need some of the merged PRs available at main of https://github.com/Thireus/ik_llama.cpp for now. Thanks @Thireus

any update ERNIE-4.5-300B-A47B models i just checked the q1 went very well. i am just looking to find out which model is the best to run it in the 1.3k$

waiting for my big boy xeon cpu to arrive from overseas, this iq1 should hold me over until then... downloading now to test! my poor am5 memory bus is already shaking in its boots lmao
thanks for everything !

Thank you @ubergarm for uploading the updated v0.2 quants.
May I ask, do you have any idea how they compare in composition and PPL with @mradermacher 's?

On that latest build of ik_llama.cpp (before disappearance) are tool calls supposed to be working?

Those last quants, I couldn't seem to get tools working on ik_llama despite messing with the tokenizer for awhile.

On that latest build of ik_llama.cpp (before disappearance) are tool calls supposed to be working?

Those last quants, I couldn't seem to get tools working on ik_llama despite messing with the tokenizer for awhile.

i think still in the main repo the tool calling didt merged i was using different repo back then but unfortunately whole gets deleted

On that latest build of ik_llama.cpp (before disappearance) are tool calls supposed to be working?

Those last quants, I couldn't seem to get tools working on ik_llama despite messing with the tokenizer for awhile.

i think still in the main repo the tool calling didt merged i was using different repo back then but unfortunately whole gets deleted

I think it was PR#628? with tool support that I merged as well, but that wasn't helping. I left off right about there when the repo got nuked.

Tools are being weird in llama.cpp as well, but claude-code does kinda almost work as intended.

CUDA_VISIBLE_DEVICES="0" LLAMA_ARG_NUMA="numactl" GGML_CUDA_ENABLE_UNIFIED_MEMORY=0 numactl --cpunodebind=0 --membind=0 ./bin/llama-server --model "/home/gopi/Kimi-K2-Instruct-smol-IQ1_KT-00001-of-00005.gguf" --ctx-size 22144 -mla 2 -fa -amb 512 -fmoe --n-gpu-layers 95 --override-tensor exps=CPU -b 1048 -ub 500 --parallel 1 --threads 52 --threads-batch 56 --temp 0.7 --min-p 0.05 --run-time-repack --top-p 0.8 --host 127.0.0.1 --port 8080

i am currently using this, getting 3.3 t/sec on my 256 ddr4 ram with dula e5 xeon 280 v4 with 1 rtx 3060 12vram response is preety good.

@gopi87

any update ERNIE-4.5-300B-A47B models i just checked the q1 went very well. i am just looking to find out which model is the best to run it in the 1.3k$

I had an issue open on ik_llama.cpp to add ERNIE support, however that is gone now. waiting to see if it comes back and where. be warned though that A47B could run slower TG than big old Kimi-K2-Instruct-1000B-A32B because of number of active parameters...


@sousekd

May I ask, do you have any idea how they compare in composition and PPL with @mradermacher 's?

I'm not sure if @nicoboss used the latest mainline lcpp PR from compilade to generate the imatrix including MLA tensors (attn_k_b / attn_v_b), but if so then likely theirs are one of the better quantizations available. in general mradermacher, unsloth, and bartowski quants are fairly similar in perplexity given they use similar mixes of mainline lcpp quantizations.

my quantizations use ik_llama.cpp exclusives like iq5_ks, iq2_kl, and the newest iq1_kt to really pack in the quality in the least BPW.

in general, your best bet it probably to find the most BPW you can fit into your specific RAM+VRAM configuration regardless of who quantizes it. but yeah it takes a lot of resources to download and test these giant models so sorry i have no specific data points on them this time.

some more discussions in general on the topic are available here on reddit: https://www.reddit.com/r/LocalLLaMA/comments/1khwxal/the_great_quant_wars_of_2025/

tl;dr;

Q: Who provides the best GGUFs now?

A: They're all pretty good.


Tools are being weird in llama.cpp as well, but claude-code does kinda almost work as intended.

So a guy from Liepzig had just started working on improving tool calling but it was not complete nor merged. @mtcl had suggested using some tool-calling-wrapper front end that could manage the template properly i think? I haven't tried it myself, the chat endpoint doesn't have the tool role added when I did the PR, so you would probably have to use the completions endpoint and manage the template on the client side maybe?


i am currently using this, getting 3.3 t/sec on my 256 ddr4 ram with dula e5 xeon 280 v4 with 1 rtx 3060 12vram response is preety good.

great job @gopi87 you might benefit in TG speed increasing --threads to a little more given the KT quants tend to be CPU-bound during TG which is different than all the other quants which are memory bandwidth bound.

Sign up or log in to comment