Thank you

#10
by alain401 - opened

Just a big thank you for posting ik-llama.cpp quants.

I find that ik.llama.cpp is both faster and more stable than llama.cpp at this point...and llama-server webui now works great with ik and doesn't really with llama.cpp.

Having the optimized quants makes a big difference.

Owner

Glad to hear from a fellow ik_llama.cpp GGUF enjoyer!

Yeah feel free to put your full command in here if you want to workshop it for optimizing PP/TG etc depending on your application. Also your CPU/RAM and OS (Linux/Windows). If you haven't tried llama-sweep-bench it is quite nice tool for optimizing your setup and offloads. Basically replace your llama-server command with llama-sweep-bench --warmup-batch and a lowish context window to sweep e.g. -c 8192 etc...

Anyway thanks for the good vibes!

My setup is a simple EPYC board with 12 channel RAM, no GPU. So I kept things simple.

My command is:
./build/bin/llama-server -m models/DeepSeek-R1-0528-IQ4_KS_R4-00001-of-00009.gguf -fa -mla 3 -fmoe -ctk q8_0 -ctv q8_0 -fmoe -t 32 -tb 64

It is a single CPU setup, so I haven't messed with the BIOS yet and don't have numa optimizations to worry about. Any advice on further optimizations would be great.

Owner
β€’
edited Aug 9

Oh yeah for CPU-only ik_llama.cpp is the way to go. I quantize and test most of these quants on a big dual socket AMD EPYC 9965 192-Core Processor with 768GB RAM per socket presenting as two numa nodes. Nice to be quantizing or inferencing on one socket while calculating perplexity on the other socket haha...

My initial thoughts specific to DeepSeek-R1-0528 (Qwen does not use -mla 3 so a bit different)

  1. even on a single cpu check bios is in NPS1 mode for a single numa node in the single socket (sounds like you already did this)
  2. Increase batch sizes for higher throughput at the cost of some latency for short prompts e.g. -ub 4096 -b 4096 is my usual. should give you pretty good PP gains which is great for longer context depth applications.
  3. no need to specify -ctv q8_0 as this is MLA so it just takes the single value from -ctk and uses that. Doesn't hurt anything to add, it just prints out a little note on startup.
  4. you might get a little benefit by dropping caches sudo sync; echo 3 | sudo tee /proc/sys/vm/drop_caches and then running with --no-mmap and waiting for it to malloc(). kind of slow to start up but gives a better chance at using THPs for everything (transparent huge pages) which might help but probably not worth the effort hah. you'd have to really a/b test with llama-sweep-bench to notice probably.

Enjoy that sweet rig!

Didn't know about llama-sweep-bench, super useful.

Ran a few tests with various combinations of threads and batch sizes:

file-QDXuiC7j36GjuU3CGTyf6G.png

file-GoXVgiWXbNq4uBvjbEhJuP.png

All with THP on always and --no-mmap. I find the results a bit puzzling.

@alain401

Very nice, glad you got the llama-sweep-bench figured out including making nice graphs!

All with THP on always and --no-mmap. I find the results a bit puzzling.

Its not clear given this discussion is on a Qwen3 repo, but I'll assume you're testing DeepSeek-R1-0528-IQ4_KS_R4-00001-of-00009.gguf in the graphs above. So given this is one of my older quants, you'll see it is a "pre-repacked" _r4 quant which does not benefit from larger batch sizes like the non _r4 quants.

THP (transparent huge pages) doesn't give a very noticible boost on most rigs probably, I only saw it on a big 1.5TB dual socket Intel Xeon 6980P rig which wasn't saturating the memory bandwidth anyway.

Thanks , it all makes sense now.

Sorry for posting on the wrong thread, got excited about trying your DeepSeek quants after adding RAM to my rig...

All good, glad you're having fun and we sorted it out! If you want to try some of my non _r4 DeepSeek quants I believe this one was when I switched away from releasing _r4 quants: https://huggingface.co/ubergarm/DeepSeek-R1-0528-GGUF#iq3_ks-281463-gib-3598-bpw or any of these: https://huggingface.co/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF or the Kimi-K2 models.

Enjoy!

Sign up or log in to comment