Request for IQ2_KS

#8
by gnadlr - opened

Hello, thank you for the incredible work you have done for these IQK quants. Any chance for an IQ2 that fit into 128GB RAM? I know most of the quants are built for some common amount of ram with a small GPU, however surprisingly the speed running on CPU+RAM are pretty usable.

gnadlr changed discussion title from Request for IQ2_KS/KSS to Request for IQ2_KS

@gnadlr

Ahh so total 128GB RAM+VRAM where VRAM=0 ? haha... I think I understand. Yes some of these big MoEs are quite good on CPU-only backend if you have enough memory bandwidth.

I could use the IQ2_KT and IQ3_KT possibly though these tend to be CPU-bottlenecked for TG given calculating the trellis on CPU is expensive.

So maybe an IQ2_K/IQ3_K, or KS yes,would be about the right size at say 110GB target to leave enough RAM left over for context and a few tabs of firefox open lol...

Let me see if I have all the files still handy, might be able to pop one out today

Doing some quick estimation the two closest options look like:

  • smol-IQ2_KL ~117 GIB ffn_(gate|down|up)_exps at iq2_kl
  • IQ2_KS ~ 110 GiB ffn_(gate|up)_exps at iq2_ks and down at iq3_ks

I'm not gonna cook both to compare PPL, but just go with the IQ2_KS as I think the smol-IQ2_KL would be a bit too tight still for comfortable daily driving with sufficient context and enough RAM left-over for your desktop.

@gnadlr

Okay here you go, enjoy! https://huggingface.co/ubergarm/GLM-4.5-GGUF#iq2_ks-111404-gib-2671-bpw

IQ2_KS 111.404 GiB (2.671 BPW)

Thank you very much, that's very fast. I am downloading now. I think it should fit nicely since i was also able to fit your Qwen3 235B Q4_KSS at 115 GiB.

I think we will soon see consumer motherboards able to run 4 sticks of 64GB at > 6000MT. So I'm holding off on investing in VRAM for now.

I think we will soon see consumer motherboards able to run 4 sticks of 64GB at > 6000MT. So I'm holding off on investing in VRAM for now.

Yeah, while AMD only guarantees up to DDR5-3600MT/s in the 4x populated dimm configuration, it is getting more common to find people winning the "silicon lottery" to get 6000MT/s working on the newer boards and RAM like this redditor discuses with me here: https://www.reddit.com/r/LocalLLaMA/comments/1n1h6xx/comment/naycx1g/

Having a single GPU is still nice to speed up PP and kv-cache stuff even on the big MoEs and keep the attn/shexp/first N dense layers which are always active paramters on GPU and routed experts on CPU/RAM.

But yeeah that VRAM is still very pricy!

Oh found a good video by wendell about 4x dimms here: https://youtu.be/JLZ9Au-4DJs?t=615

Sorry if this is off topic, but are any of the quant types particularly x86 CPU optimal? For instance, are the R4 quants like iq2_k_r4 (2.375 bpw) faster on CPU than iq2_ks(2.1875) or iq2_kl (2.6875)?

What about GPU? For example, is there a reason you chose iq5_ks instead of iq5_ks_r4?

Is there any point in even looking at the bitnet iq2_bn_r4? I know all the KT trellis quants are slow on CPU, but I'm still trying to wrap my head around the rest.


I'm asking because I bit the bullet and ordered 2x 64GB sticks for my rig, and have some upcoming time to try these recipes myself! Or maybe Thireus's quant script, along a backlog of other stuff...

Oh, and for reference my setup will be 3090 + 2 x 64Gb 6000Mhz, hopefully. That's right in the range of your IQ2_KL, but I intend to try similar recipes with the GLM-4.5 base, and with ERNIE 300B.

Also, slightly more on topic, this is probably a great quant for those Ryzen AI 395 MAX folks out there. I wonder if any are running ik_llama.cpp.

I'd love an ITX board with one TBH, with an x8 slot for a GPU. But it does not exist :(

More info regarding which board and which sticks guaranteed to work with 4x64GB sticks at >= 6000MT.
https://www.gskill.com/community/1502239313/1745234238/G.SKILL-Reveals-Worlds-First-Large-Capacity-256GB-64GBx4-DDR5-U-DIMM-Memory-at-DDR5-6000-CL32-Overclocked-Speed
ASUS ROG CROSSHAIR X870E HERO
MSI MPG X870E CARBON WIFI
MSI MEG X870E GODLIKE
MSI MAG B850M MORTAR WIFI
Even B850 boards is already able to support it, which means there should be a lot more capable mid-range boards in the future.
@ubergarm Can you share which rigs are you running?

@gnadlr

Can you share which rigs are you running?

My home gaming rig is AMD 9950X with 2x48GB DDR5-6400MT/s and 3090 TI FE 24GB VRAM

I have access to some remote rigs with Wendell of level1techs.com including a Thread Ripper Pro 24-Core (Zen4) and a big dual socket Zen5 EPYC rig with a ton of RAM which I use for most of the quanting and imatrix running CPU only!

Sign up or log in to comment