The duality of a budget system...

#3
by phakio - opened

I can run IQ5 with partial gpu offload, much slower, but very accurate and great results, or I can say screw it and shove the IQ1 into my 96gb VRAM and send it full speed, and get a surprise output! haha Thanks for the quants @ubergarm ! Always a blast to play with the latest models.

Heads up, I wasn't able to run this model on my week old build of ik_llama, if you are having errors running please recompile the latest source!

Correct, loading the models required a small PR to support it: https://github.com/ikawrakow/ik_llama.cpp/pull/814

so yep, would have to pull + rebuild to be able to run these.

I am honestly disappointed and feel a bit rugpulled that there was no 4.6-Air. I would like an update to the Air as that fits in to 96GB of vram at 4bit because it would be fast, relatively nimble.

The full is also good dont get me wrong. I can also run basically yeah 4.6 regular at IQ4 (or IQ5) for hybrid inference but hmm maybe its the model architecture which causes at long context token gen speeds to crater (I use deepseek because of this; the TG speeds at longer context dont drop as hard from 0).

I am honestly disappointed and feel a bit rugpulled that there was no 4.6-Air. I would like an update to the Air as that fits in to 96GB of ram at 4bit because it would be fast, relatively nimble.

The full is also good dont get me wrong. I can also run basically yeah 4.6 regular at IQ4 (or IQ5) for hybrid inference but hmm maybe its the model architecture which causes at long context token gen speeds to crater (I use deepseek because of this; the TG speeds at longer context dont drop as hard from 0).

I agree with the fact that this models architecture is a little slower, especially for hybrid inference. I can run kimi k2 at 17 t/s generation (albeit iq2_kl but still...), where this model, with less total parameters is a solid 11.8 t/s generation. Both this model and kimi k2 have 32b active parameters, so there really is no reason except architecture as to why this model is slower at token gen.

I'm still waiting for the (hopefully soon) either variable active parameters, or the switch to smaller active parameters, like qwen3.5 next 80b a3b... I can keep dreaming!

switch to smaller active parameters

I hope not. Those models aren't very intelligent. To me they are a nightmare coming off of dense models. Good for data retrieval and not much else. I'll take my 13t/s from smol_IQ4_KSS or UD-Q3KM_XL and be happy.

I can run kimi k2 at 17 t/s generation (albeit iq2_kl but still...), where this model, with less total parameters is a solid 11.8 t/s generation. Both this model and kimi k2 have 32b active parameters, so there really is no reason except architecture as to why this model is slower at token gen.

mainline llamacpp is faster* with glm4.6 for some reason, but the PPL for similar sized quants just sucks.

  • though I vaguely feel deja vu, like I've seen this discussed before and there was a way to speed it up in ik_llama.cpp

I am honestly disappointed and feel a bit rugpulled that there was no 4.6-Air.

It is a shame there's no Air for this one, as this one is pretty good. It feels like they distilled Sonnet4 when using it as a standard assistant with /nothink (I appended /nothink and reloaded several of my Sonnet4 chats and got almost the same response).
If this is just 4.5 with some Sonnet distillation (maybe o3 for creative writing), I wonder if we could LoRA 4.6-full outputs -> air-4.5-base.

Owner

@SFPLM

maybe its the model architecture which causes at long context token gen speeds to crater

Are you using -ctk q8_0 -ctv q8_0 ? In benchmarks on GLM-4.5 myself and @AesSedai discovered that this seems to tank speeds at long context faster than unquantized full f16 kv cache iirc. Might be worth trying that and doing two llama-sweep-bench to compare.

feel a bit rugpulled that there was no 4.6-Air. I would like an update to the Air as that fits in to 96GB of ram

I'm currently downloading my smol-IQ2_KS 97.990 GiB (2.359 BPW) to try locally on my amd 9950x 2x48GB DDR5@6400MT/s and 3090TI FE 24GB VRAM rig which is roughly the size of what would have been GLM-4.6-Air-Q8_0 hahah... If I can pack it in okay I'll try to get some llama-sweep-bench comparing with and without quantized kv-cache etc.

mainline llamacpp is faster* with glm4.6 for some reason

I'm curious how you're running it for both systems to confirm your sentiment, as lately ik has been faster again especially for any CPU/RAM in play.

I'm currently downloading my smol-IQ2_KS 97.990 GiB (2.359 BPW) to try locally on my amd 9950x 2x48GB DDR5@6400MT/s and 3090TI FE 24GB VRAM rig which is roughly the size of what would have been GLM-4.6-Air-Q8_0 hahah... If I can pack it in okay I'll try to get some llama-sweep-bench comparing with and without quantized kv-cache etc.

Ah got it maybe if I wanted something that almost fits into 4x3090 or 3x5090 (well with a bit of memory gaps because multigpu) or 1x6000Blackwell, this might be it or the IQ1KT one (that one fits inside).
I would probably try IQ4/IQ5 first as I can hybrid inference it.

It is a shame there's no Air for this one, as this one is pretty good. It feels like they distilled Sonnet4 when using it as a standard assistant with /nothink (I appended /nothink and reloaded several of my Sonnet4 chats and got almost the same response).
If this is just 4.5 with some Sonnet distillation (maybe o3 for creative writing), I wonder if we could LoRA 4.6-full outputs -> air-4.5-base.

Yes exactly! I still would love a miniature version of this!

Are you using -ctk q8_0 -ctv q8_0 ? In benchmarks on GLM-4.5 myself and AesSedai discovered that this seems to tank speeds at long context faster than unquantized full f16 kv cache iirc. Might be worth trying that and doing two llama-sweep-bench to compare.

@ubergarm Hmm so I lost the evidence but in the past checked that GLM-4.5 Full was around 12tps from 0 to like 7-8 in 32k and plummets even further (i may have used ctv and ctk at the time but its been a while). Deepseek IQ4K it goes from 12.9 tps from 0 to like 10 from 0 to 96k. Recently i tested GLM-4.5 Air Unsloth Q4KXL fully offload to GPU ik_llama.cpp, fp16 both KV, 4096 u/ub, and it was like 100 TG tps at 0, ctx 3300 PP speed tps, then it decremented all the way down to 15 tg tps and 300 pp tg speed at the final ones (126xxx-131xxx) Though I think it would be similar with your 4 bit quants or others had I used those.

Owner

I downloaded and ran some benchmarks with the smol-IQ2_KS as mentioned above, threw the graph into a discussion here: https://huggingface.co/ubergarm/GLM-4.6-GGUF/discussions/5

Surprisingly usable, probably one of the best models to run on my gaming rig locally hybrid CPU+GPU. Not blazing fast, but enough for local single user chats. It might work with MCP agentic stuff too, I might mess more with that again eventually given recent models updates seem to be prioritizing that capability.

I'm curious how you're running it for both systems to confirm your sentiment, as lately ik has been faster again especially for any CPU/RAM in play.

Re-reading my post I should clarify, this performance discrepancy is only when fully offloading one of the mainline-compatible "Good luck everybody" quants I can fit in Vram. Specifically:

GLM-4.6-UD-IQ2_XXS-00001-of-00003.gguf (Final estimate: PPL = 5.5401 +/- 0.03646)

Pretty bad so I'm not going to bother. And once I touch the CPU/DDR5, ik_llama.cpp runs circles around mainline llama.cpp.
If I get the chance I'll do a llama-bench rather than just vague "mainline is faster" lol.

P.S. I found this interesting quant earlier: sm54/GLM-4.6-MXFP4_MOE

πŸ‘ˆ Perplexity
system_info: n_threads = 24 / 48 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
perplexity: tokenizing the input ..
perplexity: tokenization took 269.314 ms
perplexity: calculating perplexity over 73 chunks, n_ctx=4096, batch_size=2048, n_seq=1
perplexity: 35.69 seconds per pass - ETA 43.42 minutes
[1]1.9677,[2]2.9376,[3]2.9671,[4]3.2373,[5]3.6088,[6]3.9264,[7]4.0722,[8]4.3010,[9]4.1256,[10]3.9063,[11]3.8073,[12]3.6721,[13]3.6322,[14]3.8034,[15]3.8205,[16]3.7266,[17]3.7582,[18]3.7377,[19]3.7300,[20]3.7499,[21]3.7773,[22]3.9340,[23]3.8383,[24]3.9181,[25]3.9323,[26]3.9272,[27]3.9143,[28]3.9069,[29]3.9151,[30]3.9383,[31]3.9948,[32]4.0408,[33]4.0775,[34]4.1167,[35]4.1171,[36]4.0586,[37]4.1233,[38]4.0776,[39]4.1003,[40]4.0875,[41]4.0667,[42]4.0187,[43]4.0271,[44]4.0476,[45]4.0747,[46]4.1562,[47]4.1719,[48]4.2335,[49]4.2836,[50]4.3408,[51]4.2996,[52]4.2375,[53]4.1815,[54]4.1492,[55]4.1097,[56]4.0647,[57]4.0456,[58]4.0278,[59]4.0489,[60]4.0471,[61]4.0954,[62]4.1066,[63]4.1042,[64]4.1214,[65]4.1083,[66]4.1255,[67]4.1027,[68]4.0948,[69]4.0693,[70]4.0413,[71]4.0163,[72]3.9893,[73]4.0090,
Final estimate: PPL = 4.0090 +/- 0.02391
At that size / perplexity it's not practical to use, but interesting to see people making GPT-OSS style quants now.

@ubergarm Here's a comparison. Same quant, 6xRTX3090. I noticed the parameter count and size don't match.

mainline

~/mainline_llamacpp.cpp/build/bin/llama-bench -m /models/gguf/GLM-4.6-UD/GLM-4.6-UD-IQ2_XXS-00001-of-00003.gguf -ngl 999 -fa on

model size params backend ngl test t/s
glm4moe 355B.A32B IQ2_XXS - 2.0625 bpw 107.47 GiB 356.79 B CUDA 999 pp512 239.17 Β± 2.17
glm4moe 355B.A32B IQ2_XXS - 2.0625 bpw 107.47 GiB 356.79 B CUDA 999 tg128 28.40 Β± 0.15

ik_llama.cpp

~/ik_llama.cpp/build/bin/llama-bench -m /models/gguf/GLM-4.6-UD/GLM-4.6-UD-IQ2_XXS-00001-of-00003.gguf -ngl 999 -fa on

model size params backend ngl test t/s
glm4moe 355B.A32B IQ2_XXS - 2.0625 bpw 106.09 GiB 352.80 B CUDA 999 pp512 238.95 Β± 2.58
glm4moe 355B.A32B IQ2_XXS - 2.0625 bpw 106.09 GiB 352.80 B CUDA 999 tg128 21.51 Β± 0.08
Owner

@gghfez

At that size / perplexity it's not practical to use, but interesting to see people making GPT-OSS style quants now.

Oh so strange they are using MXFP4 quantization on this model. Just because you can, doesn't always mean you should xD... Assume you are using wiki.test.raw perplexity with the usual 512 default context it doesn't seem to be doing great compared to other quantization of that size. A 200GiB model clocking ~4 perplexity? I didn't look at their exact mix, but i keep most of the attn/shexp/first 3 dense layers much larger or full size even and only smash the routed experts. But fascinating to see what people are trying!

Here's a comparison. Same quant, 6xRTX3090. I noticed the parameter count and size don't match.

Ahh thanks for following up with more details! So for ik_llama.cpp you still want to run fused MoE so add llama-bench -fmoe 1 and you should see a boost up to or more than mainline. Also add -ub 4096 -b 4096 assuming you don't OOM and PP will blow mainline out of the water.

Just because you can, doesn't always mean you should xD.

Haha yeah, though tbf, it's just been dumpped up there with no model card / hype.

You were right with -fmoe 1 it matches mainline, thanks!

model size params backend ngl fmoe test t/s
glm4moe 355B.A32B IQ2_XXS - 2.0625 bpw 106.09 GiB 352.80 B CUDA 999 0 pp512 253.20 Β± 2.55
glm4moe 355B.A32B IQ2_XXS - 2.0625 bpw 106.09 GiB 352.80 B CUDA 999 0 tg128 28.96 Β± 0.14

-b / -ub don't affect prompt processing* (I guess because I'm only testing pp512)

model size params backend ngl n_batch n_ubatch fmoe test t/s
glm4moe 355B.A32B IQ2_XXS - 2.0625 bpw 106.09 GiB 352.80 B CUDA 999 4096 4096 0 pp512 244.78 Β± 2.16
glm4moe 355B.A32B IQ2_XXS - 2.0625 bpw 106.09 GiB 352.80 B CUDA 999 4096 4096 0 tg128 28.94 Β± 0.13

Makes a difference at longer contexts as expected.

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 10.007 409.31 35.005 29.25
4096 1024 4096 13.223 309.77 41.073 24.93

Assume you are using wiki.test.raw perplexity with the usual 512 default context

I've been running it at 4096 πŸ€¦β€β™‚οΈ
That explains the dependencies when I run the quants vs your graph:

GLM-4.6-IQ3_KS-00001-of-00003.gguf (4096)

Final estimate: PPL = 4.0287 +/- 0.02365

GLM-4.6-IQ2_KL-00001-of-00003.gguf (4096)

Final estimate: PPL = 4.3505 +/- 0.02625

But yeah, wiki.test.raw. I'm going to settle on IQ3_KS
Thanks for the tip about -fmoe 1.

Most fire discussions on huggingface. Thank you all or the info.

Heads up, someone has tried a novel SVD-based distillation of GLM-4.6 -> GLM-4.5-Air
I haven't tested it yet, but in theory it could work really well given the 4.5 -> 4.6 knowledge delta is likely fairly small.
BasedBase/GLM-4.5-Air-GLM-4.6-Distill

His Qwen 480B -> 30B models are getting praised in the discussions.

Edit the "Heads up" must have gotten into my mind because of the

Heads up, I ...

in OP. But what I mean is, this looks exciting and worth trying out!

Owner

But what I mean is, this looks exciting and worth trying out!

Thanks for heads up! Did anyone give it a go or get perplexity/kld/benchmarks? Also supposedly official GLM-4.6-Air might become a reality within a couple weeks - i'll definitely try to quantize the official one despite its annoying tensor sizes restricting us to iq4_nl hah

@ubergarm Yeah my bad, I should have actually tried it myself before posting it here (I got a little too excited).

Turns out it didn't actually do anything at all, and all the praise it was getting was either placebo effect or different samplers.

If you downscale the weights to bf16 then tensor-diff, they're identical to glm4.5-air πŸ˜‚

Note: I think he genuinely believed it was working / had good intentions.
He vibe-coded all the tooling and Gemini probably convinced him that it would work.

Sign up or log in to comment