Actually slower than Q4K and Q3K quants

#1
by Lockout - opened

In IK_llama these quants are slower than much larger full sized GLM. Yea you can fit a few more layers but the overhead from dequantizing MXFP4 is pretty big. To date, no proof has been posted that they're any better than traditional quantizations.

Do they win on size? Well.. I have no other quants to compare to for this model.

The hardware support isn't there yet, that's true. I've added a modified iq3_xxs quant which used an imatrix, and I think that should be comparable to this one at a slightly smaller size (to fit in 128gb). The model is here, it should also be faster:

https://huggingface.co/sm54/GLM-4.6-REAP-268B-A32B-128GB-GGUF

I don't think it's even a matter of HW support. The MXFP4 is dequantized to FP16/BF16/etc for all. CPU offload will never use it :(

The prune itself is also pretty cooked. I'm trying and all it can be is mean. Got into a loop quick.

image

That's the kitty it can make. Model loses coherence with chatML often. English is affected this time. I'll try the larger once and compare to Q4K of AesSedai. Latter was mostly alright in native template and OOD.

Yeah I agree re the prunes, both of the models I've quantised have high perplexity, and so I think significant quality has been lost, which may be why there aren't other quantised versions available.

sm54 changed discussion status to closed

Sign up or log in to comment