Any chance for an IQ3_K?

by original-el8 - opened Aug 9

Aug 9

The IQ4_KSS is good - and does fit on 2x6000 Blackwells, but gotta keep context relatively small. Any chance for a IQ3 that could hit that space between IQ4_KSS and IQ2_KL, before ppl starts going off the rails? Maybe at something like ~160GB?

Thanks for all you do!

ubergarm

Owner Aug 9

•

edited Aug 9

Yeah maybe so, I just did an IQ3_KS for Air and might be able to use the basic recipe here given they are similar, Will see what I can do!

Also if you're on 2x6k blackwells you can use the KT quants which are from the QTIP paper and similar to exllamava3 EXL3 quants. I know turboderp just got Air going: https://huggingface.co/turboderp/GLM-4.5-Air-exl3 but not sure if full size is available or in the works.

original-el8

Aug 9

thank you so much!

original-el8

Aug 9

Looks like MikeRoz is working on the big boi:
https://huggingface.co/MikeRoz/GLM-4.5-exl3

original-el8

Aug 9

Also, looks like Thireus has a ton of options, just grabbed his IQ4_KT special sauce quant which weighs in at 168GB, and running 90k context - pretty sweet! I'll do some ppl measurement on it as well just to see where its at.

ubergarm

Owner Aug 10

Oh nice that sounds like a good size! I'm uploading an IQ3_KT IQ3_KT 147.565 GiB (3.537 BPW) Final estimate: PPL = 3.4369 +/- 0.01975 right now! I used iq4_kss on the ffn_down_exps instead of iq4_kt actually as some of my previous testing suggested they are similar (both exactly 4.0bpw) and the iq4_kss would have faster TG if anyone had to run it on CPU.

Would love to see any numbers you get! I have my pereplexity workflow here: https://huggingface.co/ubergarm/GLM-4.5-GGUF/discussions/4#6896071d1bc2e44f792ce8f8 (mine tend to be just a tiny bit high running on CPU-only backend on this rig I've noticed comparing some CUDA folks, not sure what that is about).

Have fun playing with all the quants!

original-el8

Aug 10

For the IQ4_KT_Special quant from Thireus:

Final estimate: PPL = 3.3351 +/- 0.01906. Makes sense considering your IQ4_KSS.

How do you calculate the total bpw?

original-el8 changed discussion status to closed Aug 10

original-el8 changed discussion status to open Aug 10

ubergarm

Owner Aug 10

@original-el8

For the IQ4_KT_Special quant from Thireus:

Oh nice, yes just a bit higher than my IQ4_KSS. I too am curious what size that is exactly.

How do you calculate the total bpw?

I just look at the logs when starting llama-server or running llama-perplexity it will show like so:

llm_load_print_meta: model type       = 355B.A32B
llm_load_print_meta: model ftype      = IQ3_KT - 3.125 bpw
llm_load_print_meta: model params     = 358.338 B
llm_load_print_meta: model size       = 147.565 GiB (3.537 BPW) # <--- I copy paste this line for total size/BPW
llm_load_print_meta: repeating layers = 146.560 GiB (3.529 BPW, 356.786 B parameters)
llm_load_print_meta: general.name     = GLM 4.5

original-el8

Aug 10

•

edited Aug 10

Here's some interesting results:

$ docker run --rm --gpus all -v models:/models ik_llama:latest /usr/local/bin/llama-sweep-bench -m /models/GLM-4.5-GGUF/IQ3_KT/GLM-4.5-IQ3_KT-00001-of-00004.gguf -c 32764 -ngl 999 --no-mmap --threads 16 -b 4096 -ub 4096 -fa

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	1024	0	3.896	1051.35	39.357	26.02
4096	1024	4096	4.326	946.89	43.548	23.51
4096	1024	8192	4.858	843.10	48.862	20.96
4096	1024	12288	5.381	761.18	53.799	19.03
4096	1024	16384	5.909	693.21	58.304	17.56
4096	1024	20480	6.460	634.02	63.597	16.10
4096	1024	24576	7.213	567.88	68.713	14.90
4096	1024	28672	8.192	499.99	73.088	14.01

$ docker run --rm --gpus all -v models:/models ik_llama:latest /usr/local/bin/llama-sweep-bench -m /models/GLM-4.5-GGUF/IQ4_KT-Special/GLM-4.5-THIREUS-IQ4_KT-SPECIAL_TENSOR-00001-of-01762.gguf -c 32764 -ngl 999 --no-mmap --threads 16 -b 4096 -ub 4096 -fa

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	1024	0	3.918	1045.42	36.259	28.24
4096	1024	4096	4.400	931.00	40.682	25.17
4096	1024	8192	4.934	830.24	45.855	22.33
4096	1024	12288	5.451	751.40	50.933	20.10
4096	1024	16384	5.985	684.41	55.334	18.51
4096	1024	20480	6.552	625.12	60.364	16.96
4096	1024	24576	7.328	558.94	65.469	15.64
4096	1024	28672	8.411	486.97	69.932	14.64

Definitely interesting results. Not sure yet what to make of it.

ubergarm

Owner Aug 10

@original-el8

Oh great thanks for making some llama-sweep-bench charts between mine and Thireus' quants. A couple thoughts/questions:

Can you tell me what the size of that Thireus-IQ4_KT quant is e.g. this line in the startup logs: llm_load_print_meta: model size = 147.565 GiB (3.537 BPW) # <--- I copy paste this line for total size/BPW
You can add --warmup-batch on ik's fork (no need on my mainline branch of llama-sweep-bench port though as it is hardcoded enabled there). Shouldn't effect much but otherwise the first point can be lower, however you're fully offloading so probably doesn't effect much.
Since you're fully offloaded set threads to exactly 1 e.g. --threads 1 or -t 1 can sometimes give a few more percent boost.

Yeah each quantization type has different kernel implementation depending on CUDA/Vulkan/CPU AVX2/CPU AVX_VNNI/CPU NEON etc so different mixes can perform differently.

Aver0

Aug 11

I used iq4_kss on the ffn_down_exps instead of iq4_kt actually as some of my previous testing suggested they are similar (both exactly 4.0bpw) and the iq4_kss would have faster TG if anyone had to run it on CPU.

Do you plan releasing IQ3_KL for us CPU-bound folks, or this IQ3_KT shouldn't be any slower than IQ3_KL?

ubergarm

Owner Aug 11

•

edited Aug 11

@Aver0

Do you plan releasing IQ3_KL for us CPU-bound folks, or this IQ3_KT shouldn't be any slower than IQ3_KL?

The recent addition of iq2_kl has been useful, but to be honest I've never tried the IQ3_KL : 4 bpw non-linear quantization mix which oddly suggests it is the same size as both the iq4_kt and also the new iq4_kss. So an IQ3_KL would technically be about the same size as the existing IQ4_KSS probably.

The more TG CPU-friendly version would likely be using iq3_ks or iq3_k.

I'm guessing it would be slower only for TG, but given only ffn_(up|gate)_exps are trellis quants it would be interesting to see how well it keeps up with the iq3_ks etc.

What target RAM+VRAM are you looking for, and perhaps I'll release one more in this range using non-trellis quants. Also feel free to give it a try and report back how it performs on your system and include ram/vram/cpu/os info too. Thanks!

Aver0

Aug 11

•

edited Aug 11

What target RAM+VRAM are you looking for

160 RAM + 12 VRAM. IQ3_K should be the best, leaving a bit for some limited context.
I don't think my tests would be representative, since I have such unusual RAM amount (2x48 + 2x32 DDR5 running at 62GB/s).

ubergarm

Owner Aug 12

•

edited Aug 12

@Aver0

160 RAM + 12 VRAM. IQ3_K should be the best, leaving a bit for some limited context.
I don't think my tests would be representative, since I have such unusual RAM amount (2x48 + 2x32 DDR5 running at 62GB/s).

Oh interesting combination yes. I haven't measured how much the attn/shexp/first N ffn dense layers take-up on VRAM offload here using the usual -ngl 99 -ot exps=CPU, e.g. how much room you have left-over for kv-cache.

I did some llama-sweep-benches on an all CPU configuration and interestingly the IQ3_KT is not suffering too much on TG. Granted this is a huge AMD EPYC with a ton of cores, but throwing more cores at it actually slowed down TG (which is typical of the non-KT quants).

Also, I was able to get back some of the performance using the experimental ik_llama.cpp branch ik/q8_k_r8_avx512 supporting Zen5 avx_vnni CPU flag. So definitely try that if you have a Zen5 chip on AM5 like the AMD 9950X (my personal home gaming rig uses this and sees a benefit mostly in PP uplift).

Finally, you could probably get a little more TG uplift experimenting with a draft model e.g. https://huggingface.co/jukofyork/GLM-4.5-DRAFT-0.6B-v3.0-GGUF running with something like:

    -md DRAFT-0.6B-Q4_0.gguf \
    -ngld 99 \
    --draft 32 \

So for now I'll not upload a new model in that ~3.5bpw range and curious what you see if you give the existing IQ3_KT a try. Thanks!

CalvinZero

Aug 18

hi @ubergarm

Can you share your full args to quantize this model ? I want made a IQ4_XS_R8 version with good quality. (do I need use imatrix or calibration data?)

CalvinZero

Aug 19

•

edited Aug 19

base on https://github.com/ikawrakow/ik_llama.cpp/pull/624 , any adjust about custom_q for GLM-4.5 with CPU?

I try this:


#/usr/bin/env bash

custom="
# 93 Repeating Layers [0-92]

# Attention
blk\..*\.attn_q.*=q8_0
blk\..*\.attn_k.*=q8_0
blk\..*\.attn_v.*=q8_0
blk\..*\.attn_output.*=q8_0

# First 3 Dense Layers [0-2]
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0

# Shared Expert Layers [3-92]
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0

# Routed Experts Layers [3-92]
blk\..*\.ffn_down_exps\.weight=iq6_k
blk\..*\.ffn_(gate|up)_exps\.weight=iq5_k

# NextN MTP Layer [92]
blk\..*\.nextn\.embed_tokens\.weight=iq6_k
blk\..*\.nextn\.shared_head_head\.weight=iq6_k
blk\..*\.nextn\.eh_proj\.weight=q8_0

# Non-Repeating Layers
token_embd\.weight=iq6_k
output\.weight=iq6_k
"

custom=$(
  echo "$custom" | grep -v '^#' | \
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)

numactl -N 0 -m 0 \
./build/bin/llama-quantize \
    --custom-q "$custom" \
    --imatrix /mnt/raid/models/ubergarm/GLM-4.5-GGUF/imatrix-GLM-4.5-BF16.dat \
    /mnt/raid/models/ubergarm/GLM-4.5-GGUF/GLM-160x21B-4.5-BF16-00001-of-00015.gguf \
    /mnt/raid/models/ubergarm/GLM-4.5-GGUF/GLM-4.5-IQ5_K.gguf \
    IQ5_K \
    192

ubergarm

Owner Aug 19

@CalvinZero

Can you share your full args to quantize this model ?

I try to provide my full commands in all of the model cards for every quant, let me know if I missed something, but I believe it is all there. You can find additional information in my quant cookers guide here: https://github.com/ikawrakow/ik_llama.cpp/discussions/434

You are welcome to use my imatrix and any data that I have provided to make your own quant! Let me know how it goes and I'd love to see any llama-sweep-bench or llama-perplexity comparisons as well!

I want made a IQ4_XS_R8 version with good quality.

I've never messed around with _R8 quants and thought they existed mainly for internal use with activations etc. I no longer release _R4 -rtr quants now either as going with larger -ub 4096 -b 4096 tends to favor non-repacked quants for PP speed now. I'd also advise to not use iq4_xs as it is an older mainline quant, but you can use the newer iq4_ks and iq4_kss SOTA quants with similar BPW but likely better Perplexity.

Regarding PR624 which you link, it only effects Q2_K, Q3_K, Q4_K, Q5_K, IQ2_KS, IQ3_KS, IQ3_K which do not appear in the recipe you listed?? If you read all of PR624 you can see that you might have to try with and without that branch compiled in and measure perplexity yourself to see which one is "better" as the tweaks don't seem to be 100% better across all quants/models but may vary a bit hence why it is unmerged so far.

What is your goal here? Are you trying to fit the best quality model into a specific RAM+VRAM target size? Anyway, have fun, quantizing is a cool hobby!

Good luck and cheers!

CalvinZero

Aug 20

Thank for the tips.

My target is to run model on pure cpu with zen4 avx512 ASAP, with reasonable perplexity.

I read some where non-linear quantization and _R4 _R8 is good for CPU. I will try compare normal and _Rx version with -rtr.

ubergarm

Owner Aug 21

@CalvinZero

My target is to run model on pure cpu with zen4 avx512 ASAP, with reasonable perplexity.

Zen4 doesn't get much speed-up with avx512 as it still takes multiple CPU cycles. Zen5 gives the real avx_vnni CPU flags which are faster: https://github.com/ikawrakow/ik_llama.cpp/pull/710

I read some where non-linear quantization and _R4 _R8 is good for CPU. I will try compare normal and _Rx version with -rtr.

yes the repacked row interleaved can be good for CPU/RAM inferencing especially at lower batch sizes. Larger batch sizes e.g. -ub 4096 -b 4096 though can improve PP significantly even on MoE, you will want to llama-sweep-bench test to compare results as shown in the link above where I'm doing some CPU-only benchmarks.

-rtr is the same as _R4 for all quants running on CPU/RAM except if the tensor was like IQ1_S which is not symmetric with IQ1_S_R4 i think, but the rest are.. double check looking at the closed PRs on ik_llama.cpp though. you can "offline repack" using llama-quantize yourself to prepare an _r4 version of any of my quants then you don't need to use -rtr so you can still use mmap() if needed for faster startup etc.

Aver0

Aug 27

•

edited Aug 27

@ubergarm

Thank you for advices. Sorry I didn't answer earlier. Thought I'd only write after testing every possible optimization, but my progress kinda stalled, might as well just recollect what I tried.

I have Zen4 CPU (7700), so using AVX512 actually decreases performance due to heating.

The answer to

how much room you have left-over for kv-cache

is: just 2k tokens:

llama-sweep-bench -m GLM-4.5-IQ3_KT.gguf -b 512 -ub 256 -c 2816 --no-mmap -t 8 -tb 7 -fa -ctk q8_0 -ctv q8_0 -fmoe -ot "exps=CPU" -ngl 99

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
256	64	0	18.887	13.55	22.644	2.83
256	64	256	18.531	13.81	23.138	2.77
256	64	512	19.379	13.21	22.950	2.79
256	64	768	19.192	13.34	23.809	2.69
256	64	1024	18.912	13.54	24.057	2.66
256	64	1280	18.736	13.66	26.112	2.45
256	64	1536	18.861	13.57	25.447	2.52
256	64	1792	19.062	13.43	23.655	2.71
256	64	2048	18.884	13.56	23.371	2.74
256	64	2304	19.376	13.21	23.514	2.72
256	64	2560	19.188	13.34	23.557	2.72

Obviously 2k isn't enough for anything, so I have to decrease ngl:

llama-sweep-bench -m GLM-4.5-IQ3_KT.gguf -b 1024 -ub 512 -c 10240 --no-mmap -t 8 -tb 7 -fa -ctk q8_0 -ctv q8_0 -fmoe -ot "exps=CPU" -ngl 83

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	30.990	16.52	54.778	2.34
512	128	512	30.682	16.69	56.867	2.25
512	128	1024	31.399	16.31	56.414	2.27
512	128	1536	30.918	16.56	56.719	2.26
512	128	2048	30.664	16.70	57.066	2.24
512	128	2560	31.150	16.44	56.874	2.25
512	128	3072	31.179	16.42	56.605	2.26
512	128	3584	30.571	16.75	57.030	2.24
512	128	4096	30.791	16.63	57.521	2.23
512	128	4608	31.040	16.49	58.079	2.20
512	128	5120	31.328	16.34	57.964	2.21
512	128	5632	31.479	16.26	58.625	2.18
512	128	6144	31.463	16.27	58.985	2.17
512	128	6656	31.315	16.35	58.621	2.18
512	128	7168	31.317	16.35	59.899	2.14
512	128	7680	31.168	16.43	60.143	2.13
512	128	8192	32.308	15.85	60.282	2.12
512	128	8704	31.878	16.06	59.274	2.16
512	128	9216	30.871	16.59	60.104	2.13

10k context is already enough for many simple tasks, and TG is still not too bad for a GPU-poor setup. I'm less concerned about PP with small contexts, so I don't set -ub 4096 which would require reducing ngl further.

BTW, when I tried ffn=CPU with -ngl 99 instead of exps=CPU, thus offloading only attn of all blocks, the results were worse:

llama-sweep-bench -m GLM-4.5-IQ3_KT.gguf -b 2048 -ub 512 -c 10240 --no-mmap -t 8 -tb 7 -fa -ctk q8_0 -ctv q8_0 -fmoe -ot "ffn=CPU" -ngl 99

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	30.252	16.92	60.091	2.13
512	128	512	29.573	17.31	61.239	2.09
512	128	1024	29.529	17.34	60.954	2.10
512	128	1536	29.696	17.24	63.248	2.02
512	128	2048	29.843	17.16	62.106	2.06
512	128	2560	29.749	17.21	62.514	2.05
512	128	3072	30.754	16.65	60.991	2.10
512	128	3584	32.811	15.60	62.740	2.04
512	128	4096	30.536	16.77	62.033	2.06
512	128	4608	30.473	16.80	61.400	2.08
512	128	5120	30.109	17.00	62.899	2.04
512	128	5632	30.044	17.04	64.052	2.00
512	128	6144	29.956	17.09	64.646	1.98
512	128	6656	29.830	17.16	62.800	2.04
512	128	7168	30.906	16.57	63.378	2.02
512	128	7680	31.361	16.33	63.318	2.02
512	128	8192	31.476	16.27	62.979	2.03
512	128	8704	30.113	17.00	62.722	2.04
512	128	9216	29.939	17.10	65.556	1.95
512	128	9728	30.818	16.61	63.102	2.03

Now, for 32k context the batch size 4096 makes sense:

llama-sweep-bench -m GLM-4.5-IQ3_KT.gguf -b 4096 -ub 4096 -c 32768 --no-mmap -t 8 -tb 7 -fa -ctk q8_0 -ctv q8_0 -fmoe -ot "exps=CPU" -ngl 39

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	1024	0	28.031	146.12	514.676	1.99
4096	1024	4096	28.891	141.77	566.682	1.81
4096	1024	8192	30.607	133.83	613.948	1.67
4096	1024	12288	32.868	124.62	671.010	1.53
4096	1024	16384	37.314	109.77	762.135	1.34
4096	1024	20480	38.373	106.74	859.401	1.19
4096	1024	24576	42.212	97.03	935.687	1.09

I've tried GLM-4.5-DRAFT-0.6B-32k-Q4_0.gguf. It slows down TG by ~1.25x when used with llama-server. llama-sweep-bench just ignores the -md switch so no benchmarks here. It's just slow.

I've also tried --no-kv-offload with -ngl 99, it's much slower, especially after 20k tokens.

Usually when TG is lower than 1 t/s I find it too slow and switch to a lower quant. Fortunately IQ3_KT only slows down to 1.0 after 32k tokens, and coincidentally that's where all models start overlooking older parts of context, so I tend to never use more than 32k anyway.

So, for now I'll be using IQ3_KT. Thank you for the good quant. The only thing that bothers me is 10GB of free RAM that could be used for decreasing perplexity (e.g. store ffn_(gate|up) of Routed Experts in IQ3_K_R4 or IQ3_KM, not sure which is better).

Oh, actually, why the IQ3_KT stores attn_k and attn_v of Routed Experts in Q8_0 while your IQ4_KSS stores them in IQ6_K?
Shouldn't the lower quant use same IQ6_K? (would really help store more of them in my superlimited VRAM)

I was thinking about cooking a balanced "IQ3_K" quant using Thireus' GGUF-Tool-Suite, but since I'm still on Windows, it doesn't want to cooperate without some wrestling.

ubergarm

Owner Aug 27

@Aver0

I have Zen4 CPU (7700), so using AVX512 actually decreases performance due to heating.

Yeah, Zen4 AVX512 instructions take multiple CPU clocks to perform, so no big benefit for PP like on Zen5 unfortunately. Interesting it heats your CPU up.

Obviously 2k isn't enough for anything, so I have to decrease ngl:

So I wouldn't recommend reducing ngl as the strategy for MoE is -ngl 99 and put all the routed exps on CPU/RAM. But I understand you have only 12GB VRAM which is quite low, despite a lot of RAM. You could also play with reducing kv-cache size on VRAM with heavier quantization e.g. -ctk q6_0 -ctv q6_0 or -ctk q4_1 -ctv q4_1 or -ctk iq4_nl -ctv iq4_nl or something better than q4_0 but smaller than q8_0. Even so you might not to get enough context fully offloading... hrmm...

BTW, when I tried ffn=CPU with -ngl 99 instead of exps=CPU, thus offloading only attn of all blocks, the results were worse:

Of course, you were offloading the first N dense layers ffn_(gate_down_up) as well as the shared expert ffn_(gate|down|up)_shexp which are always active weights for every token onto CPU. The strategy for MoE is to always keep those on GPU/VRAM and only offload the routed experts tensors ffn_(gate|down|up)_exps onto CPU/RAM . You can check the model card side-bar on huggingface to see the exact names of tensors for regex matching.

-t 8 -tb 7

You have 8 physics CPU cores, I'd recommend just use -t 8 and be done. Not sure why you are using less for threads-batch (prompt processing/prefill)? Typically tb should be higher, but only on big many core CPUs. For your system 8 and 8 should be best.

I've also tried --no-kv-offload with -ngl 99, it's much slower, especially after 20k tokens.

Yes, I've heard some use this strategy only if they require a ton of kv-cache and are willing to go very very slowly. Otherwise always keep kv-cace on GPU/VRAM.

I've tried GLM-4.5-DRAFT-0.6B-32k-Q4_0.gguf. It slows down TG by ~1.25x when used with llama-server. llama-sweep-bench just ignores the -md switch so no benchmarks here. It's just slow.

Ahh yeah, I've heard from some that it doesn't have enough valid tokens matching to be worth it for many types of applications. Thanks for trying. Also wth only 12GB VRAM you barely have enough to run the main model, let alone add another small model into VRAM. Huh, I thought llama-sweep-bench would use it, I have a test with a different model showing about 1 tok/sec speed-up with llama-sweep-bench. Oh well, probably not worth more exploration on your setup.

The only thing that bothers me is 10GB of free RAM that could be used for decreasing perplexity (e.g. store ffn_(gate|up) of Routed Experts in IQ3_K_R4 or IQ3_KM, not sure which is better).

Be careful trying to mini-max so much and splitting up ffn tensors across different devices. -fmoe needs to have (gate|up) on the same device at a minimum to work for example. Also depending on PCIe it could add extra communications overhead possibly going back and forth for each layer. 10GB RAM isn't going to make a noticible difference probably.

Oh, actually, why the IQ3_KT stores attn_k and attn_v of Routed Experts in Q8_0 while your IQ4_KSS stores them in IQ6_K?
Shouldn't the lower quant use same IQ6_K? (would really help store more of them in my superlimited VRAM)

The attn tensors don't belong to the routed experts. Each layer can have a mix of many tensor types. So I chose larger q8_0 attn tensors for the KT as sometimes that small larger size over iq6_k can give a noticible boost in perplexity. I generally design assuming 16GB VRAM minimum where it wouldn't matter so much. But on your system, the small savings of iq6_k < q8_0 would allow you some more context etc. Sorry about that. There are no hard and fast rules about "shouldn't the lower quant..." really. There is a tradition of quantization mix schemes e.g. IQ4_K_M OR IQ4_K_XL which have some meaning hard-coded into llama-quantize. The unsloth quants are basically just this with a slightly different mix. I only use the custom quantizations and have never limited myself to the traditional mixes as bartowski, mradermacher, and unsloth do a fine job already with those flavors.

In general keeping attn tensors a bit higher compared to the rest of the mix can give pretty good perplexity boost for the size.

How fast is your NVMe SSD? If it is PCIe Gen 5 e.g. a T700 crucial drive you might be better off going with the iq4_kss with smaller attn tensors and letting the model hang out of RAM onto SSD and letting default mmap() read-only operate off of the page cache. I've run DeepSeek 671B like this up to 4-5 tok/sec with only 96GB RAM.

The main thing to explore for you that I'd recommend is trying -rtr for run-time-repack which will disable mmap() and malloc() the entire model on start with the tensors running on CPU/RAM repacked into row interleaved format. This can give a boost to TG as it improves cpu/ram/cache effectiveness. You can also play with -ub 1024 -b 2048or batches smaller than 4096 etc. The default batches are -ub 512 -b 2048 fwiw.

Okay keep hacking and maybe you can squeeze another tok/sec out of your system!

Aver0

Sep 12

@ubergarm
So, after everything else didn't result in an improvement (I have Gen4 SSD), I went and cooked my first custom IQ3_K-like quant.
Here's the recipe:

## Quant mix recipe created using Thireus' GGUF Tool Suite - https://gguf.thireus.com/

## Model head & embeddings — qbits: 32 8 6 
output_norm\.weight=f32
token_embd\.weight=iq6_k
output\.weight=q8_0

## Multi-headed attention parameters — qbits: 32 4 
blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_output\.weight=iq4_ks
blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_k\.bias=f32
blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_k\.weight=iq4_ks
blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_q\.weight=iq4_ks
blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_v\.weight=iq4_ks
blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_q\.bias=f32
blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_k_norm\.weight=f32
blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_v\.bias=f32
blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_norm\.weight=f32
blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_q_norm\.weight=f32

## Core FFN weights — qbits: 32 8 5 4 
blk\.[0-2]\.ffn_gate\.weight=iq4_k
blk\.[0-1]\.ffn_down\.weight=iq4_k
blk\.1\.ffn_up\.weight=iq4_ks
blk\.(0|2)\.ffn_up\.weight=iq5_k
blk\.([3-9]|[1-8][0-9]|9[0-2])\.ffn_gate_inp\.weight=f32
blk\.2\.ffn_down\.weight=q8_0

## Other tensors — qbits: 32 4 
blk\.([0-9]|[1-8][0-9]|9[0-2])\.post_attention_norm\.weight=f32
blk\.92\.nextn\.eh_proj\.weight=iq4_kss
blk\.92\.nextn\.embed_tokens\.weight=iq4_kss
blk\.92\.nextn\.shared_head_norm\.weight=f32
blk\.([3-9]|[1-8][0-9]|9[0-2])\.exp_probs_b\.bias=f32
blk\.92\.nextn\.shared_head_head\.weight=iq4_kss
blk\.92\.nextn\.enorm\.weight=f32
blk\.92\.nextn\.hnorm\.weight=f32

## GPU-loaded ffn_*_shexp
# ffn_down_shexp (down-projection) — qbits: 8 6 5 4 
blk\.(26|30|32|91|2[8-9])\.ffn_down_shexp\.weight=q8_0
blk\.(19|23|42)\.ffn_down_shexp\.weight=iq6_k
blk\.(3|7|9|18|37|41|84|1[0-5])\.ffn_down_shexp\.weight=iq5_k
blk\.(27|34|36|40|2[0-2]|8[6-7]|7[7-8]|2[4-5])\.ffn_down_shexp\.weight=iq4_ks
blk\.([4-6]|8|31|33|35|[5-6][0-9]|79|85|90|92|1[6-7]|3[8-9]|8[8-9]|7[0-6]|8[0-3]|4[3-9])\.ffn_down_shexp\.weight=iq4_k

# ffn_up_shexp (up-projection) — qbits: 8 6 5 4 
blk\.(9|11|35|84|91|3[1-2]|3[8-9])\.ffn_up_shexp\.weight=q8_0
blk\.(3|7|24|26)\.ffn_up_shexp\.weight=iq6_k
blk\.(4|23|28|34|43|45|55|90|1[2-5])\.ffn_up_shexp\.weight=iq5_k
blk\.(5|8|10|16|29|30|33|37|40|86|92|7[0-5]|2[0-2]|8[0-2]|5[0-4]|6[0-9]|1[8-9]|8[8-9]|7[7-9]|[4-5][6-9])\.ffn_up_shexp\.weight=iq4_k
blk\.(6|17|25|27|36|44|76|83|85|87|4[1-2])\.ffn_up_shexp\.weight=iq4_ks

# ffn_gate_shexp (gate-projection) — qbits: 8 6 5 4 
blk\.(3|14|25|30|33|36|46|86|9[0-1])\.ffn_gate_shexp\.weight=q8_0
blk\.60\.ffn_gate_shexp\.weight=iq6_k
blk\.(5|9|10|12|15|19|23|35|41|44|68|88|2[0-1]|8[4-5])\.ffn_gate_shexp\.weight=iq5_k
blk\.(17|22|27|40|47|3[1-2]|7[5-6]|3[7-9])\.ffn_gate_shexp\.weight=iq4_ks
blk\.(4|[6-8]|11|13|16|18|24|26|34|45|69|87|89|92|6[1-7]|7[0-4]|5[0-9]|2[8-9]|4[8-9]|4[2-3]|8[0-3]|7[7-9])\.ffn_gate_shexp\.weight=iq4_k

## CPU-loaded ffn_*_exps
# ffn_down_exps (down-extraction) — qbits: 5 4 3 2 
blk\.(5|22|61)\.ffn_down_exps\.weight=iq5_ks_r4
blk\.([3-4]|11|18|35|92|3[0-1])\.ffn_down_exps\.weight=iq5_k_r4
blk\.(6|15|25|55|57|60|68|71|4[3-5]|5[2-3]|4[7-9])\.ffn_down_exps\.weight=iq4_kss
blk\.(8|14|46|56)\.ffn_down_exps\.weight=iq4_ks_r4
blk\.(13|19|24|37|42|54|59|78|8[1-2]|5[0-1]|[6-7][2-6])\.ffn_down_exps\.weight=iq3_k_r4
blk\.(7|16|28|33|41|58|67|69|70|77|79|80|8[4-9]|9[0-1])\.ffn_down_exps\.weight=iq3_ks
blk\.(10|12|17|23|26|29|32|34|36|39)\.ffn_down_exps\.weight=iq2_ks
blk\.(9|27|38|40|83|2[0-1])\.ffn_down_exps\.weight=iq2_k_r4

# ffn_up_exps (up-extraction) — qbits: 5 4 3 2 
blk\.(19|28|92)\.ffn_up_exps\.weight=iq5_k_r4
blk\.(11|15)\.ffn_up_exps\.weight=iq4_kss
blk\.20\.ffn_up_exps\.weight=iq4_ks_r4
blk\.(41|43|58|78|4[6-8]|1[6-7]|3[0-1]|6[0-1]|[1-2][3-4]|5[0-6]|7[0-6]|6[7-9]|6[4-5])\.ffn_up_exps\.weight=iq3_k_r4
blk\.([3-4]|12|21|39|42|49|57|59|66|77|79|8[0-2]|8[4-9]|9[0-1]|6[2-3]|4[4-5])\.ffn_up_exps\.weight=iq3_ks
blk\.(5|7|9|10|18|25|27|29|32|40|3[7-8]|3[4-5])\.ffn_up_exps\.weight=iq2_ks
blk\.(6|8|22|26|33|36|83)\.ffn_up_exps\.weight=iq2_k_r4

# ffn_gate_exps (gate-extraction) — qbits: 5 4 3 2 
blk\.(19|28|92)\.ffn_gate_exps\.weight=iq5_k_r4
blk\.(11|15)\.ffn_gate_exps\.weight=iq4_kss
blk\.20\.ffn_gate_exps\.weight=iq4_ks_r4
blk\.(41|43|58|78|4[6-8]|1[6-7]|3[0-1]|6[0-1]|[1-2][3-4]|5[0-6]|7[0-6]|6[7-9]|6[4-5])\.ffn_gate_exps\.weight=iq3_k_r4
blk\.([3-4]|12|21|39|42|49|57|59|66|77|79|8[0-2]|8[4-9]|9[0-1]|6[2-3]|4[4-5])\.ffn_gate_exps\.weight=iq3_ks
blk\.(5|7|9|10|18|25|27|29|32|40|3[7-8]|3[4-5])\.ffn_gate_exps\.weight=iq2_ks
blk\.(6|8|22|26|33|36|83)\.ffn_gate_exps\.weight=iq2_k_r4

## Summary of tensor sizes per class
# GPU Total: 9.509 GiB (88.4%) | 10.75 GiB max, if all were q8_0 | 9.04 GiB min, if all were iq4_ks
# CPU Total: 131.155 GiB (60.1%) | 218.28 GiB max, if all were iq5_k_r4 | 88.72 GiB min, if all were iq2_ks
# GPU+CPU Total: 140.664 GiB (74.3%)

## Summary of tensor counts and bpw per qtype
#
# GPU-loaded quants:
# QTYPE		Count	BPW	Assigned GiB	% Assigned	Max GiB (all)
# +f32       	835	32.0  	  0.28 GiB	-		-
# +q8_0      	1  	8.5   	  0.77 GiB	-		-
# q8_0      	26 	8.5   	  0.26 GiB	7.5%		3.43
# iq6_k     	9  	6.625 	  0.65 GiB	24.2%		2.67
# iq5_k     	43 	5.5   	  0.29 GiB	12.9%		2.22
# iq4_k     	164	4.5   	  0.82 GiB	45.2%		1.82
# +iq4_ks    	372	4.25  	  6.27 GiB	-		-
# iq4_ks    	38 	4.25  	  0.18 GiB	10.2%		1.71
#
# CPU-loaded quants:
# QTYPE		Count	BPW	Assigned GiB	% Assigned	Max GiB (all)
# +iq5_k_r4  	3  	5.5   	  2.42 GiB	-		-
# iq5_k_r4  	11 	5.5   	  8.86 GiB	4.1%		215.11
# iq5_ks_r4 	3  	5.25  	  2.31 GiB	1.1%		205.33
# iq4_k     	0  	4.5   	  0.00 GiB	0.0%		176.00
# iq4_ks_r4 	6  	4.25  	  3.74 GiB	2.2%		166.22
# +iq4_kss   	3  	4.0   	  0.75 GiB	-		-
# iq4_kss   	20 	4.0   	 11.72 GiB	7.5%		156.45
# iq3_k_r4  	94 	3.4375	 47.33 GiB	35.2%		134.45
# iq3_ks    	74 	3.1875	 34.55 GiB	27.7%		124.67
# iq2_k_r4  	21 	2.375 	  7.31 GiB	7.9%		92.89
# iq2_ks    	38 	2.1875	 12.18 GiB	14.2%		85.56
#
# -Average BPW: 3.3719
#
# - Command used:
# quant_assign.py ppl_results.csv --tolerance 0.01 --cpu-tensors-max-size 131 --gpu-tensors-max-size 9.5 \
# --exponential-factor 1 --skip-gpg --cpu-tensors 'blk\.([3-9]|[1-8][0-9]|9[0-9])\.ffn_down_exps\.weight' \
# 'blk\.([3-9]|[1-8][0-9]|9[0-9])\.ffn_up_exps\.weight' 'blk\.([3-9]|[1-8][0-9]|9[0-9])\.ffn_gate_exps\.weight' \
# --gpu-tensors '.*' --cpu-quants iq5_k_r4 iq5_ks_r4 iq4_k iq4_ks_r4 iq4_kss iq3_k_r4 iq3_ks iq2_k_r4 iq2_ks \
# --gpu-quants q8_0 iq6_k iq5_k iq4_k iq4_ks --gpu-assign-qtype iq4_ks --gpu-assign-tensors 'output\.weight=q8_0' \
# --cpu-assign-tensors 'blk\.(92)\.ffn_down_exps\.weight=iq5_k_r4' 'blk\.(92)\.ffn_up_exps\.weight=iq5_k_r4' \
# 'blk\.(92)\.ffn_gate_exps\.weight=iq5_k_r4' 'blk\.92\.nextn\.shared_head_head\.weight=iq4_kss' \
# 'blk\.92\.nextn\.embed_tokens\.weight=iq4_kss' 'blk\.92\.nextn\.eh_proj\.weight=iq4_kss' --harmonize-tensors \
# 'blk\..*\.ffn_up_exps.*,blk\..*\.ffn_gate_exps.*' --harmonization-technique 3

Somehow this 3.37 BPW quant produces very low PPL!

llama-perplexity -m GLM-4.5-THIREUS-test3.gguf --ctx-size 2048 -b 2048 -ub 2048 -f wiki.test.raw -fa -fmoe -ot "exps=CPU" -ngl 69 --seed 1337 --threads 8

[1]1.6807,[2]1.9940,[3]2.2606,[4]2.6403,[5]2.5938,[6]2.4223,[7]2.4690,[8]2.5832,[9]2.7854,[10]2.9249,[11]3.0186,[12]2.9407,[13]3.0277,[14]3.0931,[15]3.1681,[16]3.2853,[17]3.1949,[18]3.1483,[19]3.0985,[20]3.0447,[21]3.0188,[22]2.9310,[23]2.8580,[24]2.7836,[25]2.7441,[26]2.7852,[27]2.8774,[28]2.9328,[29]2.8714,[30]2.8529,[31]2.8040,[32]2.7740,[33]2.7651,[34]2.7732,[35]2.8026,[36]2.7988,[37]2.7752,[38]2.7772,[39]2.7825,[40]2.8111,[41]2.8155,[42]2.8744,[43]2.9478,[44]2.9024,[45]2.8491,[46]2.9158,[47]2.9756,[48]3.0113,[49]3.0049,[50]3.0100,[51]3.0154,[52]3.0178,[53]3.0194,[54]3.0178,[55]3.0046,[56]3.0071,[57]2.9934,[58]2.9792,[59]2.9612,[60]2.9948,[61]3.0263,[62]3.0505,[63]3.0288,[64]3.0578,[65]3.0696,[66]3.0715,[67]3.0715,[68]3.0765,[69]3.0428,[70]3.0469,[71]3.0640,[72]3.0858,[73]3.0866,[74]3.0610,[75]3.0377,[76]3.0110,[77]3.0226,[78]3.0013,[79]3.0043,[80]2.9847,[81]2.9655,[82]2.9521,[83]2.9512,[84]2.9560,[85]2.9617,[86]2.9851,[87]2.9683,[88]2.9495,[89]2.9805,[90]3.0059,[91]3.0251,[92]3.0460,[93]3.0705,[94]3.0799,[95]3.1033,[96]3.1185,[97]3.1322,[98]3.1085,[99]3.0887,[100]3.0606,[101]3.0469,[102]3.0207,[103]2.9992,[104]2.9983,[105]2.9880,[106]2.9690,[107]2.9618,[108]2.9535,[109]2.9336,[110]2.9265,[111]2.9503,[112]2.9505,[113]2.9598,[114]2.9681,[115]2.9767,[116]2.9691,[117]2.9669,[118]2.9762,[119]2.9860,[120]2.9936,[121]2.9980,[122]3.0139,[123]3.0247,[124]3.0379,[125]3.0416,[126]3.0586,[127]3.0662,[128]3.0636,[129]3.0532,[130]3.0439,[131]3.0363,[132]3.0341,[133]3.0251,[134]3.0167,[135]3.0015,[136]2.9998,[137]2.9877,[138]2.9679,[139]2.9557,[140]2.9556,[141]2.9680,
Final estimate: PPL = 2.9680 +/- 0.01564

Do you see this PPL??
What's going on? There must be some mistake. Am I even doing it right? I'm using latest ik_llama. Here's what it outputs for IQ3_KT quant:

llama-perplexity -m GLM-4.5-IQ3_KT.gguf --ctx-size 2048 -b 2048 -ub 2048 -f wiki.test.raw -fa -fmoe -ot "exps=CPU" -ngl 69 --seed 1337 --threads 8

[1]1.6796,[2]2.9507,[3]3.1831,[4]3.5334,[5]3.4014,[6]3.0574,[7]3.0257,[8]3.1106,[9]3.2925,[10]3.3990,[11]3.4909,[12]3.3635,[13]3.4267,[14]3.4866,[15]3.5566,[16]3.6910,[17]3.5740,[18]3.5219,[19]3.4623,[20]3.4978,[21]3.5729,[22]3.5191,[23]3.5499,[24]3.5704,[25]3.5411,[26]3.6347,[27]3.7202,[28]3.7773,[29]3.6799,[30]3.6564,[31]3.6035,[32]3.5541,[33]3.5381,[34]3.5401,[35]3.5615,[36]3.5350,[37]3.4880,[38]3.4787,[39]3.4709,[40]3.4955,[41]3.4890,[42]3.5500,[43]3.6316,[44]3.6127,[45]3.5694,[46]3.6391,[47]3.6985,[48]3.7313,[49]3.7155,[50]3.7182,[51]3.7138,[52]3.7064,[53]3.6990,[54]3.6864,[55]3.6617,[56]3.6568,[57]3.6331,[58]3.6076,[59]3.5725,[60]3.6072,[61]3.6406,[62]3.6628,[63]3.6354,[64]3.6659,[65]3.6770,[66]3.6703,[67]3.6695,[68]3.6711,[69]3.6340,[70]3.6344,[71]3.6491,[72]3.6737,[73]3.6665,[74]3.6377,[75]3.6051,[76]3.5931,[77]3.6044,[78]3.5832,[79]3.5975,[80]3.5678,[81]3.5388,[82]3.5212,[83]3.5174,[84]3.5200,[85]3.5234,[86]3.5469,[87]3.5473,[88]3.5186,[89]3.5535,[90]3.5791,[91]3.6007,[92]3.6249,[93]3.6498,[94]3.6601,[95]3.6852,[96]3.7092,[97]3.7214,[98]3.6949,[99]3.6651,[100]3.6277,[101]3.6317,[102]3.6147,[103]3.5866,[104]3.6075,[105]3.5912,[106]3.5680,[107]3.5663,[108]3.5630,[109]3.5406,[110]3.5277,[111]3.5532,[112]3.5504,[113]3.5573,[114]3.5620,[115]3.5678,[116]3.5545,[117]3.5526,[118]3.5594,[119]3.5670,[120]3.5741,[121]3.5782,[122]3.5935,[123]3.6041,[124]3.6162,[125]3.6161,[126]3.6322,[127]3.6378,[128]3.6301,[129]3.6207,[130]3.6120,[131]3.5997,[132]3.5937,[133]3.5817,[134]3.5702,[135]3.5506,[136]3.5611,[137]3.5458,[138]3.5240,[139]3.5264,[140]3.5250,[141]3.5396,
Final estimate: PPL = 3.5396 +/- 0.02052

PPL slightly varies with different batch sizes, but I never expected to get sub-3. And the model does feel smarter now, at least in instruction following.

The custom quant is also ~1.25x faster than IQ3_KT quant, even though now I must set -ngl lower to have same context. That's baffling since attention tensors are quantized lower in my recipe than in IQ3_KT recipe. I thought I'd be able to fit more layers into VRAM, but nope. And no idea why.
But even with reduced -ngl the new quant is faster due to CPU-friendly types used.

llama-sweep-bench -m GLM-4.5-THIREUS-test3.gguf -b 1024 -ub 512 -c 10240 --no-mmap -t 8 -fa -ctk q8_0 -ctv q8_0 -fmoe -ot "exps=CPU" -ngl 74

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
512	128	0	25.079	20.42	43.361	2.95
512	128	512	24.859	20.60	43.377	2.95
512	128	1024	24.435	20.95	43.915	2.91
512	128	1536	24.849	20.60	44.305	2.89
512	128	2048	24.615	20.80	44.708	2.86
512	128	2560	23.940	21.39	44.640	2.87
512	128	3072	24.610	20.80	45.472	2.81
512	128	3584	24.622	20.79	45.265	2.83
512	128	4096	24.423	20.96	45.643	2.80
512	128	4608	24.531	20.87	46.334	2.76
512	128	5120	24.594	20.82	46.774	2.74
512	128	5632	24.312	21.06	51.196	2.50
512	128	6144	24.008	21.33	51.472	2.49
512	128	6656	24.952	20.52	51.181	2.50
512	128	7168	24.964	20.51	50.818	2.52
512	128	7680	24.296	21.07	50.414	2.54
512	128	8192	24.287	21.08	51.282	2.50
512	128	8704	24.727	20.71	51.092	2.51
512	128	9216	24.602	20.81	50.434	2.54

All of this feels like black magic. The custom quant works too well.

Shoutouts to @Thireus for this awesome idea of quants assembler! I think that's the future of GGUF quantizations.

~~It even makes me start ogling at Kimi~~

Thireus

Sep 13

•

edited Sep 13

@Aver0 , glad you like the tool but there are a few things to check first:

Double check the ik_llama.cpp logs to verify that the bpw matches and that the quant types also match. Sometimes some tensors cannot be quantized to the specified quant type and will fall back to higher quant types, you will need to manually check the tensors.map file of the faulty quant type to verify this if you observe that there is a discrepancy between what ik_llama.cpp reports and what the recipe says.
du -h . or ls -lh to check the total model size to see if it matches the advertised 140.664 GiB. In your case it seems it may not be the case, and the model size if probably higher which could explain the PPL.
It seems you are using an older version of the tool suite, which is subject to this bug: https://github.com/Thireus/GGUF-Tool-Suite/issues/21 - as a result attn_output.weight layer may be q8_0 (not that it's a bad thing but it is not what the recipe mentions). Please try to produce the recipe again using the latest version (it will bring the PPL a bit higher, but it will effectively respect the recipe rules). I can see you have a single GGUF file, so not sure if you have quantized the model using the recipe file or if you have merged the shards downloaded using the quant_downloader.sh tool.
PPL benchmarks should be computed with the parameters -ctk f16 -c 512 -b 4096 -ub 4096 to compare them against @ubergarm 's PPL values. Changing any of these parameters will alter the PPL. In particular, reducing -b 4096 -ub 4096 increases the PPL, while increasing them decreases the PPL.
See PPL curve here for comparison: https://github.com/Thireus/GGUF-Tool-Suite/blob/main/ppl_graphs/GLM-4.5.svg
You might be able to obtain an even better PPL by reducing the quant types to the ones around 3.3bpw: --cpu-quants iq4_kss iq3_k_r4 iq2_k_r4
It would also be useful to display the rest of the recipe file, specifically the hashes of the different files it used.

It is also possible that you have found a better approach to using the tool suite for this model. From experience, I have mainly played with quants that perform well on Intel, and I've observed that PPL reaches its optimum level when restricting the quant types to a total of 3-4 around the expected bpw for cpu-friendly tensors (depending on the model calibration PPL curves observed of course).

Aver0

Sep 13

•

edited Sep 13

@Thireus

Right, the file is larger - 154 332 383 136 bytes for merged GGUF (I always merge when I place them to my /models folder for inferencing).
It's still smaller than IQ3_KT quant that I used as a reference point.

ik_llama.cpp log:

llm_load_print_meta: model type       = 355B.A32B
llm_load_print_meta: model ftype      = IQ1_S - 1.5625 bpw
llm_load_print_meta: model params     = 358.338 B
llm_load_print_meta: model size       = 143.724 GiB (3.445 BPW)
llm_load_print_meta: repeating layers = 142.358 GiB (3.427 BPW, 356.786 B parameters)
llm_load_print_meta: general.name     = GLM 4.5

I've noticed the file size is generally ~2-3 GB higher than advertised, so I quickly got used to this quirk.
Is there any reason the quant_assign.py doesn't consider that some tensors cannot be quantized to the specified quant type?

The funny thing is, I stumbled upon this lucky recipe on my 3rd try. Then I spent several days trying to improve it within my RAM/VRAM budget, tested several dozens of promising ideas for tweaking, but untimately couldn't improve it.
Can't decrease PPL without a significant increase in file size. And can't reduce file size without an increase in PPL.
The best IQ3_XSS-like quant I could make (for cases when I need 40k context) is 3.1064 BPW quant (141 041 612 704 bytes merged GGUF) with PPL 3.1655 +/- 0.01711, which is still very good, but not as amazing as sub-3.

I'll check GGUF-Tool-Suite update and retest everything tomorrow.

You might be able to obtain an even better PPL by reducing the quant types to the ones around 3.3bpw

Oh, I've tried many combinations, all of them increase PPL.
I kinda developed a bunch of intuitions, will formulate them later.

Thireus

Sep 13

•

edited Sep 13

I've noticed the file size is generally ~2-3 GB higher than advertised, so I quickly got used to this quirk.

This is mainly due to this issue that got fixed in the latest release: https://github.com/Thireus/GGUF-Tool-Suite/issues/21 as explained in my previous message, your attn_output.weight layer ends up being q8_0. If this is what you'd like you can still add --gpu-assign-tensors '^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_output\.weight$=q8_0' in the newest version.

You should also check this section of the logs:

llama_model_loader: - type  f32:  145 tensors
llama_model_loader: - type q5_0:    1 tensors
llama_model_loader: - type iq2_kt:   51 tensors
llama_model_loader: - type iq3_kt:  188 tensors
llama_model_loader: - type iq4_kt:   13 tensors

Which is what I meant when I said "verify [...] that the quant types also match". And compare against the count you see in the recipe. That will help you identify which tensor counts don't match and troubleshoot.

Is there any reason the quant_assign.py doesn't consider that some tensors cannot be quantized to the specified quant type?

The complexity to handle this scenario is atrocious, so I've decided to only implement such check for the benchmarking scripts. quant_assign.sh trusts that the quant types provided by the user are pure for all tensors, which may not always be the case - so it is left to the user to double check that the quants used match the end result. This mainly happens for low bit quantization or when some tensor shapes are simply incompatible. See example below:

https://huggingface.co/Thireus/Kimi-K2-Instruct-0905-THIREUS-IQ2_K-SPECIAL_SPLIT/blob/main/tensors.map#L7-L8 - attn_k_b.weight layer falls back to iq4_nl instead of iq2_k during quantization (this is ik_llama.cpp deciding on its own to use this iq4_nl quantization for this tensor layer specifically)

I should probably create a pre-assessment script that warns the user if some tensors use the different quantization type due to incompatibility.

Please have a look at the other points I raised in my previous answer. Specifically about computing the PPL using -ctk f16 -c 512 -b 4096 -ub 4096.
Otherwise, very interesting results that you've got here and indeed it takes a bit of trial and error to identify which combination of quants is best to use and I'm glad you've nailed it!

I kinda developed a bunch of intuitions, will formulate them later.

Yes please. I've been tempted to create a script that automatically selects a bunch of quants for the user, which would have been based on my observations/intuition. But I decided it was too early exactly for this reason: that I may not have the right intuition that applies to all models - and I think this is what you've demonstrated here.

The custom quant is also ~1.25x faster than IQ3_KT quant

Not sure if you're using Intel or AMD, but yeah KT can be slower, especially on Intel - see https://github.com/Thireus/GGUF-Tool-Suite/tree/main/quants_graphs

Aver0

Sep 15

@Thireus

Okay, with "-c 512 -b 4096 -ub 4096" the results are more down-to-earth.

llama-perplexity -m GLM-4.5-THIREUS-test3fix.gguf -c 512 -b 4096 -ub 4096 -f wiki.test.raw -fa -fmoe -ctk f16 -ot "exps=CPU" -ngl 50 --seed 1337 --threads 8

[1]2.5972,[2]3.3418,[3]2.5321,[4]2.1730,[5]2.0564,[6]1.9767,[7]1.9867,[8]1.9368,[9]1.9744,[10]1.9370,[11]1.9426,[12]2.0704,[13]2.0989,[14]2.1349,[15]2.2696,[16]2.3646,[17]2.4922,[18]2.6856,[19]2.6841,[20]2.6651,[21]2.7263,[22]2.7287,[23]2.6923,[24]2.6467,[25]2.6115,[26]2.6019,[27]2.5931,[28]2.6334,[29]2.6694,[30]2.7380,[31]2.7986,[32]2.8524,[33]2.9098,[34]2.9481,[35]3.0197,[36]3.0604,[37]3.0768,[38]3.1496,[39]3.1910,[40]3.2285,[41]3.3004,[42]3.3054,[43]3.3273,[44]3.3602,[45]3.4470,[46]3.5082,[47]3.4595,[48]3.3992,[49]3.3493,[50]3.3296,[51]3.3528,[52]3.3753,[53]3.4091,[54]3.4143,[55]3.4300,[56]3.4581,[57]3.4389,[58]3.4435,[59]3.4473,[60]3.4825,[61]3.5173,[62]3.5676,[63]3.5984,[64]3.6157,[65]3.6175,[66]3.6003,[67]3.5708,[68]3.5486,[69]3.5665,[70]3.5537,[71]3.5326,[72]3.5329,[73]3.5411,[74]3.5686,[75]3.5723,[76]3.5302,[77]3.4918,[78]3.4511,[79]3.4160,[80]3.3739,[81]3.3457,[82]3.3212,[83]3.3092,[84]3.2888,[85]3.2612,[86]3.2499,[87]3.2309,[88]3.2272,[89]3.1991,[90]3.1732,[91]3.1496,[92]3.1300,[93]3.1113,[94]3.0955,[95]3.0709,[96]3.0617,[97]3.0762,[98]3.0634,[99]3.0466,[100]3.0370,[101]3.0515,[102]3.0504,[103]3.0550,[104]3.0543,[105]3.0727,[106]3.0979,[107]3.1547,[108]3.1674,[109]3.1773,[110]3.2164,[111]3.2415,[112]3.2197,[113]3.1966,[114]3.1801,[115]3.1613,[116]3.1507,[117]3.1391,[118]3.1398,[119]3.1279,[120]3.1193,[121]3.1096,[122]3.0993,[123]3.0810,[124]3.0662,[125]3.0510,[126]3.0359,[127]3.0228,[128]3.0163,[129]3.0139,[130]3.0030,[131]2.9970,[132]2.9896,[133]2.9879,[134]2.9976,[135]3.0158,[136]3.0077,[137]3.0073,[138]2.9954,[139]2.9881,[140]3.0018,[141]3.0014,[142]3.0015,[143]2.9993,[144]2.9987,[145]2.9991,[146]2.9969,[147]2.9897,[148]2.9900,[149]2.9871,[150]2.9896,[151]2.9834,[152]2.9805,[153]2.9851,[154]2.9826,[155]2.9826,[156]2.9856,[157]2.9873,[158]2.9882,[159]3.0007,[160]3.0113,[161]3.0158,[162]3.0133,[163]3.0091,[164]3.0193,[165]3.0248,[166]3.0446,[167]3.0660,[168]3.0747,[169]3.1022,[170]3.1211,[171]3.1332,[172]3.1582,[173]3.1497,[174]3.1342,[175]3.1195,[176]3.1067,[177]3.0944,[178]3.0801,[179]3.0659,[180]3.0515,[181]3.0479,[182]3.0629,[183]3.0820,[184]3.1037,[185]3.1208,[186]3.1290,[187]3.1482,[188]3.1739,[189]3.1936,[190]3.2068,[191]3.2217,[192]3.2284,[193]3.2336,[194]3.2366,[195]3.2331,[196]3.2354,[197]3.2449,[198]3.2605,[199]3.2570,[200]3.2601,[201]3.2613,[202]3.2623,[203]3.2585,[204]3.2667,[205]3.2740,[206]3.2799,[207]3.2824,[208]3.2832,[209]3.2842,[210]3.2812,[211]3.2847,[212]3.2839,[213]3.2837,[214]3.2846,[215]3.2865,[216]3.2867,[217]3.2876,[218]3.2966,[219]3.2914,[220]3.2895,[221]3.2880,[222]3.2882,[223]3.2883,[224]3.2918,[225]3.2926,[226]3.2983,[227]3.2946,[228]3.2909,[229]3.2795,[230]3.2711,[231]3.2668,[232]3.2673,[233]3.2686,[234]3.2667,[235]3.2606,[236]3.2626,[237]3.2596,[238]3.2616,[239]3.2721,[240]3.2852,[241]3.2950,[242]3.3046,[243]3.3159,[244]3.3264,[245]3.3375,[246]3.3481,[247]3.3606,[248]3.3672,[249]3.3692,[250]3.3692,[251]3.3574,[252]3.3485,[253]3.3430,[254]3.3423,[255]3.3443,[256]3.3507,[257]3.3529,[258]3.3527,[259]3.3538,[260]3.3536,[261]3.3543,[262]3.3570,[263]3.3562,[264]3.3550,[265]3.3545,[266]3.3565,[267]3.3552,[268]3.3536,[269]3.3540,[270]3.3604,[271]3.3605,[272]3.3569,[273]3.3567,[274]3.3496,[275]3.3436,[276]3.3321,[277]3.3244,[278]3.3180,[279]3.3204,[280]3.3261,[281]3.3303,[282]3.3385,[283]3.3462,[284]3.3499,[285]3.3550,[286]3.3638,[287]3.3762,[288]3.3755,[289]3.3753,[290]3.3775,[291]3.3796,[292]3.3741,[293]3.3644,[294]3.3584,[295]3.3552,[296]3.3488,[297]3.3414,[298]3.3343,[299]3.3281,[300]3.3227,[301]3.3210,[302]3.3133,[303]3.3091,[304]3.3005,[305]3.2927,[306]3.2888,[307]3.2878,[308]3.2927,[309]3.3035,[310]3.2925,[311]3.2849,[312]3.2780,[313]3.2741,[314]3.2696,[315]3.2676,[316]3.2622,[317]3.2576,[318]3.2533,[319]3.2481,[320]3.2449,[321]3.2410,[322]3.2392,[323]3.2320,[324]3.2276,[325]3.2258,[326]3.2217,[327]3.2230,[328]3.2215,[329]3.2213,[330]3.2194,[331]3.2174,[332]3.2206,[333]3.2242,[334]3.2277,[335]3.2297,[336]3.2298,[337]3.2307,[338]3.2309,[339]3.2311,[340]3.2333,[341]3.2349,[342]3.2377,[343]3.2445,[344]3.2509,[345]3.2613,[346]3.2617,[347]3.2543,[348]3.2499,[349]3.2505,[350]3.2440,[351]3.2361,[352]3.2305,[353]3.2297,[354]3.2330,[355]3.2402,[356]3.2520,[357]3.2562,[358]3.2599,[359]3.2677,[360]3.2764,[361]3.2777,[362]3.2825,[363]3.2863,[364]3.2922,[365]3.2945,[366]3.2986,[367]3.3031,[368]3.3058,[369]3.3126,[370]3.3180,[371]3.3214,[372]3.3306,[373]3.3411,[374]3.3483,[375]3.3512,[376]3.3555,[377]3.3586,[378]3.3697,[379]3.3805,[380]3.3825,[381]3.3797,[382]3.3776,[383]3.3797,[384]3.3859,[385]3.3894,[386]3.3943,[387]3.3976,[388]3.4016,[389]3.4071,[390]3.4086,[391]3.3993,[392]3.3924,[393]3.3841,[394]3.3813,[395]3.3748,[396]3.3693,[397]3.3621,[398]3.3538,[399]3.3475,[400]3.3394,[401]3.3316,[402]3.3278,[403]3.3191,[404]3.3111,[405]3.3057,[406]3.2976,[407]3.2897,[408]3.2812,[409]3.2741,[410]3.2673,[411]3.2609,[412]3.2566,[413]3.2570,[414]3.2526,[415]3.2495,[416]3.2470,[417]3.2398,[418]3.2325,[419]3.2382,[420]3.2326,[421]3.2302,[422]3.2319,[423]3.2277,[424]3.2226,[425]3.2201,[426]3.2185,[427]3.2158,[428]3.2128,[429]3.2087,[430]3.2053,[431]3.2056,[432]3.2012,[433]3.1957,[434]3.1897,[435]3.1867,[436]3.1792,[437]3.1728,[438]3.1688,[439]3.1685,[440]3.1692,[441]3.1692,[442]3.1687,[443]3.1745,[444]3.1834,[445]3.1810,[446]3.1787,[447]3.1783,[448]3.1773,[449]3.1817,[450]3.1824,[451]3.1824,[452]3.1860,[453]3.1931,[454]3.1959,[455]3.1974,[456]3.2002,[457]3.1995,[458]3.2026,[459]3.2040,[460]3.2088,[461]3.2130,[462]3.2151,[463]3.2147,[464]3.2122,[465]3.2103,[466]3.2164,[467]3.2160,[468]3.2156,[469]3.2212,[470]3.2231,[471]3.2272,[472]3.2321,[473]3.2340,[474]3.2337,[475]3.2356,[476]3.2373,[477]3.2402,[478]3.2410,[479]3.2430,[480]3.2445,[481]3.2481,[482]3.2501,[483]3.2535,[484]3.2496,[485]3.2523,[486]3.2521,[487]3.2578,[488]3.2629,[489]3.2680,[490]3.2686,[491]3.2731,[492]3.2767,[493]3.2799,[494]3.2847,[495]3.2901,[496]3.2899,[497]3.2899,[498]3.2910,[499]3.2925,[500]3.2946,[501]3.2948,[502]3.2952,[503]3.2999,[504]3.3052,[505]3.3051,[506]3.3056,[507]3.3076,[508]3.3120,[509]3.3188,[510]3.3204,[511]3.3253,[512]3.3192,[513]3.3130,[514]3.3080,[515]3.3082,[516]3.3061,[517]3.3051,[518]3.3038,[519]3.2995,[520]3.2980,[521]3.2970,[522]3.2933,[523]3.2916,[524]3.2930,[525]3.2924,[526]3.2907,[527]3.2924,[528]3.2882,[529]3.2831,[530]3.2785,[531]3.2751,[532]3.2752,[533]3.2732,[534]3.2719,[535]3.2691,[536]3.2657,[537]3.2600,[538]3.2550,[539]3.2490,[540]3.2467,[541]3.2472,[542]3.2449,[543]3.2411,[544]3.2411,[545]3.2378,[546]3.2363,[547]3.2352,[548]3.2325,[549]3.2280,[550]3.2227,[551]3.2173,[552]3.2123,[553]3.2085,[554]3.2050,[555]3.1995,[556]3.1954,[557]3.1914,[558]3.1923,[559]3.1898,[560]3.1894,[561]3.1902,[562]3.1938,[563]3.1984,[564]3.2017,[565]3.2007,
Final estimate: PPL = 3.2007 +/- 0.01770

This is how the recipe looks for updated GGUF-Tool-Suite. It creates identical file.

## Quant mix recipe created using Thireus' GGUF Tool Suite - https://gguf.thireus.com/

## Model head & embeddings — qbits: 32 8 6 
^output_norm\.weight$=f32
^token_embd\.weight$=iq6_k
^output\.weight$=q8_0

## Multi-headed attention parameters — qbits: 32 8 4 
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_v\.bias$=f32
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_output\.weight$=q8_0
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_q\.weight$=iq4_ks
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_k\.weight$=iq4_ks
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_q\.bias$=f32
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_v\.weight$=iq4_ks
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_q_norm\.weight$=f32
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_k_norm\.weight$=f32
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_k\.bias$=f32
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_norm\.weight$=f32

## Core FFN weights — qbits: 32 8 5 4 
^blk\.[0-2]\.ffn_gate\.weight$=iq4_k
^blk\.([3-9]|[1-8][0-9]|9[0-2])\.ffn_gate_inp\.weight$=f32
^blk\.[0-1]\.ffn_down\.weight$=iq4_k
^blk\.2\.ffn_down\.weight$=q8_0
^blk\.1\.ffn_up\.weight$=iq4_ks
^blk\.(0|2)\.ffn_up\.weight$=iq5_k

## Other tensors — qbits: 32 4 
^blk\.92\.nextn\.shared_head_norm\.weight$=f32
^blk\.92\.nextn\.enorm\.weight$=f32
^blk\.92\.nextn\.eh_proj\.weight$=iq4_kss
^blk\.92\.nextn\.embed_tokens\.weight$=iq4_kss
^blk\.([0-9]|[1-8][0-9]|9[0-2])\.post_attention_norm\.weight$=f32
^blk\.([3-9]|[1-8][0-9]|9[0-2])\.exp_probs_b\.bias$=f32
^blk\.92\.nextn\.shared_head_head\.weight$=iq4_kss
^blk\.92\.nextn\.hnorm\.weight$=f32

## GPU-loaded ffn_*_shexp
# ffn_down_shexp (down-projection) — qbits: 8 6 5 4 
^blk\.(26|2[8-9]|30|32|91)\.ffn_down_shexp\.weight$=q8_0
^blk\.(19|23|42)\.ffn_down_shexp\.weight$=iq6_k
^blk\.(3|7|9|1[0-5]|18|37|41|84)\.ffn_down_shexp\.weight$=iq5_k
^blk\.(2[0-2]|2[4-5]|27|34|36|40|7[7-8]|8[6-7])\.ffn_down_shexp\.weight$=iq4_ks
^blk\.([4-6]|8|1[6-7]|31|33|35|3[8-9]|[5-6][0-9]|4[3-9]|7[0-6]|79|8[0-3]|85|8[8-9]|90|92)\.ffn_down_shexp\.weight$=iq4_k

# ffn_up_shexp (up-projection) — qbits: 8 6 5 4 
^blk\.(9|11|3[1-2]|35|3[8-9]|84|91)\.ffn_up_shexp\.weight$=q8_0
^blk\.(3|7|24|26)\.ffn_up_shexp\.weight$=iq6_k
^blk\.(4|1[2-5]|23|28|34|43|45|55|90)\.ffn_up_shexp\.weight$=iq5_k
^blk\.(5|8|10|16|1[8-9]|2[0-2]|29|30|33|37|40|4[6-9]|5[0-4]|6[0-9]|5[6-9]|7[0-5]|7[7-9]|8[0-2]|86|8[8-9]|92)\.ffn_up_shexp\.weight$=iq4_k
^blk\.(6|17|25|27|36|4[1-2]|44|76|83|85|87)\.ffn_up_shexp\.weight$=iq4_ks

# ffn_gate_shexp (gate-projection) — qbits: 8 6 5 4 
^blk\.(3|14|25|30|33|36|46|86|9[0-1])\.ffn_gate_shexp\.weight$=q8_0
^blk\.60\.ffn_gate_shexp\.weight$=iq6_k
^blk\.(5|9|10|12|15|19|2[0-1]|23|35|41|44|68|8[4-5]|88)\.ffn_gate_shexp\.weight$=iq5_k
^blk\.(17|22|27|3[1-2]|3[7-9]|40|47|7[5-6])\.ffn_gate_shexp\.weight$=iq4_ks
^blk\.(4|[6-8]|11|13|16|18|24|26|2[8-9]|34|4[2-3]|45|5[0-9]|4[8-9]|6[1-7]|69|7[0-4]|7[7-9]|8[0-3]|87|89|92)\.ffn_gate_shexp\.weight$=iq4_k

## CPU-friendly ffn_*_exps
# ffn_down_exps (down-extraction) — qbits: 5 4 3 2 
^blk\.(5|22|61)\.ffn_down_exps\.weight$=iq5_ks_r4
^blk\.([3-4]|11|18|3[0-1]|35|92)\.ffn_down_exps\.weight$=iq5_k_r4
^blk\.(6|15|25|4[3-5]|4[7-9]|5[2-3]|55|57|60|68|71)\.ffn_down_exps\.weight$=iq4_kss
^blk\.(8|14|46|56)\.ffn_down_exps\.weight$=iq4_ks_r4
^blk\.(13|19|24|37|42|5[0-1]|54|59|6[2-6]|7[2-6]|78|8[1-2])\.ffn_down_exps\.weight$=iq3_k_r4
^blk\.(7|16|28|33|41|58|67|69|70|77|79|80|8[4-9]|9[0-1])\.ffn_down_exps\.weight$=iq3_ks
^blk\.(10|12|17|23|26|29|32|34|36|39)\.ffn_down_exps\.weight$=iq2_ks
^blk\.(9|2[0-1]|27|38|40|83)\.ffn_down_exps\.weight$=iq2_k_r4

# ffn_up_exps (up-extraction) — qbits: 5 4 3 2 
^blk\.(19|28|92)\.ffn_up_exps\.weight$=iq5_k_r4
^blk\.(11|15)\.ffn_up_exps\.weight$=iq4_kss
^blk\.20\.ffn_up_exps\.weight$=iq4_ks_r4
^blk\.(1[3-4]|1[6-7]|2[3-4]|3[0-1]|41|43|4[6-8]|5[0-6]|58|6[0-1]|6[4-5]|6[7-9]|7[0-6]|78)\.ffn_up_exps\.weight$=iq3_k_r4
^blk\.([3-4]|12|21|39|42|4[4-5]|49|57|59|6[2-3]|66|77|79|8[0-2]|8[4-9]|9[0-1])\.ffn_up_exps\.weight$=iq3_ks
^blk\.(5|7|9|10|18|25|27|29|32|3[4-5]|3[7-8]|40)\.ffn_up_exps\.weight$=iq2_ks
^blk\.(6|8|22|26|33|36|83)\.ffn_up_exps\.weight$=iq2_k_r4

# ffn_gate_exps (gate-extraction) — qbits: 5 4 3 2 
^blk\.(19|28|92)\.ffn_gate_exps\.weight$=iq5_k_r4
^blk\.(11|15)\.ffn_gate_exps\.weight$=iq4_kss
^blk\.20\.ffn_gate_exps\.weight$=iq4_ks_r4
^blk\.(1[3-4]|1[6-7]|2[3-4]|3[0-1]|41|43|4[6-8]|5[0-6]|58|6[0-1]|6[4-5]|6[7-9]|7[0-6]|78)\.ffn_gate_exps\.weight$=iq3_k_r4
^blk\.([3-4]|12|21|39|42|4[4-5]|49|57|59|6[2-3]|66|77|79|8[0-2]|8[4-9]|9[0-1])\.ffn_gate_exps\.weight$=iq3_ks
^blk\.(5|7|9|10|18|25|27|29|32|3[4-5]|3[7-8]|40)\.ffn_gate_exps\.weight$=iq2_ks
^blk\.(6|8|22|26|33|36|83)\.ffn_gate_exps\.weight$=iq2_k_r4

## Summary of tensor sizes per class
# GPU Total: 12.404 GiB (90.9%) | 13.65 GiB max, if all were q8_0 | 11.93 GiB min, if all were iq4_ks
# CPU Total: 131.155 GiB (60.1%) | 218.28 GiB max, if all were iq5_k_r4 | 88.72 GiB min, if all were iq2_ks
# GPU+CPU Total: 143.559 GiB (75.5%)

## Summary of tensor counts and bpw per qtype
#
# GPU-loaded quants:
# QTYPE		Count	BPW	Assigned GiB	% Assigned	Max GiB (all)
# +f32       	835	32.0  	  0.28 GiB	-		-
# +q8_0      	94 	8.5   	  6.56 GiB	-		-
# q8_0      	26 	8.5   	  0.26 GiB	7.5%		3.43
# iq6_k     	9  	6.625 	  0.65 GiB	24.2%		2.67
# iq5_k     	43 	5.5   	  0.29 GiB	12.9%		2.22
# iq4_k     	164	4.5   	  0.82 GiB	45.2%		1.82
# +iq4_ks    	279	4.25  	  3.38 GiB	-		-
# iq4_ks    	38 	4.25  	  0.18 GiB	10.2%		1.71
#
# CPU-friendly quants:
# QTYPE		Count	BPW	Assigned GiB	% Assigned	Max GiB (all)
# +iq5_k_r4  	3  	5.5   	  2.42 GiB	-		-
# iq5_k_r4  	11 	5.5   	  8.86 GiB	4.1%		215.11
# iq5_ks_r4 	3  	5.25  	  2.31 GiB	1.1%		205.33
# iq4_k     	0  	4.5   	  0.00 GiB	0.0%		176.00
# iq4_ks_r4 	6  	4.25  	  3.74 GiB	2.2%		166.22
# +iq4_kss   	3  	4.0   	  0.75 GiB	-		-
# iq4_kss   	20 	4.0   	 11.72 GiB	7.5%		156.45
# iq3_k_r4  	94 	3.4375	 47.33 GiB	35.2%		134.45
# iq3_ks    	74 	3.1875	 34.55 GiB	27.7%		124.67
# iq2_k_r4  	21 	2.375 	  7.31 GiB	7.9%		92.89
# iq2_ks    	38 	2.1875	 12.18 GiB	14.2%		85.56
#
# -Average BPW: 3.4413
#
# -Notes:
# - '+' means user-defined pre-assigned tensors, or tensor missing from csv data or f32 tensors
# - Script SHA-256: 555c369a7691ac7edfbb245ead4013c5b4284437e7d5dbe98d450870c9db71ef
# - Calibration dataset 'ppl_results.csv' SHA-256: 319798956511f81ea6fbf4292a2733927adeaff4edc8dc056e1c0797fcf22358
# - tensors.bf16.map SHA-256: 4e8b7b435f6257174a7adfc90290ac92c36758fef201ba0f5358338eea7606b8
# - tensors.bf16.map model name: GLM-4.5-THIREUS-BF16-SPECIAL_TENSOR-01762-of-01762
# - tensors.iq5_k_r4.map SHA-256: c13be13e7b38a6206c03c5c77d5fda27e281b38fff35a1cbe26c024d8cc98741
# - tensors.iq5_k_r4.map model name: GLM-4.5-THIREUS-IQ5_K_R4-SPECIAL_TENSOR-01762-of-01762
# - tensors.iq5_ks_r4.map SHA-256: af4fb823a9b77217346ee4757708588a750e309dbcdebf8c9066728c732a6385
# - tensors.iq5_ks_r4.map model name: GLM-4.5-THIREUS-IQ5_KS_R4-SPECIAL_TENSOR-01762-of-01762
# - tensors.iq4_k.map SHA-256: deae5223105990b50335e53fc0d850e5739e3c9a49ff1a7f7abf7a2747e3d78c
# - tensors.iq4_k.map model name: GLM-4.5-THIREUS-IQ4_K-SPECIAL_TENSOR-01762-of-01762
# - tensors.iq4_ks_r4.map SHA-256: de7c261ff73467ffac8d6be3831a0a1edd3443cd88c746ba4786402539daf9e8
# - tensors.iq4_ks_r4.map model name: GLM-4.5-THIREUS-IQ4_KS_R4-SPECIAL_TENSOR-01762-of-01762
# - tensors.iq4_kss.map SHA-256: 78644e76c921c329b6cf32d1c8711766170edea7e8960fcd3e9eb6d94601bc4b
# - tensors.iq4_kss.map model name: GLM-4.5-THIREUS-IQ4_KSS-SPECIAL_TENSOR-01762-of-01762
# - tensors.iq3_k_r4.map SHA-256: 0f60dbf1522e6918a09b2035bdaaffdb98e73fde0c9ec5c45f2c21757ff0dfd0
# - tensors.iq3_k_r4.map model name: GLM-4.5-THIREUS-IQ3_K_R4-SPECIAL_TENSOR-01762-of-01762
# - tensors.iq3_ks.map SHA-256: 4a33c7b3901cadf1a4e6130aeaed8806168249a0d386219db4ffec31188cb6af
# - tensors.iq3_ks.map model name: GLM-4.5-THIREUS-IQ3_KS-SPECIAL_TENSOR-01762-of-01762
# - tensors.iq2_k_r4.map SHA-256: adae6ac545dccb5f4c2e92762fb9a59518f19844175cb68e80f2d2add27c1441
# - tensors.iq2_k_r4.map model name: GLM-4.5-THIREUS-IQ2_K_R4-SPECIAL_TENSOR-01762-of-01762
# - tensors.iq2_ks.map SHA-256: 4ed9fc5a73d854ad30f9f75577a1d826150cb23283a0bb54dba45d6aba6c9de2
# - tensors.iq2_ks.map model name: GLM-4.5-THIREUS-IQ2_KS-SPECIAL_TENSOR-01762-of-01762
# - tensors.q8_0.map SHA-256: 2814e1547cf288d327264135ed3f83e612b879826640283037d45f95a22ebfe2
# - tensors.q8_0.map model name: GLM-4.5-THIREUS-Q8_0-SPECIAL_TENSOR-01762-of-01762
# - tensors.iq6_k.map SHA-256: 8ef4d5c379126fc13dfb46bbc8c10308d2c8e78602c0b3f6cea197d963fc80f1
# - tensors.iq6_k.map model name: GLM-4.5-THIREUS-IQ6_K-SPECIAL_TENSOR-01762-of-01762
# - tensors.iq5_k.map SHA-256: 3a994a079301788e106d98319189fd23b40220b6a6b290f3a5ddabc4e2a63bbc
# - tensors.iq5_k.map model name: GLM-4.5-THIREUS-IQ5_K-SPECIAL_TENSOR-01762-of-01762
# - tensors.iq4_ks.map SHA-256: d0b98943a74acb3cacb78bebf0865eb050c8dece3f4ee963fadca909bbc5bb1a
# - tensors.iq4_ks.map model name: GLM-4.5-THIREUS-IQ4_KS-SPECIAL_TENSOR-01762-of-01762
# - GPG signatures: DISABLED
# - Command used:
# quant_assign.py ppl_results.csv --tolerance 0.01 --cpu-tensors-max-size 131 --gpu-tensors-max-size 12.4 \
# --exponential-factor 1 --skip-gpg --cpu-tensors 'blk\.([3-9]|[1-8][0-9]|9[0-9])\.ffn_down_exps\.weight' \
# 'blk\.([3-9]|[1-8][0-9]|9[0-9])\.ffn_up_exps\.weight' 'blk\.([3-9]|[1-8][0-9]|9[0-9])\.ffn_gate_exps\.weight' \
# --gpu-tensors '.*' --cpu-quants iq5_k_r4 iq5_ks_r4 iq4_k iq4_ks_r4 iq4_kss iq3_k_r4 iq3_ks iq2_k_r4 iq2_ks \
# --gpu-quants q8_0 iq6_k iq5_k iq4_k iq4_ks --cpu-assign-tensors '^blk\.(92)\.ffn_down_exps\.weight=iq5_k_r4' \
# '^blk\.(92)\.ffn_up_exps\.weight=iq5_k_r4' '^blk\.(92)\.ffn_gate_exps\.weight=iq5_k_r4' \
# '^blk\.92\.nextn\.shared_head_head\.weight=iq4_kss' '^blk\.92\.nextn\.embed_tokens\.weight=iq4_kss' \
# '^blk\.92\.nextn\.eh_proj\.weight=iq4_kss' --gpu-assign-tensors '^output\.weight=q8_0' \
# '^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_output\.weight=q8_0' --gpu-assign-qtype iq4_ks --harmonize-tensors \
# 'blk\..*\.ffn_up_exps.*,blk\..*\.ffn_gate_exps.*' --harmonization-technique 3

BTW, if you wonder why Script SHA-256 is different: since I use Windows, I had to fix quant_assign.py (change line 379 to cmd = ["bash", "tensor_downloader.sh", qtype.upper(), "0", TMP_DIR, f"tensors.{qtype}.map"])

My CPU is AMD Ryzen 7700, so KT quants are definitely slower not just for Intel. Also the speedup is likely caused by _r4 (just using -rtr won't do anything with *_kt quants).

it is left to the user to double check that the quants used match the end result

Maybe quant_downloader.sh should generate a text report file with a section listing all non-matching quants? That, or a pre-assessment script.

Aver0

Sep 15

•

edited Sep 15

@Thireus
Here's my thoughts about quant cooking with GGUF Tool Suite.

The greatest advantage of this curve fitting method is that it automatically finds Super Experts among Routed Experts to provide them with highest quant type from --cpu-quants.

For GLM4.5 these clearly are blocks [3, 4, 5, 11, 18, 19, 22, 28, 30, 31, 35, 61], from what I observed in every recipe generated.

Quality of these experts directly affects quality of the model output. That's why we need to provide not just several quant types around our target BPW, but also a couple of very high quants for Super Experts.

My target was 3.5bpw, so my initial recipe had --cpu-quants iq4_ks iq4_kss iq3_k iq3_ks. The result was mostly uniform, no matter the --exponential-factor, and PPL was on par with traditionally-made quants like IQ3_KT.

Then I added higher quants to the list - iq5_ks and iq4_k, plus a lower quant iq2_k to compensate. As a result, PPL decreased dramatically. All Super Experts took iq5_ks, while iq4_k was completely ignored (0 uses), and of course some other exps had to take this new iq2_k, because RAM budget was as limited as before.

Then I added iq5_k and iq2_ks to extrapolate the experiment further, and that's when I got the best recipe with very small PPL. Super Experts took iq5_k (a couple of them took iq5_ks), other exps had to take leftovers.
BTW, some exps consistently take lowest quants, but while "Super Experts" sure exist, I'm not so sure about "Unimportant Experts". It's likely just that wiki.test.raw and imatrix introduce biases in PPL calculation. Maybe those exps that always take iq2_k and iq2_ks are responsible for non-English stuff or something... Anyway, I don't want to add iq1_m to my precious, even if it helps with PPL.

So then I added iq6_k, and PPL got worse, and with q8_0 it got much worse. Because Super Experts sure took q8_0, but everything else moved to low quants for the feast to happen within limited RAM. Average BPW was still ~3.5bpw, the median was more like ~2.3bpw.

I think, max quant in --cpu-quants should be about 1.7x your target BPW, and min quant shouldn't be smaller than target BPW / 1.7x.
Also you should provide "slightly smaller quant" (iq5_ks in my case) so there's still some competition among Super Experts, this makes it easier for ordinary exps. Oh, and a "slightly higher quant" (iq2_k in my case) for lowest exps.

That's the final picture for Routed Experts: 2 high quants, several quants around target BPW, 2 low quants.

No need to cover the gap between highest quants and quants in the middle. iq4_k was never taken by any expert in my tests. And the gap between the middle and the 2 low quants in actualy useful for fencing the middle exps from leaking lower due to pressure from Super Experts. When I removed the gap by adding iq3_kt, iq3_kt was instantly used by many of the middle exps, but none of lowest exps moved up, instead one of Super Experts went from iq5_ks to iq5_k. And PPL got worse.

And this is why I set --exponential-factor 1, because there's no need to further aggravate disparity.
In fact, when PPL doesn't meaningfully change between any two tests I choose the one where there's more uniformity in quant types used. Because PPL is only a metric. Better not put all eggs in one basket, at least when PPL difference is negligible.

For "GPU quants" it's mostly important to find a good size: such that attn quality is adequate to quality of Routed Experts.
--gpu-tensors-max-size is not necessarily the size of your VRAM, in my case it's bigger and forces me to offload many layers to RAM, but I can endure some slowdown for tangible quality increase.

Here's how I would suggest a newbie (like me a week ago) approach quant cooking with GGUF Tool Suite:

Determine your RAM+VRAM size, subtract 5-10 GB for OS and browser tabs, subtract a dozen GB for small context and batching, then calculate target BPW of your quant. In my case, I have 12GB VRAM + 160GB RAM, which leaves me with 155 GB for GGUF filesize (and even less if I want big context). GLM4.5 has 355B parameters, so 155/355 = 0.437 bytes = 3.5 bits per weight. This means my primary quant type for Routed Experts will be iq3_k (3.4375 BPW).
So my --cpu-quants will be iq4_kss iq3_k iq3_ks. Well, maybe also iq4_ks if I feel optimistic about Super Experts' greediness. Then I add quants for Super Experts: iq5_k iq5_ks, as per "1.7x" guideline. Then add a dump for those unlucky exps that must suffer for the Super Experts to thrive - iq2_ks iq2_k. Don't add extremely low quant types, unless you target below 2.5 BPW. Low quants might get chosen because of their indifference to PPL, but the the model could suffer in unexpected ways. Final command: --cpu-quants iq5_k_r4 iq5_ks_r4 iq4_ks_r4 iq4_kss iq3_k_r4 iq3_ks iq2_k_r4 iq2_ks. I've added _r4 wherever possible because I know exps will always be loaded into RAM, never into VRAM, in my case.
For --gpu-quants we must absolutely provide q8_0 for important stuff and iq4_k or iq4_ks for unimportant stuff. Maybe even iq4_kss, when your target is below 2.5 BPW. Then add iq6_k as a slightly lower quant for important stuff, and iq5_k for good measure. Even when you're cooking a low-BPW quant, attn should be high-quality. You can add something low like iq3_kt to --gpu-quants, but this low type shouldn't end up being used by many tensors. In fact, if you see in "GPU-loaded quants" section of recipe that the lowest quant type is being used a lot - it means you're VRAM-bottlenecked and should increase GPU budget until that low quant type is only used by a few tensors. Final command: --gpu-quants q8_0 iq6_k iq5_k iq4_k iq4_ks.
Don't use old *_xs quants, use newer *_ks instead. ik-llama new quant types are always preferable
Avoid having two quant types of the same BPW (for example, having both iq4_ks and iq4_xs makes the curve fitting seemingly randomly choose iq4_xs over iq4_ks, which results in worse PPL)
Use _r4 for CPU-loaded Routed Experts when possible, these types are faster. Don't use *_kt for CPU-loaded tensors, these types are slower. Do use *_kt for GPU-loaded tensors. But personally I don't use KT even in --gpu-quants, because my VRAM is so small I have to offload half of attn and ffn layers to CPU RAM.
Set --exponential-factor 1, so the curve fitting doesn't become brazen.

Additional tips for people with too much free time:

Some tensors don't have PPL data in CSV file, so they use default quant type, which is often too generic. if you have time for experiments, don't just tweak their type in bulk using --gpu-assign-qtype, instead learn what they are by first creating a dummy recipe with something ridiculous like "--gpu-assign-qtype q8_kv_r8" (assuming q8_kv_r8 isn't used anywhere else in the recipe), and then tweak each detected tensor separately using --gpu-assign-tensors
Some tensors are so small they can be left at f32, don't bother optimizing.
There's always correlation between quality of attention and experts. Big CPU budget is mostly fixed, defined by target BPW, but small GPU budget can vary by a couple GB here and there (taking these gigabytes from CPU budget, of course, but CPU budget is usually so big it won't notice). So, try tweaking --gpu-tensors-max-size to find the best value that corresponds to your --cpu-tensors-max-size The value isn't in percents, better write it in gigabytes for more control.
All tensors have certain threshold, above which they don't improve much. Sometimes there's several such thresholds (going below each threshold destroys more and more structures in the tensor). The key to perfect custom quant is to find a point just above an acceptable threshold - for as many GPU-loaded tensors as possible, then bind them to this smallest viable quant type, and leave all other untested tensors at a good enough quant type determined by curve fitting. So, if after some tests you have a hunch about adequate quant type for a certain tensor (or group) - go ahead and pin this tensor with --gpu-assign-tensors or --cpu-assign-tensors, so that future tests won't affect the tensor, and the tensor won't affect future tests. Example: --gpu-assign-tensors "^output\.weight=q8_0" "^blk\.([0-9]|[1-8][0-9]|9[0-2])\.attn_output\.weight=q8_0" --cpu-assign-tensors "^blk\.(92)\.ffn_down_exps\.weight=iq5_k_r4" "^blk\.(92)\.ffn_up_exps\.weight=iq5_k_r4" "^blk\.(92)\.ffn_gate_exps\.weight=iq5_k_r4"
Every time you generate a new recipe, look at its "Summary" section before actually downloading shards and testing PPL. Some ideas can be rejected right there, before even checking PPL. Most of my ideas were born when looking at the Summary table and comparing it with reference recipe (Alt-Tabbing back and forth).
Don't test more than one idea simultaneously.

If you find any of these thoughts useful, feel free to add them to readme.md

Thireus

Sep 15

@Aver0 , thank you so much for sharing your observations and tips! I think this is really useful and I will add a reference in the readme to your post.

Regarding change line 379, is it because bash wasn't used when the script was invoked? If so, I'll populate your change to the scripts to ensure this is always the case.

The GPU tensors could also be calibrated (PPL benchmarked) but would require a lot of additional benchmarking time, and because they are proportionally so small in size compared to the rest I always decide to leave them out. See "=locked" mentioned here: https://github.com/Thireus/GGUF-Tool-Suite/blob/main/benchmark_each_tensor.sh#L228-L246 for example.

Very interesting to read about potential "biases in PPL calculation" due to wiki.test.raw being used. I've been wondering exactly this, and been discussing this with @ubergarm in separate threads - https://www.reddit.com/r/LocalLLaMA/comments/1ndibn1/comment/ndlo76j/. I've been wanting to use a different dataset to compute the PPL, if you think of some please let me know, I'd really like to compare the ppl curve of the calibration data to see if it has the same shape across different dataset - if it does then it would mean there is no bias (or at least that they are not significant to negatively affect the results).

Very glad to see you have a deep understanding of how the script functions and that you have dived into using advanced capabilities of the script such as setting the --exponential-factor 1. That's impressive, and I would happily welcome any additional improvement suggestions than those mentioned in your last 2 replies - don't hesitate to create an issue on https://github.com/Thireus/GGUF-Tool-Suite/issues.

ubergarm

Owner Sep 16

@Aver0 @Thireus

Interesting reading here for all the quant cookers for sure! It is def a fun albiet somewhat "dark art" haha...

@ArtusDev sent me a cool repo/paper to check out too, it looks like they are doing something similar to @Thireus by evaluating the layers to rank sensitivity to quantization: https://github.com/IST-DASLab/gptq-gguf-toolkit?tab=readme-ov-file#key-findings-1

Thireus

Oct 11

@Aver0 , if you have some time could you please have a look at the new quant assign algorithm that @Stealt91 has implemented: https://github.com/Thireus/GGUF-Tool-Suite/discussions/23#discussioncomment-14654893

We are testing and comparing with the current algo. It looks very promising, but since you're the best person I know with excellent results on GLM I wanted to have your assessment as well. We have kld_results.txt data that is available for GLM-4.6.

So, the tl;dr is that:

You can use quant_assign.py kld_results.csv (csv found in the GLM-4.6 model folder), which provides more accurate data about the individual tensor degradation than ppl_results.txt
See how your methodology produces better results with kld_results.csv than the --use-greedy-quant-assign which @Stealt91 has implemented. But you'll need to ensure your recipes produce the same bpw (which can be tricky).

So far my observations are that his algo produces recipes that beat my results on GLM-4.6.

Aver0

Oct 11

•

edited Oct 11

@Thireus
I will try doing it in a few days. Still haven't even tried GLM4.6. Sorry.

That's one epic thread there! I look forward to finally shifting from simple PPL to KLD and Top P.
The greedy algorithm is interesting option too, though personally it feels precarious to rely on metrics so completely. But I wouldn't be surprised if it ends up being better option to have by default, at least for casual users who want good-enough results asap.

I was skeptical about PPL since reading all those articles and discussions, you know, from unsloth, bartowski, others.
I even thought the only reliable metric would be running actual benchmarks like SimpleQA/MMLU/Aider/etc - for each single tensor drop. Imagine how expensive that would be.
But then I realized even that heavy-weight wouldn't be perfect, since the underlying strategy itself (evaluating tensors one-by-one) has inherent limitation of ignoring synergy between tensors.
The synergy is negligible between different experts, so the method works very well for arranging exps to buckets. But it should work less well for distributing relative importance between attn and exps, for example. Or between up_exps and gate_exps of the same expert. Or between any other dependent tensors.
But that's what we can have before combinatorial explosion. So, with these limitations considered, a weighted mix of 3 simple metrics should be enough!
(For one, I'd try using 15% PPL/35% KLD/50% Top_P, ofc this needs to be tested a lot)
Any more complex metrics likely won't help, since metric is no more a bottleneck of the method. For pure PPL is was.

Applying a low-pass filter to KLD results (smoothing the curve) feels like patching holes, honestly. But it could give minor improvements, maybe it should be an option in quant_assign.py
Because we're already doing several patchworks to compensate for the strategy limitations. Exponential-factor is one such patch. Harmonization is another (accounts for synergy between up and gate of same expert).
Maybe with enough tricks the strategy can approach the effectiveness of brute-force method.

By the way, benchmarking attn (and in fact, every single tensor) would be really nice to have... Especially now that metrics are more reliable.
Even though a full optimization of attn would likely save less than a couple GB, but it's VRAM saved, much more valuable than RAM. Very relevant to GPU-poor people who have to decrease -ngl in order to fit some context.

About having an overall budget for CPU+GPU: I still think there should be an option to let user set the budgets separately.
It's yet another patch/trick to compensate for the fact that IsolatedTensor-based metric doesn't provide enough information about group dynamics.
Meanwhile, user can gather such information by benchmarking a few cooked recipes and learning about sensivity of a group of tensors, e.g. of entire attn. That's what "my methodology" boils down to - just trying things and being really attentive to the results.
Actually, would be great in the future to have a way for user to feed PPL/KLD/TopP info of finished recipes to the CSVs database. Not sure what algorithm could use that information though. Sounds too advanced. But that's the only thing that would eliminate the need for manual tweaking.

LagOps

Oct 12

Hey! I'm @Stealt91 from GitHub (different name here, GitHub account is ancient)

Those are some great insights @Aver0 . Using a mix of metrics has been considered by me before, but for the greedy algo i implemented, the absolute degradation is needed to work properly, so PPL isn't suitable for this, but KLD and same top p work as metrics.

Smoothing the curve across layers is something that needs to be explored and for smoothing between related tensors of the same layer, it appears that synergy there is so strong that assigning the same quant seems to give the best results. It's true that just smoothing data feels like patching holes, but at the same time it is very difficult to properly incorporate synergies for harmonizing quants across layers without doing so. Smoothing the curve might end up as a solution, which, despite not being optimal, can get you to a 99% match of an optimal autistically handcrafted solution that takes days to narrow down.

Benchmarking attention is also something i was asking for and if we can indeed confirm that harmonizing all related tensors is the optimal solution, then we can collect degradation data on those groups instead of individual tensors, making it much cheaper and faster to collect data for everything.

In terms of having combined or separate gpu/cpu budgets, this is already supported by my method! I am just not using this feature for eval as i want to have an algorithm, which delivers optimal results out of the box. I have incorporated quant comparisons to estimate quant quality, the size differential between quantization methods applied to tensors as well as the per tensor degradation metric to make a solid metric to assign quants without any need for further tweaking. I also incorporate exponents to account for non-linearities, so that degradation metrics of small tensors can be mapped to a space where they are about linear when compared to larger tensors (i.e. quanting 10 small tensors down and adding the degradation estimates closely compares to quanting down a single large tensor). So far in my experiments the method has worked very well and even with manual tweaks only the most minor improvements compared to a fully automatic assignment of tensors could be made.

Still, the possibility for manal tweaks is open - you can pre-assign tensors, split the budget in gpu/cpu, pick harmonization, select different quant pools etc. feel free to play around with it!

Aver0

Oct 12

•

edited Oct 12

Hi, @LagOps ! I was obviously reading the github thread as well. Huge stuff going on, this is. Sorry can't join for now.

Benchmarking attention is also something i was asking for and if we can indeed confirm that harmonizing all related tensors is the optimal solution, then we can collect degradation data on those groups instead of individual tensors, making it much cheaper and faster to collect data for everything.

Wait, but if we only benchmark relative importance of whole layers, how is algo going to know that e.g. blk.10.attn_k should be Q5 while blk.10.ffn_up_exps should be Q3, and blk.9.attn_k should be Q4 while blk.9.ffn_up_exps should be Q2? It'd only know blk.10 should have 1.4x bpw relative to blk.9.
Unless I'm missing something, there's still need to discern between different tensors within groups.
For strictly harmonized ffn_up_exps/ffn_gate_exps benchmarking can be unified of course. Not sure about ffn_down_exps.

Speaking of separate budgets for GPU/CPU: I mean, the real distinction isn't between GPU/CPU, but between hi-bpw/low-bpw groups. It just that these splits often coincide: high-bpw tensors go to VRAM, low-bpw go to RAM. So the budget split is kinda artificial, not as grounded as the names imply - and the ratio definitely shouldn't depend on user's hardware. For example, I have only 12GB VRAM, but I set GPU budget to 12.4 for 3.5BPW quants, and then have to offload almost half of attn to RAM. As I aim for quality instead of speed.
The ability to manually split the budget into 2 groups is still useful leverage that comes in handy when user knows something that benchmarking CSV doesn't provide.

LagOps

Oct 13

•

edited Oct 13

Hi, @LagOps ! I was obviously reading the github thread as well. Huge stuff going on, this is. Sorry can't join for now.

Benchmarking attention is also something i was asking for and if we can indeed confirm that harmonizing all related tensors is the optimal solution, then we can collect degradation data on those groups instead of individual tensors, making it much cheaper and faster to collect data for everything.

Wait, but if we only benchmark relative importance of whole layers, how is algo going to know that e.g. blk.10.attn_k should be Q5 while blk.10.ffn_up_exps should be Q3, and blk.9.attn_k should be Q4 while blk.9.ffn_up_exps should be Q2? It'd only know blk.10 should have 1.4x bpw relative to blk.9.
Unless I'm missing something, there's still need to discern between different tensors within groups.
For strictly harmonized ffn_up_exps/ffn_gate_exps benchmarking can be unified of course. Not sure about ffn_down_exps.

Only related tensors would be quanted to the same type, so attention would be one group, shared experts would be one group and routed experts would be one group. so "blk.10.attn_k should be Q5 while blk.10.ffn_up_exps should be Q3" is absolutely a configuration you can (and should) run!

It's just that in testing, so far, it seems like quanting all tensors in a single group of a layer, at least for experts, to the same quant appears to be optimal. This is a bit surprising as common wisdom has it that ffn_down_exps should be quanted higher than ffn_up_exps/ffn_gate_exps. If one was quanting blindly, then indeed quanting ffn_down_exps higher gives the best performance boost for the size increase, but quanting important tensors together at a higer quant on some key-layers appears to be superior so far during testing. The intuition is that since those tensors are all part of the same operation, having all 3 tensors of the same layer at a higer quant gives a higer uplift than having 3x individual tensor uplifts in different layers even if per-tensor degradation values make the latter seem like the better choice.

Edit: further testing has shown that for quant mixes where only a few tensors can be uplifted to a higher quant level, no harmonization performs better. so for the large expert tensors, it makes sense to use individual per-tensor degradation, but for smaller tensors, such as attention or shared experts, it likely makes sense to group tensors together as the individual tensors are small and there is less of a budget issue.

ubergarm

Owner Oct 16

@LagOps

It's just that in testing, so far, it seems like quanting all tensors in a single group of a layer, at least for experts, to the same quant appears to be optimal. This is a bit surprising as common wisdom has it that ffn_down_exps should be quanted higher than ffn_up_exps/ffn_gate_exps.

I too have found that for some models/quantization levels it seems to be fine to keep ffn_(down|gate|up)_exps all the same level. In my quant collections I refer to this convention as smol-... e.g. smol-IQ4_KSS. The "default" of no smol is down one size larger generally.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment