Aug 7

W790E Sage + QYFS + 512G + RTX5090

IQ5_K:

Tensor blk.91.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.91.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.91.ffn_up_exps.weight buffer type overriden to CPU
model has unused tensor blk.92.attn_norm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.attn_q.weight (size = 66846720 bytes) -- ignoring
model has unused tensor blk.92.attn_k.weight (size = 5570560 bytes) -- ignoring
model has unused tensor blk.92.attn_v.weight (size = 5570560 bytes) -- ignoring
model has unused tensor blk.92.attn_q.bias (size = 49152 bytes) -- ignoring
model has unused tensor blk.92.attn_k.bias (size = 4096 bytes) -- ignoring
model has unused tensor blk.92.attn_v.bias (size = 4096 bytes) -- ignoring
model has unused tensor blk.92.attn_output.weight (size = 66846720 bytes) -- ignoring
model has unused tensor blk.92.attn_q_norm.weight (size = 512 bytes) -- ignoring
model has unused tensor blk.92.attn_k_norm.weight (size = 512 bytes) -- ignoring
model has unused tensor blk.92.post_attention_norm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.ffn_gate_inp.weight (size = 3276800 bytes) -- ignoring
model has unused tensor blk.92.exp_probs_b.bias (size = 640 bytes) -- ignoring
Tensor blk.92.ffn_gate_exps.weight buffer type overriden to CPU
model has unused tensor blk.92.ffn_gate_exps.weight (size = 865075200 bytes) -- ignoring
Tensor blk.92.ffn_down_exps.weight buffer type overriden to CPU
model has unused tensor blk.92.ffn_down_exps.weight (size = 1042022400 bytes) -- ignoring
Tensor blk.92.ffn_up_exps.weight buffer type overriden to CPU
model has unused tensor blk.92.ffn_up_exps.weight (size = 865075200 bytes) -- ignoring
model has unused tensor blk.92.ffn_gate_shexp.weight (size = 8355840 bytes) -- ignoring
model has unused tensor blk.92.ffn_down_shexp.weight (size = 8355840 bytes) -- ignoring
model has unused tensor blk.92.ffn_up_shexp.weight (size = 8355840 bytes) -- ignoring
model has unused tensor blk.92.nextn.eh_proj.weight (size = 55705600 bytes) -- ignoring
model has unused tensor blk.92.nextn.embed_tokens.weight (size = 642580480 bytes) -- ignoring
model has unused tensor blk.92.nextn.enorm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.nextn.hnorm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.nextn.shared_head_head.weight (size = 642580480 bytes) -- ignoring
model has unused tensor blk.92.nextn.shared_head_norm.weight (size = 20480 bytes) -- ignoring
llm_load_tensors: offloading 93 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 94/94 layers to GPU
llm_load_tensors: CPU buffer size = 41144.77 MiB
llm_load_tensors: CPU buffer size = 42960.02 MiB
llm_load_tensors: CPU buffer size = 43128.77 MiB
llm_load_tensors: CPU buffer size = 43785.02 MiB
llm_load_tensors: CPU buffer size = 43128.77 MiB
llm_load_tensors: CPU buffer size = 35970.83 MiB
llm_load_tensors: CPU buffer size = 612.81 MiB
llm_load_tensors: CUDA0 buffer size = 16308.63 MiB
....................................................................................................

MLA is only available for LLM_ARCH_DEEPSEEK2 -> turning off MLA

llama_new_context_with_model: n_ctx = 58368
llama_new_context_with_model: n_batch = 4096
llama_new_context_with_model: n_ubatch = 4096
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 0
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 11264.67 MiB
llama_new_context_with_model: KV self size = 11264.62 MiB, K (q8_0): 5632.31 MiB, V (q8_0): 5632.31 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.58 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 2905.77 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 992.05 MiB
llama_new_context_with_model: graph nodes = 4457
llama_new_context_with_model: graph splits = 180
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MUL_MAT_ID to OFF
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MOE_FUSED_UP_GATE to OFF

main: n_kv_max = 58368, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	1024	0	21.111	194.02	109.352	9.36
4096	1024	4096	21.349	191.86	97.248	10.53
4096	1024	8192	22.908	178.80	115.277	8.88
4096	1024	12288	22.586	181.35	133.793	7.65
4096	1024	16384	22.803	179.62	141.520	7.24

shewin

Aug 7

have some questions:

model has unused tensor blk.92 -- ignoring ?
is this correct?
How can I increase the context size, the maximum for my workstation is less than 60K

shewin

Aug 7

SFPLM

Aug 8

Hmm I want to know since the QYFS + 512 GB DDR5 + Sage SE combo...

Have you tried Deepseek (latest of V3 or R1) I want to know the token gen speeds you get from say a IQ4 quant 32k Context?

shewin

Aug 8

DeepSeek-V3-0324-IQ4_K_R4

main: n_kv_max = 65536, n_batch = 4090, n_ubatch = 4090, flash_attn = 1, n_gpu_layers = 63, n_threads = 101, n_threads_batch = 101

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4090	1022	0	42.215	96.88	97.254	10.51
4090	1022	4090	43.060	94.98	95.129	10.74
4090	1022	8180	43.872	93.23	103.784	9.85
4090	1022	12270	44.895	91.10	105.761	9.66

DeepSeek-R1-0528-IQ4_KS_R4

main: n_kv_max = 65536, n_batch = 4090, n_ubatch = 4090, flash_attn = 1, n_gpu_layers = 63, n_threads = 101, n_threads_batch = 101

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4090	1022	0	39.589	103.31	91.805	11.13
4090	1022	4090	40.485	101.03	98.592	10.37
4090	1022	8180	41.546	98.44	104.753	9.76
4090	1022	12270	42.236	96.84	108.134	9.45

shewin

Aug 8

GLM 4.5 IQ4_K

Tensor blk.91.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.91.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.91.ffn_up_exps.weight buffer type overriden to CPU
model has unused tensor blk.92.attn_norm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.attn_q.weight (size = 52101120 bytes) -- ignoring
model has unused tensor blk.92.attn_k.weight (size = 5570560 bytes) -- ignoring
model has unused tensor blk.92.attn_v.weight (size = 5570560 bytes) -- ignoring
model has unused tensor blk.92.attn_q.bias (size = 49152 bytes) -- ignoring
model has unused tensor blk.92.attn_k.bias (size = 4096 bytes) -- ignoring
model has unused tensor blk.92.attn_v.bias (size = 4096 bytes) -- ignoring
model has unused tensor blk.92.attn_output.weight (size = 52101120 bytes) -- ignoring
model has unused tensor blk.92.attn_q_norm.weight (size = 512 bytes) -- ignoring
model has unused tensor blk.92.attn_k_norm.weight (size = 512 bytes) -- ignoring
model has unused tensor blk.92.post_attention_norm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.ffn_gate_inp.weight (size = 3276800 bytes) -- ignoring
model has unused tensor blk.92.exp_probs_b.bias (size = 640 bytes) -- ignoring
Tensor blk.92.ffn_gate_exps.weight buffer type overriden to CPU
model has unused tensor blk.92.ffn_gate_exps.weight (size = 707788800 bytes) -- ignoring
Tensor blk.92.ffn_down_exps.weight buffer type overriden to CPU
model has unused tensor blk.92.ffn_down_exps.weight (size = 865075200 bytes) -- ignoring
Tensor blk.92.ffn_up_exps.weight buffer type overriden to CPU
model has unused tensor blk.92.ffn_up_exps.weight (size = 707788800 bytes) -- ignoring
model has unused tensor blk.92.ffn_gate_shexp.weight (size = 6512640 bytes) -- ignoring
model has unused tensor blk.92.ffn_down_shexp.weight (size = 8355840 bytes) -- ignoring
model has unused tensor blk.92.ffn_up_shexp.weight (size = 6512640 bytes) -- ignoring
model has unused tensor blk.92.nextn.eh_proj.weight (size = 55705600 bytes) -- ignoring
model has unused tensor blk.92.nextn.embed_tokens.weight (size = 533463040 bytes) -- ignoring
model has unused tensor blk.92.nextn.enorm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.nextn.hnorm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.nextn.shared_head_head.weight (size = 533463040 bytes) -- ignoring
model has unused tensor blk.92.nextn.shared_head_norm.weight (size = 20480 bytes) -- ignoring
llm_load_tensors: offloading 93 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 94/94 layers to GPU
llm_load_tensors: CPU buffer size = 40746.39 MiB
llm_load_tensors: CPU buffer size = 42230.00 MiB
llm_load_tensors: CPU buffer size = 42380.00 MiB
llm_load_tensors: CPU buffer size = 42905.00 MiB
llm_load_tensors: CPU buffer size = 37416.98 MiB
llm_load_tensors: CPU buffer size = 416.25 MiB
llm_load_tensors: CUDA0 buffer size = 13323.87 MiB
....................................................................................................

MLA is only available for LLM_ARCH_DEEPSEEK2 -> turning off MLA

llama_new_context_with_model: n_ctx = 68352
llama_new_context_with_model: n_batch = 4096
llama_new_context_with_model: n_ubatch = 4096
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 0
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 13191.51 MiB
llama_new_context_with_model: KV self size = 13191.47 MiB, K (q8_0): 6595.73 MiB, V (q8_0): 6595.73 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.58 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 2978.02 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 1148.05 MiB
llama_new_context_with_model: graph nodes = 4457
llama_new_context_with_model: graph splits = 180
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MUL_MAT_ID to OFF
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MOE_FUSED_UP_GATE to OFF

main: n_kv_max = 68352, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	1024	0	20.941	195.59	96.096	10.66
4096	1024	4096	21.013	194.93	81.857	12.51
4096	1024	8192	21.443	191.01	109.799	9.33
4096	1024	12288	22.150	184.92	116.912	8.76
4096	1024	16384	22.374	183.07	122.584	8.35

SFPLM

Aug 8

hmm....
I thought the GLM being relatively less in size would make it have more TG speed than the Deepseeks

ubergarm

Owner Aug 8

@shewin

model has unused tensor blk.92 -- ignoring ? is this correct?

Yes, for both GLM models the last layer is set with the new TENSOR_SKIP flag and is not involved with inference. It only takes up space on the disk, and is not loaded into VRAM/RAM. This layer has the NextN MTP tensors which are not yet used anywhere but may become used in the future.

How can I increase the context size, the maximum for my workstation is less than 60K

You can use -ctk q6_0 -ctv q6_0 to free up some more VRAM to fit more context for example, or offload less extra exps layer to VRAM etc.

@SFPLM

hmm.... I thought the GLM being relatively less in size would make it have more TG speed than the Deepseeks

So DeepSeek is about 37B active parameters and GLM-4.5 is about 32B active parameters for TG so it should be slightly faster assuming memory bandwidth is the limit and similar quantization size.

I'd have to see the full command from @shewin including how many GPUs, system CPU and RAM speed, and Linux or Windows to fully know. In general I no longer use the -rtr and instead omitt it and use -ub 4096 -b 4096 which can really boost PP speed. However leaving it at default batch sizes and going with -rtr and using the extra VRAM to offload a few more exps layers may improve TG on GLM-4.5 but I have not tested.

shewin

Aug 8

with -ctk q6_0 -ctv q6_0
context size 10 - 20K increased

IQ5_K:
main: n_kv_max = 71168, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	1024	0	21.294	192.35	102.750	9.97
4096	1024	4096	21.940	186.69	107.686	9.51
4096	1024	8192	21.658	189.12	129.563	7.90
4096	1024	12288	22.404	182.83	136.833	7.48
4096	1024	16384	22.742	180.11	148.500	6.90

IQ4_K:
main: n_kv_max = 90112, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
4096	1024	0	21.029	194.78	99.277	10.31
4096	1024	4096	21.703	188.73	104.923	9.76
4096	1024	8192	22.113	185.23	111.030	9.22
4096	1024	12288	22.666	180.71	121.368	8.44
4096	1024	16384	22.762	179.95	126.116	8.12

SFPLM

Aug 9

•

edited Aug 9

main: n_kv_max = 131072, n_batch = 8192, n_ubatch = 8192, flash_attn = 1, n_gpu_layers = 999, n_threads = 57, n_threads_batch = 57

PP	TG	N_KV	T_PP s	S_PP t/s	T_TG s	S_TG t/s
8192	2048	0	84.193	97.30	205.887	9.95
8192	2048	8192	40.191	203.82	222.147	9.22
8192	2048	16384	42.153	194.34	237.941	8.61
8192	2048	24576	44.176	185.44	254.582	8.04

I also get similar numbers with similar config but using anikifoss' HQ4 Quant. I decided to end there because I did not want to wait.

@ubergarm
Do you know if MLA has potential to be supported for GLM models? does this play a role in the impact of TG speeds?

ubergarm

Owner Aug 9

@SFPLM

n_batch = 8192, n_ubatch = 8192

Some folks have reported success with the larger batch sizes, but some others have had issues going higher. I tend to stick to 4096 or below usually but looks like you got it working there!

Do you know if MLA has potential to be supported for GLM models? does this play a role in the impact of TG speeds?

MLA would be multi-head latent attention similar to deepseek and kimi-k2. there are some papers discussing implementing MLA instead of say GQA etc on existing architectures but it would take more fine-tuning at least to get it working so beyond the scope of most of us home users to adapt it ourselves. MLA has both pros and cons in that it can reduce the size of the kv-cache (it grows linearly instead of exponentially) however it takes more computation to compress/decompress it to/from latent space so not necessarily faster.

ubergarm
/

GLM-4.5-GGUF

testing IQ5_K

MLA is only available for LLM_ARCH_DEEPSEEK2 -> turning off MLA

MLA is only available for LLM_ARCH_DEEPSEEK2 -> turning off MLA