testing IQ5_K

#5
by shewin - opened

W790E Sage + QYFS + 512G + RTX5090

IQ5_K:

Tensor blk.91.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.91.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.91.ffn_up_exps.weight buffer type overriden to CPU
model has unused tensor blk.92.attn_norm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.attn_q.weight (size = 66846720 bytes) -- ignoring
model has unused tensor blk.92.attn_k.weight (size = 5570560 bytes) -- ignoring
model has unused tensor blk.92.attn_v.weight (size = 5570560 bytes) -- ignoring
model has unused tensor blk.92.attn_q.bias (size = 49152 bytes) -- ignoring
model has unused tensor blk.92.attn_k.bias (size = 4096 bytes) -- ignoring
model has unused tensor blk.92.attn_v.bias (size = 4096 bytes) -- ignoring
model has unused tensor blk.92.attn_output.weight (size = 66846720 bytes) -- ignoring
model has unused tensor blk.92.attn_q_norm.weight (size = 512 bytes) -- ignoring
model has unused tensor blk.92.attn_k_norm.weight (size = 512 bytes) -- ignoring
model has unused tensor blk.92.post_attention_norm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.ffn_gate_inp.weight (size = 3276800 bytes) -- ignoring
model has unused tensor blk.92.exp_probs_b.bias (size = 640 bytes) -- ignoring
Tensor blk.92.ffn_gate_exps.weight buffer type overriden to CPU
model has unused tensor blk.92.ffn_gate_exps.weight (size = 865075200 bytes) -- ignoring
Tensor blk.92.ffn_down_exps.weight buffer type overriden to CPU
model has unused tensor blk.92.ffn_down_exps.weight (size = 1042022400 bytes) -- ignoring
Tensor blk.92.ffn_up_exps.weight buffer type overriden to CPU
model has unused tensor blk.92.ffn_up_exps.weight (size = 865075200 bytes) -- ignoring
model has unused tensor blk.92.ffn_gate_shexp.weight (size = 8355840 bytes) -- ignoring
model has unused tensor blk.92.ffn_down_shexp.weight (size = 8355840 bytes) -- ignoring
model has unused tensor blk.92.ffn_up_shexp.weight (size = 8355840 bytes) -- ignoring
model has unused tensor blk.92.nextn.eh_proj.weight (size = 55705600 bytes) -- ignoring
model has unused tensor blk.92.nextn.embed_tokens.weight (size = 642580480 bytes) -- ignoring
model has unused tensor blk.92.nextn.enorm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.nextn.hnorm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.nextn.shared_head_head.weight (size = 642580480 bytes) -- ignoring
model has unused tensor blk.92.nextn.shared_head_norm.weight (size = 20480 bytes) -- ignoring
llm_load_tensors: offloading 93 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 94/94 layers to GPU
llm_load_tensors: CPU buffer size = 41144.77 MiB
llm_load_tensors: CPU buffer size = 42960.02 MiB
llm_load_tensors: CPU buffer size = 43128.77 MiB
llm_load_tensors: CPU buffer size = 43785.02 MiB
llm_load_tensors: CPU buffer size = 43128.77 MiB
llm_load_tensors: CPU buffer size = 35970.83 MiB
llm_load_tensors: CPU buffer size = 612.81 MiB
llm_load_tensors: CUDA0 buffer size = 16308.63 MiB
....................................................................................................

MLA is only available for LLM_ARCH_DEEPSEEK2 -> turning off MLA

llama_new_context_with_model: n_ctx = 58368
llama_new_context_with_model: n_batch = 4096
llama_new_context_with_model: n_ubatch = 4096
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 0
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 11264.67 MiB
llama_new_context_with_model: KV self size = 11264.62 MiB, K (q8_0): 5632.31 MiB, V (q8_0): 5632.31 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.58 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 2905.77 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 992.05 MiB
llama_new_context_with_model: graph nodes = 4457
llama_new_context_with_model: graph splits = 180
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MUL_MAT_ID to OFF
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MOE_FUSED_UP_GATE to OFF

main: n_kv_max = 58368, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 21.111 194.02 109.352 9.36
4096 1024 4096 21.349 191.86 97.248 10.53
4096 1024 8192 22.908 178.80 115.277 8.88
4096 1024 12288 22.586 181.35 133.793 7.65
4096 1024 16384 22.803 179.62 141.520 7.24

have some questions:

  1. model has unused tensor blk.92 -- ignoring ?
    is this correct?
  2. How can I increase the context size, the maximum for my workstation is less than 60K

Hmm I want to know since the QYFS + 512 GB DDR5 + Sage SE combo...

Have you tried Deepseek (latest of V3 or R1) I want to know the token gen speeds you get from say a IQ4 quant 32k Context?

DeepSeek-V3-0324-IQ4_K_R4

main: n_kv_max = 65536, n_batch = 4090, n_ubatch = 4090, flash_attn = 1, n_gpu_layers = 63, n_threads = 101, n_threads_batch = 101

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4090 1022 0 42.215 96.88 97.254 10.51
4090 1022 4090 43.060 94.98 95.129 10.74
4090 1022 8180 43.872 93.23 103.784 9.85
4090 1022 12270 44.895 91.10 105.761 9.66

DeepSeek-R1-0528-IQ4_KS_R4

main: n_kv_max = 65536, n_batch = 4090, n_ubatch = 4090, flash_attn = 1, n_gpu_layers = 63, n_threads = 101, n_threads_batch = 101

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4090 1022 0 39.589 103.31 91.805 11.13
4090 1022 4090 40.485 101.03 98.592 10.37
4090 1022 8180 41.546 98.44 104.753 9.76
4090 1022 12270 42.236 96.84 108.134 9.45

GLM 4.5 IQ4_K

Tensor blk.91.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.91.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.91.ffn_up_exps.weight buffer type overriden to CPU
model has unused tensor blk.92.attn_norm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.attn_q.weight (size = 52101120 bytes) -- ignoring
model has unused tensor blk.92.attn_k.weight (size = 5570560 bytes) -- ignoring
model has unused tensor blk.92.attn_v.weight (size = 5570560 bytes) -- ignoring
model has unused tensor blk.92.attn_q.bias (size = 49152 bytes) -- ignoring
model has unused tensor blk.92.attn_k.bias (size = 4096 bytes) -- ignoring
model has unused tensor blk.92.attn_v.bias (size = 4096 bytes) -- ignoring
model has unused tensor blk.92.attn_output.weight (size = 52101120 bytes) -- ignoring
model has unused tensor blk.92.attn_q_norm.weight (size = 512 bytes) -- ignoring
model has unused tensor blk.92.attn_k_norm.weight (size = 512 bytes) -- ignoring
model has unused tensor blk.92.post_attention_norm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.ffn_gate_inp.weight (size = 3276800 bytes) -- ignoring
model has unused tensor blk.92.exp_probs_b.bias (size = 640 bytes) -- ignoring
Tensor blk.92.ffn_gate_exps.weight buffer type overriden to CPU
model has unused tensor blk.92.ffn_gate_exps.weight (size = 707788800 bytes) -- ignoring
Tensor blk.92.ffn_down_exps.weight buffer type overriden to CPU
model has unused tensor blk.92.ffn_down_exps.weight (size = 865075200 bytes) -- ignoring
Tensor blk.92.ffn_up_exps.weight buffer type overriden to CPU
model has unused tensor blk.92.ffn_up_exps.weight (size = 707788800 bytes) -- ignoring
model has unused tensor blk.92.ffn_gate_shexp.weight (size = 6512640 bytes) -- ignoring
model has unused tensor blk.92.ffn_down_shexp.weight (size = 8355840 bytes) -- ignoring
model has unused tensor blk.92.ffn_up_shexp.weight (size = 6512640 bytes) -- ignoring
model has unused tensor blk.92.nextn.eh_proj.weight (size = 55705600 bytes) -- ignoring
model has unused tensor blk.92.nextn.embed_tokens.weight (size = 533463040 bytes) -- ignoring
model has unused tensor blk.92.nextn.enorm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.nextn.hnorm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.nextn.shared_head_head.weight (size = 533463040 bytes) -- ignoring
model has unused tensor blk.92.nextn.shared_head_norm.weight (size = 20480 bytes) -- ignoring
llm_load_tensors: offloading 93 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 94/94 layers to GPU
llm_load_tensors: CPU buffer size = 40746.39 MiB
llm_load_tensors: CPU buffer size = 42230.00 MiB
llm_load_tensors: CPU buffer size = 42380.00 MiB
llm_load_tensors: CPU buffer size = 42905.00 MiB
llm_load_tensors: CPU buffer size = 37416.98 MiB
llm_load_tensors: CPU buffer size = 416.25 MiB
llm_load_tensors: CUDA0 buffer size = 13323.87 MiB
....................................................................................................

MLA is only available for LLM_ARCH_DEEPSEEK2 -> turning off MLA

llama_new_context_with_model: n_ctx = 68352
llama_new_context_with_model: n_batch = 4096
llama_new_context_with_model: n_ubatch = 4096
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 0
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 13191.51 MiB
llama_new_context_with_model: KV self size = 13191.47 MiB, K (q8_0): 6595.73 MiB, V (q8_0): 6595.73 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.58 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 2978.02 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 1148.05 MiB
llama_new_context_with_model: graph nodes = 4457
llama_new_context_with_model: graph splits = 180
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MUL_MAT_ID to OFF
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MOE_FUSED_UP_GATE to OFF

main: n_kv_max = 68352, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 20.941 195.59 96.096 10.66
4096 1024 4096 21.013 194.93 81.857 12.51
4096 1024 8192 21.443 191.01 109.799 9.33
4096 1024 12288 22.150 184.92 116.912 8.76
4096 1024 16384 22.374 183.07 122.584 8.35

hmm....
I thought the GLM being relatively less in size would make it have more TG speed than the Deepseeks

Owner

@shewin

model has unused tensor blk.92 -- ignoring ? is this correct?

Yes, for both GLM models the last layer is set with the new TENSOR_SKIP flag and is not involved with inference. It only takes up space on the disk, and is not loaded into VRAM/RAM. This layer has the NextN MTP tensors which are not yet used anywhere but may become used in the future.

How can I increase the context size, the maximum for my workstation is less than 60K

You can use -ctk q6_0 -ctv q6_0 to free up some more VRAM to fit more context for example, or offload less extra exps layer to VRAM etc.

@SFPLM

hmm.... I thought the GLM being relatively less in size would make it have more TG speed than the Deepseeks

So DeepSeek is about 37B active parameters and GLM-4.5 is about 32B active parameters for TG so it should be slightly faster assuming memory bandwidth is the limit and similar quantization size.

I'd have to see the full command from @shewin including how many GPUs, system CPU and RAM speed, and Linux or Windows to fully know. In general I no longer use the -rtr and instead omitt it and use -ub 4096 -b 4096 which can really boost PP speed. However leaving it at default batch sizes and going with -rtr and using the extra VRAM to offload a few more exps layers may improve TG on GLM-4.5 but I have not tested.

with -ctk q6_0 -ctv q6_0
context size 10 - 20K increased

IQ5_K:
main: n_kv_max = 71168, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 21.294 192.35 102.750 9.97
4096 1024 4096 21.940 186.69 107.686 9.51
4096 1024 8192 21.658 189.12 129.563 7.90
4096 1024 12288 22.404 182.83 136.833 7.48
4096 1024 16384 22.742 180.11 148.500 6.90

IQ4_K:
main: n_kv_max = 90112, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
4096 1024 0 21.029 194.78 99.277 10.31
4096 1024 4096 21.703 188.73 104.923 9.76
4096 1024 8192 22.113 185.23 111.030 9.22
4096 1024 12288 22.666 180.71 121.368 8.44
4096 1024 16384 22.762 179.95 126.116 8.12

main: n_kv_max = 131072, n_batch = 8192, n_ubatch = 8192, flash_attn = 1, n_gpu_layers = 999, n_threads = 57, n_threads_batch = 57

PP TG N_KV T_PP s S_PP t/s T_TG s S_TG t/s
8192 2048 0 84.193 97.30 205.887 9.95
8192 2048 8192 40.191 203.82 222.147 9.22
8192 2048 16384 42.153 194.34 237.941 8.61
8192 2048 24576 44.176 185.44 254.582 8.04

I also get similar numbers with similar config but using anikifoss' HQ4 Quant. I decided to end there because I did not want to wait.

@ubergarm
Do you know if MLA has potential to be supported for GLM models? does this play a role in the impact of TG speeds?

Owner

@SFPLM

n_batch = 8192, n_ubatch = 8192

Some folks have reported success with the larger batch sizes, but some others have had issues going higher. I tend to stick to 4096 or below usually but looks like you got it working there!

Do you know if MLA has potential to be supported for GLM models? does this play a role in the impact of TG speeds?

MLA would be multi-head latent attention similar to deepseek and kimi-k2. there are some papers discussing implementing MLA instead of say GQA etc on existing architectures but it would take more fine-tuning at least to get it working so beyond the scope of most of us home users to adapt it ourselves. MLA has both pros and cons in that it can reduce the size of the kv-cache (it grows linearly instead of exponentially) however it takes more computation to compress/decompress it to/from latent space so not necessarily faster.

Sign up or log in to comment