testing IQ5_K
W790E Sage + QYFS + 512G + RTX5090
IQ5_K:
Tensor blk.91.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.91.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.91.ffn_up_exps.weight buffer type overriden to CPU
model has unused tensor blk.92.attn_norm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.attn_q.weight (size = 66846720 bytes) -- ignoring
model has unused tensor blk.92.attn_k.weight (size = 5570560 bytes) -- ignoring
model has unused tensor blk.92.attn_v.weight (size = 5570560 bytes) -- ignoring
model has unused tensor blk.92.attn_q.bias (size = 49152 bytes) -- ignoring
model has unused tensor blk.92.attn_k.bias (size = 4096 bytes) -- ignoring
model has unused tensor blk.92.attn_v.bias (size = 4096 bytes) -- ignoring
model has unused tensor blk.92.attn_output.weight (size = 66846720 bytes) -- ignoring
model has unused tensor blk.92.attn_q_norm.weight (size = 512 bytes) -- ignoring
model has unused tensor blk.92.attn_k_norm.weight (size = 512 bytes) -- ignoring
model has unused tensor blk.92.post_attention_norm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.ffn_gate_inp.weight (size = 3276800 bytes) -- ignoring
model has unused tensor blk.92.exp_probs_b.bias (size = 640 bytes) -- ignoring
Tensor blk.92.ffn_gate_exps.weight buffer type overriden to CPU
model has unused tensor blk.92.ffn_gate_exps.weight (size = 865075200 bytes) -- ignoring
Tensor blk.92.ffn_down_exps.weight buffer type overriden to CPU
model has unused tensor blk.92.ffn_down_exps.weight (size = 1042022400 bytes) -- ignoring
Tensor blk.92.ffn_up_exps.weight buffer type overriden to CPU
model has unused tensor blk.92.ffn_up_exps.weight (size = 865075200 bytes) -- ignoring
model has unused tensor blk.92.ffn_gate_shexp.weight (size = 8355840 bytes) -- ignoring
model has unused tensor blk.92.ffn_down_shexp.weight (size = 8355840 bytes) -- ignoring
model has unused tensor blk.92.ffn_up_shexp.weight (size = 8355840 bytes) -- ignoring
model has unused tensor blk.92.nextn.eh_proj.weight (size = 55705600 bytes) -- ignoring
model has unused tensor blk.92.nextn.embed_tokens.weight (size = 642580480 bytes) -- ignoring
model has unused tensor blk.92.nextn.enorm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.nextn.hnorm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.nextn.shared_head_head.weight (size = 642580480 bytes) -- ignoring
model has unused tensor blk.92.nextn.shared_head_norm.weight (size = 20480 bytes) -- ignoring
llm_load_tensors: offloading 93 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 94/94 layers to GPU
llm_load_tensors: CPU buffer size = 41144.77 MiB
llm_load_tensors: CPU buffer size = 42960.02 MiB
llm_load_tensors: CPU buffer size = 43128.77 MiB
llm_load_tensors: CPU buffer size = 43785.02 MiB
llm_load_tensors: CPU buffer size = 43128.77 MiB
llm_load_tensors: CPU buffer size = 35970.83 MiB
llm_load_tensors: CPU buffer size = 612.81 MiB
llm_load_tensors: CUDA0 buffer size = 16308.63 MiB
....................................................................................................
MLA is only available for LLM_ARCH_DEEPSEEK2 -> turning off MLA
llama_new_context_with_model: n_ctx = 58368
llama_new_context_with_model: n_batch = 4096
llama_new_context_with_model: n_ubatch = 4096
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 0
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 11264.67 MiB
llama_new_context_with_model: KV self size = 11264.62 MiB, K (q8_0): 5632.31 MiB, V (q8_0): 5632.31 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.58 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 2905.77 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 992.05 MiB
llama_new_context_with_model: graph nodes = 4457
llama_new_context_with_model: graph splits = 180
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MUL_MAT_ID to OFF
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MOE_FUSED_UP_GATE to OFF
main: n_kv_max = 58368, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 4096 | 1024 | 0 | 21.111 | 194.02 | 109.352 | 9.36 |
| 4096 | 1024 | 4096 | 21.349 | 191.86 | 97.248 | 10.53 |
| 4096 | 1024 | 8192 | 22.908 | 178.80 | 115.277 | 8.88 |
| 4096 | 1024 | 12288 | 22.586 | 181.35 | 133.793 | 7.65 |
| 4096 | 1024 | 16384 | 22.803 | 179.62 | 141.520 | 7.24 |
have some questions:
- model has unused tensor blk.92 -- ignoring ?
is this correct? - How can I increase the context size, the maximum for my workstation is less than 60K
Hmm I want to know since the QYFS + 512 GB DDR5 + Sage SE combo...
Have you tried Deepseek (latest of V3 or R1) I want to know the token gen speeds you get from say a IQ4 quant 32k Context?
DeepSeek-V3-0324-IQ4_K_R4
main: n_kv_max = 65536, n_batch = 4090, n_ubatch = 4090, flash_attn = 1, n_gpu_layers = 63, n_threads = 101, n_threads_batch = 101
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 4090 | 1022 | 0 | 42.215 | 96.88 | 97.254 | 10.51 |
| 4090 | 1022 | 4090 | 43.060 | 94.98 | 95.129 | 10.74 |
| 4090 | 1022 | 8180 | 43.872 | 93.23 | 103.784 | 9.85 |
| 4090 | 1022 | 12270 | 44.895 | 91.10 | 105.761 | 9.66 |
DeepSeek-R1-0528-IQ4_KS_R4
main: n_kv_max = 65536, n_batch = 4090, n_ubatch = 4090, flash_attn = 1, n_gpu_layers = 63, n_threads = 101, n_threads_batch = 101
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 4090 | 1022 | 0 | 39.589 | 103.31 | 91.805 | 11.13 |
| 4090 | 1022 | 4090 | 40.485 | 101.03 | 98.592 | 10.37 |
| 4090 | 1022 | 8180 | 41.546 | 98.44 | 104.753 | 9.76 |
| 4090 | 1022 | 12270 | 42.236 | 96.84 | 108.134 | 9.45 |
GLM 4.5 IQ4_K
Tensor blk.91.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.91.ffn_down_exps.weight buffer type overriden to CPU
Tensor blk.91.ffn_up_exps.weight buffer type overriden to CPU
model has unused tensor blk.92.attn_norm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.attn_q.weight (size = 52101120 bytes) -- ignoring
model has unused tensor blk.92.attn_k.weight (size = 5570560 bytes) -- ignoring
model has unused tensor blk.92.attn_v.weight (size = 5570560 bytes) -- ignoring
model has unused tensor blk.92.attn_q.bias (size = 49152 bytes) -- ignoring
model has unused tensor blk.92.attn_k.bias (size = 4096 bytes) -- ignoring
model has unused tensor blk.92.attn_v.bias (size = 4096 bytes) -- ignoring
model has unused tensor blk.92.attn_output.weight (size = 52101120 bytes) -- ignoring
model has unused tensor blk.92.attn_q_norm.weight (size = 512 bytes) -- ignoring
model has unused tensor blk.92.attn_k_norm.weight (size = 512 bytes) -- ignoring
model has unused tensor blk.92.post_attention_norm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.ffn_gate_inp.weight (size = 3276800 bytes) -- ignoring
model has unused tensor blk.92.exp_probs_b.bias (size = 640 bytes) -- ignoring
Tensor blk.92.ffn_gate_exps.weight buffer type overriden to CPU
model has unused tensor blk.92.ffn_gate_exps.weight (size = 707788800 bytes) -- ignoring
Tensor blk.92.ffn_down_exps.weight buffer type overriden to CPU
model has unused tensor blk.92.ffn_down_exps.weight (size = 865075200 bytes) -- ignoring
Tensor blk.92.ffn_up_exps.weight buffer type overriden to CPU
model has unused tensor blk.92.ffn_up_exps.weight (size = 707788800 bytes) -- ignoring
model has unused tensor blk.92.ffn_gate_shexp.weight (size = 6512640 bytes) -- ignoring
model has unused tensor blk.92.ffn_down_shexp.weight (size = 8355840 bytes) -- ignoring
model has unused tensor blk.92.ffn_up_shexp.weight (size = 6512640 bytes) -- ignoring
model has unused tensor blk.92.nextn.eh_proj.weight (size = 55705600 bytes) -- ignoring
model has unused tensor blk.92.nextn.embed_tokens.weight (size = 533463040 bytes) -- ignoring
model has unused tensor blk.92.nextn.enorm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.nextn.hnorm.weight (size = 20480 bytes) -- ignoring
model has unused tensor blk.92.nextn.shared_head_head.weight (size = 533463040 bytes) -- ignoring
model has unused tensor blk.92.nextn.shared_head_norm.weight (size = 20480 bytes) -- ignoring
llm_load_tensors: offloading 93 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 94/94 layers to GPU
llm_load_tensors: CPU buffer size = 40746.39 MiB
llm_load_tensors: CPU buffer size = 42230.00 MiB
llm_load_tensors: CPU buffer size = 42380.00 MiB
llm_load_tensors: CPU buffer size = 42905.00 MiB
llm_load_tensors: CPU buffer size = 37416.98 MiB
llm_load_tensors: CPU buffer size = 416.25 MiB
llm_load_tensors: CUDA0 buffer size = 13323.87 MiB
....................................................................................................
MLA is only available for LLM_ARCH_DEEPSEEK2 -> turning off MLA
llama_new_context_with_model: n_ctx = 68352
llama_new_context_with_model: n_batch = 4096
llama_new_context_with_model: n_ubatch = 4096
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 0
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 13191.51 MiB
llama_new_context_with_model: KV self size = 13191.47 MiB, K (q8_0): 6595.73 MiB, V (q8_0): 6595.73 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.58 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 2978.02 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 1148.05 MiB
llama_new_context_with_model: graph nodes = 4457
llama_new_context_with_model: graph splits = 180
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MUL_MAT_ID to OFF
XXXXXXXXXXXXXXXXXXXXX Setting offload policy for op MOE_FUSED_UP_GATE to OFF
main: n_kv_max = 68352, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 4096 | 1024 | 0 | 20.941 | 195.59 | 96.096 | 10.66 |
| 4096 | 1024 | 4096 | 21.013 | 194.93 | 81.857 | 12.51 |
| 4096 | 1024 | 8192 | 21.443 | 191.01 | 109.799 | 9.33 |
| 4096 | 1024 | 12288 | 22.150 | 184.92 | 116.912 | 8.76 |
| 4096 | 1024 | 16384 | 22.374 | 183.07 | 122.584 | 8.35 |
hmm....
I thought the GLM being relatively less in size would make it have more TG speed than the Deepseeks
model has unused tensor blk.92 -- ignoring ? is this correct?
Yes, for both GLM models the last layer is set with the new TENSOR_SKIP flag and is not involved with inference. It only takes up space on the disk, and is not loaded into VRAM/RAM. This layer has the NextN MTP tensors which are not yet used anywhere but may become used in the future.
How can I increase the context size, the maximum for my workstation is less than 60K
You can use -ctk q6_0 -ctv q6_0 to free up some more VRAM to fit more context for example, or offload less extra exps layer to VRAM etc.
hmm.... I thought the GLM being relatively less in size would make it have more TG speed than the Deepseeks
So DeepSeek is about 37B active parameters and GLM-4.5 is about 32B active parameters for TG so it should be slightly faster assuming memory bandwidth is the limit and similar quantization size.
I'd have to see the full command from
@shewin
including how many GPUs, system CPU and RAM speed, and Linux or Windows to fully know. In general I no longer use the -rtr and instead omitt it and use -ub 4096 -b 4096 which can really boost PP speed. However leaving it at default batch sizes and going with -rtr and using the extra VRAM to offload a few more exps layers may improve TG on GLM-4.5 but I have not tested.
with -ctk q6_0 -ctv q6_0
context size 10 - 20K increased
IQ5_K:
main: n_kv_max = 71168, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 4096 | 1024 | 0 | 21.294 | 192.35 | 102.750 | 9.97 |
| 4096 | 1024 | 4096 | 21.940 | 186.69 | 107.686 | 9.51 |
| 4096 | 1024 | 8192 | 21.658 | 189.12 | 129.563 | 7.90 |
| 4096 | 1024 | 12288 | 22.404 | 182.83 | 136.833 | 7.48 |
| 4096 | 1024 | 16384 | 22.742 | 180.11 | 148.500 | 6.90 |
IQ4_K:
main: n_kv_max = 90112, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 99, n_threads = 101, n_threads_batch = 101
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 4096 | 1024 | 0 | 21.029 | 194.78 | 99.277 | 10.31 |
| 4096 | 1024 | 4096 | 21.703 | 188.73 | 104.923 | 9.76 |
| 4096 | 1024 | 8192 | 22.113 | 185.23 | 111.030 | 9.22 |
| 4096 | 1024 | 12288 | 22.666 | 180.71 | 121.368 | 8.44 |
| 4096 | 1024 | 16384 | 22.762 | 179.95 | 126.116 | 8.12 |
main: n_kv_max = 131072, n_batch = 8192, n_ubatch = 8192, flash_attn = 1, n_gpu_layers = 999, n_threads = 57, n_threads_batch = 57
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|---|---|---|---|---|---|---|
| 8192 | 2048 | 0 | 84.193 | 97.30 | 205.887 | 9.95 |
| 8192 | 2048 | 8192 | 40.191 | 203.82 | 222.147 | 9.22 |
| 8192 | 2048 | 16384 | 42.153 | 194.34 | 237.941 | 8.61 |
| 8192 | 2048 | 24576 | 44.176 | 185.44 | 254.582 | 8.04 |
I also get similar numbers with similar config but using anikifoss' HQ4 Quant. I decided to end there because I did not want to wait.
@ubergarm
Do you know if MLA has potential to be supported for GLM models? does this play a role in the impact of TG speeds?
n_batch = 8192, n_ubatch = 8192
Some folks have reported success with the larger batch sizes, but some others have had issues going higher. I tend to stick to 4096 or below usually but looks like you got it working there!
Do you know if MLA has potential to be supported for GLM models? does this play a role in the impact of TG speeds?
MLA would be multi-head latent attention similar to deepseek and kimi-k2. there are some papers discussing implementing MLA instead of say GQA etc on existing architectures but it would take more fine-tuning at least to get it working so beyond the scope of most of us home users to adapt it ourselves. MLA has both pros and cons in that it can reduce the size of the kv-cache (it grows linearly instead of exponentially) however it takes more computation to compress/decompress it to/from latent space so not necessarily faster.