--- license: mit language: - en - zh pipeline_tag: text-generation base_model: - zai-org/GLM-4.5-Base base_model_relation: quantized tags: - imatrix - conversational - ik_llama.cpp --- GLM-4.5-**Base**, quantized down to 124GB (V2) and 118GB (V1), specifically for 128GB RAM + small GPU setups. ``` llm_load_tensors: CPU buffer size = 114156.88 MiB llm_load_tensors: CUDA_Host buffer size = 416.25 MiB llm_load_tensors: CUDA0 buffer size = 10584.35 MiB ``` It uses ik_llama.cpp's new IQ2_KL quantization: https://github.com/ikawrakow/ik_llama.cpp/pull/602 With the following mix, derived from ubergarm's GLM-4.5 (Instruct) quantizations: https://huggingface.co/ubergarm/GLM-4.5-GGUF ``` # Attention blk\..*\.attn_q.*=iq5_ks blk\..*\.attn_k.*=iq6_k blk\..*\.attn_v.*=iq6_k blk\..*\.attn_output.*=iq5_ks # First 3 Dense Layers [0-2] blk\..*\.ffn_down\.weight=iq5_ks blk\..*\.ffn_(gate|up)\.weight=iq5_ks # Shared Expert Layers [3-92] blk\..*\.ffn_down_shexp\.weight=iq5_ks blk\..*\.ffn_(gate|up)_shexp\.weight=iq5_ks # Routed Experts Layers [3-19] blk\.[3-9]\.ffn_down_exps\.weight=iq3_ks blk\.[3-9]\.ffn_(gate|up)_exps\.weight=iq3_ks blk\.[1-1][0-9]\.ffn_down_exps\.weight=iq3_ks blk\.[1-1][0-9]\.ffn_(gate|up)_exps\.weight=iq3_ks # Routed Experts Layers [81-92] blk\.[8-8][1-9]\.ffn_down_exps\.weight=iq3_ks blk\.[8-8][1-9]\.ffn_(gate|up)_exps\.weight=iq3_ks blk\.[9-9][0-2]\.ffn_down_exps\.weight=iq3_ks blk\.[9-9][0-2]\.ffn_(gate|up)_exps\.weight=iq3_ks # Routed Experts Layers [20-80] blk\..*\.ffn_down_exps\.weight=iq2_kl blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl # NextN MTP Layer [92] blk\..*\.nextn\.embed_tokens\.weight=iq5_ks blk\..*\.nextn\.shared_head_head\.weight=iq5_ks blk\..*\.nextn\.eh_proj\.weight=q8_0 # Non-Repeating Layers token_embd\.weight=iq4_k output\.weight=iq6_k ``` - Mostly iq5_ks GPU layers to minimize loss cheaply, keep it fast (as iqX_ks quantizations are very fast), and minimize the number of quantization types. - iq3_ks shared experts near the beginning and end, as this seems to be where there are perplexity 'bumps.' - iq2_kl 'middle' shared experts. Works well on 128GB RAM, with room for 24K F16 context in 24GB VRAM and RAM to spare for the system. It's awesome for story continuation. Do NOT load with mmap! Requires ik_llama.cpp, see ubergarm's GLM 4.5 page. And let me know if you want a different mix (such as one more optimal for 8-11GB GPUs).
Old V1 Recipe ``` # Attention blk\..*\.attn_q.*=iq5_ks_r4 blk\..*\.attn_k.*=iq6_k blk\..*\.attn_v.*=iq6_k blk\..*\.attn_output.*=iq5_ks_r4 # First 3 Dense Layers [0-2] blk\..*\.ffn_down\.weight=iq4_kt blk\..*\.ffn_(gate|up)\.weight=iq4_kt # Shared Expert Layers [3-92] blk\..*\.ffn_down_shexp\.weight=iq5_ks_r4 blk\..*\.ffn_(gate|up)_shexp\.weight=iq4_ks_r4 # Routed Experts Layers [3-92] blk\..*\.ffn_down_exps\.weight=iq2_kl blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl # NextN MTP Layer [92] blk\..*\.nextn\.embed_tokens\.weight=iq5_ks blk\..*\.nextn\.shared_head_head\.weight=iq5_ks blk\..*\.nextn\.eh_proj\.weight=q8_0 # Non-Repeating Layers token_embd\.weight=iq4_k output\.weight=iq6_k ```