Downtown-Case
/

GLM-4.5-Base-128GB-RAM-IQ2_KL-GGUF

@@ -13,7 +13,7 @@ tags:
 - ik_llama.cpp
 ---
-GLM-4.5-**Base**, quantized down to 118GB/124GB, specifically for 128GB RAM + small GPU setups.
 It uses ik_llama.cpp's new IQ2_KL quantization:
@@ -23,39 +23,6 @@ With the following mix, derived from ubergarm's GLM-4.5 (Instruct) quantizations
 https://huggingface.co/ubergarm/GLM-4.5-GGUF
-```
-# Attention
-blk\..*\.attn_q.*=iq5_ks_r4
-blk\..*\.attn_k.*=iq6_k
-blk\..*\.attn_v.*=iq6_k
-blk\..*\.attn_output.*=iq5_ks_r4
-# First 3 Dense Layers [0-2]
-blk\..*\.ffn_down\.weight=iq4_kt
-blk\..*\.ffn_(gate|up)\.weight=iq4_kt
-# Shared Expert Layers [3-92]
-blk\..*\.ffn_down_shexp\.weight=iq5_ks_r4
-blk\..*\.ffn_(gate|up)_shexp\.weight=iq4_ks_r4
-# Routed Experts Layers [3-92]
-blk\..*\.ffn_down_exps\.weight=iq2_kl
-blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl
-# NextN MTP Layer [92]
-blk\..*\.nextn\.embed_tokens\.weight=iq5_ks
-blk\..*\.nextn\.shared_head_head\.weight=iq5_ks
-blk\..*\.nextn\.eh_proj\.weight=q8_0
-# Non-Repeating Layers
-token_embd\.weight=iq4_k
-output\.weight=iq6_k
-```
-# V2:
-The 'V2' file uses this config, now that I better understand how offloading works:
 ```
 # Attention
 blk\..*\.attn_q.*=iq5_ks
@@ -99,7 +66,8 @@ blk\..*\.nextn\.eh_proj\.weight=q8_0
 # Non-Repeating Layers
 token_embd\.weight=iq4_k
 output\.weight=iq6_k
-"
 - Mostly iq5_ks GPU layers to keep it fast and minimize the number of quantization types and loss here
@@ -107,6 +75,40 @@ output\.weight=iq6_k
 - iq2_kl 'middle' shared experts
-Works well on 128GB RAM, with room for 24K F16 context in 24GB VRAM and plenty of RAM to spare for the system, but do NOT load with mmap! It's awesome for story continuation! Requires ik_llama.cpp, see ubergarm's GLM 4.5 page.

 - ik_llama.cpp
 ---
+GLM-4.5-**Base**, quantized down to 124GB (V2) and 118GB (V1), specifically for 128GB RAM + small GPU setups.
 It uses ik_llama.cpp's new IQ2_KL quantization:
 https://huggingface.co/ubergarm/GLM-4.5-GGUF
 ```
 # Attention
 blk\..*\.attn_q.*=iq5_ks
 # Non-Repeating Layers
 token_embd\.weight=iq4_k
 output\.weight=iq6_k
+```
 - Mostly iq5_ks GPU layers to keep it fast and minimize the number of quantization types and loss here
 - iq2_kl 'middle' shared experts
+<details>
+  <summary>Old V1 Recipe</summary>
+```
+# Attention
+blk\..*\.attn_q.*=iq5_ks_r4
+blk\..*\.attn_k.*=iq6_k
+blk\..*\.attn_v.*=iq6_k
+blk\..*\.attn_output.*=iq5_ks_r4
+# First 3 Dense Layers [0-2]
+blk\..*\.ffn_down\.weight=iq4_kt
+blk\..*\.ffn_(gate|up)\.weight=iq4_kt
+# Shared Expert Layers [3-92]
+blk\..*\.ffn_down_shexp\.weight=iq5_ks_r4
+blk\..*\.ffn_(gate|up)_shexp\.weight=iq4_ks_r4
+# Routed Experts Layers [3-92]
+blk\..*\.ffn_down_exps\.weight=iq2_kl
+blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl
+# NextN MTP Layer [92]
+blk\..*\.nextn\.embed_tokens\.weight=iq5_ks
+blk\..*\.nextn\.shared_head_head\.weight=iq5_ks
+blk\..*\.nextn\.eh_proj\.weight=q8_0
+# Non-Repeating Layers
+token_embd\.weight=iq4_k
+output\.weight=iq6_k
+```
+</details>
+Works well on 128GB RAM, with room for 24K F16 context in 24GB VRAM and RAM to spare for the system, but do NOT load with mmap! It's awesome for story continuation! Requires ik_llama.cpp, see ubergarm's GLM 4.5 page.