Downtown-Case
/

GLM-4.5-Base-128GB-RAM-IQ2_KL-GGUF

Text Generation

Model card Files Files and versions

Downtown-Case commited on Sep 12

Commit

fce0aa6

·

verified ·

1 Parent(s): 3b340dd

Update README.md

Files changed (1) hide show

README.md +57 -7

README.md CHANGED Viewed

@@ -13,9 +13,9 @@ tags:
 - ik_llama.cpp
 ---
-GLM-4.5-**Base**, quantized down to 118GB, specifically for 128GB RAM + small GPU setups.
-It uses ik_llama.cpp's excellent new IQ2_KL quantization, thanks to @ikawrakow
 https://github.com/ikawrakow/ik_llama.cpp/pull/602
@@ -52,11 +52,61 @@ token_embd\.weight=iq4_k
 output\.weight=iq6_k
 ```
-- iq5/iq6 attention and shared experts for minimal loss there.
-- iq4_kt dense layer ffns to save VRAM for context, since they will be offloaded to GPU.
-- iq2_kl ffn up *and* down experts, as it results in a optimal size with a good format as opposed to quantizing up/down differently.
-Works well on dual channel DDR5 + a single 3090, with room for 24K F16 context and plenty of RAM to spare for the system. It's awesome for story continuation! Requires ik_llama.cpp, see ubergarm's GLM 4.5 page.
-I may tweak the formula by unifying the iq5/iq6 layers to one quantization type, for possibly better speed... But if anyone wants a slightly different formula or model, just ask.

 - ik_llama.cpp
 ---
+GLM-4.5-**Base**, quantized down to 118GB/124GB, specifically for 128GB RAM + small GPU setups.
+It uses ik_llama.cpp's new IQ2_KL quantization:
 https://github.com/ikawrakow/ik_llama.cpp/pull/602
 output\.weight=iq6_k
 ```
+# V2:
+The 'V2' file uses this config, now that I better understand how offloading works:
+```
+# Attention
+blk\..*\.attn_q.*=iq5_ks
+blk\..*\.attn_k.*=iq6_k
+blk\..*\.attn_v.*=iq6_k
+blk\..*\.attn_output.*=iq5_ks
+# First 3 Dense Layers [0-2]
+blk\..*\.ffn_down\.weight=iq5_ks
+blk\..*\.ffn_(gate|up)\.weight=iq5_ks
+# Shared Expert Layers [3-92]
+blk\..*\.ffn_down_shexp\.weight=iq5_ks
+blk\..*\.ffn_(gate|up)_shexp\.weight=iq5_ks
+# Routed Experts Layers [3-9]
+blk\.[3-9]\.ffn_down_exps\.weight=iq3_ks
+blk\.[3-9]\.ffn_(gate|up)_exps\.weight=iq3_ks
+# Routed Experts Layers [10-19]
+blk\.[1-1][0-9]\.ffn_down_exps\.weight=iq3_ks
+blk\.[1-1][0-9]\.ffn_(gate|up)_exps\.weight=iq3_ks
+# Routed Experts Layers [81-89]
+blk\.[8-8][1-9]\.ffn_down_exps\.weight=iq3_ks
+blk\.[8-8][1-9]\.ffn_(gate|up)_exps\.weight=iq3_ks
+# Routed Experts Layers [90-92]
+blk\.[9-9][0-2]\.ffn_down_exps\.weight=iq3_ks
+blk\.[9-9][0-2]\.ffn_(gate|up)_exps\.weight=iq3_ks
+# Routed Experts Layers [20-80]
+blk\..*\.ffn_down_exps\.weight=iq2_kl
+blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl
+# NextN MTP Layer [92]
+blk\..*\.nextn\.embed_tokens\.weight=iq5_ks
+blk\..*\.nextn\.shared_head_head\.weight=iq5_ks
+blk\..*\.nextn\.eh_proj\.weight=q8_0
+# Non-Repeating Layers
+token_embd\.weight=iq4_k
+output\.weight=iq6_k
+"
+- Mostly iq5_ks GPU layers to keep it fast and minimize the number of quantization types and loss here
+- iq3_ks shared experts near the beginning and end, as this seems to be where there are perplexity 'bumps'
+- iq2_kl 'middle' shared experts
+Works well on 128GB RAM, with room for 24K F16 context in 24GB VRAM and plenty of RAM to spare for the system, but do NOT load with mmap! It's awesome for story continuation! Requires ik_llama.cpp, see ubergarm's GLM 4.5 page.