ubergarm
/

GLM-4.5-GGUF

Text Generation

Model card Files Files and versions

ubergarm commited on Aug 6

Commit

266d765

·

1 Parent(s): 1eb4462

Prepping IQ4_K

Files changed (1) hide show

README.md +59 -2

README.md CHANGED Viewed

@@ -43,7 +43,6 @@ These first two are just test quants for baseline perplexity comparison:
 * `Q8_0` 354.794 GiB (8.505 BPW)
   - Final estimate: PPL = 3.1746 +/- 0.01784
 ## IQ5_K 250.296 GiB (6.000 BPW)
 Final estimate: PPL = 3.1690 +/- 0.01779
@@ -102,6 +101,64 @@ numactl -N 0 -m 0 \
 </details>
 ## IQ4_KSS 173.726 GiB (4.164 BPW)
 Final estimate: PPL = 3.3261 +/- 0.01899
@@ -227,7 +284,7 @@ numactl -N 1 -m 1 \
 If you want to disable thinking, add `/nothink` (correct, no underscore) at the *end* of your prompt.
 ```bash
-# Clone and checkout experimental PR
 $ git clone https://github.com/ikawrakow/ik_llama.cpp
 $ cd ik_llama.cpp
 $ git remote add Thireus https://github.com/Thireus/ik_llama.cpp.git

 * `Q8_0` 354.794 GiB (8.505 BPW)
   - Final estimate: PPL = 3.1746 +/- 0.01784
 ## IQ5_K 250.296 GiB (6.000 BPW)
 Final estimate: PPL = 3.1690 +/- 0.01779
 </details>
+## IQ4_K TODO
+Final estimate: PPL = TODO
+<details>
+<summary>👈 Secret Recipe</summary>
+```bash
+#/usr/bin/env bash
+custom="
+# 93 Repeating Layers [0-92]
+# Attention
+blk\..*\.attn_q.*=iq6_k
+blk\..*\.attn_k.*=q8_0
+blk\..*\.attn_v.*=q8_0
+blk\..*\.attn_output.*=iq6_k
+# First 3 Dense Layers [0-2]
+blk\..*\.ffn_down\.weight=q8_0
+blk\..*\.ffn_(gate|up)\.weight=iq6_k
+# Shared Expert Layers [3-92]
+blk\..*\.ffn_down_shexp\.weight=q8_0
+blk\..*\.ffn_(gate|up)_shexp\.weight=iq6_k
+# Routed Experts Layers [3-92]
+blk\..*\.ffn_down_exps\.weight=iq5_k
+blk\..*\.ffn_(gate|up)_exps\.weight=iq4_k
+# NextN MTP Layer [92]
+blk\..*\.nextn\.embed_tokens\.weight=iq5_k
+blk\..*\.nextn\.shared_head_head\.weight=iq5_k
+blk\..*\.nextn\.eh_proj\.weight=q8_0
+# Non-Repeating Layers
+token_embd\.weight=iq4_k
+output\.weight=iq6_k
+"
+custom=$(
+  echo "$custom" | grep -v '^#' | \
+  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
+)
+numactl -N 0 -m 0 \
+./build/bin/llama-quantize \
+    --custom-q "$custom" \
+    --imatrix /mnt/raid/models/ubergarm/GLM-4.5-GGUF/imatrix-GLM-4.5-BF16.dat \
+    /mnt/raid/models/ubergarm/GLM-4.5-GGUF/GLM-160x21B-4.5-BF16-00001-of-00015.gguf \
+    /mnt/raid/models/ubergarm/GLM-4.5-GGUF/GLM-4.5-IQ4_K.gguf \
+    IQ4_K \
+    192
+```
+</details>
 ## IQ4_KSS 173.726 GiB (4.164 BPW)
 Final estimate: PPL = 3.3261 +/- 0.01899
 If you want to disable thinking, add `/nothink` (correct, no underscore) at the *end* of your prompt.
 ```bash
+# Clone and checkout experimental PR (hopefully merged into main soon)
 $ git clone https://github.com/ikawrakow/ik_llama.cpp
 $ cd ik_llama.cpp
 $ git remote add Thireus https://github.com/Thireus/ik_llama.cpp.git