ubergarm commited on
Commit
266d765
·
1 Parent(s): 1eb4462

Prepping IQ4_K

Browse files
Files changed (1) hide show
  1. README.md +59 -2
README.md CHANGED
@@ -43,7 +43,6 @@ These first two are just test quants for baseline perplexity comparison:
43
  * `Q8_0` 354.794 GiB (8.505 BPW)
44
  - Final estimate: PPL = 3.1746 +/- 0.01784
45
 
46
-
47
  ## IQ5_K 250.296 GiB (6.000 BPW)
48
  Final estimate: PPL = 3.1690 +/- 0.01779
49
 
@@ -102,6 +101,64 @@ numactl -N 0 -m 0 \
102
 
103
  </details>
104
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
105
  ## IQ4_KSS 173.726 GiB (4.164 BPW)
106
  Final estimate: PPL = 3.3261 +/- 0.01899
107
 
@@ -227,7 +284,7 @@ numactl -N 1 -m 1 \
227
  If you want to disable thinking, add `/nothink` (correct, no underscore) at the *end* of your prompt.
228
 
229
  ```bash
230
- # Clone and checkout experimental PR
231
  $ git clone https://github.com/ikawrakow/ik_llama.cpp
232
  $ cd ik_llama.cpp
233
  $ git remote add Thireus https://github.com/Thireus/ik_llama.cpp.git
 
43
  * `Q8_0` 354.794 GiB (8.505 BPW)
44
  - Final estimate: PPL = 3.1746 +/- 0.01784
45
 
 
46
  ## IQ5_K 250.296 GiB (6.000 BPW)
47
  Final estimate: PPL = 3.1690 +/- 0.01779
48
 
 
101
 
102
  </details>
103
 
104
+ ## IQ4_K TODO
105
+ Final estimate: PPL = TODO
106
+
107
+ <details>
108
+
109
+ <summary>👈 Secret Recipe</summary>
110
+
111
+ ```bash
112
+ #/usr/bin/env bash
113
+ custom="
114
+ # 93 Repeating Layers [0-92]
115
+
116
+ # Attention
117
+ blk\..*\.attn_q.*=iq6_k
118
+ blk\..*\.attn_k.*=q8_0
119
+ blk\..*\.attn_v.*=q8_0
120
+ blk\..*\.attn_output.*=iq6_k
121
+
122
+ # First 3 Dense Layers [0-2]
123
+ blk\..*\.ffn_down\.weight=q8_0
124
+ blk\..*\.ffn_(gate|up)\.weight=iq6_k
125
+
126
+ # Shared Expert Layers [3-92]
127
+ blk\..*\.ffn_down_shexp\.weight=q8_0
128
+ blk\..*\.ffn_(gate|up)_shexp\.weight=iq6_k
129
+
130
+ # Routed Experts Layers [3-92]
131
+ blk\..*\.ffn_down_exps\.weight=iq5_k
132
+ blk\..*\.ffn_(gate|up)_exps\.weight=iq4_k
133
+
134
+ # NextN MTP Layer [92]
135
+ blk\..*\.nextn\.embed_tokens\.weight=iq5_k
136
+ blk\..*\.nextn\.shared_head_head\.weight=iq5_k
137
+ blk\..*\.nextn\.eh_proj\.weight=q8_0
138
+
139
+ # Non-Repeating Layers
140
+ token_embd\.weight=iq4_k
141
+ output\.weight=iq6_k
142
+ "
143
+
144
+ custom=$(
145
+ echo "$custom" | grep -v '^#' | \
146
+ sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
147
+ )
148
+
149
+ numactl -N 0 -m 0 \
150
+ ./build/bin/llama-quantize \
151
+ --custom-q "$custom" \
152
+ --imatrix /mnt/raid/models/ubergarm/GLM-4.5-GGUF/imatrix-GLM-4.5-BF16.dat \
153
+ /mnt/raid/models/ubergarm/GLM-4.5-GGUF/GLM-160x21B-4.5-BF16-00001-of-00015.gguf \
154
+ /mnt/raid/models/ubergarm/GLM-4.5-GGUF/GLM-4.5-IQ4_K.gguf \
155
+ IQ4_K \
156
+ 192
157
+ ```
158
+
159
+ </details>
160
+
161
+
162
  ## IQ4_KSS 173.726 GiB (4.164 BPW)
163
  Final estimate: PPL = 3.3261 +/- 0.01899
164
 
 
284
  If you want to disable thinking, add `/nothink` (correct, no underscore) at the *end* of your prompt.
285
 
286
  ```bash
287
+ # Clone and checkout experimental PR (hopefully merged into main soon)
288
  $ git clone https://github.com/ikawrakow/ik_llama.cpp
289
  $ cd ik_llama.cpp
290
  $ git remote add Thireus https://github.com/Thireus/ik_llama.cpp.git