Downtown-Case commited on
Commit
fce0aa6
·
verified ·
1 Parent(s): 3b340dd

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +57 -7
README.md CHANGED
@@ -13,9 +13,9 @@ tags:
13
  - ik_llama.cpp
14
  ---
15
 
16
- GLM-4.5-**Base**, quantized down to 118GB, specifically for 128GB RAM + small GPU setups.
17
 
18
- It uses ik_llama.cpp's excellent new IQ2_KL quantization, thanks to @ikawrakow
19
 
20
  https://github.com/ikawrakow/ik_llama.cpp/pull/602
21
 
@@ -52,11 +52,61 @@ token_embd\.weight=iq4_k
52
  output\.weight=iq6_k
53
  ```
54
 
55
- - iq5/iq6 attention and shared experts for minimal loss there.
56
- - iq4_kt dense layer ffns to save VRAM for context, since they will be offloaded to GPU.
57
- - iq2_kl ffn up *and* down experts, as it results in a optimal size with a good format as opposed to quantizing up/down differently.
58
 
59
- Works well on dual channel DDR5 + a single 3090, with room for 24K F16 context and plenty of RAM to spare for the system. It's awesome for story continuation! Requires ik_llama.cpp, see ubergarm's GLM 4.5 page.
60
 
61
- I may tweak the formula by unifying the iq5/iq6 layers to one quantization type, for possibly better speed... But if anyone wants a slightly different formula or model, just ask.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
 
 
13
  - ik_llama.cpp
14
  ---
15
 
16
+ GLM-4.5-**Base**, quantized down to 118GB/124GB, specifically for 128GB RAM + small GPU setups.
17
 
18
+ It uses ik_llama.cpp's new IQ2_KL quantization:
19
 
20
  https://github.com/ikawrakow/ik_llama.cpp/pull/602
21
 
 
52
  output\.weight=iq6_k
53
  ```
54
 
55
+ # V2:
 
 
56
 
57
+ The 'V2' file uses this config, now that I better understand how offloading works:
58
 
59
+ ```
60
+ # Attention
61
+ blk\..*\.attn_q.*=iq5_ks
62
+ blk\..*\.attn_k.*=iq6_k
63
+ blk\..*\.attn_v.*=iq6_k
64
+ blk\..*\.attn_output.*=iq5_ks
65
+
66
+ # First 3 Dense Layers [0-2]
67
+ blk\..*\.ffn_down\.weight=iq5_ks
68
+ blk\..*\.ffn_(gate|up)\.weight=iq5_ks
69
+
70
+ # Shared Expert Layers [3-92]
71
+ blk\..*\.ffn_down_shexp\.weight=iq5_ks
72
+ blk\..*\.ffn_(gate|up)_shexp\.weight=iq5_ks
73
+
74
+ # Routed Experts Layers [3-9]
75
+ blk\.[3-9]\.ffn_down_exps\.weight=iq3_ks
76
+ blk\.[3-9]\.ffn_(gate|up)_exps\.weight=iq3_ks
77
+
78
+ # Routed Experts Layers [10-19]
79
+ blk\.[1-1][0-9]\.ffn_down_exps\.weight=iq3_ks
80
+ blk\.[1-1][0-9]\.ffn_(gate|up)_exps\.weight=iq3_ks
81
+
82
+ # Routed Experts Layers [81-89]
83
+ blk\.[8-8][1-9]\.ffn_down_exps\.weight=iq3_ks
84
+ blk\.[8-8][1-9]\.ffn_(gate|up)_exps\.weight=iq3_ks
85
+
86
+ # Routed Experts Layers [90-92]
87
+ blk\.[9-9][0-2]\.ffn_down_exps\.weight=iq3_ks
88
+ blk\.[9-9][0-2]\.ffn_(gate|up)_exps\.weight=iq3_ks
89
+
90
+ # Routed Experts Layers [20-80]
91
+ blk\..*\.ffn_down_exps\.weight=iq2_kl
92
+ blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl
93
+
94
+ # NextN MTP Layer [92]
95
+ blk\..*\.nextn\.embed_tokens\.weight=iq5_ks
96
+ blk\..*\.nextn\.shared_head_head\.weight=iq5_ks
97
+ blk\..*\.nextn\.eh_proj\.weight=q8_0
98
+
99
+ # Non-Repeating Layers
100
+ token_embd\.weight=iq4_k
101
+ output\.weight=iq6_k
102
+ "
103
+
104
+ - Mostly iq5_ks GPU layers to keep it fast and minimize the number of quantization types and loss here
105
+
106
+ - iq3_ks shared experts near the beginning and end, as this seems to be where there are perplexity 'bumps'
107
+
108
+ - iq2_kl 'middle' shared experts
109
+
110
+
111
+ Works well on 128GB RAM, with room for 24K F16 context in 24GB VRAM and plenty of RAM to spare for the system, but do NOT load with mmap! It's awesome for story continuation! Requires ik_llama.cpp, see ubergarm's GLM 4.5 page.
112