Downtown-Case commited on
Commit
d53ae02
·
verified ·
1 Parent(s): 13bd7bc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +38 -36
README.md CHANGED
@@ -13,7 +13,7 @@ tags:
13
  - ik_llama.cpp
14
  ---
15
 
16
- GLM-4.5-**Base**, quantized down to 118GB/124GB, specifically for 128GB RAM + small GPU setups.
17
 
18
  It uses ik_llama.cpp's new IQ2_KL quantization:
19
 
@@ -23,39 +23,6 @@ With the following mix, derived from ubergarm's GLM-4.5 (Instruct) quantizations
23
 
24
  https://huggingface.co/ubergarm/GLM-4.5-GGUF
25
 
26
- ```
27
- # Attention
28
- blk\..*\.attn_q.*=iq5_ks_r4
29
- blk\..*\.attn_k.*=iq6_k
30
- blk\..*\.attn_v.*=iq6_k
31
- blk\..*\.attn_output.*=iq5_ks_r4
32
-
33
- # First 3 Dense Layers [0-2]
34
- blk\..*\.ffn_down\.weight=iq4_kt
35
- blk\..*\.ffn_(gate|up)\.weight=iq4_kt
36
-
37
- # Shared Expert Layers [3-92]
38
- blk\..*\.ffn_down_shexp\.weight=iq5_ks_r4
39
- blk\..*\.ffn_(gate|up)_shexp\.weight=iq4_ks_r4
40
-
41
- # Routed Experts Layers [3-92]
42
- blk\..*\.ffn_down_exps\.weight=iq2_kl
43
- blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl
44
-
45
- # NextN MTP Layer [92]
46
- blk\..*\.nextn\.embed_tokens\.weight=iq5_ks
47
- blk\..*\.nextn\.shared_head_head\.weight=iq5_ks
48
- blk\..*\.nextn\.eh_proj\.weight=q8_0
49
-
50
- # Non-Repeating Layers
51
- token_embd\.weight=iq4_k
52
- output\.weight=iq6_k
53
- ```
54
-
55
- # V2:
56
-
57
- The 'V2' file uses this config, now that I better understand how offloading works:
58
-
59
  ```
60
  # Attention
61
  blk\..*\.attn_q.*=iq5_ks
@@ -99,7 +66,8 @@ blk\..*\.nextn\.eh_proj\.weight=q8_0
99
  # Non-Repeating Layers
100
  token_embd\.weight=iq4_k
101
  output\.weight=iq6_k
102
- "
 
103
 
104
  - Mostly iq5_ks GPU layers to keep it fast and minimize the number of quantization types and loss here
105
 
@@ -107,6 +75,40 @@ output\.weight=iq6_k
107
 
108
  - iq2_kl 'middle' shared experts
109
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
110
 
111
- Works well on 128GB RAM, with room for 24K F16 context in 24GB VRAM and plenty of RAM to spare for the system, but do NOT load with mmap! It's awesome for story continuation! Requires ik_llama.cpp, see ubergarm's GLM 4.5 page.
112
 
 
13
  - ik_llama.cpp
14
  ---
15
 
16
+ GLM-4.5-**Base**, quantized down to 124GB (V2) and 118GB (V1), specifically for 128GB RAM + small GPU setups.
17
 
18
  It uses ik_llama.cpp's new IQ2_KL quantization:
19
 
 
23
 
24
  https://huggingface.co/ubergarm/GLM-4.5-GGUF
25
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
  ```
27
  # Attention
28
  blk\..*\.attn_q.*=iq5_ks
 
66
  # Non-Repeating Layers
67
  token_embd\.weight=iq4_k
68
  output\.weight=iq6_k
69
+ ```
70
+
71
 
72
  - Mostly iq5_ks GPU layers to keep it fast and minimize the number of quantization types and loss here
73
 
 
75
 
76
  - iq2_kl 'middle' shared experts
77
 
78
+ <details>
79
+ <summary>Old V1 Recipe</summary>
80
+ ```
81
+ # Attention
82
+ blk\..*\.attn_q.*=iq5_ks_r4
83
+ blk\..*\.attn_k.*=iq6_k
84
+ blk\..*\.attn_v.*=iq6_k
85
+ blk\..*\.attn_output.*=iq5_ks_r4
86
+
87
+ # First 3 Dense Layers [0-2]
88
+ blk\..*\.ffn_down\.weight=iq4_kt
89
+ blk\..*\.ffn_(gate|up)\.weight=iq4_kt
90
+
91
+ # Shared Expert Layers [3-92]
92
+ blk\..*\.ffn_down_shexp\.weight=iq5_ks_r4
93
+ blk\..*\.ffn_(gate|up)_shexp\.weight=iq4_ks_r4
94
+
95
+ # Routed Experts Layers [3-92]
96
+ blk\..*\.ffn_down_exps\.weight=iq2_kl
97
+ blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl
98
+
99
+ # NextN MTP Layer [92]
100
+ blk\..*\.nextn\.embed_tokens\.weight=iq5_ks
101
+ blk\..*\.nextn\.shared_head_head\.weight=iq5_ks
102
+ blk\..*\.nextn\.eh_proj\.weight=q8_0
103
+
104
+ # Non-Repeating Layers
105
+ token_embd\.weight=iq4_k
106
+ output\.weight=iq6_k
107
+ ```
108
+
109
+ </details>
110
+
111
+
112
 
113
+ Works well on 128GB RAM, with room for 24K F16 context in 24GB VRAM and RAM to spare for the system, but do NOT load with mmap! It's awesome for story continuation! Requires ik_llama.cpp, see ubergarm's GLM 4.5 page.
114