Update README.md
Browse files
README.md
CHANGED
|
@@ -13,7 +13,7 @@ tags:
|
|
| 13 |
- ik_llama.cpp
|
| 14 |
---
|
| 15 |
|
| 16 |
-
GLM-4.5-**Base**, quantized down to 118GB
|
| 17 |
|
| 18 |
It uses ik_llama.cpp's new IQ2_KL quantization:
|
| 19 |
|
|
@@ -23,39 +23,6 @@ With the following mix, derived from ubergarm's GLM-4.5 (Instruct) quantizations
|
|
| 23 |
|
| 24 |
https://huggingface.co/ubergarm/GLM-4.5-GGUF
|
| 25 |
|
| 26 |
-
```
|
| 27 |
-
# Attention
|
| 28 |
-
blk\..*\.attn_q.*=iq5_ks_r4
|
| 29 |
-
blk\..*\.attn_k.*=iq6_k
|
| 30 |
-
blk\..*\.attn_v.*=iq6_k
|
| 31 |
-
blk\..*\.attn_output.*=iq5_ks_r4
|
| 32 |
-
|
| 33 |
-
# First 3 Dense Layers [0-2]
|
| 34 |
-
blk\..*\.ffn_down\.weight=iq4_kt
|
| 35 |
-
blk\..*\.ffn_(gate|up)\.weight=iq4_kt
|
| 36 |
-
|
| 37 |
-
# Shared Expert Layers [3-92]
|
| 38 |
-
blk\..*\.ffn_down_shexp\.weight=iq5_ks_r4
|
| 39 |
-
blk\..*\.ffn_(gate|up)_shexp\.weight=iq4_ks_r4
|
| 40 |
-
|
| 41 |
-
# Routed Experts Layers [3-92]
|
| 42 |
-
blk\..*\.ffn_down_exps\.weight=iq2_kl
|
| 43 |
-
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl
|
| 44 |
-
|
| 45 |
-
# NextN MTP Layer [92]
|
| 46 |
-
blk\..*\.nextn\.embed_tokens\.weight=iq5_ks
|
| 47 |
-
blk\..*\.nextn\.shared_head_head\.weight=iq5_ks
|
| 48 |
-
blk\..*\.nextn\.eh_proj\.weight=q8_0
|
| 49 |
-
|
| 50 |
-
# Non-Repeating Layers
|
| 51 |
-
token_embd\.weight=iq4_k
|
| 52 |
-
output\.weight=iq6_k
|
| 53 |
-
```
|
| 54 |
-
|
| 55 |
-
# V2:
|
| 56 |
-
|
| 57 |
-
The 'V2' file uses this config, now that I better understand how offloading works:
|
| 58 |
-
|
| 59 |
```
|
| 60 |
# Attention
|
| 61 |
blk\..*\.attn_q.*=iq5_ks
|
|
@@ -99,7 +66,8 @@ blk\..*\.nextn\.eh_proj\.weight=q8_0
|
|
| 99 |
# Non-Repeating Layers
|
| 100 |
token_embd\.weight=iq4_k
|
| 101 |
output\.weight=iq6_k
|
| 102 |
-
|
|
|
|
| 103 |
|
| 104 |
- Mostly iq5_ks GPU layers to keep it fast and minimize the number of quantization types and loss here
|
| 105 |
|
|
@@ -107,6 +75,40 @@ output\.weight=iq6_k
|
|
| 107 |
|
| 108 |
- iq2_kl 'middle' shared experts
|
| 109 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 110 |
|
| 111 |
-
Works well on 128GB RAM, with room for 24K F16 context in 24GB VRAM and
|
| 112 |
|
|
|
|
| 13 |
- ik_llama.cpp
|
| 14 |
---
|
| 15 |
|
| 16 |
+
GLM-4.5-**Base**, quantized down to 124GB (V2) and 118GB (V1), specifically for 128GB RAM + small GPU setups.
|
| 17 |
|
| 18 |
It uses ik_llama.cpp's new IQ2_KL quantization:
|
| 19 |
|
|
|
|
| 23 |
|
| 24 |
https://huggingface.co/ubergarm/GLM-4.5-GGUF
|
| 25 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 26 |
```
|
| 27 |
# Attention
|
| 28 |
blk\..*\.attn_q.*=iq5_ks
|
|
|
|
| 66 |
# Non-Repeating Layers
|
| 67 |
token_embd\.weight=iq4_k
|
| 68 |
output\.weight=iq6_k
|
| 69 |
+
```
|
| 70 |
+
|
| 71 |
|
| 72 |
- Mostly iq5_ks GPU layers to keep it fast and minimize the number of quantization types and loss here
|
| 73 |
|
|
|
|
| 75 |
|
| 76 |
- iq2_kl 'middle' shared experts
|
| 77 |
|
| 78 |
+
<details>
|
| 79 |
+
<summary>Old V1 Recipe</summary>
|
| 80 |
+
```
|
| 81 |
+
# Attention
|
| 82 |
+
blk\..*\.attn_q.*=iq5_ks_r4
|
| 83 |
+
blk\..*\.attn_k.*=iq6_k
|
| 84 |
+
blk\..*\.attn_v.*=iq6_k
|
| 85 |
+
blk\..*\.attn_output.*=iq5_ks_r4
|
| 86 |
+
|
| 87 |
+
# First 3 Dense Layers [0-2]
|
| 88 |
+
blk\..*\.ffn_down\.weight=iq4_kt
|
| 89 |
+
blk\..*\.ffn_(gate|up)\.weight=iq4_kt
|
| 90 |
+
|
| 91 |
+
# Shared Expert Layers [3-92]
|
| 92 |
+
blk\..*\.ffn_down_shexp\.weight=iq5_ks_r4
|
| 93 |
+
blk\..*\.ffn_(gate|up)_shexp\.weight=iq4_ks_r4
|
| 94 |
+
|
| 95 |
+
# Routed Experts Layers [3-92]
|
| 96 |
+
blk\..*\.ffn_down_exps\.weight=iq2_kl
|
| 97 |
+
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl
|
| 98 |
+
|
| 99 |
+
# NextN MTP Layer [92]
|
| 100 |
+
blk\..*\.nextn\.embed_tokens\.weight=iq5_ks
|
| 101 |
+
blk\..*\.nextn\.shared_head_head\.weight=iq5_ks
|
| 102 |
+
blk\..*\.nextn\.eh_proj\.weight=q8_0
|
| 103 |
+
|
| 104 |
+
# Non-Repeating Layers
|
| 105 |
+
token_embd\.weight=iq4_k
|
| 106 |
+
output\.weight=iq6_k
|
| 107 |
+
```
|
| 108 |
+
|
| 109 |
+
</details>
|
| 110 |
+
|
| 111 |
+
|
| 112 |
|
| 113 |
+
Works well on 128GB RAM, with room for 24K F16 context in 24GB VRAM and RAM to spare for the system, but do NOT load with mmap! It's awesome for story continuation! Requires ik_llama.cpp, see ubergarm's GLM 4.5 page.
|
| 114 |
|