Update README.md
Browse files
README.md
CHANGED
|
@@ -13,9 +13,9 @@ tags:
|
|
| 13 |
- ik_llama.cpp
|
| 14 |
---
|
| 15 |
|
| 16 |
-
GLM-4.5-**Base**, quantized down to 118GB, specifically for 128GB RAM + small GPU setups.
|
| 17 |
|
| 18 |
-
It uses ik_llama.cpp's
|
| 19 |
|
| 20 |
https://github.com/ikawrakow/ik_llama.cpp/pull/602
|
| 21 |
|
|
@@ -52,11 +52,61 @@ token_embd\.weight=iq4_k
|
|
| 52 |
output\.weight=iq6_k
|
| 53 |
```
|
| 54 |
|
| 55 |
-
|
| 56 |
-
- iq4_kt dense layer ffns to save VRAM for context, since they will be offloaded to GPU.
|
| 57 |
-
- iq2_kl ffn up *and* down experts, as it results in a optimal size with a good format as opposed to quantizing up/down differently.
|
| 58 |
|
| 59 |
-
|
| 60 |
|
| 61 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 62 |
|
|
|
|
| 13 |
- ik_llama.cpp
|
| 14 |
---
|
| 15 |
|
| 16 |
+
GLM-4.5-**Base**, quantized down to 118GB/124GB, specifically for 128GB RAM + small GPU setups.
|
| 17 |
|
| 18 |
+
It uses ik_llama.cpp's new IQ2_KL quantization:
|
| 19 |
|
| 20 |
https://github.com/ikawrakow/ik_llama.cpp/pull/602
|
| 21 |
|
|
|
|
| 52 |
output\.weight=iq6_k
|
| 53 |
```
|
| 54 |
|
| 55 |
+
# V2:
|
|
|
|
|
|
|
| 56 |
|
| 57 |
+
The 'V2' file uses this config, now that I better understand how offloading works:
|
| 58 |
|
| 59 |
+
```
|
| 60 |
+
# Attention
|
| 61 |
+
blk\..*\.attn_q.*=iq5_ks
|
| 62 |
+
blk\..*\.attn_k.*=iq6_k
|
| 63 |
+
blk\..*\.attn_v.*=iq6_k
|
| 64 |
+
blk\..*\.attn_output.*=iq5_ks
|
| 65 |
+
|
| 66 |
+
# First 3 Dense Layers [0-2]
|
| 67 |
+
blk\..*\.ffn_down\.weight=iq5_ks
|
| 68 |
+
blk\..*\.ffn_(gate|up)\.weight=iq5_ks
|
| 69 |
+
|
| 70 |
+
# Shared Expert Layers [3-92]
|
| 71 |
+
blk\..*\.ffn_down_shexp\.weight=iq5_ks
|
| 72 |
+
blk\..*\.ffn_(gate|up)_shexp\.weight=iq5_ks
|
| 73 |
+
|
| 74 |
+
# Routed Experts Layers [3-9]
|
| 75 |
+
blk\.[3-9]\.ffn_down_exps\.weight=iq3_ks
|
| 76 |
+
blk\.[3-9]\.ffn_(gate|up)_exps\.weight=iq3_ks
|
| 77 |
+
|
| 78 |
+
# Routed Experts Layers [10-19]
|
| 79 |
+
blk\.[1-1][0-9]\.ffn_down_exps\.weight=iq3_ks
|
| 80 |
+
blk\.[1-1][0-9]\.ffn_(gate|up)_exps\.weight=iq3_ks
|
| 81 |
+
|
| 82 |
+
# Routed Experts Layers [81-89]
|
| 83 |
+
blk\.[8-8][1-9]\.ffn_down_exps\.weight=iq3_ks
|
| 84 |
+
blk\.[8-8][1-9]\.ffn_(gate|up)_exps\.weight=iq3_ks
|
| 85 |
+
|
| 86 |
+
# Routed Experts Layers [90-92]
|
| 87 |
+
blk\.[9-9][0-2]\.ffn_down_exps\.weight=iq3_ks
|
| 88 |
+
blk\.[9-9][0-2]\.ffn_(gate|up)_exps\.weight=iq3_ks
|
| 89 |
+
|
| 90 |
+
# Routed Experts Layers [20-80]
|
| 91 |
+
blk\..*\.ffn_down_exps\.weight=iq2_kl
|
| 92 |
+
blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl
|
| 93 |
+
|
| 94 |
+
# NextN MTP Layer [92]
|
| 95 |
+
blk\..*\.nextn\.embed_tokens\.weight=iq5_ks
|
| 96 |
+
blk\..*\.nextn\.shared_head_head\.weight=iq5_ks
|
| 97 |
+
blk\..*\.nextn\.eh_proj\.weight=q8_0
|
| 98 |
+
|
| 99 |
+
# Non-Repeating Layers
|
| 100 |
+
token_embd\.weight=iq4_k
|
| 101 |
+
output\.weight=iq6_k
|
| 102 |
+
"
|
| 103 |
+
|
| 104 |
+
- Mostly iq5_ks GPU layers to keep it fast and minimize the number of quantization types and loss here
|
| 105 |
+
|
| 106 |
+
- iq3_ks shared experts near the beginning and end, as this seems to be where there are perplexity 'bumps'
|
| 107 |
+
|
| 108 |
+
- iq2_kl 'middle' shared experts
|
| 109 |
+
|
| 110 |
+
|
| 111 |
+
Works well on 128GB RAM, with room for 24K F16 context in 24GB VRAM and plenty of RAM to spare for the system, but do NOT load with mmap! It's awesome for story continuation! Requires ik_llama.cpp, see ubergarm's GLM 4.5 page.
|
| 112 |
|