Text Generation
Transformers
GGUF
64k context
high speed
all use cases
creative
creative writing
all genres
tool calls
tool use
problem solving
deep thinking
reasoning
deep reasoning
story
qwen3_moe
writing
fiction
roleplaying
bfloat16
role play
sillytavern
backyard
Lmstudio
Mixture of Experts
mixture of experts
4 experts activated
128 experts
NEO Imatrix
Max Imatrix
qwen3
imatrix
conversational
Update README.md
Browse files
README.md
CHANGED
|
@@ -74,12 +74,11 @@ This method close to doubles the speed of the model and uses 1.5B (of 30B) param
|
|
| 74 |
use the regular model ("30B-A3B"), and use this model for simpler use case(s) although I did not notice any loss of function during
|
| 75 |
routine (but not extensive) testing.
|
| 76 |
|
| 77 |
-
GGUF NEO Imatrix ggufs
|
| 78 |
-
context as per Qwen tech notes using "YARN".
|
| 79 |
|
| 80 |
NEO Imatrix dataset was developed in house after testing and evaluating over 50 Imatrix datasets and a lot of "tinkering".
|
| 81 |
|
| 82 |
-
The quants (and specific Imatrix processes) were specially designed for Qwen3 30B-
|
| 83 |
|
| 84 |
That being said, "Team Qwen" deserves all the credit. Qwen3s are SOTA.
|
| 85 |
|
|
@@ -103,13 +102,7 @@ Use Jinja Template or CHATML template.
|
|
| 103 |
|
| 104 |
<B>IQ1_M MAX / IQ1_M MAX PLUS and Higher Quants:</B>
|
| 105 |
|
| 106 |
-
|
| 107 |
-
I suggest using prompts with a bit more direction/information (see two example generations) than a standard prompt with these specific quants to compensate
|
| 108 |
-
for losses at this very low bit level.
|
| 109 |
-
|
| 110 |
-
IQ1_M MAX PLUS has additional optimizations (vs IQ1_M MAX) at critical points in the model.
|
| 111 |
-
|
| 112 |
-
IQ2s will be a lot stronger than the IQ1_Ms.
|
| 113 |
|
| 114 |
Q2K/Q2KS will be faster (token per second) on CPU/RAM only usage, but performance will lower than IQ2s.
|
| 115 |
|
|
@@ -124,8 +117,6 @@ Q5s will be very high performance.
|
|
| 124 |
|
| 125 |
Q6 will be peak performance, but with minimal NEO imatrix effect(s).
|
| 126 |
|
| 127 |
-
Q8s (specialized) will be well... excellent performance.
|
| 128 |
-
|
| 129 |
NOTES:
|
| 130 |
- IQ3s will outperform Q3s quants, likewise for IQ2s vs Q2s quants.
|
| 131 |
- IQ4_XS / IQ4_NL will perform at or outperform Q4s.
|
|
|
|
| 74 |
use the regular model ("30B-A3B"), and use this model for simpler use case(s) although I did not notice any loss of function during
|
| 75 |
routine (but not extensive) testing.
|
| 76 |
|
| 77 |
+
GGUF NEO Imatrix ggufs have extended context to 64k (65535) (up from 32k/32768) context as per Qwen tech notes using "YARN".
|
|
|
|
| 78 |
|
| 79 |
NEO Imatrix dataset was developed in house after testing and evaluating over 50 Imatrix datasets and a lot of "tinkering".
|
| 80 |
|
| 81 |
+
The quants (and specific Imatrix processes) were specially designed for Qwen3 30B-A1.5B model and used recent changes at LLamacpp (April 15 2025 / B5127 onwards) to customize the quant structure itself.
|
| 82 |
|
| 83 |
That being said, "Team Qwen" deserves all the credit. Qwen3s are SOTA.
|
| 84 |
|
|
|
|
| 102 |
|
| 103 |
<B>IQ1_M MAX / IQ1_M MAX PLUS and Higher Quants:</B>
|
| 104 |
|
| 105 |
+
IQ2s work well.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 106 |
|
| 107 |
Q2K/Q2KS will be faster (token per second) on CPU/RAM only usage, but performance will lower than IQ2s.
|
| 108 |
|
|
|
|
| 117 |
|
| 118 |
Q6 will be peak performance, but with minimal NEO imatrix effect(s).
|
| 119 |
|
|
|
|
|
|
|
| 120 |
NOTES:
|
| 121 |
- IQ3s will outperform Q3s quants, likewise for IQ2s vs Q2s quants.
|
| 122 |
- IQ4_XS / IQ4_NL will perform at or outperform Q4s.
|