|
|
--- |
|
|
license: apache-2.0 |
|
|
base_model: Qwen/Qwen3-30B-A3B-Thinking-2507 |
|
|
base_model_relation: quantized |
|
|
tags: |
|
|
- Qwen |
|
|
- Qwen3 Thinking 2507 |
|
|
- GGUF |
|
|
- quantized |
|
|
- 4-bit |
|
|
--- |
|
|
|
|
|
## Llama.cpp hybrid layer quantization of Qwen3-30B-A3B-Thinking-2507 by Qwen |
|
|
|
|
|
Original model: https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507 |
|
|
|
|
|
The hybrid quant employs different quantization levels on a per layer basis to increase |
|
|
flexibility of trading off performance vs file size. Less parameter bits are used at deep layers |
|
|
and more bits at cortex layers to simultaneously optimize quantized size and model performance. |
|
|
For this file the layer quants are as follows: |
|
|
``` |
|
|
LAYER_TYPES='[ |
|
|
[0 ,"Q4_K_M"],[1 ,"Q4_K_M"],[2 ,"Q4_K_S"],[3 ,"Q3_K_L"],[4 ,"Q3_K_M"],[5 ,"Q3_K_M"],[6 ,"Q3_K_M"],[7 ,"Q3_K_M"], |
|
|
[8 ,"Q3_K_L"],[9 ,"Q3_K_M"],[10,"Q3_K_L"],[11,"Q3_K_M"],[12,"Q3_K_L"],[13,"Q3_K_M"],[14,"Q3_K_L"],[15,"Q3_K_M"], |
|
|
[16,"Q3_K_L"],[17,"Q3_K_M"],[18,"Q3_K_L"],[19,"Q3_K_M"],[20,"Q3_K_L"],[21,"Q3_K_L"],[22,"Q3_K_L"],[23,"Q3_K_L"], |
|
|
[24,"Q3_K_L"],[25,"Q3_K_L"],[26,"Q3_K_L"],[27,"Q3_K_L"],[28,"Q4_K_S"],[29,"Q3_K_L"],[30,"Q4_K_S"],[31,"Q3_K_L"], |
|
|
[32,"Q4_K_S"],[33,"Q3_K_L"],[34,"Q4_K_S"],[35,"Q3_K_L"],[36,"Q4_K_S"],[37,"Q4_K_S"],[38,"Q4_K_S"],[39,"Q4_K_S"], |
|
|
[40,"Q4_K_S"],[41,"Q4_K_S"],[42,"Q4_K_S"],[43,"Q4_K_S"],[44,"Q4_K_M"],[45,"Q5_K_S"],[46,"Q5_K_M"],[47,"Q6_K" ] |
|
|
]' |
|
|
FLAGS="--token-embedding-type Q6_K --output-tensor-type Q6_K --layer-types-high" |
|
|
``` |
|
|
The layer quants were optimized for good performance on the non thinking variant of 30B A3B 2507 and reused verbatim |
|
|
on the thinking version. Tests show it performs well, about grade B, on a set of curated test promps, even getting |
|
|
one IQ-test-like problem right that virtually every other tested model (including strong ones like QwQ and GLM Z1) |
|
|
fails while tripping up on some other easier problems. Nonetheless the evals show a pretty solid performance across |
|
|
a wide range of diverse problems. |
|
|
|
|
|
Comparison: |
|
|
|
|
|
Quant | size | PPL | Comment |
|
|
---------|---------|------|----------- |
|
|
IQ4_XS | 16.6e9 | 7.4 | default embed and output, unstable with greedy sampling |
|
|
Q4_K_H | 16.8e9 | 7.5 | Q6_K embed Q6_K output, stable with greedy sampling |
|
|
|
|
|
Usage: |
|
|
|
|
|
Compared to the first Qwen3-30B-A3B this model changes: |
|
|
1) Bigger native context of 256k extendable to 1M with yarn |
|
|
2) Only thinking mode is available. It is a dedicated RL trained thinking model with think block header |
|
|
similar to QwQ and think mode of original Qwen3 series. Just like QwQ, overthinking is baked into the |
|
|
model training. It might be possible to nudge the model to use less overthinking in the prompt but this |
|
|
was not tested. GLM Z1 9B is an example of a model which does not overthink while still being able to solve |
|
|
some pretty tricky problems correctly. |
|
|
|
|
|
This moe model can be efficiently run by offloading expert tensors to CPU via -ot exps=CPU |
|
|
to open up very large context space. The smaller size of the optimally quantized parameters will give |
|
|
an effective boost in CPU processing speed due to reducing the memory BW needed to repeatedly copy them |
|
|
from main memory to SIMD regs. It can also run fully offloaded on GPU via RPC or high VRAM GPU. |
|
|
|
|
|
The recommended speculator for the model is Qwen3-0.6B if the inference platform can support |
|
|
vocabulary translation between draft and target. Approximate performance using 4070 GPU and a 9900k |
|
|
CPU with a downstream speculator used with llama.cpp: |
|
|
|
|
|
Config | block 4 think mode gen speed |
|
|
---------|-------- |
|
|
2 4070, RPC, fully offloaded to GPU | 42 t/s |
|
|
1 4070, -ot exps=CPU, CPU=9900k | 18 t/s |
|
|
|
|
|
Benchmarks: |
|
|
|
|
|
Evals for the model will eventually be given here: https://huggingface.co/spaces/steampunque/benchlm. |
|
|
|
|
|
## Download the file from below: |
|
|
| Link | Type | Size/e9 B | Notes | |
|
|
|------|------|-----------|-------| |
|
|
| [Qwen3-30B-A3B-Thinking-2507.Q4_K_H.gguf](https://huggingface.co/steampunque/Qwen3-30B-A3B-Thinking-2507-Hybrid-GGUF/resolve/main/Qwen3-30B-A3B-Thinking-2507.Q4_K_H.gguf) | Q4_K_H | 16.8e9 B | ~IQ4_XS size | |
|
|
|
|
|
A discussion thread about the hybrid layer quant approach can be found here on the llama.cpp git repository: |
|
|
|
|
|
https://github.com/ggml-org/llama.cpp/discussions/13040 |