File size: 10,608 Bytes
93e3a8a 2788ad0 93e3a8a 4a66ec0 93e3a8a 4a66ec0 93e3a8a 2788ad0 93e3a8a 2788ad0 93e3a8a 4a66ec0 93e3a8a 2788ad0 93e3a8a 2788ad0 93e3a8a 2788ad0 93e3a8a 4a66ec0 93e3a8a 2788ad0 93e3a8a 2788ad0 93e3a8a 2788ad0 93e3a8a 2788ad0 93e3a8a 2788ad0 93e3a8a 2788ad0 93e3a8a 2788ad0 93e3a8a 2788ad0 93e3a8a 2788ad0 93e3a8a 2788ad0 93e3a8a 2788ad0 93e3a8a 2788ad0 93e3a8a 2788ad0 93e3a8a 2788ad0 93e3a8a 2788ad0 4a66ec0 93e3a8a 4a66ec0 93e3a8a 4a66ec0 93e3a8a 4a66ec0 2788ad0 93e3a8a 4a66ec0 2788ad0 4a66ec0 2788ad0 4a66ec0 2788ad0 93e3a8a 2788ad0 4a66ec0 93e3a8a 2788ad0 93e3a8a 2788ad0 93e3a8a 2788ad0 4a66ec0 93e3a8a 2788ad0 93e3a8a 2788ad0 93e3a8a 2788ad0 93e3a8a 2788ad0 93e3a8a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 |
# Q3 Quantization Format Comparison Summary
## Executive Summary
This document compares three 3-bit quantization formats for the Qwen3-0.6B model: **Q3_K_S**, **Q3_K_M**, and **Q3_HIFI**. All models were evaluated on the same test dataset (wikitext-2-raw/wiki.test.raw) with identical parameters.
---
## Performance Metrics
| Metric | Q3_K_S | Q3_K_M | Q3_HIFI |
|--------|--------|--------|---------|
| **Perplexity** | 35.85 Β± 0.32 | 31.81 Β± 0.29 | **31.22 Β± 0.28** β |
| **File Size** | 366.19 MiB | 389.12 MiB | **308.23 MiB** β |
| **Bits Per Weight** | 4.09 bpw | 4.34 bpw | **3.44 bpw** β |
| **Inference Speed** | **179.80 tok/s** β | 223.44 tok/s | 189.23 tok/s |
| **Memory Usage** | 888 MiB | 911 MiB | **830 MiB** β |
| **Quality Rank** | 3rd | 2nd | **1st** β |
| **Size Rank** | 2nd | 3rd | **1st** β |
| **Speed Rank** | 2nd | **1st** β | 3rd |
---
## Detailed Analysis
### 1. Q3_K_S (Small) - The Balanced Option
**Perplexity:** 35.85 Β± 0.32
**File Size:** 366.19 MiB (4.09 bpw)
**Speed:** 179.80 tokens/second
#### β
Pros:
- **Good balance** between quality, size, and speed
- **Smaller than Q3_K_M** (366 MB vs 389 MB)
- **Faster than Q3_HIFI** (180 tok/s vs 189 tok/s)
- **Simple quantization** - zero configuration required
- **Automatic tensor upgrades** - uses Q6_K for output.weight
- **Production-ready** - works out of the box
#### β Cons:
- **Worst quality** - 4.0 points worse perplexity than Q3_HIFI
- **Not the best in any category** - middle ground in all metrics
- **Lower precision** - fewer automatic upgrades than Q3_K_M
#### π― Best For:
- General-purpose applications where you need a reasonable compromise
- When you want good-enough quality without optimization effort
- Production deployments where simplicity is valued over maximum quality
- Systems where file size matters but you can't invest in optimization
---
### 2. Q3_K_M (Medium) - The Speed Champion
**Perplexity:** 31.81 Β± 0.29
**File Size:** 389.12 MiB (4.34 bpw)
**Speed:** 223.44 tokens/second
#### β
Pros:
- **Fastest inference** - 223 tok/s (24% faster than Q3_HIFI, 24% faster than Q3_K_S)
- **Good quality** - only 0.59 points worse than Q3_HIFI
- **Automatic tensor upgrades** - uses Q4_K and Q5_K for critical tensors
- **Production-ready** - zero configuration required
- **CPU_REPACK support** - optimized memory layout (91 MiB repack buffer)
- **Best speed-to-quality ratio** - excellent performance for the quality level
#### β Cons:
- **Largest file size** - 389 MB (26% larger than Q3_HIFI, 6% larger than Q3_K_S)
- **Higher memory usage** - 911 MiB total
- **Not the best quality** - 0.59 points worse than Q3_HIFI
#### π― Best For:
- **Real-time applications** where speed is critical
- **Interactive systems** requiring low latency
- **Production deployments** where speed matters more than file size
- **Systems with sufficient storage** but need maximum throughput
- **When you want good quality without configuration effort**
---
### 3. Q3_HIFI (Optimized) - The Quality & Size Champion
**Perplexity:** 31.22 Β± 0.28 β **BEST**
**File Size:** 308.23 MiB (3.44 bpw) β **SMALLEST**
**Speed:** 189.23 tokens/second
#### β
Pros:
- **Best quality** - 31.22 perplexity (0.59 points better than Q3_K_M, 4.6 points better than Q3_K_S)
- **Smallest file size** - 308 MB (21% smaller than Q3_K_S, 26% smaller than Q3_K_M)
- **Lowest memory usage** - 830 MiB total
- **Best quality-to-size ratio** - highest quality in smallest package
- **IMatrix-guided quantization** - uses importance matrix for optimal outlier selection
- **Expanded tensor coverage** - Q3_HIFI applied to 6 tensor types (attn_v/q/k, ffn_down/gate/up)
- **Automatic upgrades** - Q6_K for output.weight, Q4_K for attn_output.weight
- **6 FP16 outliers per block** - preserves precision for critical weights
#### β Cons:
- **Slower inference** - 189 tok/s (15% slower than Q3_K_M, 5% slower than Q3_K_S)
- **Requires configuration** - needs IMatrix generation and tensor-type specification
- **More setup effort** - must generate imatrix file and specify quantization strategy
- **Longer quantization time** - IMatrix generation takes 30-60 minutes
#### π― Best For:
- **Quality-critical applications** where accuracy matters most
- **Storage-constrained systems** - mobile devices, embedded systems
- **Offline deployments** where file size is a concern
- **When you can invest in proper quantization setup**
- **Research and development** where quality is the priority
- **Production systems** where quality > speed
---
## Head-to-Head Comparisons
### Quality Comparison
| Format | Perplexity | vs Q3_HIFI | vs Q3_K_M | vs Q3_K_S |
|--------|------------|------------|-----------|-----------|
| **Q3_HIFI** | **31.22** | Baseline | **-0.59** β | **-4.63** β |
| Q3_K_M | 31.81 | +0.59 | Baseline | -4.04 |
| Q3_K_S | 35.85 | +4.63 | +4.04 | Baseline |
**Winner:** Q3_HIFI (best quality by 0.59 points over Q3_K_M)
### File Size Comparison
| Format | Size | vs Q3_HIFI | vs Q3_K_M | vs Q3_K_S |
|--------|------|------------|-----------|-----------|
| **Q3_HIFI** | **308 MB** | Baseline | **-81 MB** β | **-58 MB** β |
| Q3_K_S | 366 MB | +58 MB | -23 MB | Baseline |
| Q3_K_M | 389 MB | +81 MB | Baseline | +23 MB |
**Winner:** Q3_HIFI (smallest by 58-81 MB)
### Speed Comparison
| Format | Speed | vs Q3_HIFI | vs Q3_K_M | vs Q3_K_S |
|--------|-------|------------|-----------|-----------|
| **Q3_K_M** | **223 tok/s** | **+34 tok/s** β | Baseline | **+44 tok/s** β |
| Q3_HIFI | 189 tok/s | Baseline | -34 tok/s | +9 tok/s |
| Q3_K_S | 180 tok/s | -9 tok/s | -44 tok/s | Baseline |
**Winner:** Q3_K_M (fastest by 15-24%)
---
## Recommendations
### π― Best Overall: **Q3_HIFI** (With IMatrix + Expanded Coverage)
- β
**Best quality** (31.22) - beats Q3_K_M by 0.59 points
- β
**Smallest file size** (308 MB) - 21% smaller than Q3_K_S, 26% smaller than Q3_K_M
- β
**Best quality-to-size ratio** - best quality in smallest package
- β οΈ Requires IMatrix + tensor-type configuration
- β οΈ Slower inference (189 tok/s vs 223 tok/s)
- **Use when:** You want the best quality in the smallest file and can invest in proper quantization setup
### β‘ Best Speed: **Q3_K_M** (Out of the Box)
- β
**Fastest inference** (223 tok/s) - 18% faster than Q3_HIFI
- β
**Good quality** (31.81) - only 0.59 points worse than Q3_HIFI
- β
Automatic tensor upgrades
- β
Production-ready immediately - zero configuration
- β οΈ Largest file size (389 MB)
- **Use when:** Speed is critical and you want good quality without configuration effort
### βοΈ Best Balance: **Q3_K_S**
- β
Good middle ground - reasonable quality (35.85) and speed (180 tok/s)
- β
Smaller than Q3_K_M (366 MB vs 389 MB)
- β
Simple quantization process - zero configuration
- β οΈ Not the best in any category
- **Use when:** You need a balanced compromise without optimization effort
---
## Technical Notes
### Why Q3_HIFI Achieves Best Quality
Q3_HIFI (with proper configuration) achieves the best quality through:
- **IMatrix-guided outlier selection** - Uses importance weights to select the most critical outliers
- **Expanded tensor coverage** - Applies Q3_HIFI to 6 tensor types (attn_v/q/k, ffn_down/gate/up)
- **Automatic upgrades** - Q6_K for output.weight, Q4_K for attn_output.weight
- **6 FP16 outliers per block** - Preserves precision for the most important weights
This combination allows Q3_HIFI to achieve **31.22 perplexity** - better than Q3_K_M's 31.81.
### Why Q3_HIFI is Slower
Q3_HIFI's unique architecture (6 FP16 outliers per block) requires:
- More memory lookups (scattered access pattern for outlier indices)
- Additional FP16-to-FP32 conversions for outliers
- More complex dequantization logic
- Currently limited to CPU with basic vectorization (no GPU/SIMD optimizations yet)
However, with the recent SIMD/GPU optimizations implemented, speed should improve significantly in future builds.
### Why Q3_K_M is Fastest
Q3_K_M benefits from:
- **CPU_REPACK optimization** - Optimized memory layout (91 MiB repack buffer)
- **Mature optimizations** - Well-optimized SIMD/GPU kernels
- **Automatic tensor upgrades** - Uses Q4_K and Q5_K for critical tensors, reducing computation
- **Efficient block structure** - Optimized for speed over extreme compression
### Quantization Configuration
**Q3_HIFI (Optimized):**
```powershell
.\build\bin\Release\llama-quantize.exe `
--imatrix .\qwen3-0.6b-imatrix.gguf `
--tensor-type "attn_v=q3_hifi" `
--tensor-type "attn_q=q3_hifi" `
--tensor-type "attn_k=q3_hifi" `
--tensor-type "ffn_down=q3_hifi" `
--tensor-type "ffn_gate=q3_hifi" `
--tensor-type "ffn_up=q3_hifi" `
--tensor-type "attn_output.weight=q4_k" `
--tensor-type "output.weight=q6_k" `
--tensor-type ".*=q3_k" `
.\Qwen3-0.6B-f16.gguf `
.\Qwen3-0.6B-f16-Q3_HIFI.gguf `
Q3_HIFI
```
**Q3_K_M (Simple):**
```powershell
.\build\bin\Release\llama-quantize.exe `
.\Qwen3-0.6B-f16.gguf `
.\Qwen3-0.6B-f16-Q3_K_M.gguf `
Q3_K_M
```
**Q3_K_S (Simple):**
```powershell
.\build\bin\Release\llama-quantize.exe `
.\Qwen3-0.6B-f16.gguf `
.\Qwen3-0.6B-f16-Q3_K_S.gguf `
Q3_K_S
```
---
## Decision Matrix
| Priority | Recommended Format | Reason |
|----------|-------------------|--------|
| **Quality** | Q3_HIFI | Best perplexity (31.22) |
| **File Size** | Q3_HIFI | Smallest (308 MB) |
| **Speed** | Q3_K_M | Fastest (223 tok/s) |
| **Simplicity** | Q3_K_S or Q3_K_M | Zero configuration |
| **Quality + Size** | Q3_HIFI | Best of both |
| **Speed + Quality** | Q3_K_M | Good balance |
| **Production (Simple)** | Q3_K_M | Fast + good quality |
| **Production (Optimized)** | Q3_HIFI | Best quality + smallest |
---
## Conclusion
**For most users:**
- **Choose Q3_K_M** if speed is your priority and you want good quality without configuration
- **Choose Q3_HIFI** if quality and file size are your priorities and you can invest in setup
- **Choose Q3_K_S** if you want a simple, balanced option
**For quality-critical applications:**
Q3_HIFI is the clear winner, offering the best quality (31.22 perplexity) in the smallest package (308 MB), though it requires more setup effort and is slightly slower.
**For speed-critical applications:**
Q3_K_M is the best choice, offering the fastest inference (223 tok/s) with good quality (31.81 perplexity) and zero configuration.
---
**Generated:** 2025-12-03
**Test Dataset:** wikitext-2-raw/wiki.test.raw
**Model:** Qwen3-0.6B
**Evaluation Parameters:** --ppl-stride 0 --ppl-output-type 0 -b 2048 -c 512
|