Qwen3-0.6B-f16 / Q3_Quantisation_Comparison.md
geoffmunn's picture
Update Q3_Quantisation_Comparison.md
93e3a8a verified
# Q3 Quantization Format Comparison Summary
## Executive Summary
This document compares three 3-bit quantization formats for the Qwen3-0.6B model: **Q3_K_S**, **Q3_K_M**, and **Q3_HIFI**. All models were evaluated on the same test dataset (wikitext-2-raw/wiki.test.raw) with identical parameters.
---
## Performance Metrics
| Metric | Q3_K_S | Q3_K_M | Q3_HIFI |
|--------|--------|--------|---------|
| **Perplexity** | 35.85 ± 0.32 | 31.81 ± 0.29 | **31.22 ± 0.28** ⭐ |
| **File Size** | 366.19 MiB | 389.12 MiB | **308.23 MiB** ⭐ |
| **Bits Per Weight** | 4.09 bpw | 4.34 bpw | **3.44 bpw** ⭐ |
| **Inference Speed** | **179.80 tok/s** ⭐ | 223.44 tok/s | 189.23 tok/s |
| **Memory Usage** | 888 MiB | 911 MiB | **830 MiB** ⭐ |
| **Quality Rank** | 3rd | 2nd | **1st** ⭐ |
| **Size Rank** | 2nd | 3rd | **1st** ⭐ |
| **Speed Rank** | 2nd | **1st** ⭐ | 3rd |
---
## Detailed Analysis
### 1. Q3_K_S (Small) - The Balanced Option
**Perplexity:** 35.85 ± 0.32
**File Size:** 366.19 MiB (4.09 bpw)
**Speed:** 179.80 tokens/second
#### ✅ Pros:
- **Good balance** between quality, size, and speed
- **Smaller than Q3_K_M** (366 MB vs 389 MB)
- **Faster than Q3_HIFI** (180 tok/s vs 189 tok/s)
- **Simple quantization** - zero configuration required
- **Automatic tensor upgrades** - uses Q6_K for output.weight
- **Production-ready** - works out of the box
#### ❌ Cons:
- **Worst quality** - 4.0 points worse perplexity than Q3_HIFI
- **Not the best in any category** - middle ground in all metrics
- **Lower precision** - fewer automatic upgrades than Q3_K_M
#### 🎯 Best For:
- General-purpose applications where you need a reasonable compromise
- When you want good-enough quality without optimization effort
- Production deployments where simplicity is valued over maximum quality
- Systems where file size matters but you can't invest in optimization
---
### 2. Q3_K_M (Medium) - The Speed Champion
**Perplexity:** 31.81 ± 0.29
**File Size:** 389.12 MiB (4.34 bpw)
**Speed:** 223.44 tokens/second
#### ✅ Pros:
- **Fastest inference** - 223 tok/s (24% faster than Q3_HIFI, 24% faster than Q3_K_S)
- **Good quality** - only 0.59 points worse than Q3_HIFI
- **Automatic tensor upgrades** - uses Q4_K and Q5_K for critical tensors
- **Production-ready** - zero configuration required
- **CPU_REPACK support** - optimized memory layout (91 MiB repack buffer)
- **Best speed-to-quality ratio** - excellent performance for the quality level
#### ❌ Cons:
- **Largest file size** - 389 MB (26% larger than Q3_HIFI, 6% larger than Q3_K_S)
- **Higher memory usage** - 911 MiB total
- **Not the best quality** - 0.59 points worse than Q3_HIFI
#### 🎯 Best For:
- **Real-time applications** where speed is critical
- **Interactive systems** requiring low latency
- **Production deployments** where speed matters more than file size
- **Systems with sufficient storage** but need maximum throughput
- **When you want good quality without configuration effort**
---
### 3. Q3_HIFI (Optimized) - The Quality & Size Champion
**Perplexity:** 31.22 ± 0.28 ⭐ **BEST**
**File Size:** 308.23 MiB (3.44 bpw) ⭐ **SMALLEST**
**Speed:** 189.23 tokens/second
#### ✅ Pros:
- **Best quality** - 31.22 perplexity (0.59 points better than Q3_K_M, 4.6 points better than Q3_K_S)
- **Smallest file size** - 308 MB (21% smaller than Q3_K_S, 26% smaller than Q3_K_M)
- **Lowest memory usage** - 830 MiB total
- **Best quality-to-size ratio** - highest quality in smallest package
- **IMatrix-guided quantization** - uses importance matrix for optimal outlier selection
- **Expanded tensor coverage** - Q3_HIFI applied to 6 tensor types (attn_v/q/k, ffn_down/gate/up)
- **Automatic upgrades** - Q6_K for output.weight, Q4_K for attn_output.weight
- **6 FP16 outliers per block** - preserves precision for critical weights
#### ❌ Cons:
- **Slower inference** - 189 tok/s (15% slower than Q3_K_M, 5% slower than Q3_K_S)
- **Requires configuration** - needs IMatrix generation and tensor-type specification
- **More setup effort** - must generate imatrix file and specify quantization strategy
- **Longer quantization time** - IMatrix generation takes 30-60 minutes
#### 🎯 Best For:
- **Quality-critical applications** where accuracy matters most
- **Storage-constrained systems** - mobile devices, embedded systems
- **Offline deployments** where file size is a concern
- **When you can invest in proper quantization setup**
- **Research and development** where quality is the priority
- **Production systems** where quality > speed
---
## Head-to-Head Comparisons
### Quality Comparison
| Format | Perplexity | vs Q3_HIFI | vs Q3_K_M | vs Q3_K_S |
|--------|------------|------------|-----------|-----------|
| **Q3_HIFI** | **31.22** | Baseline | **-0.59** ⭐ | **-4.63** ⭐ |
| Q3_K_M | 31.81 | +0.59 | Baseline | -4.04 |
| Q3_K_S | 35.85 | +4.63 | +4.04 | Baseline |
**Winner:** Q3_HIFI (best quality by 0.59 points over Q3_K_M)
### File Size Comparison
| Format | Size | vs Q3_HIFI | vs Q3_K_M | vs Q3_K_S |
|--------|------|------------|-----------|-----------|
| **Q3_HIFI** | **308 MB** | Baseline | **-81 MB** ⭐ | **-58 MB** ⭐ |
| Q3_K_S | 366 MB | +58 MB | -23 MB | Baseline |
| Q3_K_M | 389 MB | +81 MB | Baseline | +23 MB |
**Winner:** Q3_HIFI (smallest by 58-81 MB)
### Speed Comparison
| Format | Speed | vs Q3_HIFI | vs Q3_K_M | vs Q3_K_S |
|--------|-------|------------|-----------|-----------|
| **Q3_K_M** | **223 tok/s** | **+34 tok/s** ⭐ | Baseline | **+44 tok/s** ⭐ |
| Q3_HIFI | 189 tok/s | Baseline | -34 tok/s | +9 tok/s |
| Q3_K_S | 180 tok/s | -9 tok/s | -44 tok/s | Baseline |
**Winner:** Q3_K_M (fastest by 15-24%)
---
## Recommendations
### 🎯 Best Overall: **Q3_HIFI** (With IMatrix + Expanded Coverage)
- ✅ **Best quality** (31.22) - beats Q3_K_M by 0.59 points
- ✅ **Smallest file size** (308 MB) - 21% smaller than Q3_K_S, 26% smaller than Q3_K_M
- ✅ **Best quality-to-size ratio** - best quality in smallest package
- ⚠️ Requires IMatrix + tensor-type configuration
- ⚠️ Slower inference (189 tok/s vs 223 tok/s)
- **Use when:** You want the best quality in the smallest file and can invest in proper quantization setup
### ⚡ Best Speed: **Q3_K_M** (Out of the Box)
- ✅ **Fastest inference** (223 tok/s) - 18% faster than Q3_HIFI
- ✅ **Good quality** (31.81) - only 0.59 points worse than Q3_HIFI
- ✅ Automatic tensor upgrades
- ✅ Production-ready immediately - zero configuration
- ⚠️ Largest file size (389 MB)
- **Use when:** Speed is critical and you want good quality without configuration effort
### ⚖️ Best Balance: **Q3_K_S**
- ✅ Good middle ground - reasonable quality (35.85) and speed (180 tok/s)
- ✅ Smaller than Q3_K_M (366 MB vs 389 MB)
- ✅ Simple quantization process - zero configuration
- ⚠️ Not the best in any category
- **Use when:** You need a balanced compromise without optimization effort
---
## Technical Notes
### Why Q3_HIFI Achieves Best Quality
Q3_HIFI (with proper configuration) achieves the best quality through:
- **IMatrix-guided outlier selection** - Uses importance weights to select the most critical outliers
- **Expanded tensor coverage** - Applies Q3_HIFI to 6 tensor types (attn_v/q/k, ffn_down/gate/up)
- **Automatic upgrades** - Q6_K for output.weight, Q4_K for attn_output.weight
- **6 FP16 outliers per block** - Preserves precision for the most important weights
This combination allows Q3_HIFI to achieve **31.22 perplexity** - better than Q3_K_M's 31.81.
### Why Q3_HIFI is Slower
Q3_HIFI's unique architecture (6 FP16 outliers per block) requires:
- More memory lookups (scattered access pattern for outlier indices)
- Additional FP16-to-FP32 conversions for outliers
- More complex dequantization logic
- Currently limited to CPU with basic vectorization (no GPU/SIMD optimizations yet)
However, with the recent SIMD/GPU optimizations implemented, speed should improve significantly in future builds.
### Why Q3_K_M is Fastest
Q3_K_M benefits from:
- **CPU_REPACK optimization** - Optimized memory layout (91 MiB repack buffer)
- **Mature optimizations** - Well-optimized SIMD/GPU kernels
- **Automatic tensor upgrades** - Uses Q4_K and Q5_K for critical tensors, reducing computation
- **Efficient block structure** - Optimized for speed over extreme compression
### Quantization Configuration
**Q3_HIFI (Optimized):**
```powershell
.\build\bin\Release\llama-quantize.exe `
--imatrix .\qwen3-0.6b-imatrix.gguf `
--tensor-type "attn_v=q3_hifi" `
--tensor-type "attn_q=q3_hifi" `
--tensor-type "attn_k=q3_hifi" `
--tensor-type "ffn_down=q3_hifi" `
--tensor-type "ffn_gate=q3_hifi" `
--tensor-type "ffn_up=q3_hifi" `
--tensor-type "attn_output.weight=q4_k" `
--tensor-type "output.weight=q6_k" `
--tensor-type ".*=q3_k" `
.\Qwen3-0.6B-f16.gguf `
.\Qwen3-0.6B-f16-Q3_HIFI.gguf `
Q3_HIFI
```
**Q3_K_M (Simple):**
```powershell
.\build\bin\Release\llama-quantize.exe `
.\Qwen3-0.6B-f16.gguf `
.\Qwen3-0.6B-f16-Q3_K_M.gguf `
Q3_K_M
```
**Q3_K_S (Simple):**
```powershell
.\build\bin\Release\llama-quantize.exe `
.\Qwen3-0.6B-f16.gguf `
.\Qwen3-0.6B-f16-Q3_K_S.gguf `
Q3_K_S
```
---
## Decision Matrix
| Priority | Recommended Format | Reason |
|----------|-------------------|--------|
| **Quality** | Q3_HIFI | Best perplexity (31.22) |
| **File Size** | Q3_HIFI | Smallest (308 MB) |
| **Speed** | Q3_K_M | Fastest (223 tok/s) |
| **Simplicity** | Q3_K_S or Q3_K_M | Zero configuration |
| **Quality + Size** | Q3_HIFI | Best of both |
| **Speed + Quality** | Q3_K_M | Good balance |
| **Production (Simple)** | Q3_K_M | Fast + good quality |
| **Production (Optimized)** | Q3_HIFI | Best quality + smallest |
---
## Conclusion
**For most users:**
- **Choose Q3_K_M** if speed is your priority and you want good quality without configuration
- **Choose Q3_HIFI** if quality and file size are your priorities and you can invest in setup
- **Choose Q3_K_S** if you want a simple, balanced option
**For quality-critical applications:**
Q3_HIFI is the clear winner, offering the best quality (31.22 perplexity) in the smallest package (308 MB), though it requires more setup effort and is slightly slower.
**For speed-critical applications:**
Q3_K_M is the best choice, offering the fastest inference (223 tok/s) with good quality (31.81 perplexity) and zero configuration.
---
**Generated:** 2025-12-03
**Test Dataset:** wikitext-2-raw/wiki.test.raw
**Model:** Qwen3-0.6B
**Evaluation Parameters:** --ppl-stride 0 --ppl-output-type 0 -b 2048 -c 512