Create Q3_Quantisation_Comparison.md

Browse files

Files changed (1) hide show

Q3_Quantisation_Comparison.md +218 -0

Q3_Quantisation_Comparison.md ADDED Viewed

	@@ -0,0 +1,218 @@

+# Q3 Quantization Formats Comparison
+## Executive Summary
+This document compares three Q3 quantization formats for the Qwen3-0.6B model based on perplexity evaluation results.
+---
+## Performance Metrics
+| Format | Perplexity | File Size | Bits/Weight | Speed | Quality Rank | Size Rank | Speed Rank |
+|--------|-----------|-----------|-------------|-------|--------------|-----------|-------------|
+| **Q3_K_M** | **31.81 ± 0.29** | 389.12 MiB | 4.34 BPW | **240.34 tok/s** | 🥇 **Best** | 3rd (Largest) | 🥇 **Fastest** |
+| **Q3_K_S** | 35.85 ± 0.32 | 366.19 MiB | 4.09 BPW | 197.90 tok/s | 2nd | 2nd | 2nd |
+| **Q3_HIFI** | 37.41 ± 0.34 | **308.23 MiB** | **3.44 BPW** | 132.08 tok/s | 3rd | 🥇 **Smallest** | 3rd (Slowest) |
+---
+## Detailed Analysis
+### Q3_K_M (Medium) - Best Quality & Speed
+**Tensor Distribution:**
+- f32: 113 tensors (norm layers)
+- q3_K: 113 tensors
+- q4_K: 81 tensors (upgraded for quality)
+- q5_K: 3 tensors (critical layers)
+- q6_K: 1 tensor (output.weight)
+**Pros:**
+- ✅ **Best perplexity** (31.81) - 6.0 points better than Q3_HIFI
+- ✅ **Fastest inference** (240.34 tokens/sec) - 82% faster than Q3_HIFI
+- ✅ **Balanced approach** - Uses mixed precision (Q3/Q4/Q5/Q6) for optimal quality
+- ✅ **Automatic tensor upgrades** - Intelligently upgrades critical tensors
+- ✅ **Best for production** - Excellent quality-to-speed ratio
+**Cons:**
+- ❌ **Largest file size** (389 MB) - 26% larger than Q3_HIFI
+- ❌ **Higher memory usage** - Requires more RAM
+**When to Use:**
+- Production deployments requiring best quality
+- Applications where speed matters
+- When file size is not a primary constraint
+- General-purpose language model tasks
+---
+### Q3_K_S (Small) - Balanced Option
+**Tensor Distribution:**
+- f32: 113 tensors (norm layers)
+- q3_K: 197 tensors (most tensors)
+- q6_K: 1 tensor (output.weight)
+**Pros:**
+- ✅ **Good balance** - Better quality than Q3_HIFI, smaller than Q3_K_M
+- ✅ **Reasonable speed** (197.90 tokens/sec) - 50% faster than Q3_HIFI
+- ✅ **Smaller than Q3_K_M** - 6% reduction in file size
+- ✅ **Simpler quantization** - Less aggressive tensor upgrades
+**Cons:**
+- ❌ **Worse quality than Q3_K_M** - 4.0 points higher perplexity
+- ❌ **Slower than Q3_K_M** - 18% slower inference
+- ❌ **Still larger than Q3_HIFI** - 19% bigger file
+**When to Use:**
+- When you need better quality than Q3_HIFI but smaller than Q3_K_M
+- Moderate quality requirements
+- Balanced size/quality/speed trade-offs
+---
+### Q3_HIFI - Smallest Size
+**Tensor Distribution:**
+- f32: 113 tensors (norm layers)
+- q3_K: 198 tensors (most tensors use Q3_K, not Q3_HIFI!)
+**Note:** This appears to be a hybrid model where most tensors are Q3_K, not pure Q3_HIFI.
+**Pros:**
+- ✅ **Smallest file size** (308 MB) - 21% smaller than Q3_K_S, 26% smaller than Q3_K_M
+- ✅ **Lowest bits/weight** (3.44 BPW) - Most efficient compression
+- ✅ **Unique architecture** - 6 FP16 outliers per block for precision
+- ✅ **Best for storage-constrained** environments
+**Cons:**
+- ❌ **Worst perplexity** (37.41) - 5.6 points worse than Q3_K_M
+- ❌ **Slowest inference** (132.08 tokens/sec) - 45% slower than Q3_K_M
+- ❌ **Limited tensor coverage** - Most tensors still use Q3_K instead of Q3_HIFI
+- ❌ **No automatic upgrades** - Missing the mixed-precision benefits of Q3_K_S/M
+**When to Use:**
+- Storage-constrained environments (mobile, embedded)
+- When file size is the primary concern
+- Offline/archival purposes
+- When quality can be sacrificed for size
+---
+## Quality Comparison
+```
+Perplexity (Lower is Better):
+Q3_K_M:  ████████████████████████████████████ 31.81  ⭐ Best
+Q3_K_S:  ████████████████████████████████████████ 35.85
+Q3_HIFI: ████████████████████████████████████████████ 37.41
+```
+**Quality Gap:**
+- Q3_K_M is **18% better** than Q3_HIFI (6.0 perplexity points)
+- Q3_K_S is **4% better** than Q3_HIFI (1.6 perplexity points)
+---
+## Size Comparison
+```
+File Size (Smaller is Better):
+Q3_HIFI: ████████████████████████████████████ 308 MB  ⭐ Smallest
+Q3_K_S:  ████████████████████████████████████████ 366 MB
+Q3_K_M:  ████████████████████████████████████████████ 389 MB
+```
+**Size Savings:**
+- Q3_HIFI is **21% smaller** than Q3_K_S
+- Q3_HIFI is **26% smaller** than Q3_K_M
+---
+## Speed Comparison
+```
+Inference Speed (Higher is Better):
+Q3_K_M:  ████████████████████████████████████████████ 240 tok/s  ⭐ Fastest
+Q3_K_S:  ████████████████████████████████████████████ 198 tok/s
+Q3_HIFI: ████████████████████████████████████ 132 tok/s
+```
+**Speed Advantage:**
+- Q3_K_M is **82% faster** than Q3_HIFI
+- Q3_K_S is **50% faster** than Q3_HIFI
+---
+## Recommendations
+### 🎯 Best Overall: **Q3_K_M**
+- Best quality and speed
+- Worth the extra 81 MB for most use cases
+- Recommended for production deployments
+### 💾 Best for Storage: **Q3_HIFI**
+- Smallest file size
+- Acceptable if quality/speed are secondary
+- Good for mobile/embedded systems
+### ⚖️ Best Balance: **Q3_K_S**
+- Middle ground between quality and size
+- Good compromise when Q3_K_M is too large but Q3_HIFI quality is insufficient
+---
+## Technical Notes
+### Why Q3_K_M is Best Quality
+Q3_K_M uses **automatic tensor upgrades**:
+- Critical tensors (first/last layers) → Q5_K or Q6_K
+- Important tensors (attention outputs) → Q4_K
+- Standard tensors → Q3_K
+This mixed-precision approach preserves accuracy where it matters most.
+### Why Q3_HIFI is Slower
+Q3_HIFI's unique architecture (6 FP16 outliers per block) requires:
+- More memory lookups (scattered access pattern)
+- No optimized SIMD/GPU kernels yet
+- Additional dequantization overhead for outliers
+### Why Q3_HIFI Quality is Lower
+The current Q3_HIFI model appears to be a hybrid:
+- Most tensors use Q3_K (not Q3_HIFI)
+- Limited Q3_HIFI coverage reduces its benefits
+- Missing the automatic tensor upgrades of Q3_K_S/M
+**Note:** A properly optimized Q3_HIFI with expanded coverage and IMatrix can achieve **31.10 perplexity** (better than Q3_K_M!), but requires:
+- IMatrix file for better outlier selection
+- Expanded tensor-type arguments
+- More quantization time
+---
+## Conclusion
+**For most users:** Choose **Q3_K_M** - it offers the best quality and speed with only a modest size increase.
+**For storage-constrained users:** Choose **Q3_HIFI** - accept the quality/speed trade-off for maximum compression.
+**For balanced needs:** Choose **Q3_K_S** - good middle ground.
+---
+## Test Configuration
+- **Model:** Qwen3-0.6B
+- **Dataset:** wiki.test.raw (wikitext-2-raw)
+- **Context:** 512 tokens
+- **Hardware:** 16 threads, AVX2, FMA enabled
+- **Build:** 7173 (6a7ff532) with MSVC 19.44.35217.0
+---
+*Generated from perplexity evaluation results*