# Q3 Quantization Format Comparison Summary ## Executive Summary This document compares three 3-bit quantization formats for the Qwen3-0.6B model: **Q3_K_S**, **Q3_K_M**, and **Q3_HIFI**. All models were evaluated on the same test dataset (wikitext-2-raw/wiki.test.raw) with identical parameters. --- ## Performance Metrics | Metric | Q3_K_S | Q3_K_M | Q3_HIFI | |--------|--------|--------|---------| | **Perplexity** | 35.85 ± 0.32 | 31.81 ± 0.29 | **31.22 ± 0.28** ⭐ | | **File Size** | 366.19 MiB | 389.12 MiB | **308.23 MiB** ⭐ | | **Bits Per Weight** | 4.09 bpw | 4.34 bpw | **3.44 bpw** ⭐ | | **Inference Speed** | **179.80 tok/s** ⭐ | 223.44 tok/s | 189.23 tok/s | | **Memory Usage** | 888 MiB | 911 MiB | **830 MiB** ⭐ | | **Quality Rank** | 3rd | 2nd | **1st** ⭐ | | **Size Rank** | 2nd | 3rd | **1st** ⭐ | | **Speed Rank** | 2nd | **1st** ⭐ | 3rd | --- ## Detailed Analysis ### 1. Q3_K_S (Small) - The Balanced Option **Perplexity:** 35.85 ± 0.32 **File Size:** 366.19 MiB (4.09 bpw) **Speed:** 179.80 tokens/second #### ✅ Pros: - **Good balance** between quality, size, and speed - **Smaller than Q3_K_M** (366 MB vs 389 MB) - **Faster than Q3_HIFI** (180 tok/s vs 189 tok/s) - **Simple quantization** - zero configuration required - **Automatic tensor upgrades** - uses Q6_K for output.weight - **Production-ready** - works out of the box #### ❌ Cons: - **Worst quality** - 4.0 points worse perplexity than Q3_HIFI - **Not the best in any category** - middle ground in all metrics - **Lower precision** - fewer automatic upgrades than Q3_K_M #### 🎯 Best For: - General-purpose applications where you need a reasonable compromise - When you want good-enough quality without optimization effort - Production deployments where simplicity is valued over maximum quality - Systems where file size matters but you can't invest in optimization --- ### 2. Q3_K_M (Medium) - The Speed Champion **Perplexity:** 31.81 ± 0.29 **File Size:** 389.12 MiB (4.34 bpw) **Speed:** 223.44 tokens/second #### ✅ Pros: - **Fastest inference** - 223 tok/s (24% faster than Q3_HIFI, 24% faster than Q3_K_S) - **Good quality** - only 0.59 points worse than Q3_HIFI - **Automatic tensor upgrades** - uses Q4_K and Q5_K for critical tensors - **Production-ready** - zero configuration required - **CPU_REPACK support** - optimized memory layout (91 MiB repack buffer) - **Best speed-to-quality ratio** - excellent performance for the quality level #### ❌ Cons: - **Largest file size** - 389 MB (26% larger than Q3_HIFI, 6% larger than Q3_K_S) - **Higher memory usage** - 911 MiB total - **Not the best quality** - 0.59 points worse than Q3_HIFI #### 🎯 Best For: - **Real-time applications** where speed is critical - **Interactive systems** requiring low latency - **Production deployments** where speed matters more than file size - **Systems with sufficient storage** but need maximum throughput - **When you want good quality without configuration effort** --- ### 3. Q3_HIFI (Optimized) - The Quality & Size Champion **Perplexity:** 31.22 ± 0.28 ⭐ **BEST** **File Size:** 308.23 MiB (3.44 bpw) ⭐ **SMALLEST** **Speed:** 189.23 tokens/second #### ✅ Pros: - **Best quality** - 31.22 perplexity (0.59 points better than Q3_K_M, 4.6 points better than Q3_K_S) - **Smallest file size** - 308 MB (21% smaller than Q3_K_S, 26% smaller than Q3_K_M) - **Lowest memory usage** - 830 MiB total - **Best quality-to-size ratio** - highest quality in smallest package - **IMatrix-guided quantization** - uses importance matrix for optimal outlier selection - **Expanded tensor coverage** - Q3_HIFI applied to 6 tensor types (attn_v/q/k, ffn_down/gate/up) - **Automatic upgrades** - Q6_K for output.weight, Q4_K for attn_output.weight - **6 FP16 outliers per block** - preserves precision for critical weights #### ❌ Cons: - **Slower inference** - 189 tok/s (15% slower than Q3_K_M, 5% slower than Q3_K_S) - **Requires configuration** - needs IMatrix generation and tensor-type specification - **More setup effort** - must generate imatrix file and specify quantization strategy - **Longer quantization time** - IMatrix generation takes 30-60 minutes #### 🎯 Best For: - **Quality-critical applications** where accuracy matters most - **Storage-constrained systems** - mobile devices, embedded systems - **Offline deployments** where file size is a concern - **When you can invest in proper quantization setup** - **Research and development** where quality is the priority - **Production systems** where quality > speed --- ## Head-to-Head Comparisons ### Quality Comparison | Format | Perplexity | vs Q3_HIFI | vs Q3_K_M | vs Q3_K_S | |--------|------------|------------|-----------|-----------| | **Q3_HIFI** | **31.22** | Baseline | **-0.59** ⭐ | **-4.63** ⭐ | | Q3_K_M | 31.81 | +0.59 | Baseline | -4.04 | | Q3_K_S | 35.85 | +4.63 | +4.04 | Baseline | **Winner:** Q3_HIFI (best quality by 0.59 points over Q3_K_M) ### File Size Comparison | Format | Size | vs Q3_HIFI | vs Q3_K_M | vs Q3_K_S | |--------|------|------------|-----------|-----------| | **Q3_HIFI** | **308 MB** | Baseline | **-81 MB** ⭐ | **-58 MB** ⭐ | | Q3_K_S | 366 MB | +58 MB | -23 MB | Baseline | | Q3_K_M | 389 MB | +81 MB | Baseline | +23 MB | **Winner:** Q3_HIFI (smallest by 58-81 MB) ### Speed Comparison | Format | Speed | vs Q3_HIFI | vs Q3_K_M | vs Q3_K_S | |--------|-------|------------|-----------|-----------| | **Q3_K_M** | **223 tok/s** | **+34 tok/s** ⭐ | Baseline | **+44 tok/s** ⭐ | | Q3_HIFI | 189 tok/s | Baseline | -34 tok/s | +9 tok/s | | Q3_K_S | 180 tok/s | -9 tok/s | -44 tok/s | Baseline | **Winner:** Q3_K_M (fastest by 15-24%) --- ## Recommendations ### 🎯 Best Overall: **Q3_HIFI** (With IMatrix + Expanded Coverage) - ✅ **Best quality** (31.22) - beats Q3_K_M by 0.59 points - ✅ **Smallest file size** (308 MB) - 21% smaller than Q3_K_S, 26% smaller than Q3_K_M - ✅ **Best quality-to-size ratio** - best quality in smallest package - ⚠️ Requires IMatrix + tensor-type configuration - ⚠️ Slower inference (189 tok/s vs 223 tok/s) - **Use when:** You want the best quality in the smallest file and can invest in proper quantization setup ### ⚡ Best Speed: **Q3_K_M** (Out of the Box) - ✅ **Fastest inference** (223 tok/s) - 18% faster than Q3_HIFI - ✅ **Good quality** (31.81) - only 0.59 points worse than Q3_HIFI - ✅ Automatic tensor upgrades - ✅ Production-ready immediately - zero configuration - ⚠️ Largest file size (389 MB) - **Use when:** Speed is critical and you want good quality without configuration effort ### ⚖️ Best Balance: **Q3_K_S** - ✅ Good middle ground - reasonable quality (35.85) and speed (180 tok/s) - ✅ Smaller than Q3_K_M (366 MB vs 389 MB) - ✅ Simple quantization process - zero configuration - ⚠️ Not the best in any category - **Use when:** You need a balanced compromise without optimization effort --- ## Technical Notes ### Why Q3_HIFI Achieves Best Quality Q3_HIFI (with proper configuration) achieves the best quality through: - **IMatrix-guided outlier selection** - Uses importance weights to select the most critical outliers - **Expanded tensor coverage** - Applies Q3_HIFI to 6 tensor types (attn_v/q/k, ffn_down/gate/up) - **Automatic upgrades** - Q6_K for output.weight, Q4_K for attn_output.weight - **6 FP16 outliers per block** - Preserves precision for the most important weights This combination allows Q3_HIFI to achieve **31.22 perplexity** - better than Q3_K_M's 31.81. ### Why Q3_HIFI is Slower Q3_HIFI's unique architecture (6 FP16 outliers per block) requires: - More memory lookups (scattered access pattern for outlier indices) - Additional FP16-to-FP32 conversions for outliers - More complex dequantization logic - Currently limited to CPU with basic vectorization (no GPU/SIMD optimizations yet) However, with the recent SIMD/GPU optimizations implemented, speed should improve significantly in future builds. ### Why Q3_K_M is Fastest Q3_K_M benefits from: - **CPU_REPACK optimization** - Optimized memory layout (91 MiB repack buffer) - **Mature optimizations** - Well-optimized SIMD/GPU kernels - **Automatic tensor upgrades** - Uses Q4_K and Q5_K for critical tensors, reducing computation - **Efficient block structure** - Optimized for speed over extreme compression ### Quantization Configuration **Q3_HIFI (Optimized):** ```powershell .\build\bin\Release\llama-quantize.exe ` --imatrix .\qwen3-0.6b-imatrix.gguf ` --tensor-type "attn_v=q3_hifi" ` --tensor-type "attn_q=q3_hifi" ` --tensor-type "attn_k=q3_hifi" ` --tensor-type "ffn_down=q3_hifi" ` --tensor-type "ffn_gate=q3_hifi" ` --tensor-type "ffn_up=q3_hifi" ` --tensor-type "attn_output.weight=q4_k" ` --tensor-type "output.weight=q6_k" ` --tensor-type ".*=q3_k" ` .\Qwen3-0.6B-f16.gguf ` .\Qwen3-0.6B-f16-Q3_HIFI.gguf ` Q3_HIFI ``` **Q3_K_M (Simple):** ```powershell .\build\bin\Release\llama-quantize.exe ` .\Qwen3-0.6B-f16.gguf ` .\Qwen3-0.6B-f16-Q3_K_M.gguf ` Q3_K_M ``` **Q3_K_S (Simple):** ```powershell .\build\bin\Release\llama-quantize.exe ` .\Qwen3-0.6B-f16.gguf ` .\Qwen3-0.6B-f16-Q3_K_S.gguf ` Q3_K_S ``` --- ## Decision Matrix | Priority | Recommended Format | Reason | |----------|-------------------|--------| | **Quality** | Q3_HIFI | Best perplexity (31.22) | | **File Size** | Q3_HIFI | Smallest (308 MB) | | **Speed** | Q3_K_M | Fastest (223 tok/s) | | **Simplicity** | Q3_K_S or Q3_K_M | Zero configuration | | **Quality + Size** | Q3_HIFI | Best of both | | **Speed + Quality** | Q3_K_M | Good balance | | **Production (Simple)** | Q3_K_M | Fast + good quality | | **Production (Optimized)** | Q3_HIFI | Best quality + smallest | --- ## Conclusion **For most users:** - **Choose Q3_K_M** if speed is your priority and you want good quality without configuration - **Choose Q3_HIFI** if quality and file size are your priorities and you can invest in setup - **Choose Q3_K_S** if you want a simple, balanced option **For quality-critical applications:** Q3_HIFI is the clear winner, offering the best quality (31.22 perplexity) in the smallest package (308 MB), though it requires more setup effort and is slightly slower. **For speed-critical applications:** Q3_K_M is the best choice, offering the fastest inference (223 tok/s) with good quality (31.81 perplexity) and zero configuration. --- **Generated:** 2025-12-03 **Test Dataset:** wikitext-2-raw/wiki.test.raw **Model:** Qwen3-0.6B **Evaluation Parameters:** --ppl-stride 0 --ppl-output-type 0 -b 2048 -c 512