| # Q3 Quantization Format Comparison Summary | |
| ## Executive Summary | |
| This document compares three 3-bit quantization formats for the Qwen3-0.6B model: **Q3_K_S**, **Q3_K_M**, and **Q3_HIFI**. All models were evaluated on the same test dataset (wikitext-2-raw/wiki.test.raw) with identical parameters. | |
| --- | |
| ## Performance Metrics | |
| | Metric | Q3_K_S | Q3_K_M | Q3_HIFI | | |
| |--------|--------|--------|---------| | |
| | **Perplexity** | 35.85 ± 0.32 | 31.81 ± 0.29 | **31.22 ± 0.28** ⭐ | | |
| | **File Size** | 366.19 MiB | 389.12 MiB | **308.23 MiB** ⭐ | | |
| | **Bits Per Weight** | 4.09 bpw | 4.34 bpw | **3.44 bpw** ⭐ | | |
| | **Inference Speed** | **179.80 tok/s** ⭐ | 223.44 tok/s | 189.23 tok/s | | |
| | **Memory Usage** | 888 MiB | 911 MiB | **830 MiB** ⭐ | | |
| | **Quality Rank** | 3rd | 2nd | **1st** ⭐ | | |
| | **Size Rank** | 2nd | 3rd | **1st** ⭐ | | |
| | **Speed Rank** | 2nd | **1st** ⭐ | 3rd | | |
| --- | |
| ## Detailed Analysis | |
| ### 1. Q3_K_S (Small) - The Balanced Option | |
| **Perplexity:** 35.85 ± 0.32 | |
| **File Size:** 366.19 MiB (4.09 bpw) | |
| **Speed:** 179.80 tokens/second | |
| #### ✅ Pros: | |
| - **Good balance** between quality, size, and speed | |
| - **Smaller than Q3_K_M** (366 MB vs 389 MB) | |
| - **Faster than Q3_HIFI** (180 tok/s vs 189 tok/s) | |
| - **Simple quantization** - zero configuration required | |
| - **Automatic tensor upgrades** - uses Q6_K for output.weight | |
| - **Production-ready** - works out of the box | |
| #### ❌ Cons: | |
| - **Worst quality** - 4.0 points worse perplexity than Q3_HIFI | |
| - **Not the best in any category** - middle ground in all metrics | |
| - **Lower precision** - fewer automatic upgrades than Q3_K_M | |
| #### 🎯 Best For: | |
| - General-purpose applications where you need a reasonable compromise | |
| - When you want good-enough quality without optimization effort | |
| - Production deployments where simplicity is valued over maximum quality | |
| - Systems where file size matters but you can't invest in optimization | |
| --- | |
| ### 2. Q3_K_M (Medium) - The Speed Champion | |
| **Perplexity:** 31.81 ± 0.29 | |
| **File Size:** 389.12 MiB (4.34 bpw) | |
| **Speed:** 223.44 tokens/second | |
| #### ✅ Pros: | |
| - **Fastest inference** - 223 tok/s (24% faster than Q3_HIFI, 24% faster than Q3_K_S) | |
| - **Good quality** - only 0.59 points worse than Q3_HIFI | |
| - **Automatic tensor upgrades** - uses Q4_K and Q5_K for critical tensors | |
| - **Production-ready** - zero configuration required | |
| - **CPU_REPACK support** - optimized memory layout (91 MiB repack buffer) | |
| - **Best speed-to-quality ratio** - excellent performance for the quality level | |
| #### ❌ Cons: | |
| - **Largest file size** - 389 MB (26% larger than Q3_HIFI, 6% larger than Q3_K_S) | |
| - **Higher memory usage** - 911 MiB total | |
| - **Not the best quality** - 0.59 points worse than Q3_HIFI | |
| #### 🎯 Best For: | |
| - **Real-time applications** where speed is critical | |
| - **Interactive systems** requiring low latency | |
| - **Production deployments** where speed matters more than file size | |
| - **Systems with sufficient storage** but need maximum throughput | |
| - **When you want good quality without configuration effort** | |
| --- | |
| ### 3. Q3_HIFI (Optimized) - The Quality & Size Champion | |
| **Perplexity:** 31.22 ± 0.28 ⭐ **BEST** | |
| **File Size:** 308.23 MiB (3.44 bpw) ⭐ **SMALLEST** | |
| **Speed:** 189.23 tokens/second | |
| #### ✅ Pros: | |
| - **Best quality** - 31.22 perplexity (0.59 points better than Q3_K_M, 4.6 points better than Q3_K_S) | |
| - **Smallest file size** - 308 MB (21% smaller than Q3_K_S, 26% smaller than Q3_K_M) | |
| - **Lowest memory usage** - 830 MiB total | |
| - **Best quality-to-size ratio** - highest quality in smallest package | |
| - **IMatrix-guided quantization** - uses importance matrix for optimal outlier selection | |
| - **Expanded tensor coverage** - Q3_HIFI applied to 6 tensor types (attn_v/q/k, ffn_down/gate/up) | |
| - **Automatic upgrades** - Q6_K for output.weight, Q4_K for attn_output.weight | |
| - **6 FP16 outliers per block** - preserves precision for critical weights | |
| #### ❌ Cons: | |
| - **Slower inference** - 189 tok/s (15% slower than Q3_K_M, 5% slower than Q3_K_S) | |
| - **Requires configuration** - needs IMatrix generation and tensor-type specification | |
| - **More setup effort** - must generate imatrix file and specify quantization strategy | |
| - **Longer quantization time** - IMatrix generation takes 30-60 minutes | |
| #### 🎯 Best For: | |
| - **Quality-critical applications** where accuracy matters most | |
| - **Storage-constrained systems** - mobile devices, embedded systems | |
| - **Offline deployments** where file size is a concern | |
| - **When you can invest in proper quantization setup** | |
| - **Research and development** where quality is the priority | |
| - **Production systems** where quality > speed | |
| --- | |
| ## Head-to-Head Comparisons | |
| ### Quality Comparison | |
| | Format | Perplexity | vs Q3_HIFI | vs Q3_K_M | vs Q3_K_S | | |
| |--------|------------|------------|-----------|-----------| | |
| | **Q3_HIFI** | **31.22** | Baseline | **-0.59** ⭐ | **-4.63** ⭐ | | |
| | Q3_K_M | 31.81 | +0.59 | Baseline | -4.04 | | |
| | Q3_K_S | 35.85 | +4.63 | +4.04 | Baseline | | |
| **Winner:** Q3_HIFI (best quality by 0.59 points over Q3_K_M) | |
| ### File Size Comparison | |
| | Format | Size | vs Q3_HIFI | vs Q3_K_M | vs Q3_K_S | | |
| |--------|------|------------|-----------|-----------| | |
| | **Q3_HIFI** | **308 MB** | Baseline | **-81 MB** ⭐ | **-58 MB** ⭐ | | |
| | Q3_K_S | 366 MB | +58 MB | -23 MB | Baseline | | |
| | Q3_K_M | 389 MB | +81 MB | Baseline | +23 MB | | |
| **Winner:** Q3_HIFI (smallest by 58-81 MB) | |
| ### Speed Comparison | |
| | Format | Speed | vs Q3_HIFI | vs Q3_K_M | vs Q3_K_S | | |
| |--------|-------|------------|-----------|-----------| | |
| | **Q3_K_M** | **223 tok/s** | **+34 tok/s** ⭐ | Baseline | **+44 tok/s** ⭐ | | |
| | Q3_HIFI | 189 tok/s | Baseline | -34 tok/s | +9 tok/s | | |
| | Q3_K_S | 180 tok/s | -9 tok/s | -44 tok/s | Baseline | | |
| **Winner:** Q3_K_M (fastest by 15-24%) | |
| --- | |
| ## Recommendations | |
| ### 🎯 Best Overall: **Q3_HIFI** (With IMatrix + Expanded Coverage) | |
| - ✅ **Best quality** (31.22) - beats Q3_K_M by 0.59 points | |
| - ✅ **Smallest file size** (308 MB) - 21% smaller than Q3_K_S, 26% smaller than Q3_K_M | |
| - ✅ **Best quality-to-size ratio** - best quality in smallest package | |
| - ⚠️ Requires IMatrix + tensor-type configuration | |
| - ⚠️ Slower inference (189 tok/s vs 223 tok/s) | |
| - **Use when:** You want the best quality in the smallest file and can invest in proper quantization setup | |
| ### ⚡ Best Speed: **Q3_K_M** (Out of the Box) | |
| - ✅ **Fastest inference** (223 tok/s) - 18% faster than Q3_HIFI | |
| - ✅ **Good quality** (31.81) - only 0.59 points worse than Q3_HIFI | |
| - ✅ Automatic tensor upgrades | |
| - ✅ Production-ready immediately - zero configuration | |
| - ⚠️ Largest file size (389 MB) | |
| - **Use when:** Speed is critical and you want good quality without configuration effort | |
| ### ⚖️ Best Balance: **Q3_K_S** | |
| - ✅ Good middle ground - reasonable quality (35.85) and speed (180 tok/s) | |
| - ✅ Smaller than Q3_K_M (366 MB vs 389 MB) | |
| - ✅ Simple quantization process - zero configuration | |
| - ⚠️ Not the best in any category | |
| - **Use when:** You need a balanced compromise without optimization effort | |
| --- | |
| ## Technical Notes | |
| ### Why Q3_HIFI Achieves Best Quality | |
| Q3_HIFI (with proper configuration) achieves the best quality through: | |
| - **IMatrix-guided outlier selection** - Uses importance weights to select the most critical outliers | |
| - **Expanded tensor coverage** - Applies Q3_HIFI to 6 tensor types (attn_v/q/k, ffn_down/gate/up) | |
| - **Automatic upgrades** - Q6_K for output.weight, Q4_K for attn_output.weight | |
| - **6 FP16 outliers per block** - Preserves precision for the most important weights | |
| This combination allows Q3_HIFI to achieve **31.22 perplexity** - better than Q3_K_M's 31.81. | |
| ### Why Q3_HIFI is Slower | |
| Q3_HIFI's unique architecture (6 FP16 outliers per block) requires: | |
| - More memory lookups (scattered access pattern for outlier indices) | |
| - Additional FP16-to-FP32 conversions for outliers | |
| - More complex dequantization logic | |
| - Currently limited to CPU with basic vectorization (no GPU/SIMD optimizations yet) | |
| However, with the recent SIMD/GPU optimizations implemented, speed should improve significantly in future builds. | |
| ### Why Q3_K_M is Fastest | |
| Q3_K_M benefits from: | |
| - **CPU_REPACK optimization** - Optimized memory layout (91 MiB repack buffer) | |
| - **Mature optimizations** - Well-optimized SIMD/GPU kernels | |
| - **Automatic tensor upgrades** - Uses Q4_K and Q5_K for critical tensors, reducing computation | |
| - **Efficient block structure** - Optimized for speed over extreme compression | |
| ### Quantization Configuration | |
| **Q3_HIFI (Optimized):** | |
| ```powershell | |
| .\build\bin\Release\llama-quantize.exe ` | |
| --imatrix .\qwen3-0.6b-imatrix.gguf ` | |
| --tensor-type "attn_v=q3_hifi" ` | |
| --tensor-type "attn_q=q3_hifi" ` | |
| --tensor-type "attn_k=q3_hifi" ` | |
| --tensor-type "ffn_down=q3_hifi" ` | |
| --tensor-type "ffn_gate=q3_hifi" ` | |
| --tensor-type "ffn_up=q3_hifi" ` | |
| --tensor-type "attn_output.weight=q4_k" ` | |
| --tensor-type "output.weight=q6_k" ` | |
| --tensor-type ".*=q3_k" ` | |
| .\Qwen3-0.6B-f16.gguf ` | |
| .\Qwen3-0.6B-f16-Q3_HIFI.gguf ` | |
| Q3_HIFI | |
| ``` | |
| **Q3_K_M (Simple):** | |
| ```powershell | |
| .\build\bin\Release\llama-quantize.exe ` | |
| .\Qwen3-0.6B-f16.gguf ` | |
| .\Qwen3-0.6B-f16-Q3_K_M.gguf ` | |
| Q3_K_M | |
| ``` | |
| **Q3_K_S (Simple):** | |
| ```powershell | |
| .\build\bin\Release\llama-quantize.exe ` | |
| .\Qwen3-0.6B-f16.gguf ` | |
| .\Qwen3-0.6B-f16-Q3_K_S.gguf ` | |
| Q3_K_S | |
| ``` | |
| --- | |
| ## Decision Matrix | |
| | Priority | Recommended Format | Reason | | |
| |----------|-------------------|--------| | |
| | **Quality** | Q3_HIFI | Best perplexity (31.22) | | |
| | **File Size** | Q3_HIFI | Smallest (308 MB) | | |
| | **Speed** | Q3_K_M | Fastest (223 tok/s) | | |
| | **Simplicity** | Q3_K_S or Q3_K_M | Zero configuration | | |
| | **Quality + Size** | Q3_HIFI | Best of both | | |
| | **Speed + Quality** | Q3_K_M | Good balance | | |
| | **Production (Simple)** | Q3_K_M | Fast + good quality | | |
| | **Production (Optimized)** | Q3_HIFI | Best quality + smallest | | |
| --- | |
| ## Conclusion | |
| **For most users:** | |
| - **Choose Q3_K_M** if speed is your priority and you want good quality without configuration | |
| - **Choose Q3_HIFI** if quality and file size are your priorities and you can invest in setup | |
| - **Choose Q3_K_S** if you want a simple, balanced option | |
| **For quality-critical applications:** | |
| Q3_HIFI is the clear winner, offering the best quality (31.22 perplexity) in the smallest package (308 MB), though it requires more setup effort and is slightly slower. | |
| **For speed-critical applications:** | |
| Q3_K_M is the best choice, offering the fastest inference (223 tok/s) with good quality (31.81 perplexity) and zero configuration. | |
| --- | |
| **Generated:** 2025-12-03 | |
| **Test Dataset:** wikitext-2-raw/wiki.test.raw | |
| **Model:** Qwen3-0.6B | |
| **Evaluation Parameters:** --ppl-stride 0 --ppl-output-type 0 -b 2048 -c 512 | |