Q3 Quantization Format Comparison Summary
Executive Summary
This document compares three 3-bit quantization formats for the Qwen3-0.6B model: Q3_K_S, Q3_K_M, and Q3_HIFI. All models were evaluated on the same test dataset (wikitext-2-raw/wiki.test.raw) with identical parameters.
Performance Metrics
| Metric | Q3_K_S | Q3_K_M | Q3_HIFI |
|---|---|---|---|
| Perplexity | 35.85 Β± 0.32 | 31.81 Β± 0.29 | 31.22 Β± 0.28 β |
| File Size | 366.19 MiB | 389.12 MiB | 308.23 MiB β |
| Bits Per Weight | 4.09 bpw | 4.34 bpw | 3.44 bpw β |
| Inference Speed | 179.80 tok/s β | 223.44 tok/s | 189.23 tok/s |
| Memory Usage | 888 MiB | 911 MiB | 830 MiB β |
| Quality Rank | 3rd | 2nd | 1st β |
| Size Rank | 2nd | 3rd | 1st β |
| Speed Rank | 2nd | 1st β | 3rd |
Detailed Analysis
1. Q3_K_S (Small) - The Balanced Option
Perplexity: 35.85 Β± 0.32
File Size: 366.19 MiB (4.09 bpw)
Speed: 179.80 tokens/second
β Pros:
- Good balance between quality, size, and speed
- Smaller than Q3_K_M (366 MB vs 389 MB)
- Faster than Q3_HIFI (180 tok/s vs 189 tok/s)
- Simple quantization - zero configuration required
- Automatic tensor upgrades - uses Q6_K for output.weight
- Production-ready - works out of the box
β Cons:
- Worst quality - 4.0 points worse perplexity than Q3_HIFI
- Not the best in any category - middle ground in all metrics
- Lower precision - fewer automatic upgrades than Q3_K_M
π― Best For:
- General-purpose applications where you need a reasonable compromise
- When you want good-enough quality without optimization effort
- Production deployments where simplicity is valued over maximum quality
- Systems where file size matters but you can't invest in optimization
2. Q3_K_M (Medium) - The Speed Champion
Perplexity: 31.81 Β± 0.29
File Size: 389.12 MiB (4.34 bpw)
Speed: 223.44 tokens/second
β Pros:
- Fastest inference - 223 tok/s (24% faster than Q3_HIFI, 24% faster than Q3_K_S)
- Good quality - only 0.59 points worse than Q3_HIFI
- Automatic tensor upgrades - uses Q4_K and Q5_K for critical tensors
- Production-ready - zero configuration required
- CPU_REPACK support - optimized memory layout (91 MiB repack buffer)
- Best speed-to-quality ratio - excellent performance for the quality level
β Cons:
- Largest file size - 389 MB (26% larger than Q3_HIFI, 6% larger than Q3_K_S)
- Higher memory usage - 911 MiB total
- Not the best quality - 0.59 points worse than Q3_HIFI
π― Best For:
- Real-time applications where speed is critical
- Interactive systems requiring low latency
- Production deployments where speed matters more than file size
- Systems with sufficient storage but need maximum throughput
- When you want good quality without configuration effort
3. Q3_HIFI (Optimized) - The Quality & Size Champion
Perplexity: 31.22 Β± 0.28 β BEST
File Size: 308.23 MiB (3.44 bpw) β SMALLEST
Speed: 189.23 tokens/second
β Pros:
- Best quality - 31.22 perplexity (0.59 points better than Q3_K_M, 4.6 points better than Q3_K_S)
- Smallest file size - 308 MB (21% smaller than Q3_K_S, 26% smaller than Q3_K_M)
- Lowest memory usage - 830 MiB total
- Best quality-to-size ratio - highest quality in smallest package
- IMatrix-guided quantization - uses importance matrix for optimal outlier selection
- Expanded tensor coverage - Q3_HIFI applied to 6 tensor types (attn_v/q/k, ffn_down/gate/up)
- Automatic upgrades - Q6_K for output.weight, Q4_K for attn_output.weight
- 6 FP16 outliers per block - preserves precision for critical weights
β Cons:
- Slower inference - 189 tok/s (15% slower than Q3_K_M, 5% slower than Q3_K_S)
- Requires configuration - needs IMatrix generation and tensor-type specification
- More setup effort - must generate imatrix file and specify quantization strategy
- Longer quantization time - IMatrix generation takes 30-60 minutes
π― Best For:
- Quality-critical applications where accuracy matters most
- Storage-constrained systems - mobile devices, embedded systems
- Offline deployments where file size is a concern
- When you can invest in proper quantization setup
- Research and development where quality is the priority
- Production systems where quality > speed
Head-to-Head Comparisons
Quality Comparison
| Format | Perplexity | vs Q3_HIFI | vs Q3_K_M | vs Q3_K_S |
|---|---|---|---|---|
| Q3_HIFI | 31.22 | Baseline | -0.59 β | -4.63 β |
| Q3_K_M | 31.81 | +0.59 | Baseline | -4.04 |
| Q3_K_S | 35.85 | +4.63 | +4.04 | Baseline |
Winner: Q3_HIFI (best quality by 0.59 points over Q3_K_M)
File Size Comparison
| Format | Size | vs Q3_HIFI | vs Q3_K_M | vs Q3_K_S |
|---|---|---|---|---|
| Q3_HIFI | 308 MB | Baseline | -81 MB β | -58 MB β |
| Q3_K_S | 366 MB | +58 MB | -23 MB | Baseline |
| Q3_K_M | 389 MB | +81 MB | Baseline | +23 MB |
Winner: Q3_HIFI (smallest by 58-81 MB)
Speed Comparison
| Format | Speed | vs Q3_HIFI | vs Q3_K_M | vs Q3_K_S |
|---|---|---|---|---|
| Q3_K_M | 223 tok/s | +34 tok/s β | Baseline | +44 tok/s β |
| Q3_HIFI | 189 tok/s | Baseline | -34 tok/s | +9 tok/s |
| Q3_K_S | 180 tok/s | -9 tok/s | -44 tok/s | Baseline |
Winner: Q3_K_M (fastest by 15-24%)
Recommendations
π― Best Overall: Q3_HIFI (With IMatrix + Expanded Coverage)
- β Best quality (31.22) - beats Q3_K_M by 0.59 points
- β Smallest file size (308 MB) - 21% smaller than Q3_K_S, 26% smaller than Q3_K_M
- β Best quality-to-size ratio - best quality in smallest package
- β οΈ Requires IMatrix + tensor-type configuration
- β οΈ Slower inference (189 tok/s vs 223 tok/s)
- Use when: You want the best quality in the smallest file and can invest in proper quantization setup
β‘ Best Speed: Q3_K_M (Out of the Box)
- β Fastest inference (223 tok/s) - 18% faster than Q3_HIFI
- β Good quality (31.81) - only 0.59 points worse than Q3_HIFI
- β Automatic tensor upgrades
- β Production-ready immediately - zero configuration
- β οΈ Largest file size (389 MB)
- Use when: Speed is critical and you want good quality without configuration effort
βοΈ Best Balance: Q3_K_S
- β Good middle ground - reasonable quality (35.85) and speed (180 tok/s)
- β Smaller than Q3_K_M (366 MB vs 389 MB)
- β Simple quantization process - zero configuration
- β οΈ Not the best in any category
- Use when: You need a balanced compromise without optimization effort
Technical Notes
Why Q3_HIFI Achieves Best Quality
Q3_HIFI (with proper configuration) achieves the best quality through:
- IMatrix-guided outlier selection - Uses importance weights to select the most critical outliers
- Expanded tensor coverage - Applies Q3_HIFI to 6 tensor types (attn_v/q/k, ffn_down/gate/up)
- Automatic upgrades - Q6_K for output.weight, Q4_K for attn_output.weight
- 6 FP16 outliers per block - Preserves precision for the most important weights
This combination allows Q3_HIFI to achieve 31.22 perplexity - better than Q3_K_M's 31.81.
Why Q3_HIFI is Slower
Q3_HIFI's unique architecture (6 FP16 outliers per block) requires:
- More memory lookups (scattered access pattern for outlier indices)
- Additional FP16-to-FP32 conversions for outliers
- More complex dequantization logic
- Currently limited to CPU with basic vectorization (no GPU/SIMD optimizations yet)
However, with the recent SIMD/GPU optimizations implemented, speed should improve significantly in future builds.
Why Q3_K_M is Fastest
Q3_K_M benefits from:
- CPU_REPACK optimization - Optimized memory layout (91 MiB repack buffer)
- Mature optimizations - Well-optimized SIMD/GPU kernels
- Automatic tensor upgrades - Uses Q4_K and Q5_K for critical tensors, reducing computation
- Efficient block structure - Optimized for speed over extreme compression
Quantization Configuration
Q3_HIFI (Optimized):
.\build\bin\Release\llama-quantize.exe `
--imatrix .\qwen3-0.6b-imatrix.gguf `
--tensor-type "attn_v=q3_hifi" `
--tensor-type "attn_q=q3_hifi" `
--tensor-type "attn_k=q3_hifi" `
--tensor-type "ffn_down=q3_hifi" `
--tensor-type "ffn_gate=q3_hifi" `
--tensor-type "ffn_up=q3_hifi" `
--tensor-type "attn_output.weight=q4_k" `
--tensor-type "output.weight=q6_k" `
--tensor-type ".*=q3_k" `
.\Qwen3-0.6B-f16.gguf `
.\Qwen3-0.6B-f16-Q3_HIFI.gguf `
Q3_HIFI
Q3_K_M (Simple):
.\build\bin\Release\llama-quantize.exe `
.\Qwen3-0.6B-f16.gguf `
.\Qwen3-0.6B-f16-Q3_K_M.gguf `
Q3_K_M
Q3_K_S (Simple):
.\build\bin\Release\llama-quantize.exe `
.\Qwen3-0.6B-f16.gguf `
.\Qwen3-0.6B-f16-Q3_K_S.gguf `
Q3_K_S
Decision Matrix
| Priority | Recommended Format | Reason |
|---|---|---|
| Quality | Q3_HIFI | Best perplexity (31.22) |
| File Size | Q3_HIFI | Smallest (308 MB) |
| Speed | Q3_K_M | Fastest (223 tok/s) |
| Simplicity | Q3_K_S or Q3_K_M | Zero configuration |
| Quality + Size | Q3_HIFI | Best of both |
| Speed + Quality | Q3_K_M | Good balance |
| Production (Simple) | Q3_K_M | Fast + good quality |
| Production (Optimized) | Q3_HIFI | Best quality + smallest |
Conclusion
For most users:
- Choose Q3_K_M if speed is your priority and you want good quality without configuration
- Choose Q3_HIFI if quality and file size are your priorities and you can invest in setup
- Choose Q3_K_S if you want a simple, balanced option
For quality-critical applications: Q3_HIFI is the clear winner, offering the best quality (31.22 perplexity) in the smallest package (308 MB), though it requires more setup effort and is slightly slower.
For speed-critical applications: Q3_K_M is the best choice, offering the fastest inference (223 tok/s) with good quality (31.81 perplexity) and zero configuration.
Generated: 2025-12-03
Test Dataset: wikitext-2-raw/wiki.test.raw
Model: Qwen3-0.6B
Evaluation Parameters: --ppl-stride 0 --ppl-output-type 0 -b 2048 -c 512