Qwen3-0.6B-f16 / Q3_Quantisation_Comparison.md
geoffmunn's picture
Update Q3_Quantisation_Comparison.md
93e3a8a verified

Q3 Quantization Format Comparison Summary

Executive Summary

This document compares three 3-bit quantization formats for the Qwen3-0.6B model: Q3_K_S, Q3_K_M, and Q3_HIFI. All models were evaluated on the same test dataset (wikitext-2-raw/wiki.test.raw) with identical parameters.


Performance Metrics

Metric Q3_K_S Q3_K_M Q3_HIFI
Perplexity 35.85 ± 0.32 31.81 ± 0.29 31.22 ± 0.28 ⭐
File Size 366.19 MiB 389.12 MiB 308.23 MiB ⭐
Bits Per Weight 4.09 bpw 4.34 bpw 3.44 bpw ⭐
Inference Speed 179.80 tok/s ⭐ 223.44 tok/s 189.23 tok/s
Memory Usage 888 MiB 911 MiB 830 MiB ⭐
Quality Rank 3rd 2nd 1st ⭐
Size Rank 2nd 3rd 1st ⭐
Speed Rank 2nd 1st ⭐ 3rd

Detailed Analysis

1. Q3_K_S (Small) - The Balanced Option

Perplexity: 35.85 Β± 0.32
File Size: 366.19 MiB (4.09 bpw)
Speed: 179.80 tokens/second

βœ… Pros:

  • Good balance between quality, size, and speed
  • Smaller than Q3_K_M (366 MB vs 389 MB)
  • Faster than Q3_HIFI (180 tok/s vs 189 tok/s)
  • Simple quantization - zero configuration required
  • Automatic tensor upgrades - uses Q6_K for output.weight
  • Production-ready - works out of the box

❌ Cons:

  • Worst quality - 4.0 points worse perplexity than Q3_HIFI
  • Not the best in any category - middle ground in all metrics
  • Lower precision - fewer automatic upgrades than Q3_K_M

🎯 Best For:

  • General-purpose applications where you need a reasonable compromise
  • When you want good-enough quality without optimization effort
  • Production deployments where simplicity is valued over maximum quality
  • Systems where file size matters but you can't invest in optimization

2. Q3_K_M (Medium) - The Speed Champion

Perplexity: 31.81 Β± 0.29
File Size: 389.12 MiB (4.34 bpw)
Speed: 223.44 tokens/second

βœ… Pros:

  • Fastest inference - 223 tok/s (24% faster than Q3_HIFI, 24% faster than Q3_K_S)
  • Good quality - only 0.59 points worse than Q3_HIFI
  • Automatic tensor upgrades - uses Q4_K and Q5_K for critical tensors
  • Production-ready - zero configuration required
  • CPU_REPACK support - optimized memory layout (91 MiB repack buffer)
  • Best speed-to-quality ratio - excellent performance for the quality level

❌ Cons:

  • Largest file size - 389 MB (26% larger than Q3_HIFI, 6% larger than Q3_K_S)
  • Higher memory usage - 911 MiB total
  • Not the best quality - 0.59 points worse than Q3_HIFI

🎯 Best For:

  • Real-time applications where speed is critical
  • Interactive systems requiring low latency
  • Production deployments where speed matters more than file size
  • Systems with sufficient storage but need maximum throughput
  • When you want good quality without configuration effort

3. Q3_HIFI (Optimized) - The Quality & Size Champion

Perplexity: 31.22 ± 0.28 ⭐ BEST
File Size: 308.23 MiB (3.44 bpw) ⭐ SMALLEST
Speed: 189.23 tokens/second

βœ… Pros:

  • Best quality - 31.22 perplexity (0.59 points better than Q3_K_M, 4.6 points better than Q3_K_S)
  • Smallest file size - 308 MB (21% smaller than Q3_K_S, 26% smaller than Q3_K_M)
  • Lowest memory usage - 830 MiB total
  • Best quality-to-size ratio - highest quality in smallest package
  • IMatrix-guided quantization - uses importance matrix for optimal outlier selection
  • Expanded tensor coverage - Q3_HIFI applied to 6 tensor types (attn_v/q/k, ffn_down/gate/up)
  • Automatic upgrades - Q6_K for output.weight, Q4_K for attn_output.weight
  • 6 FP16 outliers per block - preserves precision for critical weights

❌ Cons:

  • Slower inference - 189 tok/s (15% slower than Q3_K_M, 5% slower than Q3_K_S)
  • Requires configuration - needs IMatrix generation and tensor-type specification
  • More setup effort - must generate imatrix file and specify quantization strategy
  • Longer quantization time - IMatrix generation takes 30-60 minutes

🎯 Best For:

  • Quality-critical applications where accuracy matters most
  • Storage-constrained systems - mobile devices, embedded systems
  • Offline deployments where file size is a concern
  • When you can invest in proper quantization setup
  • Research and development where quality is the priority
  • Production systems where quality > speed

Head-to-Head Comparisons

Quality Comparison

Format Perplexity vs Q3_HIFI vs Q3_K_M vs Q3_K_S
Q3_HIFI 31.22 Baseline -0.59 ⭐ -4.63 ⭐
Q3_K_M 31.81 +0.59 Baseline -4.04
Q3_K_S 35.85 +4.63 +4.04 Baseline

Winner: Q3_HIFI (best quality by 0.59 points over Q3_K_M)

File Size Comparison

Format Size vs Q3_HIFI vs Q3_K_M vs Q3_K_S
Q3_HIFI 308 MB Baseline -81 MB ⭐ -58 MB ⭐
Q3_K_S 366 MB +58 MB -23 MB Baseline
Q3_K_M 389 MB +81 MB Baseline +23 MB

Winner: Q3_HIFI (smallest by 58-81 MB)

Speed Comparison

Format Speed vs Q3_HIFI vs Q3_K_M vs Q3_K_S
Q3_K_M 223 tok/s +34 tok/s ⭐ Baseline +44 tok/s ⭐
Q3_HIFI 189 tok/s Baseline -34 tok/s +9 tok/s
Q3_K_S 180 tok/s -9 tok/s -44 tok/s Baseline

Winner: Q3_K_M (fastest by 15-24%)


Recommendations

🎯 Best Overall: Q3_HIFI (With IMatrix + Expanded Coverage)

  • βœ… Best quality (31.22) - beats Q3_K_M by 0.59 points
  • βœ… Smallest file size (308 MB) - 21% smaller than Q3_K_S, 26% smaller than Q3_K_M
  • βœ… Best quality-to-size ratio - best quality in smallest package
  • ⚠️ Requires IMatrix + tensor-type configuration
  • ⚠️ Slower inference (189 tok/s vs 223 tok/s)
  • Use when: You want the best quality in the smallest file and can invest in proper quantization setup

⚑ Best Speed: Q3_K_M (Out of the Box)

  • βœ… Fastest inference (223 tok/s) - 18% faster than Q3_HIFI
  • βœ… Good quality (31.81) - only 0.59 points worse than Q3_HIFI
  • βœ… Automatic tensor upgrades
  • βœ… Production-ready immediately - zero configuration
  • ⚠️ Largest file size (389 MB)
  • Use when: Speed is critical and you want good quality without configuration effort

βš–οΈ Best Balance: Q3_K_S

  • βœ… Good middle ground - reasonable quality (35.85) and speed (180 tok/s)
  • βœ… Smaller than Q3_K_M (366 MB vs 389 MB)
  • βœ… Simple quantization process - zero configuration
  • ⚠️ Not the best in any category
  • Use when: You need a balanced compromise without optimization effort

Technical Notes

Why Q3_HIFI Achieves Best Quality

Q3_HIFI (with proper configuration) achieves the best quality through:

  • IMatrix-guided outlier selection - Uses importance weights to select the most critical outliers
  • Expanded tensor coverage - Applies Q3_HIFI to 6 tensor types (attn_v/q/k, ffn_down/gate/up)
  • Automatic upgrades - Q6_K for output.weight, Q4_K for attn_output.weight
  • 6 FP16 outliers per block - Preserves precision for the most important weights

This combination allows Q3_HIFI to achieve 31.22 perplexity - better than Q3_K_M's 31.81.

Why Q3_HIFI is Slower

Q3_HIFI's unique architecture (6 FP16 outliers per block) requires:

  • More memory lookups (scattered access pattern for outlier indices)
  • Additional FP16-to-FP32 conversions for outliers
  • More complex dequantization logic
  • Currently limited to CPU with basic vectorization (no GPU/SIMD optimizations yet)

However, with the recent SIMD/GPU optimizations implemented, speed should improve significantly in future builds.

Why Q3_K_M is Fastest

Q3_K_M benefits from:

  • CPU_REPACK optimization - Optimized memory layout (91 MiB repack buffer)
  • Mature optimizations - Well-optimized SIMD/GPU kernels
  • Automatic tensor upgrades - Uses Q4_K and Q5_K for critical tensors, reducing computation
  • Efficient block structure - Optimized for speed over extreme compression

Quantization Configuration

Q3_HIFI (Optimized):

.\build\bin\Release\llama-quantize.exe `
  --imatrix .\qwen3-0.6b-imatrix.gguf `
  --tensor-type "attn_v=q3_hifi" `
  --tensor-type "attn_q=q3_hifi" `
  --tensor-type "attn_k=q3_hifi" `
  --tensor-type "ffn_down=q3_hifi" `
  --tensor-type "ffn_gate=q3_hifi" `
  --tensor-type "ffn_up=q3_hifi" `
  --tensor-type "attn_output.weight=q4_k" `
  --tensor-type "output.weight=q6_k" `
  --tensor-type ".*=q3_k" `
  .\Qwen3-0.6B-f16.gguf `
  .\Qwen3-0.6B-f16-Q3_HIFI.gguf `
  Q3_HIFI

Q3_K_M (Simple):

.\build\bin\Release\llama-quantize.exe `
  .\Qwen3-0.6B-f16.gguf `
  .\Qwen3-0.6B-f16-Q3_K_M.gguf `
  Q3_K_M

Q3_K_S (Simple):

.\build\bin\Release\llama-quantize.exe `
  .\Qwen3-0.6B-f16.gguf `
  .\Qwen3-0.6B-f16-Q3_K_S.gguf `
  Q3_K_S

Decision Matrix

Priority Recommended Format Reason
Quality Q3_HIFI Best perplexity (31.22)
File Size Q3_HIFI Smallest (308 MB)
Speed Q3_K_M Fastest (223 tok/s)
Simplicity Q3_K_S or Q3_K_M Zero configuration
Quality + Size Q3_HIFI Best of both
Speed + Quality Q3_K_M Good balance
Production (Simple) Q3_K_M Fast + good quality
Production (Optimized) Q3_HIFI Best quality + smallest

Conclusion

For most users:

  • Choose Q3_K_M if speed is your priority and you want good quality without configuration
  • Choose Q3_HIFI if quality and file size are your priorities and you can invest in setup
  • Choose Q3_K_S if you want a simple, balanced option

For quality-critical applications: Q3_HIFI is the clear winner, offering the best quality (31.22 perplexity) in the smallest package (308 MB), though it requires more setup effort and is slightly slower.

For speed-critical applications: Q3_K_M is the best choice, offering the fastest inference (223 tok/s) with good quality (31.81 perplexity) and zero configuration.


Generated: 2025-12-03
Test Dataset: wikitext-2-raw/wiki.test.raw
Model: Qwen3-0.6B
Evaluation Parameters: --ppl-stride 0 --ppl-output-type 0 -b 2048 -c 512