Qwen3-0.6B-f16 / Q3_Quantisation_Comparison.md

geoffmunn

Update Q3_Quantisation_Comparison.md

93e3a8a verified 16 days ago

preview code

raw

history blame contribute delete

10.6 kB

Q3 Quantization Format Comparison Summary

Executive Summary

This document compares three 3-bit quantization formats for the Qwen3-0.6B model: Q3_K_S, Q3_K_M, and Q3_HIFI. All models were evaluated on the same test dataset (wikitext-2-raw/wiki.test.raw) with identical parameters.

Performance Metrics

Metric	Q3_K_S	Q3_K_M	Q3_HIFI
Perplexity	35.85 ± 0.32	31.81 ± 0.29	31.22 ± 0.28 ⭐
File Size	366.19 MiB	389.12 MiB	308.23 MiB ⭐
Bits Per Weight	4.09 bpw	4.34 bpw	3.44 bpw ⭐
Inference Speed	179.80 tok/s ⭐	223.44 tok/s	189.23 tok/s
Memory Usage	888 MiB	911 MiB	830 MiB ⭐
Quality Rank	3rd	2nd	1st ⭐
Size Rank	2nd	3rd	1st ⭐
Speed Rank	2nd	1st ⭐	3rd

Detailed Analysis

1. Q3_K_S (Small) - The Balanced Option

Perplexity: 35.85 ± 0.32
File Size: 366.19 MiB (4.09 bpw)
Speed: 179.80 tokens/second

✅ Pros:

Good balance between quality, size, and speed
Smaller than Q3_K_M (366 MB vs 389 MB)
Faster than Q3_HIFI (180 tok/s vs 189 tok/s)
Simple quantization - zero configuration required
Automatic tensor upgrades - uses Q6_K for output.weight
Production-ready - works out of the box

❌ Cons:

Worst quality - 4.0 points worse perplexity than Q3_HIFI
Not the best in any category - middle ground in all metrics
Lower precision - fewer automatic upgrades than Q3_K_M

🎯 Best For:

General-purpose applications where you need a reasonable compromise
When you want good-enough quality without optimization effort
Production deployments where simplicity is valued over maximum quality
Systems where file size matters but you can't invest in optimization

2. Q3_K_M (Medium) - The Speed Champion

Perplexity: 31.81 ± 0.29
File Size: 389.12 MiB (4.34 bpw)
Speed: 223.44 tokens/second

✅ Pros:

Fastest inference - 223 tok/s (24% faster than Q3_HIFI, 24% faster than Q3_K_S)
Good quality - only 0.59 points worse than Q3_HIFI
Automatic tensor upgrades - uses Q4_K and Q5_K for critical tensors
Production-ready - zero configuration required
CPU_REPACK support - optimized memory layout (91 MiB repack buffer)
Best speed-to-quality ratio - excellent performance for the quality level

❌ Cons:

Largest file size - 389 MB (26% larger than Q3_HIFI, 6% larger than Q3_K_S)
Higher memory usage - 911 MiB total
Not the best quality - 0.59 points worse than Q3_HIFI

🎯 Best For:

Real-time applications where speed is critical
Interactive systems requiring low latency
Production deployments where speed matters more than file size
Systems with sufficient storage but need maximum throughput
When you want good quality without configuration effort

3. Q3_HIFI (Optimized) - The Quality & Size Champion

Perplexity: 31.22 ± 0.28 ⭐ BEST
File Size: 308.23 MiB (3.44 bpw) ⭐ SMALLEST
Speed: 189.23 tokens/second

✅ Pros:

Best quality - 31.22 perplexity (0.59 points better than Q3_K_M, 4.6 points better than Q3_K_S)
Smallest file size - 308 MB (21% smaller than Q3_K_S, 26% smaller than Q3_K_M)
Lowest memory usage - 830 MiB total
Best quality-to-size ratio - highest quality in smallest package
IMatrix-guided quantization - uses importance matrix for optimal outlier selection
Expanded tensor coverage - Q3_HIFI applied to 6 tensor types (attn_v/q/k, ffn_down/gate/up)
Automatic upgrades - Q6_K for output.weight, Q4_K for attn_output.weight
6 FP16 outliers per block - preserves precision for critical weights

❌ Cons:

Slower inference - 189 tok/s (15% slower than Q3_K_M, 5% slower than Q3_K_S)
Requires configuration - needs IMatrix generation and tensor-type specification
More setup effort - must generate imatrix file and specify quantization strategy
Longer quantization time - IMatrix generation takes 30-60 minutes

🎯 Best For:

Quality-critical applications where accuracy matters most
Storage-constrained systems - mobile devices, embedded systems
Offline deployments where file size is a concern
When you can invest in proper quantization setup
Research and development where quality is the priority
Production systems where quality > speed

Head-to-Head Comparisons

Quality Comparison

Format	Perplexity	vs Q3_HIFI	vs Q3_K_M	vs Q3_K_S
Q3_HIFI	31.22	Baseline	-0.59 ⭐	-4.63 ⭐
Q3_K_M	31.81	+0.59	Baseline	-4.04
Q3_K_S	35.85	+4.63	+4.04	Baseline

Winner: Q3_HIFI (best quality by 0.59 points over Q3_K_M)

File Size Comparison

Format	Size	vs Q3_HIFI	vs Q3_K_M	vs Q3_K_S
Q3_HIFI	308 MB	Baseline	-81 MB ⭐	-58 MB ⭐
Q3_K_S	366 MB	+58 MB	-23 MB	Baseline
Q3_K_M	389 MB	+81 MB	Baseline	+23 MB

Winner: Q3_HIFI (smallest by 58-81 MB)

Speed Comparison

Format	Speed	vs Q3_HIFI	vs Q3_K_M	vs Q3_K_S
Q3_K_M	223 tok/s	+34 tok/s ⭐	Baseline	+44 tok/s ⭐
Q3_HIFI	189 tok/s	Baseline	-34 tok/s	+9 tok/s
Q3_K_S	180 tok/s	-9 tok/s	-44 tok/s	Baseline

Winner: Q3_K_M (fastest by 15-24%)

Recommendations

🎯 Best Overall: Q3_HIFI (With IMatrix + Expanded Coverage)

✅ Best quality (31.22) - beats Q3_K_M by 0.59 points
✅ Smallest file size (308 MB) - 21% smaller than Q3_K_S, 26% smaller than Q3_K_M
✅ Best quality-to-size ratio - best quality in smallest package
⚠️ Requires IMatrix + tensor-type configuration
⚠️ Slower inference (189 tok/s vs 223 tok/s)
Use when: You want the best quality in the smallest file and can invest in proper quantization setup

⚡ Best Speed: Q3_K_M (Out of the Box)

✅ Fastest inference (223 tok/s) - 18% faster than Q3_HIFI
✅ Good quality (31.81) - only 0.59 points worse than Q3_HIFI
✅ Automatic tensor upgrades
✅ Production-ready immediately - zero configuration
⚠️ Largest file size (389 MB)
Use when: Speed is critical and you want good quality without configuration effort

⚖️ Best Balance: Q3_K_S

✅ Good middle ground - reasonable quality (35.85) and speed (180 tok/s)
✅ Smaller than Q3_K_M (366 MB vs 389 MB)
✅ Simple quantization process - zero configuration
⚠️ Not the best in any category
Use when: You need a balanced compromise without optimization effort

Technical Notes

Why Q3_HIFI Achieves Best Quality

Q3_HIFI (with proper configuration) achieves the best quality through:

IMatrix-guided outlier selection - Uses importance weights to select the most critical outliers
Expanded tensor coverage - Applies Q3_HIFI to 6 tensor types (attn_v/q/k, ffn_down/gate/up)
Automatic upgrades - Q6_K for output.weight, Q4_K for attn_output.weight
6 FP16 outliers per block - Preserves precision for the most important weights

This combination allows Q3_HIFI to achieve 31.22 perplexity - better than Q3_K_M's 31.81.

Why Q3_HIFI is Slower

Q3_HIFI's unique architecture (6 FP16 outliers per block) requires:

More memory lookups (scattered access pattern for outlier indices)
Additional FP16-to-FP32 conversions for outliers
More complex dequantization logic
Currently limited to CPU with basic vectorization (no GPU/SIMD optimizations yet)

However, with the recent SIMD/GPU optimizations implemented, speed should improve significantly in future builds.

Why Q3_K_M is Fastest

Q3_K_M benefits from:

CPU_REPACK optimization - Optimized memory layout (91 MiB repack buffer)
Mature optimizations - Well-optimized SIMD/GPU kernels
Automatic tensor upgrades - Uses Q4_K and Q5_K for critical tensors, reducing computation
Efficient block structure - Optimized for speed over extreme compression

Quantization Configuration

Q3_HIFI (Optimized):

.\build\bin\Release\llama-quantize.exe `
  --imatrix .\qwen3-0.6b-imatrix.gguf `
  --tensor-type "attn_v=q3_hifi" `
  --tensor-type "attn_q=q3_hifi" `
  --tensor-type "attn_k=q3_hifi" `
  --tensor-type "ffn_down=q3_hifi" `
  --tensor-type "ffn_gate=q3_hifi" `
  --tensor-type "ffn_up=q3_hifi" `
  --tensor-type "attn_output.weight=q4_k" `
  --tensor-type "output.weight=q6_k" `
  --tensor-type ".*=q3_k" `
  .\Qwen3-0.6B-f16.gguf `
  .\Qwen3-0.6B-f16-Q3_HIFI.gguf `
  Q3_HIFI

Q3_K_M (Simple):

.\build\bin\Release\llama-quantize.exe `
  .\Qwen3-0.6B-f16.gguf `
  .\Qwen3-0.6B-f16-Q3_K_M.gguf `
  Q3_K_M

Q3_K_S (Simple):

.\build\bin\Release\llama-quantize.exe `
  .\Qwen3-0.6B-f16.gguf `
  .\Qwen3-0.6B-f16-Q3_K_S.gguf `
  Q3_K_S

Decision Matrix

Priority	Recommended Format	Reason
Quality	Q3_HIFI	Best perplexity (31.22)
File Size	Q3_HIFI	Smallest (308 MB)
Speed	Q3_K_M	Fastest (223 tok/s)
Simplicity	Q3_K_S or Q3_K_M	Zero configuration
Quality + Size	Q3_HIFI	Best of both
Speed + Quality	Q3_K_M	Good balance
Production (Simple)	Q3_K_M	Fast + good quality
Production (Optimized)	Q3_HIFI	Best quality + smallest

Conclusion

For most users:

Choose Q3_K_M if speed is your priority and you want good quality without configuration
Choose Q3_HIFI if quality and file size are your priorities and you can invest in setup
Choose Q3_K_S if you want a simple, balanced option

For quality-critical applications: Q3_HIFI is the clear winner, offering the best quality (31.22 perplexity) in the smallest package (308 MB), though it requires more setup effort and is slightly slower.

For speed-critical applications: Q3_K_M is the best choice, offering the fastest inference (223 tok/s) with good quality (31.81 perplexity) and zero configuration.

Generated: 2025-12-03
Test Dataset: wikitext-2-raw/wiki.test.raw
Model: Qwen3-0.6B
Evaluation Parameters: --ppl-stride 0 --ppl-output-type 0 -b 2048 -c 512