Q3_Quantisation_Comparison.md · geoffmunn/Qwen3-0.6B-f16 at main

Qwen3-0.6B-f16 / Q3_Quantisation_Comparison.md

geoffmunn

Update Q3_Quantisation_Comparison.md

93e3a8a verified 17 days ago

preview code

raw

history blame contribute delete

10.6 kB

	# Q3 Quantization Format Comparison Summary

	## Executive Summary

	This document compares three 3-bit quantization formats for the Qwen3-0.6B model: Q3_K_S, Q3_K_M, and Q3_HIFI. All models were evaluated on the same test dataset (wikitext-2-raw/wiki.test.raw) with identical parameters.

	---

	## Performance Metrics

	\| Metric \| Q3_K_S \| Q3_K_M \| Q3_HIFI \|
	\|--------\|--------\|--------\|---------\|
	\| Perplexity \| 35.85 ± 0.32 \| 31.81 ± 0.29 \| 31.22 ± 0.28 ⭐ \|
	\| File Size \| 366.19 MiB \| 389.12 MiB \| 308.23 MiB ⭐ \|
	\| Bits Per Weight \| 4.09 bpw \| 4.34 bpw \| 3.44 bpw ⭐ \|
	\| Inference Speed \| 179.80 tok/s ⭐ \| 223.44 tok/s \| 189.23 tok/s \|
	\| Memory Usage \| 888 MiB \| 911 MiB \| 830 MiB ⭐ \|
	\| Quality Rank \| 3rd \| 2nd \| 1st ⭐ \|
	\| Size Rank \| 2nd \| 3rd \| 1st ⭐ \|
	\| Speed Rank \| 2nd \| 1st ⭐ \| 3rd \|

	---

	## Detailed Analysis

	### 1. Q3_K_S (Small) - The Balanced Option

	Perplexity: 35.85 ± 0.32
	File Size: 366.19 MiB (4.09 bpw)
	Speed: 179.80 tokens/second

	#### ✅ Pros:
	- Good balance between quality, size, and speed
	- Smaller than Q3_K_M (366 MB vs 389 MB)
	- Faster than Q3_HIFI (180 tok/s vs 189 tok/s)
	- Simple quantization - zero configuration required
	- Automatic tensor upgrades - uses Q6_K for output.weight
	- Production-ready - works out of the box

	#### ❌ Cons:
	- Worst quality - 4.0 points worse perplexity than Q3_HIFI
	- Not the best in any category - middle ground in all metrics
	- Lower precision - fewer automatic upgrades than Q3_K_M

	#### 🎯 Best For:
	- General-purpose applications where you need a reasonable compromise
	- When you want good-enough quality without optimization effort
	- Production deployments where simplicity is valued over maximum quality
	- Systems where file size matters but you can't invest in optimization

	---

	### 2. Q3_K_M (Medium) - The Speed Champion

	Perplexity: 31.81 ± 0.29
	File Size: 389.12 MiB (4.34 bpw)
	Speed: 223.44 tokens/second

	#### ✅ Pros:
	- Fastest inference - 223 tok/s (24% faster than Q3_HIFI, 24% faster than Q3_K_S)
	- Good quality - only 0.59 points worse than Q3_HIFI
	- Automatic tensor upgrades - uses Q4_K and Q5_K for critical tensors
	- Production-ready - zero configuration required
	- CPU_REPACK support - optimized memory layout (91 MiB repack buffer)
	- Best speed-to-quality ratio - excellent performance for the quality level

	#### ❌ Cons:
	- Largest file size - 389 MB (26% larger than Q3_HIFI, 6% larger than Q3_K_S)
	- Higher memory usage - 911 MiB total
	- Not the best quality - 0.59 points worse than Q3_HIFI

	#### 🎯 Best For:
	- Real-time applications where speed is critical
	- Interactive systems requiring low latency
	- Production deployments where speed matters more than file size
	- Systems with sufficient storage but need maximum throughput
	- When you want good quality without configuration effort

	---

	### 3. Q3_HIFI (Optimized) - The Quality & Size Champion

	Perplexity: 31.22 ± 0.28 ⭐ BEST
	File Size: 308.23 MiB (3.44 bpw) ⭐ SMALLEST
	Speed: 189.23 tokens/second

	#### ✅ Pros:
	- Best quality - 31.22 perplexity (0.59 points better than Q3_K_M, 4.6 points better than Q3_K_S)
	- Smallest file size - 308 MB (21% smaller than Q3_K_S, 26% smaller than Q3_K_M)
	- Lowest memory usage - 830 MiB total
	- Best quality-to-size ratio - highest quality in smallest package
	- IMatrix-guided quantization - uses importance matrix for optimal outlier selection
	- Expanded tensor coverage - Q3_HIFI applied to 6 tensor types (attn_v/q/k, ffn_down/gate/up)
	- Automatic upgrades - Q6_K for output.weight, Q4_K for attn_output.weight
	- 6 FP16 outliers per block - preserves precision for critical weights

	#### ❌ Cons:
	- Slower inference - 189 tok/s (15% slower than Q3_K_M, 5% slower than Q3_K_S)
	- Requires configuration - needs IMatrix generation and tensor-type specification
	- More setup effort - must generate imatrix file and specify quantization strategy
	- Longer quantization time - IMatrix generation takes 30-60 minutes

	#### 🎯 Best For:
	- Quality-critical applications where accuracy matters most
	- Storage-constrained systems - mobile devices, embedded systems
	- Offline deployments where file size is a concern
	- When you can invest in proper quantization setup
	- Research and development where quality is the priority
	- Production systems where quality > speed

	---

	## Head-to-Head Comparisons

	### Quality Comparison

	\| Format \| Perplexity \| vs Q3_HIFI \| vs Q3_K_M \| vs Q3_K_S \|
	\|--------\|------------\|------------\|-----------\|-----------\|
	\| Q3_HIFI \| 31.22 \| Baseline \| -0.59 ⭐ \| -4.63 ⭐ \|
	\| Q3_K_M \| 31.81 \| +0.59 \| Baseline \| -4.04 \|
	\| Q3_K_S \| 35.85 \| +4.63 \| +4.04 \| Baseline \|

	Winner: Q3_HIFI (best quality by 0.59 points over Q3_K_M)

	### File Size Comparison

	\| Format \| Size \| vs Q3_HIFI \| vs Q3_K_M \| vs Q3_K_S \|
	\|--------\|------\|------------\|-----------\|-----------\|
	\| Q3_HIFI \| 308 MB \| Baseline \| -81 MB ⭐ \| -58 MB ⭐ \|
	\| Q3_K_S \| 366 MB \| +58 MB \| -23 MB \| Baseline \|
	\| Q3_K_M \| 389 MB \| +81 MB \| Baseline \| +23 MB \|

	Winner: Q3_HIFI (smallest by 58-81 MB)

	### Speed Comparison

	\| Format \| Speed \| vs Q3_HIFI \| vs Q3_K_M \| vs Q3_K_S \|
	\|--------\|-------\|------------\|-----------\|-----------\|
	\| Q3_K_M \| 223 tok/s \| +34 tok/s ⭐ \| Baseline \| +44 tok/s ⭐ \|
	\| Q3_HIFI \| 189 tok/s \| Baseline \| -34 tok/s \| +9 tok/s \|
	\| Q3_K_S \| 180 tok/s \| -9 tok/s \| -44 tok/s \| Baseline \|

	Winner: Q3_K_M (fastest by 15-24%)

	---

	## Recommendations

	### 🎯 Best Overall: Q3_HIFI (With IMatrix + Expanded Coverage)
	- ✅ Best quality (31.22) - beats Q3_K_M by 0.59 points
	- ✅ Smallest file size (308 MB) - 21% smaller than Q3_K_S, 26% smaller than Q3_K_M
	- ✅ Best quality-to-size ratio - best quality in smallest package
	- ⚠️ Requires IMatrix + tensor-type configuration
	- ⚠️ Slower inference (189 tok/s vs 223 tok/s)
	- Use when: You want the best quality in the smallest file and can invest in proper quantization setup

	### ⚡ Best Speed: Q3_K_M (Out of the Box)
	- ✅ Fastest inference (223 tok/s) - 18% faster than Q3_HIFI
	- ✅ Good quality (31.81) - only 0.59 points worse than Q3_HIFI
	- ✅ Automatic tensor upgrades
	- ✅ Production-ready immediately - zero configuration
	- ⚠️ Largest file size (389 MB)
	- Use when: Speed is critical and you want good quality without configuration effort

	### ⚖️ Best Balance: Q3_K_S
	- ✅ Good middle ground - reasonable quality (35.85) and speed (180 tok/s)
	- ✅ Smaller than Q3_K_M (366 MB vs 389 MB)
	- ✅ Simple quantization process - zero configuration
	- ⚠️ Not the best in any category
	- Use when: You need a balanced compromise without optimization effort

	---

	## Technical Notes

	### Why Q3_HIFI Achieves Best Quality

	Q3_HIFI (with proper configuration) achieves the best quality through:
	- IMatrix-guided outlier selection - Uses importance weights to select the most critical outliers
	- Expanded tensor coverage - Applies Q3_HIFI to 6 tensor types (attn_v/q/k, ffn_down/gate/up)
	- Automatic upgrades - Q6_K for output.weight, Q4_K for attn_output.weight
	- 6 FP16 outliers per block - Preserves precision for the most important weights

	This combination allows Q3_HIFI to achieve 31.22 perplexity - better than Q3_K_M's 31.81.

	### Why Q3_HIFI is Slower

	Q3_HIFI's unique architecture (6 FP16 outliers per block) requires:
	- More memory lookups (scattered access pattern for outlier indices)
	- Additional FP16-to-FP32 conversions for outliers
	- More complex dequantization logic
	- Currently limited to CPU with basic vectorization (no GPU/SIMD optimizations yet)

	However, with the recent SIMD/GPU optimizations implemented, speed should improve significantly in future builds.

	### Why Q3_K_M is Fastest

	Q3_K_M benefits from:
	- CPU_REPACK optimization - Optimized memory layout (91 MiB repack buffer)
	- Mature optimizations - Well-optimized SIMD/GPU kernels
	- Automatic tensor upgrades - Uses Q4_K and Q5_K for critical tensors, reducing computation
	- Efficient block structure - Optimized for speed over extreme compression

	### Quantization Configuration

	Q3_HIFI (Optimized):
	```powershell
	.\build\bin\Release\llama-quantize.exe `
	--imatrix .\qwen3-0.6b-imatrix.gguf `
	--tensor-type "attn_v=q3_hifi" `
	--tensor-type "attn_q=q3_hifi" `
	--tensor-type "attn_k=q3_hifi" `
	--tensor-type "ffn_down=q3_hifi" `
	--tensor-type "ffn_gate=q3_hifi" `
	--tensor-type "ffn_up=q3_hifi" `
	--tensor-type "attn_output.weight=q4_k" `
	--tensor-type "output.weight=q6_k" `
	--tensor-type ".*=q3_k" `
	.\Qwen3-0.6B-f16.gguf `
	.\Qwen3-0.6B-f16-Q3_HIFI.gguf `
	Q3_HIFI
	```

	Q3_K_M (Simple):
	```powershell
	.\build\bin\Release\llama-quantize.exe `
	.\Qwen3-0.6B-f16.gguf `
	.\Qwen3-0.6B-f16-Q3_K_M.gguf `
	Q3_K_M
	```

	Q3_K_S (Simple):
	```powershell
	.\build\bin\Release\llama-quantize.exe `
	.\Qwen3-0.6B-f16.gguf `
	.\Qwen3-0.6B-f16-Q3_K_S.gguf `
	Q3_K_S
	```

	---

	## Decision Matrix

	\| Priority \| Recommended Format \| Reason \|
	\|----------\|-------------------\|--------\|
	\| Quality \| Q3_HIFI \| Best perplexity (31.22) \|
	\| File Size \| Q3_HIFI \| Smallest (308 MB) \|
	\| Speed \| Q3_K_M \| Fastest (223 tok/s) \|
	\| Simplicity \| Q3_K_S or Q3_K_M \| Zero configuration \|
	\| Quality + Size \| Q3_HIFI \| Best of both \|
	\| Speed + Quality \| Q3_K_M \| Good balance \|
	\| Production (Simple) \| Q3_K_M \| Fast + good quality \|
	\| Production (Optimized) \| Q3_HIFI \| Best quality + smallest \|

	---

	## Conclusion

	For most users:
	- Choose Q3_K_M if speed is your priority and you want good quality without configuration
	- Choose Q3_HIFI if quality and file size are your priorities and you can invest in setup
	- Choose Q3_K_S if you want a simple, balanced option

	For quality-critical applications:
	Q3_HIFI is the clear winner, offering the best quality (31.22 perplexity) in the smallest package (308 MB), though it requires more setup effort and is slightly slower.

	For speed-critical applications:
	Q3_K_M is the best choice, offering the fastest inference (223 tok/s) with good quality (31.81 perplexity) and zero configuration.

	---

	Generated: 2025-12-03
	Test Dataset: wikitext-2-raw/wiki.test.raw
	Model: Qwen3-0.6B
	Evaluation Parameters: --ppl-stride 0 --ppl-output-type 0 -b 2048 -c 512