Q3_Quantisation_Comparison.md · geoffmunn/Qwen3-0.6B-f16 at main

File size: 10,608 Bytes

93e3a8a
2788ad0
 
 
93e3a8a
4a66ec0
 
 
93e3a8a
4a66ec0
93e3a8a
 
 
 
 
 
 
 
 
 
2788ad0
 
 
93e3a8a
 
 
 
 
 
 
2788ad0
93e3a8a
 
 
 
 
 
 
4a66ec0
93e3a8a
 
 
 
 
 
 
 
 
 
2788ad0
 
 
93e3a8a
2788ad0
93e3a8a
 
 
2788ad0
93e3a8a
 
 
 
 
 
 
4a66ec0
93e3a8a
 
 
 
2788ad0
93e3a8a
 
 
 
 
 
2788ad0
 
 
93e3a8a
2788ad0
93e3a8a
 
 
2788ad0
93e3a8a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2788ad0
 
 
93e3a8a
2788ad0
93e3a8a
2788ad0
93e3a8a
 
 
 
 
2788ad0
93e3a8a
2788ad0
93e3a8a
2788ad0
93e3a8a
 
 
 
 
2788ad0
93e3a8a
2788ad0
93e3a8a
2788ad0
93e3a8a
 
 
 
 
2788ad0
93e3a8a
2788ad0
 
 
 
 
4a66ec0
93e3a8a
4a66ec0
 
 
93e3a8a
4a66ec0
 
 
93e3a8a
 
4a66ec0
 
 
 
2788ad0
 
93e3a8a
4a66ec0
 
 
 
2788ad0
 
 
 
 
4a66ec0
2788ad0
4a66ec0
 
 
 
 
2788ad0
93e3a8a
2788ad0
 
 
 
4a66ec0
93e3a8a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2788ad0
93e3a8a
 
 
 
 
 
 
2788ad0
93e3a8a
 
 
 
 
 
 
2788ad0
4a66ec0
 
 
 
93e3a8a
 
 
 
 
 
 
 
 
 
2788ad0
 
 
 
 
93e3a8a
 
 
 
2788ad0
93e3a8a
 
2788ad0
93e3a8a
 
2788ad0
 
 
93e3a8a

# Q3 Quantization Format Comparison Summary

## Executive Summary

This document compares three 3-bit quantization formats for the Qwen3-0.6B model: **Q3_K_S**, **Q3_K_M**, and **Q3_HIFI**. All models were evaluated on the same test dataset (wikitext-2-raw/wiki.test.raw) with identical parameters.

---

## Performance Metrics

| Metric | Q3_K_S | Q3_K_M | Q3_HIFI |
|--------|--------|--------|---------|
| **Perplexity** | 35.85 ± 0.32 | 31.81 ± 0.29 | **31.22 ± 0.28** ⭐ |
| **File Size** | 366.19 MiB | 389.12 MiB | **308.23 MiB** ⭐ |
| **Bits Per Weight** | 4.09 bpw | 4.34 bpw | **3.44 bpw** ⭐ |
| **Inference Speed** | **179.80 tok/s** ⭐ | 223.44 tok/s | 189.23 tok/s |
| **Memory Usage** | 888 MiB | 911 MiB | **830 MiB** ⭐ |
| **Quality Rank** | 3rd | 2nd | **1st** ⭐ |
| **Size Rank** | 2nd | 3rd | **1st** ⭐ |
| **Speed Rank** | 2nd | **1st** ⭐ | 3rd |

---

## Detailed Analysis

### 1. Q3_K_S (Small) - The Balanced Option

**Perplexity:** 35.85 ± 0.32  
**File Size:** 366.19 MiB (4.09 bpw)  
**Speed:** 179.80 tokens/second

#### ✅ Pros:
- **Good balance** between quality, size, and speed
- **Smaller than Q3_K_M** (366 MB vs 389 MB)
- **Faster than Q3_HIFI** (180 tok/s vs 189 tok/s)
- **Simple quantization** - zero configuration required
- **Automatic tensor upgrades** - uses Q6_K for output.weight
- **Production-ready** - works out of the box

#### ❌ Cons:
- **Worst quality** - 4.0 points worse perplexity than Q3_HIFI
- **Not the best in any category** - middle ground in all metrics
- **Lower precision** - fewer automatic upgrades than Q3_K_M

#### 🎯 Best For:
- General-purpose applications where you need a reasonable compromise
- When you want good-enough quality without optimization effort
- Production deployments where simplicity is valued over maximum quality
- Systems where file size matters but you can't invest in optimization

---

### 2. Q3_K_M (Medium) - The Speed Champion

**Perplexity:** 31.81 ± 0.29  
**File Size:** 389.12 MiB (4.34 bpw)  
**Speed:** 223.44 tokens/second

#### ✅ Pros:
- **Fastest inference** - 223 tok/s (24% faster than Q3_HIFI, 24% faster than Q3_K_S)
- **Good quality** - only 0.59 points worse than Q3_HIFI
- **Automatic tensor upgrades** - uses Q4_K and Q5_K for critical tensors
- **Production-ready** - zero configuration required
- **CPU_REPACK support** - optimized memory layout (91 MiB repack buffer)
- **Best speed-to-quality ratio** - excellent performance for the quality level

#### ❌ Cons:
- **Largest file size** - 389 MB (26% larger than Q3_HIFI, 6% larger than Q3_K_S)
- **Higher memory usage** - 911 MiB total
- **Not the best quality** - 0.59 points worse than Q3_HIFI

#### 🎯 Best For:
- **Real-time applications** where speed is critical
- **Interactive systems** requiring low latency
- **Production deployments** where speed matters more than file size
- **Systems with sufficient storage** but need maximum throughput
- **When you want good quality without configuration effort**

---

### 3. Q3_HIFI (Optimized) - The Quality & Size Champion

**Perplexity:** 31.22 ± 0.28 ⭐ **BEST**  
**File Size:** 308.23 MiB (3.44 bpw) ⭐ **SMALLEST**  
**Speed:** 189.23 tokens/second

#### ✅ Pros:
- **Best quality** - 31.22 perplexity (0.59 points better than Q3_K_M, 4.6 points better than Q3_K_S)
- **Smallest file size** - 308 MB (21% smaller than Q3_K_S, 26% smaller than Q3_K_M)
- **Lowest memory usage** - 830 MiB total
- **Best quality-to-size ratio** - highest quality in smallest package
- **IMatrix-guided quantization** - uses importance matrix for optimal outlier selection
- **Expanded tensor coverage** - Q3_HIFI applied to 6 tensor types (attn_v/q/k, ffn_down/gate/up)
- **Automatic upgrades** - Q6_K for output.weight, Q4_K for attn_output.weight
- **6 FP16 outliers per block** - preserves precision for critical weights

#### ❌ Cons:
- **Slower inference** - 189 tok/s (15% slower than Q3_K_M, 5% slower than Q3_K_S)
- **Requires configuration** - needs IMatrix generation and tensor-type specification
- **More setup effort** - must generate imatrix file and specify quantization strategy
- **Longer quantization time** - IMatrix generation takes 30-60 minutes

#### 🎯 Best For:
- **Quality-critical applications** where accuracy matters most
- **Storage-constrained systems** - mobile devices, embedded systems
- **Offline deployments** where file size is a concern
- **When you can invest in proper quantization setup**
- **Research and development** where quality is the priority
- **Production systems** where quality > speed

---

## Head-to-Head Comparisons

### Quality Comparison

| Format | Perplexity | vs Q3_HIFI | vs Q3_K_M | vs Q3_K_S |
|--------|------------|------------|-----------|-----------|
| **Q3_HIFI** | **31.22** | Baseline | **-0.59** ⭐ | **-4.63** ⭐ |
| Q3_K_M | 31.81 | +0.59 | Baseline | -4.04 |
| Q3_K_S | 35.85 | +4.63 | +4.04 | Baseline |

**Winner:** Q3_HIFI (best quality by 0.59 points over Q3_K_M)

### File Size Comparison

| Format | Size | vs Q3_HIFI | vs Q3_K_M | vs Q3_K_S |
|--------|------|------------|-----------|-----------|
| **Q3_HIFI** | **308 MB** | Baseline | **-81 MB** ⭐ | **-58 MB** ⭐ |
| Q3_K_S | 366 MB | +58 MB | -23 MB | Baseline |
| Q3_K_M | 389 MB | +81 MB | Baseline | +23 MB |

**Winner:** Q3_HIFI (smallest by 58-81 MB)

### Speed Comparison

| Format | Speed | vs Q3_HIFI | vs Q3_K_M | vs Q3_K_S |
|--------|-------|------------|-----------|-----------|
| **Q3_K_M** | **223 tok/s** | **+34 tok/s** ⭐ | Baseline | **+44 tok/s** ⭐ |
| Q3_HIFI | 189 tok/s | Baseline | -34 tok/s | +9 tok/s |
| Q3_K_S | 180 tok/s | -9 tok/s | -44 tok/s | Baseline |

**Winner:** Q3_K_M (fastest by 15-24%)

---

## Recommendations

### 🎯 Best Overall: **Q3_HIFI** (With IMatrix + Expanded Coverage)
- ✅ **Best quality** (31.22) - beats Q3_K_M by 0.59 points
- ✅ **Smallest file size** (308 MB) - 21% smaller than Q3_K_S, 26% smaller than Q3_K_M
- ✅ **Best quality-to-size ratio** - best quality in smallest package
- ⚠️ Requires IMatrix + tensor-type configuration
- ⚠️ Slower inference (189 tok/s vs 223 tok/s)
- **Use when:** You want the best quality in the smallest file and can invest in proper quantization setup

### ⚡ Best Speed: **Q3_K_M** (Out of the Box)
- ✅ **Fastest inference** (223 tok/s) - 18% faster than Q3_HIFI
- ✅ **Good quality** (31.81) - only 0.59 points worse than Q3_HIFI
- ✅ Automatic tensor upgrades
- ✅ Production-ready immediately - zero configuration
- ⚠️ Largest file size (389 MB)
- **Use when:** Speed is critical and you want good quality without configuration effort

### ⚖️ Best Balance: **Q3_K_S**
- ✅ Good middle ground - reasonable quality (35.85) and speed (180 tok/s)
- ✅ Smaller than Q3_K_M (366 MB vs 389 MB)
- ✅ Simple quantization process - zero configuration
- ⚠️ Not the best in any category
- **Use when:** You need a balanced compromise without optimization effort

---

## Technical Notes

### Why Q3_HIFI Achieves Best Quality

Q3_HIFI (with proper configuration) achieves the best quality through:
- **IMatrix-guided outlier selection** - Uses importance weights to select the most critical outliers
- **Expanded tensor coverage** - Applies Q3_HIFI to 6 tensor types (attn_v/q/k, ffn_down/gate/up)
- **Automatic upgrades** - Q6_K for output.weight, Q4_K for attn_output.weight
- **6 FP16 outliers per block** - Preserves precision for the most important weights

This combination allows Q3_HIFI to achieve **31.22 perplexity** - better than Q3_K_M's 31.81.

### Why Q3_HIFI is Slower

Q3_HIFI's unique architecture (6 FP16 outliers per block) requires:
- More memory lookups (scattered access pattern for outlier indices)
- Additional FP16-to-FP32 conversions for outliers
- More complex dequantization logic
- Currently limited to CPU with basic vectorization (no GPU/SIMD optimizations yet)

However, with the recent SIMD/GPU optimizations implemented, speed should improve significantly in future builds.

### Why Q3_K_M is Fastest

Q3_K_M benefits from:
- **CPU_REPACK optimization** - Optimized memory layout (91 MiB repack buffer)
- **Mature optimizations** - Well-optimized SIMD/GPU kernels
- **Automatic tensor upgrades** - Uses Q4_K and Q5_K for critical tensors, reducing computation
- **Efficient block structure** - Optimized for speed over extreme compression

### Quantization Configuration

**Q3_HIFI (Optimized):**
```powershell
.\build\bin\Release\llama-quantize.exe `
  --imatrix .\qwen3-0.6b-imatrix.gguf `
  --tensor-type "attn_v=q3_hifi" `
  --tensor-type "attn_q=q3_hifi" `
  --tensor-type "attn_k=q3_hifi" `
  --tensor-type "ffn_down=q3_hifi" `
  --tensor-type "ffn_gate=q3_hifi" `
  --tensor-type "ffn_up=q3_hifi" `
  --tensor-type "attn_output.weight=q4_k" `
  --tensor-type "output.weight=q6_k" `
  --tensor-type ".*=q3_k" `
  .\Qwen3-0.6B-f16.gguf `
  .\Qwen3-0.6B-f16-Q3_HIFI.gguf `
  Q3_HIFI
```

**Q3_K_M (Simple):**
```powershell
.\build\bin\Release\llama-quantize.exe `
  .\Qwen3-0.6B-f16.gguf `
  .\Qwen3-0.6B-f16-Q3_K_M.gguf `
  Q3_K_M
```

**Q3_K_S (Simple):**
```powershell
.\build\bin\Release\llama-quantize.exe `
  .\Qwen3-0.6B-f16.gguf `
  .\Qwen3-0.6B-f16-Q3_K_S.gguf `
  Q3_K_S
```

---

## Decision Matrix

| Priority | Recommended Format | Reason |
|----------|-------------------|--------|
| **Quality** | Q3_HIFI | Best perplexity (31.22) |
| **File Size** | Q3_HIFI | Smallest (308 MB) |
| **Speed** | Q3_K_M | Fastest (223 tok/s) |
| **Simplicity** | Q3_K_S or Q3_K_M | Zero configuration |
| **Quality + Size** | Q3_HIFI | Best of both |
| **Speed + Quality** | Q3_K_M | Good balance |
| **Production (Simple)** | Q3_K_M | Fast + good quality |
| **Production (Optimized)** | Q3_HIFI | Best quality + smallest |

---

## Conclusion

**For most users:**
- **Choose Q3_K_M** if speed is your priority and you want good quality without configuration
- **Choose Q3_HIFI** if quality and file size are your priorities and you can invest in setup
- **Choose Q3_K_S** if you want a simple, balanced option

**For quality-critical applications:**
Q3_HIFI is the clear winner, offering the best quality (31.22 perplexity) in the smallest package (308 MB), though it requires more setup effort and is slightly slower.

**For speed-critical applications:**
Q3_K_M is the best choice, offering the fastest inference (223 tok/s) with good quality (31.81 perplexity) and zero configuration.

---

**Generated:** 2025-12-03  
**Test Dataset:** wikitext-2-raw/wiki.test.raw  
**Model:** Qwen3-0.6B  
**Evaluation Parameters:** --ppl-stride 0 --ppl-output-type 0 -b 2048 -c 512