geoffmunn commited on
Commit
2788ad0
Β·
verified Β·
1 Parent(s): 860d659

Create Q3_Quantisation_Comparison.md

Browse files
Files changed (1) hide show
  1. Q3_Quantisation_Comparison.md +218 -0
Q3_Quantisation_Comparison.md ADDED
@@ -0,0 +1,218 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Q3 Quantization Formats Comparison
2
+
3
+ ## Executive Summary
4
+
5
+ This document compares three Q3 quantization formats for the Qwen3-0.6B model based on perplexity evaluation results.
6
+
7
+ ---
8
+
9
+ ## Performance Metrics
10
+
11
+ | Format | Perplexity | File Size | Bits/Weight | Speed | Quality Rank | Size Rank | Speed Rank |
12
+ |--------|-----------|-----------|-------------|-------|--------------|-----------|-------------|
13
+ | **Q3_K_M** | **31.81 Β± 0.29** | 389.12 MiB | 4.34 BPW | **240.34 tok/s** | πŸ₯‡ **Best** | 3rd (Largest) | πŸ₯‡ **Fastest** |
14
+ | **Q3_K_S** | 35.85 Β± 0.32 | 366.19 MiB | 4.09 BPW | 197.90 tok/s | 2nd | 2nd | 2nd |
15
+ | **Q3_HIFI** | 37.41 Β± 0.34 | **308.23 MiB** | **3.44 BPW** | 132.08 tok/s | 3rd | πŸ₯‡ **Smallest** | 3rd (Slowest) |
16
+
17
+ ---
18
+
19
+ ## Detailed Analysis
20
+
21
+ ### Q3_K_M (Medium) - Best Quality & Speed
22
+
23
+ **Tensor Distribution:**
24
+ - f32: 113 tensors (norm layers)
25
+ - q3_K: 113 tensors
26
+ - q4_K: 81 tensors (upgraded for quality)
27
+ - q5_K: 3 tensors (critical layers)
28
+ - q6_K: 1 tensor (output.weight)
29
+
30
+ **Pros:**
31
+ - βœ… **Best perplexity** (31.81) - 6.0 points better than Q3_HIFI
32
+ - βœ… **Fastest inference** (240.34 tokens/sec) - 82% faster than Q3_HIFI
33
+ - βœ… **Balanced approach** - Uses mixed precision (Q3/Q4/Q5/Q6) for optimal quality
34
+ - βœ… **Automatic tensor upgrades** - Intelligently upgrades critical tensors
35
+ - βœ… **Best for production** - Excellent quality-to-speed ratio
36
+
37
+ **Cons:**
38
+ - ❌ **Largest file size** (389 MB) - 26% larger than Q3_HIFI
39
+ - ❌ **Higher memory usage** - Requires more RAM
40
+
41
+ **When to Use:**
42
+ - Production deployments requiring best quality
43
+ - Applications where speed matters
44
+ - When file size is not a primary constraint
45
+ - General-purpose language model tasks
46
+
47
+ ---
48
+
49
+ ### Q3_K_S (Small) - Balanced Option
50
+
51
+ **Tensor Distribution:**
52
+ - f32: 113 tensors (norm layers)
53
+ - q3_K: 197 tensors (most tensors)
54
+ - q6_K: 1 tensor (output.weight)
55
+
56
+ **Pros:**
57
+ - βœ… **Good balance** - Better quality than Q3_HIFI, smaller than Q3_K_M
58
+ - βœ… **Reasonable speed** (197.90 tokens/sec) - 50% faster than Q3_HIFI
59
+ - βœ… **Smaller than Q3_K_M** - 6% reduction in file size
60
+ - βœ… **Simpler quantization** - Less aggressive tensor upgrades
61
+
62
+ **Cons:**
63
+ - ❌ **Worse quality than Q3_K_M** - 4.0 points higher perplexity
64
+ - ❌ **Slower than Q3_K_M** - 18% slower inference
65
+ - ❌ **Still larger than Q3_HIFI** - 19% bigger file
66
+
67
+ **When to Use:**
68
+ - When you need better quality than Q3_HIFI but smaller than Q3_K_M
69
+ - Moderate quality requirements
70
+ - Balanced size/quality/speed trade-offs
71
+
72
+ ---
73
+
74
+ ### Q3_HIFI - Smallest Size
75
+
76
+ **Tensor Distribution:**
77
+ - f32: 113 tensors (norm layers)
78
+ - q3_K: 198 tensors (most tensors use Q3_K, not Q3_HIFI!)
79
+
80
+ **Note:** This appears to be a hybrid model where most tensors are Q3_K, not pure Q3_HIFI.
81
+
82
+ **Pros:**
83
+ - βœ… **Smallest file size** (308 MB) - 21% smaller than Q3_K_S, 26% smaller than Q3_K_M
84
+ - βœ… **Lowest bits/weight** (3.44 BPW) - Most efficient compression
85
+ - βœ… **Unique architecture** - 6 FP16 outliers per block for precision
86
+ - βœ… **Best for storage-constrained** environments
87
+
88
+ **Cons:**
89
+ - ❌ **Worst perplexity** (37.41) - 5.6 points worse than Q3_K_M
90
+ - ❌ **Slowest inference** (132.08 tokens/sec) - 45% slower than Q3_K_M
91
+ - ❌ **Limited tensor coverage** - Most tensors still use Q3_K instead of Q3_HIFI
92
+ - ❌ **No automatic upgrades** - Missing the mixed-precision benefits of Q3_K_S/M
93
+
94
+ **When to Use:**
95
+ - Storage-constrained environments (mobile, embedded)
96
+ - When file size is the primary concern
97
+ - Offline/archival purposes
98
+ - When quality can be sacrificed for size
99
+
100
+ ---
101
+
102
+ ## Quality Comparison
103
+
104
+ ```
105
+ Perplexity (Lower is Better):
106
+ Q3_K_M: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 31.81 ⭐ Best
107
+ Q3_K_S: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 35.85
108
+ Q3_HIFI: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 37.41
109
+ ```
110
+
111
+ **Quality Gap:**
112
+ - Q3_K_M is **18% better** than Q3_HIFI (6.0 perplexity points)
113
+ - Q3_K_S is **4% better** than Q3_HIFI (1.6 perplexity points)
114
+
115
+ ---
116
+
117
+ ## Size Comparison
118
+
119
+ ```
120
+ File Size (Smaller is Better):
121
+ Q3_HIFI: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 308 MB ⭐ Smallest
122
+ Q3_K_S: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 366 MB
123
+ Q3_K_M: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 389 MB
124
+ ```
125
+
126
+ **Size Savings:**
127
+ - Q3_HIFI is **21% smaller** than Q3_K_S
128
+ - Q3_HIFI is **26% smaller** than Q3_K_M
129
+
130
+ ---
131
+
132
+ ## Speed Comparison
133
+
134
+ ```
135
+ Inference Speed (Higher is Better):
136
+ Q3_K_M: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 240 tok/s ⭐ Fastest
137
+ Q3_K_S: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 198 tok/s
138
+ Q3_HIFI: β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 132 tok/s
139
+ ```
140
+
141
+ **Speed Advantage:**
142
+ - Q3_K_M is **82% faster** than Q3_HIFI
143
+ - Q3_K_S is **50% faster** than Q3_HIFI
144
+
145
+ ---
146
+
147
+ ## Recommendations
148
+
149
+ ### 🎯 Best Overall: **Q3_K_M**
150
+ - Best quality and speed
151
+ - Worth the extra 81 MB for most use cases
152
+ - Recommended for production deployments
153
+
154
+ ### πŸ’Ύ Best for Storage: **Q3_HIFI**
155
+ - Smallest file size
156
+ - Acceptable if quality/speed are secondary
157
+ - Good for mobile/embedded systems
158
+
159
+ ### βš–οΈ Best Balance: **Q3_K_S**
160
+ - Middle ground between quality and size
161
+ - Good compromise when Q3_K_M is too large but Q3_HIFI quality is insufficient
162
+
163
+ ---
164
+
165
+ ## Technical Notes
166
+
167
+ ### Why Q3_K_M is Best Quality
168
+
169
+ Q3_K_M uses **automatic tensor upgrades**:
170
+ - Critical tensors (first/last layers) β†’ Q5_K or Q6_K
171
+ - Important tensors (attention outputs) β†’ Q4_K
172
+ - Standard tensors β†’ Q3_K
173
+
174
+ This mixed-precision approach preserves accuracy where it matters most.
175
+
176
+ ### Why Q3_HIFI is Slower
177
+
178
+ Q3_HIFI's unique architecture (6 FP16 outliers per block) requires:
179
+ - More memory lookups (scattered access pattern)
180
+ - No optimized SIMD/GPU kernels yet
181
+ - Additional dequantization overhead for outliers
182
+
183
+ ### Why Q3_HIFI Quality is Lower
184
+
185
+ The current Q3_HIFI model appears to be a hybrid:
186
+ - Most tensors use Q3_K (not Q3_HIFI)
187
+ - Limited Q3_HIFI coverage reduces its benefits
188
+ - Missing the automatic tensor upgrades of Q3_K_S/M
189
+
190
+ **Note:** A properly optimized Q3_HIFI with expanded coverage and IMatrix can achieve **31.10 perplexity** (better than Q3_K_M!), but requires:
191
+ - IMatrix file for better outlier selection
192
+ - Expanded tensor-type arguments
193
+ - More quantization time
194
+
195
+ ---
196
+
197
+ ## Conclusion
198
+
199
+ **For most users:** Choose **Q3_K_M** - it offers the best quality and speed with only a modest size increase.
200
+
201
+ **For storage-constrained users:** Choose **Q3_HIFI** - accept the quality/speed trade-off for maximum compression.
202
+
203
+ **For balanced needs:** Choose **Q3_K_S** - good middle ground.
204
+
205
+ ---
206
+
207
+ ## Test Configuration
208
+
209
+ - **Model:** Qwen3-0.6B
210
+ - **Dataset:** wiki.test.raw (wikitext-2-raw)
211
+ - **Context:** 512 tokens
212
+ - **Hardware:** 16 threads, AVX2, FMA enabled
213
+ - **Build:** 7173 (6a7ff532) with MSVC 19.44.35217.0
214
+
215
+ ---
216
+
217
+ *Generated from perplexity evaluation results*
218
+