geoffmunn commited on
Commit
7880ae3
·
verified ·
1 Parent(s): 04912b7

Add Q2–Q8_0 quantized models with per-model cards, MODELFILE, CLI examples, and auto-upload

Browse files
MODELFILE CHANGED
@@ -7,11 +7,11 @@ f16: cpu
7
 
8
  # Chat template using ChatML (used by Qwen)
9
  prompt_template: >-
10
- <|im_start|>system
11
  You are a helpful assistant.<|im_end|>
12
- <|im_start|>user
13
  {prompt}<|im_end|>
14
- <|im_start|>assistant
15
 
16
  # Stop sequences help end generation cleanly
17
  stop: "<|im_end|>"
 
7
 
8
  # Chat template using ChatML (used by Qwen)
9
  prompt_template: >-
10
+ <|im_start|>system
11
  You are a helpful assistant.<|im_end|>
12
+ <|im_start|>user
13
  {prompt}<|im_end|>
14
+ <|im_start|>assistant
15
 
16
  # Stop sequences help end generation cleanly
17
  stop: "<|im_end|>"
Qwen3-0.6B-Q2_K/README.md CHANGED
@@ -6,8 +6,9 @@ tags:
6
  - llama.cpp
7
  - quantized
8
  - text-generation
9
- - tiny-model
10
  - edge-ai
 
11
  base_model: Qwen/Qwen3-0.6B
12
  author: geoffmunn
13
  ---
@@ -19,7 +20,7 @@ Quantized version of [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) a
19
  ## Model Info
20
 
21
  - **Format**: GGUF (for llama.cpp and compatible runtimes)
22
- - **Size**: 347 MB
23
  - **Precision**: Q2_K
24
  - **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
25
  - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
@@ -59,7 +60,60 @@ Recommended defaults:
59
  | Min-P | 0.0 |
60
  | Repeat Penalty | 1.1 |
61
 
62
- Stop sequences: \`<|im_end|>\`, \`<|im_start|>\`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63
 
64
  ## Verification
65
 
@@ -75,7 +129,7 @@ Compatible with:
75
  - [LM Studio](https://lmstudio.ai) – local AI model runner
76
  - [OpenWebUI](https://openwebui.com) – self-hosted AI interface
77
  - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
78
- - Directly via \`llama.cpp\`
79
 
80
  ## License
81
 
 
6
  - llama.cpp
7
  - quantized
8
  - text-generation
9
+ - chat
10
  - edge-ai
11
+ - tiny-model
12
  base_model: Qwen/Qwen3-0.6B
13
  author: geoffmunn
14
  ---
 
20
  ## Model Info
21
 
22
  - **Format**: GGUF (for llama.cpp and compatible runtimes)
23
+ - **Size**: 332M
24
  - **Precision**: Q2_K
25
  - **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
26
  - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
 
60
  | Min-P | 0.0 |
61
  | Repeat Penalty | 1.1 |
62
 
63
+ Stop sequences: `<|im_end|>`, `<|im_start|>`
64
+
65
+ > ⚠️ Due to model size, avoid temperatures above 0.9 — outputs become highly unpredictable.
66
+
67
+ ## 💡 Usage Tips
68
+
69
+ > This model is best suited for lightweight tasks:
70
+ >
71
+ > ### ✅ Ideal Uses
72
+ > - Quick replies and canned responses
73
+ > - Intent classification (e.g., “Is this user asking for help?”)
74
+ > - UI prototyping and local AI testing
75
+ > - Embedded/NPU deployment
76
+ >
77
+ > ### ❌ Limitations
78
+ > - No complex reasoning or multi-step logic
79
+ > - Poor math and code generation
80
+ > - Limited world knowledge
81
+ > - May repeat or hallucinate frequently at higher temps
82
+ >
83
+ > ---
84
+ >
85
+ > 🔄 **Fast Iteration Friendly**
86
+ > Perfect for developers building prompt templates or testing UI integrations.
87
+ >
88
+ > 🔋 **Runs on Almost Anything**
89
+ > Even Raspberry Pi Zero W can run Q2_K with swap enabled.
90
+ >
91
+ > 📦 **Tiny Footprint**
92
+ > Fits easily on USB drives, microSD cards, or IoT devices.
93
+
94
+ ## 🖥️ CLI Example Using Ollama or TGI Server
95
+
96
+ Here’s how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server (e.g., Ollama, Text Generation Inference).
97
+
98
+ ```bash
99
+ curl http://localhost:11434/api/generate -s -N -d '{
100
+ "model": "hf.co/geoffmunn/Qwen3-0.6B:Q2_K;2D",
101
+ "prompt": "Respond exactly as follows: Repeat the word 'hello' five times separated by commas.",
102
+ "temperature": 0.1,
103
+ "top_p": 0.95,
104
+ "top_k": 20,
105
+ "min_p": 0.0,
106
+ "repeat_penalty": 1.1,
107
+ "stream": false
108
+ }' | jq -r '.response'
109
+ ```
110
+
111
+ 🎯 **Why this works well**:
112
+ - The prompt is meaningful yet achievable for a tiny model.
113
+ - Temperature tuned appropriately: lower for deterministic output (`0.1`), higher for jokes (`0.8`).
114
+ - Uses `jq` to extract clean response.
115
+
116
+ > 💬 Tip: For ultra-low-latency use, try `Q3_K_M` or `Q4_K_S` on older laptops.
117
 
118
  ## Verification
119
 
 
129
  - [LM Studio](https://lmstudio.ai) – local AI model runner
130
  - [OpenWebUI](https://openwebui.com) – self-hosted AI interface
131
  - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
132
+ - Directly via `llama.cpp`
133
 
134
  ## License
135
 
Qwen3-0.6B-Q3_K_M/README.md CHANGED
@@ -6,8 +6,9 @@ tags:
6
  - llama.cpp
7
  - quantized
8
  - text-generation
9
- - tiny-model
10
  - edge-ai
 
11
  base_model: Qwen/Qwen3-0.6B
12
  author: geoffmunn
13
  ---
@@ -19,7 +20,7 @@ Quantized version of [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) a
19
  ## Model Info
20
 
21
  - **Format**: GGUF (for llama.cpp and compatible runtimes)
22
- - **Size**: 414 MB
23
  - **Precision**: Q3_K_M
24
  - **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
25
  - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
@@ -59,7 +60,60 @@ Recommended defaults:
59
  | Min-P | 0.0 |
60
  | Repeat Penalty | 1.1 |
61
 
62
- Stop sequences: \`<|im_end|>\`, \`<|im_start|>\`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63
 
64
  ## Verification
65
 
@@ -75,7 +129,7 @@ Compatible with:
75
  - [LM Studio](https://lmstudio.ai) – local AI model runner
76
  - [OpenWebUI](https://openwebui.com) – self-hosted AI interface
77
  - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
78
- - Directly via \`llama.cpp\`
79
 
80
  ## License
81
 
 
6
  - llama.cpp
7
  - quantized
8
  - text-generation
9
+ - chat
10
  - edge-ai
11
+ - tiny-model
12
  base_model: Qwen/Qwen3-0.6B
13
  author: geoffmunn
14
  ---
 
20
  ## Model Info
21
 
22
  - **Format**: GGUF (for llama.cpp and compatible runtimes)
23
+ - **Size**: 395M
24
  - **Precision**: Q3_K_M
25
  - **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
26
  - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
 
60
  | Min-P | 0.0 |
61
  | Repeat Penalty | 1.1 |
62
 
63
+ Stop sequences: `<|im_end|>`, `<|im_start|>`
64
+
65
+ > ⚠️ Due to model size, avoid temperatures above 0.9 — outputs become highly unpredictable.
66
+
67
+ ## 💡 Usage Tips
68
+
69
+ > This model is best suited for lightweight tasks:
70
+ >
71
+ > ### ✅ Ideal Uses
72
+ > - Quick replies and canned responses
73
+ > - Intent classification (e.g., “Is this user asking for help?”)
74
+ > - UI prototyping and local AI testing
75
+ > - Embedded/NPU deployment
76
+ >
77
+ > ### ❌ Limitations
78
+ > - No complex reasoning or multi-step logic
79
+ > - Poor math and code generation
80
+ > - Limited world knowledge
81
+ > - May repeat or hallucinate frequently at higher temps
82
+ >
83
+ > ---
84
+ >
85
+ > 🔄 **Fast Iteration Friendly**
86
+ > Perfect for developers building prompt templates or testing UI integrations.
87
+ >
88
+ > 🔋 **Runs on Almost Anything**
89
+ > Even Raspberry Pi Zero W can run Q2_K with swap enabled.
90
+ >
91
+ > 📦 **Tiny Footprint**
92
+ > Fits easily on USB drives, microSD cards, or IoT devices.
93
+
94
+ ## 🖥️ CLI Example Using Ollama or TGI Server
95
+
96
+ Here’s how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server (e.g., Ollama, Text Generation Inference).
97
+
98
+ ```bash
99
+ curl http://localhost:11434/api/generate -s -N -d '{
100
+ "model": "hf.co/geoffmunn/Qwen3-0.6B:Q3_K_M;2D",
101
+ "prompt": "Respond exactly as follows: Repeat the word 'hello' five times separated by commas.",
102
+ "temperature": 0.1,
103
+ "top_p": 0.95,
104
+ "top_k": 20,
105
+ "min_p": 0.0,
106
+ "repeat_penalty": 1.1,
107
+ "stream": false
108
+ }' | jq -r '.response'
109
+ ```
110
+
111
+ 🎯 **Why this works well**:
112
+ - The prompt is meaningful yet achievable for a tiny model.
113
+ - Temperature tuned appropriately: lower for deterministic output (`0.1`), higher for jokes (`0.8`).
114
+ - Uses `jq` to extract clean response.
115
+
116
+ > 💬 Tip: For ultra-low-latency use, try `Q3_K_M` or `Q4_K_S` on older laptops.
117
 
118
  ## Verification
119
 
 
129
  - [LM Studio](https://lmstudio.ai) – local AI model runner
130
  - [OpenWebUI](https://openwebui.com) – self-hosted AI interface
131
  - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
132
+ - Directly via `llama.cpp`
133
 
134
  ## License
135
 
Qwen3-0.6B-Q3_K_S/README.md CHANGED
@@ -6,8 +6,9 @@ tags:
6
  - llama.cpp
7
  - quantized
8
  - text-generation
9
- - tiny-model
10
  - edge-ai
 
11
  base_model: Qwen/Qwen3-0.6B
12
  author: geoffmunn
13
  ---
@@ -19,7 +20,7 @@ Quantized version of [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) a
19
  ## Model Info
20
 
21
  - **Format**: GGUF (for llama.cpp and compatible runtimes)
22
- - **Size**: 390 MB
23
  - **Precision**: Q3_K_S
24
  - **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
25
  - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
@@ -59,7 +60,60 @@ Recommended defaults:
59
  | Min-P | 0.0 |
60
  | Repeat Penalty | 1.1 |
61
 
62
- Stop sequences: \`<|im_end|>\`, \`<|im_start|>\`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63
 
64
  ## Verification
65
 
@@ -75,7 +129,7 @@ Compatible with:
75
  - [LM Studio](https://lmstudio.ai) – local AI model runner
76
  - [OpenWebUI](https://openwebui.com) – self-hosted AI interface
77
  - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
78
- - Directly via \`llama.cpp\`
79
 
80
  ## License
81
 
 
6
  - llama.cpp
7
  - quantized
8
  - text-generation
9
+ - chat
10
  - edge-ai
11
+ - tiny-model
12
  base_model: Qwen/Qwen3-0.6B
13
  author: geoffmunn
14
  ---
 
20
  ## Model Info
21
 
22
  - **Format**: GGUF (for llama.cpp and compatible runtimes)
23
+ - **Size**: 372M
24
  - **Precision**: Q3_K_S
25
  - **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
26
  - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
 
60
  | Min-P | 0.0 |
61
  | Repeat Penalty | 1.1 |
62
 
63
+ Stop sequences: `<|im_end|>`, `<|im_start|>`
64
+
65
+ > ⚠️ Due to model size, avoid temperatures above 0.9 — outputs become highly unpredictable.
66
+
67
+ ## 💡 Usage Tips
68
+
69
+ > This model is best suited for lightweight tasks:
70
+ >
71
+ > ### ✅ Ideal Uses
72
+ > - Quick replies and canned responses
73
+ > - Intent classification (e.g., “Is this user asking for help?”)
74
+ > - UI prototyping and local AI testing
75
+ > - Embedded/NPU deployment
76
+ >
77
+ > ### ❌ Limitations
78
+ > - No complex reasoning or multi-step logic
79
+ > - Poor math and code generation
80
+ > - Limited world knowledge
81
+ > - May repeat or hallucinate frequently at higher temps
82
+ >
83
+ > ---
84
+ >
85
+ > 🔄 **Fast Iteration Friendly**
86
+ > Perfect for developers building prompt templates or testing UI integrations.
87
+ >
88
+ > 🔋 **Runs on Almost Anything**
89
+ > Even Raspberry Pi Zero W can run Q2_K with swap enabled.
90
+ >
91
+ > 📦 **Tiny Footprint**
92
+ > Fits easily on USB drives, microSD cards, or IoT devices.
93
+
94
+ ## 🖥️ CLI Example Using Ollama or TGI Server
95
+
96
+ Here’s how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server (e.g., Ollama, Text Generation Inference).
97
+
98
+ ```bash
99
+ curl http://localhost:11434/api/generate -s -N -d '{
100
+ "model": "hf.co/geoffmunn/Qwen3-0.6B:Q3_K_S;2D",
101
+ "prompt": "Respond exactly as follows: Repeat the word 'hello' five times separated by commas.",
102
+ "temperature": 0.1,
103
+ "top_p": 0.95,
104
+ "top_k": 20,
105
+ "min_p": 0.0,
106
+ "repeat_penalty": 1.1,
107
+ "stream": false
108
+ }' | jq -r '.response'
109
+ ```
110
+
111
+ 🎯 **Why this works well**:
112
+ - The prompt is meaningful yet achievable for a tiny model.
113
+ - Temperature tuned appropriately: lower for deterministic output (`0.1`), higher for jokes (`0.8`).
114
+ - Uses `jq` to extract clean response.
115
+
116
+ > 💬 Tip: For ultra-low-latency use, try `Q3_K_M` or `Q4_K_S` on older laptops.
117
 
118
  ## Verification
119
 
 
129
  - [LM Studio](https://lmstudio.ai) – local AI model runner
130
  - [OpenWebUI](https://openwebui.com) – self-hosted AI interface
131
  - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
132
+ - Directly via `llama.cpp`
133
 
134
  ## License
135
 
Qwen3-0.6B-Q4_K_M/README.md CHANGED
@@ -6,8 +6,9 @@ tags:
6
  - llama.cpp
7
  - quantized
8
  - text-generation
9
- - tiny-model
10
  - edge-ai
 
11
  base_model: Qwen/Qwen3-0.6B
12
  author: geoffmunn
13
  ---
@@ -19,7 +20,7 @@ Quantized version of [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) a
19
  ## Model Info
20
 
21
  - **Format**: GGUF (for llama.cpp and compatible runtimes)
22
- - **Size**: 484 MB
23
  - **Precision**: Q4_K_M
24
  - **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
25
  - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
@@ -59,7 +60,60 @@ Recommended defaults:
59
  | Min-P | 0.0 |
60
  | Repeat Penalty | 1.1 |
61
 
62
- Stop sequences: \`<|im_end|>\`, \`<|im_start|>\`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63
 
64
  ## Verification
65
 
@@ -75,7 +129,7 @@ Compatible with:
75
  - [LM Studio](https://lmstudio.ai) – local AI model runner
76
  - [OpenWebUI](https://openwebui.com) – self-hosted AI interface
77
  - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
78
- - Directly via \`llama.cpp\`
79
 
80
  ## License
81
 
 
6
  - llama.cpp
7
  - quantized
8
  - text-generation
9
+ - chat
10
  - edge-ai
11
+ - tiny-model
12
  base_model: Qwen/Qwen3-0.6B
13
  author: geoffmunn
14
  ---
 
20
  ## Model Info
21
 
22
  - **Format**: GGUF (for llama.cpp and compatible runtimes)
23
+ - **Size**: 462M
24
  - **Precision**: Q4_K_M
25
  - **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
26
  - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
 
60
  | Min-P | 0.0 |
61
  | Repeat Penalty | 1.1 |
62
 
63
+ Stop sequences: `<|im_end|>`, `<|im_start|>`
64
+
65
+ > ⚠️ Due to model size, avoid temperatures above 0.9 — outputs become highly unpredictable.
66
+
67
+ ## 💡 Usage Tips
68
+
69
+ > This model is best suited for lightweight tasks:
70
+ >
71
+ > ### ✅ Ideal Uses
72
+ > - Quick replies and canned responses
73
+ > - Intent classification (e.g., “Is this user asking for help?”)
74
+ > - UI prototyping and local AI testing
75
+ > - Embedded/NPU deployment
76
+ >
77
+ > ### ❌ Limitations
78
+ > - No complex reasoning or multi-step logic
79
+ > - Poor math and code generation
80
+ > - Limited world knowledge
81
+ > - May repeat or hallucinate frequently at higher temps
82
+ >
83
+ > ---
84
+ >
85
+ > 🔄 **Fast Iteration Friendly**
86
+ > Perfect for developers building prompt templates or testing UI integrations.
87
+ >
88
+ > 🔋 **Runs on Almost Anything**
89
+ > Even Raspberry Pi Zero W can run Q2_K with swap enabled.
90
+ >
91
+ > 📦 **Tiny Footprint**
92
+ > Fits easily on USB drives, microSD cards, or IoT devices.
93
+
94
+ ## 🖥️ CLI Example Using Ollama or TGI Server
95
+
96
+ Here’s how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server (e.g., Ollama, Text Generation Inference).
97
+
98
+ ```bash
99
+ curl http://localhost:11434/api/generate -s -N -d '{
100
+ "model": "hf.co/geoffmunn/Qwen3-0.6B:Q4_K_M;2D",
101
+ "prompt": "Respond exactly as follows: Write a short joke about cats.",
102
+ "temperature": 0.8,
103
+ "top_p": 0.95,
104
+ "top_k": 20,
105
+ "min_p": 0.0,
106
+ "repeat_penalty": 1.1,
107
+ "stream": false
108
+ }' | jq -r '.response'
109
+ ```
110
+
111
+ 🎯 **Why this works well**:
112
+ - The prompt is meaningful yet achievable for a tiny model.
113
+ - Temperature tuned appropriately: lower for deterministic output (`0.1`), higher for jokes (`0.8`).
114
+ - Uses `jq` to extract clean response.
115
+
116
+ > 💬 Tip: For ultra-low-latency use, try `Q3_K_M` or `Q4_K_S` on older laptops.
117
 
118
  ## Verification
119
 
 
129
  - [LM Studio](https://lmstudio.ai) – local AI model runner
130
  - [OpenWebUI](https://openwebui.com) – self-hosted AI interface
131
  - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
132
+ - Directly via `llama.cpp`
133
 
134
  ## License
135
 
Qwen3-0.6B-Q4_K_S/README.md CHANGED
@@ -6,8 +6,9 @@ tags:
6
  - llama.cpp
7
  - quantized
8
  - text-generation
9
- - tiny-model
10
  - edge-ai
 
11
  base_model: Qwen/Qwen3-0.6B
12
  author: geoffmunn
13
  ---
@@ -19,7 +20,7 @@ Quantized version of [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) a
19
  ## Model Info
20
 
21
  - **Format**: GGUF (for llama.cpp and compatible runtimes)
22
- - **Size**: 471 MB
23
  - **Precision**: Q4_K_S
24
  - **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
25
  - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
@@ -59,7 +60,60 @@ Recommended defaults:
59
  | Min-P | 0.0 |
60
  | Repeat Penalty | 1.1 |
61
 
62
- Stop sequences: \`<|im_end|>\`, \`<|im_start|>\`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63
 
64
  ## Verification
65
 
@@ -75,7 +129,7 @@ Compatible with:
75
  - [LM Studio](https://lmstudio.ai) – local AI model runner
76
  - [OpenWebUI](https://openwebui.com) – self-hosted AI interface
77
  - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
78
- - Directly via \`llama.cpp\`
79
 
80
  ## License
81
 
 
6
  - llama.cpp
7
  - quantized
8
  - text-generation
9
+ - chat
10
  - edge-ai
11
+ - tiny-model
12
  base_model: Qwen/Qwen3-0.6B
13
  author: geoffmunn
14
  ---
 
20
  ## Model Info
21
 
22
  - **Format**: GGUF (for llama.cpp and compatible runtimes)
23
+ - **Size**: 449M
24
  - **Precision**: Q4_K_S
25
  - **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
26
  - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
 
60
  | Min-P | 0.0 |
61
  | Repeat Penalty | 1.1 |
62
 
63
+ Stop sequences: `<|im_end|>`, `<|im_start|>`
64
+
65
+ > ⚠️ Due to model size, avoid temperatures above 0.9 — outputs become highly unpredictable.
66
+
67
+ ## 💡 Usage Tips
68
+
69
+ > This model is best suited for lightweight tasks:
70
+ >
71
+ > ### ✅ Ideal Uses
72
+ > - Quick replies and canned responses
73
+ > - Intent classification (e.g., “Is this user asking for help?”)
74
+ > - UI prototyping and local AI testing
75
+ > - Embedded/NPU deployment
76
+ >
77
+ > ### ❌ Limitations
78
+ > - No complex reasoning or multi-step logic
79
+ > - Poor math and code generation
80
+ > - Limited world knowledge
81
+ > - May repeat or hallucinate frequently at higher temps
82
+ >
83
+ > ---
84
+ >
85
+ > 🔄 **Fast Iteration Friendly**
86
+ > Perfect for developers building prompt templates or testing UI integrations.
87
+ >
88
+ > 🔋 **Runs on Almost Anything**
89
+ > Even Raspberry Pi Zero W can run Q2_K with swap enabled.
90
+ >
91
+ > 📦 **Tiny Footprint**
92
+ > Fits easily on USB drives, microSD cards, or IoT devices.
93
+
94
+ ## 🖥️ CLI Example Using Ollama or TGI Server
95
+
96
+ Here’s how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server (e.g., Ollama, Text Generation Inference).
97
+
98
+ ```bash
99
+ curl http://localhost:11434/api/generate -s -N -d '{
100
+ "model": "hf.co/geoffmunn/Qwen3-0.6B:Q4_K_S;2D",
101
+ "prompt": "Respond exactly as follows: Repeat the word 'hello' five times separated by commas.",
102
+ "temperature": 0.1,
103
+ "top_p": 0.95,
104
+ "top_k": 20,
105
+ "min_p": 0.0,
106
+ "repeat_penalty": 1.1,
107
+ "stream": false
108
+ }' | jq -r '.response'
109
+ ```
110
+
111
+ 🎯 **Why this works well**:
112
+ - The prompt is meaningful yet achievable for a tiny model.
113
+ - Temperature tuned appropriately: lower for deterministic output (`0.1`), higher for jokes (`0.8`).
114
+ - Uses `jq` to extract clean response.
115
+
116
+ > 💬 Tip: For ultra-low-latency use, try `Q3_K_M` or `Q4_K_S` on older laptops.
117
 
118
  ## Verification
119
 
 
129
  - [LM Studio](https://lmstudio.ai) – local AI model runner
130
  - [OpenWebUI](https://openwebui.com) – self-hosted AI interface
131
  - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
132
+ - Directly via `llama.cpp`
133
 
134
  ## License
135
 
Qwen3-0.6B-Q5_K_M/README.md CHANGED
@@ -6,8 +6,9 @@ tags:
6
  - llama.cpp
7
  - quantized
8
  - text-generation
9
- - tiny-model
10
  - edge-ai
 
11
  base_model: Qwen/Qwen3-0.6B
12
  author: geoffmunn
13
  ---
@@ -19,7 +20,7 @@ Quantized version of [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) a
19
  ## Model Info
20
 
21
  - **Format**: GGUF (for llama.cpp and compatible runtimes)
22
- - **Size**: 551 MB
23
  - **Precision**: Q5_K_M
24
  - **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
25
  - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
@@ -59,7 +60,60 @@ Recommended defaults:
59
  | Min-P | 0.0 |
60
  | Repeat Penalty | 1.1 |
61
 
62
- Stop sequences: \`<|im_end|>\`, \`<|im_start|>\`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63
 
64
  ## Verification
65
 
@@ -75,7 +129,7 @@ Compatible with:
75
  - [LM Studio](https://lmstudio.ai) – local AI model runner
76
  - [OpenWebUI](https://openwebui.com) – self-hosted AI interface
77
  - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
78
- - Directly via \`llama.cpp\`
79
 
80
  ## License
81
 
 
6
  - llama.cpp
7
  - quantized
8
  - text-generation
9
+ - chat
10
  - edge-ai
11
+ - tiny-model
12
  base_model: Qwen/Qwen3-0.6B
13
  author: geoffmunn
14
  ---
 
20
  ## Model Info
21
 
22
  - **Format**: GGUF (for llama.cpp and compatible runtimes)
23
+ - **Size**: 526M
24
  - **Precision**: Q5_K_M
25
  - **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
26
  - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
 
60
  | Min-P | 0.0 |
61
  | Repeat Penalty | 1.1 |
62
 
63
+ Stop sequences: `<|im_end|>`, `<|im_start|>`
64
+
65
+ > ⚠️ Due to model size, avoid temperatures above 0.9 — outputs become highly unpredictable.
66
+
67
+ ## 💡 Usage Tips
68
+
69
+ > This model is best suited for lightweight tasks:
70
+ >
71
+ > ### ✅ Ideal Uses
72
+ > - Quick replies and canned responses
73
+ > - Intent classification (e.g., “Is this user asking for help?”)
74
+ > - UI prototyping and local AI testing
75
+ > - Embedded/NPU deployment
76
+ >
77
+ > ### ❌ Limitations
78
+ > - No complex reasoning or multi-step logic
79
+ > - Poor math and code generation
80
+ > - Limited world knowledge
81
+ > - May repeat or hallucinate frequently at higher temps
82
+ >
83
+ > ---
84
+ >
85
+ > 🔄 **Fast Iteration Friendly**
86
+ > Perfect for developers building prompt templates or testing UI integrations.
87
+ >
88
+ > 🔋 **Runs on Almost Anything**
89
+ > Even Raspberry Pi Zero W can run Q2_K with swap enabled.
90
+ >
91
+ > 📦 **Tiny Footprint**
92
+ > Fits easily on USB drives, microSD cards, or IoT devices.
93
+
94
+ ## 🖥️ CLI Example Using Ollama or TGI Server
95
+
96
+ Here’s how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server (e.g., Ollama, Text Generation Inference).
97
+
98
+ ```bash
99
+ curl http://localhost:11434/api/generate -s -N -d '{
100
+ "model": "hf.co/geoffmunn/Qwen3-0.6B:Q5_K_M;2D",
101
+ "prompt": "Respond exactly as follows: Explain what gravity is in one sentence suitable for a child.",
102
+ "temperature": 0.6,
103
+ "top_p": 0.95,
104
+ "top_k": 20,
105
+ "min_p": 0.0,
106
+ "repeat_penalty": 1.1,
107
+ "stream": false
108
+ }' | jq -r '.response'
109
+ ```
110
+
111
+ 🎯 **Why this works well**:
112
+ - The prompt is meaningful yet achievable for a tiny model.
113
+ - Temperature tuned appropriately: lower for deterministic output (`0.1`), higher for jokes (`0.8`).
114
+ - Uses `jq` to extract clean response.
115
+
116
+ > 💬 Tip: For ultra-low-latency use, try `Q3_K_M` or `Q4_K_S` on older laptops.
117
 
118
  ## Verification
119
 
 
129
  - [LM Studio](https://lmstudio.ai) – local AI model runner
130
  - [OpenWebUI](https://openwebui.com) – self-hosted AI interface
131
  - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
132
+ - Directly via `llama.cpp`
133
 
134
  ## License
135
 
Qwen3-0.6B-Q5_K_S/README.md CHANGED
@@ -6,8 +6,9 @@ tags:
6
  - llama.cpp
7
  - quantized
8
  - text-generation
9
- - tiny-model
10
  - edge-ai
 
11
  base_model: Qwen/Qwen3-0.6B
12
  author: geoffmunn
13
  ---
@@ -19,7 +20,7 @@ Quantized version of [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) a
19
  ## Model Info
20
 
21
  - **Format**: GGUF (for llama.cpp and compatible runtimes)
22
- - **Size**: 544 MB
23
  - **Precision**: Q5_K_S
24
  - **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
25
  - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
@@ -59,7 +60,60 @@ Recommended defaults:
59
  | Min-P | 0.0 |
60
  | Repeat Penalty | 1.1 |
61
 
62
- Stop sequences: \`<|im_end|>\`, \`<|im_start|>\`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63
 
64
  ## Verification
65
 
@@ -75,7 +129,7 @@ Compatible with:
75
  - [LM Studio](https://lmstudio.ai) – local AI model runner
76
  - [OpenWebUI](https://openwebui.com) – self-hosted AI interface
77
  - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
78
- - Directly via \`llama.cpp\`
79
 
80
  ## License
81
 
 
6
  - llama.cpp
7
  - quantized
8
  - text-generation
9
+ - chat
10
  - edge-ai
11
+ - tiny-model
12
  base_model: Qwen/Qwen3-0.6B
13
  author: geoffmunn
14
  ---
 
20
  ## Model Info
21
 
22
  - **Format**: GGUF (for llama.cpp and compatible runtimes)
23
+ - **Size**: 519M
24
  - **Precision**: Q5_K_S
25
  - **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
26
  - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
 
60
  | Min-P | 0.0 |
61
  | Repeat Penalty | 1.1 |
62
 
63
+ Stop sequences: `<|im_end|>`, `<|im_start|>`
64
+
65
+ > ⚠️ Due to model size, avoid temperatures above 0.9 — outputs become highly unpredictable.
66
+
67
+ ## 💡 Usage Tips
68
+
69
+ > This model is best suited for lightweight tasks:
70
+ >
71
+ > ### ✅ Ideal Uses
72
+ > - Quick replies and canned responses
73
+ > - Intent classification (e.g., “Is this user asking for help?”)
74
+ > - UI prototyping and local AI testing
75
+ > - Embedded/NPU deployment
76
+ >
77
+ > ### ❌ Limitations
78
+ > - No complex reasoning or multi-step logic
79
+ > - Poor math and code generation
80
+ > - Limited world knowledge
81
+ > - May repeat or hallucinate frequently at higher temps
82
+ >
83
+ > ---
84
+ >
85
+ > 🔄 **Fast Iteration Friendly**
86
+ > Perfect for developers building prompt templates or testing UI integrations.
87
+ >
88
+ > 🔋 **Runs on Almost Anything**
89
+ > Even Raspberry Pi Zero W can run Q2_K with swap enabled.
90
+ >
91
+ > 📦 **Tiny Footprint**
92
+ > Fits easily on USB drives, microSD cards, or IoT devices.
93
+
94
+ ## 🖥️ CLI Example Using Ollama or TGI Server
95
+
96
+ Here’s how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server (e.g., Ollama, Text Generation Inference).
97
+
98
+ ```bash
99
+ curl http://localhost:11434/api/generate -s -N -d '{
100
+ "model": "hf.co/geoffmunn/Qwen3-0.6B:Q5_K_S;2D",
101
+ "prompt": "Respond exactly as follows: Write a short joke about cats.",
102
+ "temperature": 0.8,
103
+ "top_p": 0.95,
104
+ "top_k": 20,
105
+ "min_p": 0.0,
106
+ "repeat_penalty": 1.1,
107
+ "stream": false
108
+ }' | jq -r '.response'
109
+ ```
110
+
111
+ 🎯 **Why this works well**:
112
+ - The prompt is meaningful yet achievable for a tiny model.
113
+ - Temperature tuned appropriately: lower for deterministic output (`0.1`), higher for jokes (`0.8`).
114
+ - Uses `jq` to extract clean response.
115
+
116
+ > 💬 Tip: For ultra-low-latency use, try `Q3_K_M` or `Q4_K_S` on older laptops.
117
 
118
  ## Verification
119
 
 
129
  - [LM Studio](https://lmstudio.ai) – local AI model runner
130
  - [OpenWebUI](https://openwebui.com) – self-hosted AI interface
131
  - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
132
+ - Directly via `llama.cpp`
133
 
134
  ## License
135
 
Qwen3-0.6B-Q6_K/README.md CHANGED
@@ -6,8 +6,9 @@ tags:
6
  - llama.cpp
7
  - quantized
8
  - text-generation
9
- - tiny-model
10
  - edge-ai
 
11
  base_model: Qwen/Qwen3-0.6B
12
  author: geoffmunn
13
  ---
@@ -19,7 +20,7 @@ Quantized version of [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) a
19
  ## Model Info
20
 
21
  - **Format**: GGUF (for llama.cpp and compatible runtimes)
22
- - **Size**: 623 MB
23
  - **Precision**: Q6_K
24
  - **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
25
  - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
@@ -59,7 +60,60 @@ Recommended defaults:
59
  | Min-P | 0.0 |
60
  | Repeat Penalty | 1.1 |
61
 
62
- Stop sequences: \`<|im_end|>\`, \`<|im_start|>\`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63
 
64
  ## Verification
65
 
@@ -75,7 +129,7 @@ Compatible with:
75
  - [LM Studio](https://lmstudio.ai) – local AI model runner
76
  - [OpenWebUI](https://openwebui.com) – self-hosted AI interface
77
  - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
78
- - Directly via \`llama.cpp\`
79
 
80
  ## License
81
 
 
6
  - llama.cpp
7
  - quantized
8
  - text-generation
9
+ - chat
10
  - edge-ai
11
+ - tiny-model
12
  base_model: Qwen/Qwen3-0.6B
13
  author: geoffmunn
14
  ---
 
20
  ## Model Info
21
 
22
  - **Format**: GGUF (for llama.cpp and compatible runtimes)
23
+ - **Size**: 594M
24
  - **Precision**: Q6_K
25
  - **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
26
  - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
 
60
  | Min-P | 0.0 |
61
  | Repeat Penalty | 1.1 |
62
 
63
+ Stop sequences: `<|im_end|>`, `<|im_start|>`
64
+
65
+ > ⚠️ Due to model size, avoid temperatures above 0.9 — outputs become highly unpredictable.
66
+
67
+ ## 💡 Usage Tips
68
+
69
+ > This model is best suited for lightweight tasks:
70
+ >
71
+ > ### ✅ Ideal Uses
72
+ > - Quick replies and canned responses
73
+ > - Intent classification (e.g., “Is this user asking for help?”)
74
+ > - UI prototyping and local AI testing
75
+ > - Embedded/NPU deployment
76
+ >
77
+ > ### ❌ Limitations
78
+ > - No complex reasoning or multi-step logic
79
+ > - Poor math and code generation
80
+ > - Limited world knowledge
81
+ > - May repeat or hallucinate frequently at higher temps
82
+ >
83
+ > ---
84
+ >
85
+ > 🔄 **Fast Iteration Friendly**
86
+ > Perfect for developers building prompt templates or testing UI integrations.
87
+ >
88
+ > 🔋 **Runs on Almost Anything**
89
+ > Even Raspberry Pi Zero W can run Q2_K with swap enabled.
90
+ >
91
+ > 📦 **Tiny Footprint**
92
+ > Fits easily on USB drives, microSD cards, or IoT devices.
93
+
94
+ ## 🖥️ CLI Example Using Ollama or TGI Server
95
+
96
+ Here’s how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server (e.g., Ollama, Text Generation Inference).
97
+
98
+ ```bash
99
+ curl http://localhost:11434/api/generate -s -N -d '{
100
+ "model": "hf.co/geoffmunn/Qwen3-0.6B:Q6_K;2D",
101
+ "prompt": "Respond exactly as follows: Explain what gravity is in one sentence suitable for a child.",
102
+ "temperature": 0.6,
103
+ "top_p": 0.95,
104
+ "top_k": 20,
105
+ "min_p": 0.0,
106
+ "repeat_penalty": 1.1,
107
+ "stream": false
108
+ }' | jq -r '.response'
109
+ ```
110
+
111
+ 🎯 **Why this works well**:
112
+ - The prompt is meaningful yet achievable for a tiny model.
113
+ - Temperature tuned appropriately: lower for deterministic output (`0.1`), higher for jokes (`0.8`).
114
+ - Uses `jq` to extract clean response.
115
+
116
+ > 💬 Tip: For ultra-low-latency use, try `Q3_K_M` or `Q4_K_S` on older laptops.
117
 
118
  ## Verification
119
 
 
129
  - [LM Studio](https://lmstudio.ai) – local AI model runner
130
  - [OpenWebUI](https://openwebui.com) – self-hosted AI interface
131
  - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
132
+ - Directly via `llama.cpp`
133
 
134
  ## License
135
 
Qwen3-0.6B-Q8_0/README.md CHANGED
@@ -6,8 +6,9 @@ tags:
6
  - llama.cpp
7
  - quantized
8
  - text-generation
9
- - tiny-model
10
  - edge-ai
 
11
  base_model: Qwen/Qwen3-0.6B
12
  author: geoffmunn
13
  ---
@@ -19,7 +20,7 @@ Quantized version of [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) a
19
  ## Model Info
20
 
21
  - **Format**: GGUF (for llama.cpp and compatible runtimes)
22
- - **Size**: 805 MB
23
  - **Precision**: Q8_0
24
  - **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
25
  - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
@@ -59,7 +60,60 @@ Recommended defaults:
59
  | Min-P | 0.0 |
60
  | Repeat Penalty | 1.1 |
61
 
62
- Stop sequences: \`<|im_end|>\`, \`<|im_start|>\`
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63
 
64
  ## Verification
65
 
@@ -75,7 +129,7 @@ Compatible with:
75
  - [LM Studio](https://lmstudio.ai) – local AI model runner
76
  - [OpenWebUI](https://openwebui.com) – self-hosted AI interface
77
  - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
78
- - Directly via \`llama.cpp\`
79
 
80
  ## License
81
 
 
6
  - llama.cpp
7
  - quantized
8
  - text-generation
9
+ - chat
10
  - edge-ai
11
+ - tiny-model
12
  base_model: Qwen/Qwen3-0.6B
13
  author: geoffmunn
14
  ---
 
20
  ## Model Info
21
 
22
  - **Format**: GGUF (for llama.cpp and compatible runtimes)
23
+ - **Size**: 768M
24
  - **Precision**: Q8_0
25
  - **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
26
  - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
 
60
  | Min-P | 0.0 |
61
  | Repeat Penalty | 1.1 |
62
 
63
+ Stop sequences: `<|im_end|>`, `<|im_start|>`
64
+
65
+ > ⚠️ Due to model size, avoid temperatures above 0.9 — outputs become highly unpredictable.
66
+
67
+ ## 💡 Usage Tips
68
+
69
+ > This model is best suited for lightweight tasks:
70
+ >
71
+ > ### ✅ Ideal Uses
72
+ > - Quick replies and canned responses
73
+ > - Intent classification (e.g., “Is this user asking for help?”)
74
+ > - UI prototyping and local AI testing
75
+ > - Embedded/NPU deployment
76
+ >
77
+ > ### ❌ Limitations
78
+ > - No complex reasoning or multi-step logic
79
+ > - Poor math and code generation
80
+ > - Limited world knowledge
81
+ > - May repeat or hallucinate frequently at higher temps
82
+ >
83
+ > ---
84
+ >
85
+ > 🔄 **Fast Iteration Friendly**
86
+ > Perfect for developers building prompt templates or testing UI integrations.
87
+ >
88
+ > 🔋 **Runs on Almost Anything**
89
+ > Even Raspberry Pi Zero W can run Q2_K with swap enabled.
90
+ >
91
+ > 📦 **Tiny Footprint**
92
+ > Fits easily on USB drives, microSD cards, or IoT devices.
93
+
94
+ ## 🖥️ CLI Example Using Ollama or TGI Server
95
+
96
+ Here’s how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server (e.g., Ollama, Text Generation Inference).
97
+
98
+ ```bash
99
+ curl http://localhost:11434/api/generate -s -N -d '{
100
+ "model": "hf.co/geoffmunn/Qwen3-0.6B:Q8_0;2D",
101
+ "prompt": "Respond exactly as follows: Explain what gravity is in one sentence suitable for a child.",
102
+ "temperature": 0.6,
103
+ "top_p": 0.95,
104
+ "top_k": 20,
105
+ "min_p": 0.0,
106
+ "repeat_penalty": 1.1,
107
+ "stream": false
108
+ }' | jq -r '.response'
109
+ ```
110
+
111
+ 🎯 **Why this works well**:
112
+ - The prompt is meaningful yet achievable for a tiny model.
113
+ - Temperature tuned appropriately: lower for deterministic output (`0.1`), higher for jokes (`0.8`).
114
+ - Uses `jq` to extract clean response.
115
+
116
+ > 💬 Tip: For ultra-low-latency use, try `Q3_K_M` or `Q4_K_S` on older laptops.
117
 
118
  ## Verification
119
 
 
129
  - [LM Studio](https://lmstudio.ai) – local AI model runner
130
  - [OpenWebUI](https://openwebui.com) – self-hosted AI interface
131
  - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
132
+ - Directly via `llama.cpp`
133
 
134
  ## License
135
 
README.md CHANGED
@@ -1,29 +1,34 @@
1
  ---
2
  license: apache-2.0
3
  tags:
4
- - gguf
5
- - qwen
6
- - llama.cpp
7
- - quantized
8
- - text-generation
9
- - tiny-model
10
- - edge-ai
 
11
  base_model: Qwen/Qwen3-0.6B
12
  author: geoffmunn
 
13
  language:
14
- - en
 
15
  ---
16
 
17
  # Qwen3-0.6B-GGUF
18
 
19
- This is a **GGUF-quantized version** of the **[Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)** language model — a compact **600M-parameter** LLM designed for **ultra-fast inference on low-resource devices**.
20
 
21
- Converted for use with `llama.cpp` and compatible tools like OpenWebUI, LM Studio, GPT4All, and more.
22
 
23
  > ⚠️ **Note**: This is a *very small* model. It will not match larger models (e.g., 4B+) in reasoning, coding, or factual accuracy. However, it shines in **speed, portability, and efficiency**.
24
 
25
  ## Available Quantizations (from f16)
26
 
 
 
27
  | Level | Quality | Speed | Size | Recommendation |
28
  |----------|--------------|----------|-----------|----------------|
29
  | Q2_K | Minimal | ⚡ Fastest | 347 MB | Use only on severely constrained systems (e.g., Raspberry Pi). Severely degraded output. |
@@ -70,14 +75,11 @@ Load this model using:
70
 
71
  Each model includes its own `README.md` and `MODELFILE` for optimal configuration.
72
 
73
- ## Verification
74
-
75
- Use \`SHA256SUMS.txt\` to verify file integrity:
76
 
77
- ```bash
78
- sha256sum -c SHA256SUMS.txt
79
- ```
80
 
81
- ## License
82
 
83
- Apache 2.0 see base model for full terms.
 
1
  ---
2
  license: apache-2.0
3
  tags:
4
+ - gguf
5
+ - qwen
6
+ - llama.cpp
7
+ - quantized
8
+ - text-generation
9
+ - chat
10
+ - edge-ai
11
+ - tiny-model
12
  base_model: Qwen/Qwen3-0.6B
13
  author: geoffmunn
14
+ pipeline_tag: text-generation
15
  language:
16
+ - en
17
+ - zh
18
  ---
19
 
20
  # Qwen3-0.6B-GGUF
21
 
22
+ This is a **GGUF-quantized version** of the **[Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)** language model — a compact **600-million-parameter** LLM designed for **ultra-fast inference on low-resource devices**.
23
 
24
+ Converted for use with `llama.cpp`, [LM Studio](https://lmstudio.ai), [OpenWebUI](https://openwebui.com), and [GPT4All](https://gpt4all.io), enabling private AI anywhere — even offline.
25
 
26
  > ⚠️ **Note**: This is a *very small* model. It will not match larger models (e.g., 4B+) in reasoning, coding, or factual accuracy. However, it shines in **speed, portability, and efficiency**.
27
 
28
  ## Available Quantizations (from f16)
29
 
30
+ These variants were built from a **f16** base model to ensure consistency across quant levels.
31
+
32
  | Level | Quality | Speed | Size | Recommendation |
33
  |----------|--------------|----------|-----------|----------------|
34
  | Q2_K | Minimal | ⚡ Fastest | 347 MB | Use only on severely constrained systems (e.g., Raspberry Pi). Severely degraded output. |
 
75
 
76
  Each model includes its own `README.md` and `MODELFILE` for optimal configuration.
77
 
78
+ ## Author
 
 
79
 
80
+ 👤 Geoff Munn (@geoffmunn)
81
+ 🔗 [Hugging Face Profile](https://huggingface.co/geoffmunn)
 
82
 
83
+ ## Disclaimer
84
 
85
+ This is a community conversion for local inference. Not affiliated with Alibaba Cloud or the Qwen team.