Add Q2–Q8_0 quantized models with per-model cards, MODELFILE, CLI examples, and auto-upload
Browse files- MODELFILE +3 -3
- Qwen3-0.6B-Q2_K/README.md +58 -4
- Qwen3-0.6B-Q3_K_M/README.md +58 -4
- Qwen3-0.6B-Q3_K_S/README.md +58 -4
- Qwen3-0.6B-Q4_K_M/README.md +58 -4
- Qwen3-0.6B-Q4_K_S/README.md +58 -4
- Qwen3-0.6B-Q5_K_M/README.md +58 -4
- Qwen3-0.6B-Q5_K_S/README.md +58 -4
- Qwen3-0.6B-Q6_K/README.md +58 -4
- Qwen3-0.6B-Q8_0/README.md +58 -4
- README.md +20 -18
MODELFILE
CHANGED
|
@@ -7,11 +7,11 @@ f16: cpu
|
|
| 7 |
|
| 8 |
# Chat template using ChatML (used by Qwen)
|
| 9 |
prompt_template: >-
|
| 10 |
-
|
| 11 |
You are a helpful assistant.<|im_end|>
|
| 12 |
-
|
| 13 |
{prompt}<|im_end|>
|
| 14 |
-
|
| 15 |
|
| 16 |
# Stop sequences help end generation cleanly
|
| 17 |
stop: "<|im_end|>"
|
|
|
|
| 7 |
|
| 8 |
# Chat template using ChatML (used by Qwen)
|
| 9 |
prompt_template: >-
|
| 10 |
+
<|im_start|>system
|
| 11 |
You are a helpful assistant.<|im_end|>
|
| 12 |
+
<|im_start|>user
|
| 13 |
{prompt}<|im_end|>
|
| 14 |
+
<|im_start|>assistant
|
| 15 |
|
| 16 |
# Stop sequences help end generation cleanly
|
| 17 |
stop: "<|im_end|>"
|
Qwen3-0.6B-Q2_K/README.md
CHANGED
|
@@ -6,8 +6,9 @@ tags:
|
|
| 6 |
- llama.cpp
|
| 7 |
- quantized
|
| 8 |
- text-generation
|
| 9 |
-
-
|
| 10 |
- edge-ai
|
|
|
|
| 11 |
base_model: Qwen/Qwen3-0.6B
|
| 12 |
author: geoffmunn
|
| 13 |
---
|
|
@@ -19,7 +20,7 @@ Quantized version of [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) a
|
|
| 19 |
## Model Info
|
| 20 |
|
| 21 |
- **Format**: GGUF (for llama.cpp and compatible runtimes)
|
| 22 |
-
- **Size**:
|
| 23 |
- **Precision**: Q2_K
|
| 24 |
- **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
|
| 25 |
- **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
|
|
@@ -59,7 +60,60 @@ Recommended defaults:
|
|
| 59 |
| Min-P | 0.0 |
|
| 60 |
| Repeat Penalty | 1.1 |
|
| 61 |
|
| 62 |
-
Stop sequences:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
|
| 64 |
## Verification
|
| 65 |
|
|
@@ -75,7 +129,7 @@ Compatible with:
|
|
| 75 |
- [LM Studio](https://lmstudio.ai) – local AI model runner
|
| 76 |
- [OpenWebUI](https://openwebui.com) – self-hosted AI interface
|
| 77 |
- [GPT4All](https://gpt4all.io) – private, offline AI chatbot
|
| 78 |
-
- Directly via
|
| 79 |
|
| 80 |
## License
|
| 81 |
|
|
|
|
| 6 |
- llama.cpp
|
| 7 |
- quantized
|
| 8 |
- text-generation
|
| 9 |
+
- chat
|
| 10 |
- edge-ai
|
| 11 |
+
- tiny-model
|
| 12 |
base_model: Qwen/Qwen3-0.6B
|
| 13 |
author: geoffmunn
|
| 14 |
---
|
|
|
|
| 20 |
## Model Info
|
| 21 |
|
| 22 |
- **Format**: GGUF (for llama.cpp and compatible runtimes)
|
| 23 |
+
- **Size**: 332M
|
| 24 |
- **Precision**: Q2_K
|
| 25 |
- **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
|
| 26 |
- **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
|
|
|
|
| 60 |
| Min-P | 0.0 |
|
| 61 |
| Repeat Penalty | 1.1 |
|
| 62 |
|
| 63 |
+
Stop sequences: `<|im_end|>`, `<|im_start|>`
|
| 64 |
+
|
| 65 |
+
> ⚠️ Due to model size, avoid temperatures above 0.9 — outputs become highly unpredictable.
|
| 66 |
+
|
| 67 |
+
## 💡 Usage Tips
|
| 68 |
+
|
| 69 |
+
> This model is best suited for lightweight tasks:
|
| 70 |
+
>
|
| 71 |
+
> ### ✅ Ideal Uses
|
| 72 |
+
> - Quick replies and canned responses
|
| 73 |
+
> - Intent classification (e.g., “Is this user asking for help?”)
|
| 74 |
+
> - UI prototyping and local AI testing
|
| 75 |
+
> - Embedded/NPU deployment
|
| 76 |
+
>
|
| 77 |
+
> ### ❌ Limitations
|
| 78 |
+
> - No complex reasoning or multi-step logic
|
| 79 |
+
> - Poor math and code generation
|
| 80 |
+
> - Limited world knowledge
|
| 81 |
+
> - May repeat or hallucinate frequently at higher temps
|
| 82 |
+
>
|
| 83 |
+
> ---
|
| 84 |
+
>
|
| 85 |
+
> 🔄 **Fast Iteration Friendly**
|
| 86 |
+
> Perfect for developers building prompt templates or testing UI integrations.
|
| 87 |
+
>
|
| 88 |
+
> 🔋 **Runs on Almost Anything**
|
| 89 |
+
> Even Raspberry Pi Zero W can run Q2_K with swap enabled.
|
| 90 |
+
>
|
| 91 |
+
> 📦 **Tiny Footprint**
|
| 92 |
+
> Fits easily on USB drives, microSD cards, or IoT devices.
|
| 93 |
+
|
| 94 |
+
## 🖥️ CLI Example Using Ollama or TGI Server
|
| 95 |
+
|
| 96 |
+
Here’s how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server (e.g., Ollama, Text Generation Inference).
|
| 97 |
+
|
| 98 |
+
```bash
|
| 99 |
+
curl http://localhost:11434/api/generate -s -N -d '{
|
| 100 |
+
"model": "hf.co/geoffmunn/Qwen3-0.6B:Q2_K;2D",
|
| 101 |
+
"prompt": "Respond exactly as follows: Repeat the word 'hello' five times separated by commas.",
|
| 102 |
+
"temperature": 0.1,
|
| 103 |
+
"top_p": 0.95,
|
| 104 |
+
"top_k": 20,
|
| 105 |
+
"min_p": 0.0,
|
| 106 |
+
"repeat_penalty": 1.1,
|
| 107 |
+
"stream": false
|
| 108 |
+
}' | jq -r '.response'
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
🎯 **Why this works well**:
|
| 112 |
+
- The prompt is meaningful yet achievable for a tiny model.
|
| 113 |
+
- Temperature tuned appropriately: lower for deterministic output (`0.1`), higher for jokes (`0.8`).
|
| 114 |
+
- Uses `jq` to extract clean response.
|
| 115 |
+
|
| 116 |
+
> 💬 Tip: For ultra-low-latency use, try `Q3_K_M` or `Q4_K_S` on older laptops.
|
| 117 |
|
| 118 |
## Verification
|
| 119 |
|
|
|
|
| 129 |
- [LM Studio](https://lmstudio.ai) – local AI model runner
|
| 130 |
- [OpenWebUI](https://openwebui.com) – self-hosted AI interface
|
| 131 |
- [GPT4All](https://gpt4all.io) – private, offline AI chatbot
|
| 132 |
+
- Directly via `llama.cpp`
|
| 133 |
|
| 134 |
## License
|
| 135 |
|
Qwen3-0.6B-Q3_K_M/README.md
CHANGED
|
@@ -6,8 +6,9 @@ tags:
|
|
| 6 |
- llama.cpp
|
| 7 |
- quantized
|
| 8 |
- text-generation
|
| 9 |
-
-
|
| 10 |
- edge-ai
|
|
|
|
| 11 |
base_model: Qwen/Qwen3-0.6B
|
| 12 |
author: geoffmunn
|
| 13 |
---
|
|
@@ -19,7 +20,7 @@ Quantized version of [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) a
|
|
| 19 |
## Model Info
|
| 20 |
|
| 21 |
- **Format**: GGUF (for llama.cpp and compatible runtimes)
|
| 22 |
-
- **Size**:
|
| 23 |
- **Precision**: Q3_K_M
|
| 24 |
- **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
|
| 25 |
- **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
|
|
@@ -59,7 +60,60 @@ Recommended defaults:
|
|
| 59 |
| Min-P | 0.0 |
|
| 60 |
| Repeat Penalty | 1.1 |
|
| 61 |
|
| 62 |
-
Stop sequences:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
|
| 64 |
## Verification
|
| 65 |
|
|
@@ -75,7 +129,7 @@ Compatible with:
|
|
| 75 |
- [LM Studio](https://lmstudio.ai) – local AI model runner
|
| 76 |
- [OpenWebUI](https://openwebui.com) – self-hosted AI interface
|
| 77 |
- [GPT4All](https://gpt4all.io) – private, offline AI chatbot
|
| 78 |
-
- Directly via
|
| 79 |
|
| 80 |
## License
|
| 81 |
|
|
|
|
| 6 |
- llama.cpp
|
| 7 |
- quantized
|
| 8 |
- text-generation
|
| 9 |
+
- chat
|
| 10 |
- edge-ai
|
| 11 |
+
- tiny-model
|
| 12 |
base_model: Qwen/Qwen3-0.6B
|
| 13 |
author: geoffmunn
|
| 14 |
---
|
|
|
|
| 20 |
## Model Info
|
| 21 |
|
| 22 |
- **Format**: GGUF (for llama.cpp and compatible runtimes)
|
| 23 |
+
- **Size**: 395M
|
| 24 |
- **Precision**: Q3_K_M
|
| 25 |
- **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
|
| 26 |
- **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
|
|
|
|
| 60 |
| Min-P | 0.0 |
|
| 61 |
| Repeat Penalty | 1.1 |
|
| 62 |
|
| 63 |
+
Stop sequences: `<|im_end|>`, `<|im_start|>`
|
| 64 |
+
|
| 65 |
+
> ⚠️ Due to model size, avoid temperatures above 0.9 — outputs become highly unpredictable.
|
| 66 |
+
|
| 67 |
+
## 💡 Usage Tips
|
| 68 |
+
|
| 69 |
+
> This model is best suited for lightweight tasks:
|
| 70 |
+
>
|
| 71 |
+
> ### ✅ Ideal Uses
|
| 72 |
+
> - Quick replies and canned responses
|
| 73 |
+
> - Intent classification (e.g., “Is this user asking for help?”)
|
| 74 |
+
> - UI prototyping and local AI testing
|
| 75 |
+
> - Embedded/NPU deployment
|
| 76 |
+
>
|
| 77 |
+
> ### ❌ Limitations
|
| 78 |
+
> - No complex reasoning or multi-step logic
|
| 79 |
+
> - Poor math and code generation
|
| 80 |
+
> - Limited world knowledge
|
| 81 |
+
> - May repeat or hallucinate frequently at higher temps
|
| 82 |
+
>
|
| 83 |
+
> ---
|
| 84 |
+
>
|
| 85 |
+
> 🔄 **Fast Iteration Friendly**
|
| 86 |
+
> Perfect for developers building prompt templates or testing UI integrations.
|
| 87 |
+
>
|
| 88 |
+
> 🔋 **Runs on Almost Anything**
|
| 89 |
+
> Even Raspberry Pi Zero W can run Q2_K with swap enabled.
|
| 90 |
+
>
|
| 91 |
+
> 📦 **Tiny Footprint**
|
| 92 |
+
> Fits easily on USB drives, microSD cards, or IoT devices.
|
| 93 |
+
|
| 94 |
+
## 🖥️ CLI Example Using Ollama or TGI Server
|
| 95 |
+
|
| 96 |
+
Here’s how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server (e.g., Ollama, Text Generation Inference).
|
| 97 |
+
|
| 98 |
+
```bash
|
| 99 |
+
curl http://localhost:11434/api/generate -s -N -d '{
|
| 100 |
+
"model": "hf.co/geoffmunn/Qwen3-0.6B:Q3_K_M;2D",
|
| 101 |
+
"prompt": "Respond exactly as follows: Repeat the word 'hello' five times separated by commas.",
|
| 102 |
+
"temperature": 0.1,
|
| 103 |
+
"top_p": 0.95,
|
| 104 |
+
"top_k": 20,
|
| 105 |
+
"min_p": 0.0,
|
| 106 |
+
"repeat_penalty": 1.1,
|
| 107 |
+
"stream": false
|
| 108 |
+
}' | jq -r '.response'
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
🎯 **Why this works well**:
|
| 112 |
+
- The prompt is meaningful yet achievable for a tiny model.
|
| 113 |
+
- Temperature tuned appropriately: lower for deterministic output (`0.1`), higher for jokes (`0.8`).
|
| 114 |
+
- Uses `jq` to extract clean response.
|
| 115 |
+
|
| 116 |
+
> 💬 Tip: For ultra-low-latency use, try `Q3_K_M` or `Q4_K_S` on older laptops.
|
| 117 |
|
| 118 |
## Verification
|
| 119 |
|
|
|
|
| 129 |
- [LM Studio](https://lmstudio.ai) – local AI model runner
|
| 130 |
- [OpenWebUI](https://openwebui.com) – self-hosted AI interface
|
| 131 |
- [GPT4All](https://gpt4all.io) – private, offline AI chatbot
|
| 132 |
+
- Directly via `llama.cpp`
|
| 133 |
|
| 134 |
## License
|
| 135 |
|
Qwen3-0.6B-Q3_K_S/README.md
CHANGED
|
@@ -6,8 +6,9 @@ tags:
|
|
| 6 |
- llama.cpp
|
| 7 |
- quantized
|
| 8 |
- text-generation
|
| 9 |
-
-
|
| 10 |
- edge-ai
|
|
|
|
| 11 |
base_model: Qwen/Qwen3-0.6B
|
| 12 |
author: geoffmunn
|
| 13 |
---
|
|
@@ -19,7 +20,7 @@ Quantized version of [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) a
|
|
| 19 |
## Model Info
|
| 20 |
|
| 21 |
- **Format**: GGUF (for llama.cpp and compatible runtimes)
|
| 22 |
-
- **Size**:
|
| 23 |
- **Precision**: Q3_K_S
|
| 24 |
- **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
|
| 25 |
- **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
|
|
@@ -59,7 +60,60 @@ Recommended defaults:
|
|
| 59 |
| Min-P | 0.0 |
|
| 60 |
| Repeat Penalty | 1.1 |
|
| 61 |
|
| 62 |
-
Stop sequences:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
|
| 64 |
## Verification
|
| 65 |
|
|
@@ -75,7 +129,7 @@ Compatible with:
|
|
| 75 |
- [LM Studio](https://lmstudio.ai) – local AI model runner
|
| 76 |
- [OpenWebUI](https://openwebui.com) – self-hosted AI interface
|
| 77 |
- [GPT4All](https://gpt4all.io) – private, offline AI chatbot
|
| 78 |
-
- Directly via
|
| 79 |
|
| 80 |
## License
|
| 81 |
|
|
|
|
| 6 |
- llama.cpp
|
| 7 |
- quantized
|
| 8 |
- text-generation
|
| 9 |
+
- chat
|
| 10 |
- edge-ai
|
| 11 |
+
- tiny-model
|
| 12 |
base_model: Qwen/Qwen3-0.6B
|
| 13 |
author: geoffmunn
|
| 14 |
---
|
|
|
|
| 20 |
## Model Info
|
| 21 |
|
| 22 |
- **Format**: GGUF (for llama.cpp and compatible runtimes)
|
| 23 |
+
- **Size**: 372M
|
| 24 |
- **Precision**: Q3_K_S
|
| 25 |
- **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
|
| 26 |
- **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
|
|
|
|
| 60 |
| Min-P | 0.0 |
|
| 61 |
| Repeat Penalty | 1.1 |
|
| 62 |
|
| 63 |
+
Stop sequences: `<|im_end|>`, `<|im_start|>`
|
| 64 |
+
|
| 65 |
+
> ⚠️ Due to model size, avoid temperatures above 0.9 — outputs become highly unpredictable.
|
| 66 |
+
|
| 67 |
+
## 💡 Usage Tips
|
| 68 |
+
|
| 69 |
+
> This model is best suited for lightweight tasks:
|
| 70 |
+
>
|
| 71 |
+
> ### ✅ Ideal Uses
|
| 72 |
+
> - Quick replies and canned responses
|
| 73 |
+
> - Intent classification (e.g., “Is this user asking for help?”)
|
| 74 |
+
> - UI prototyping and local AI testing
|
| 75 |
+
> - Embedded/NPU deployment
|
| 76 |
+
>
|
| 77 |
+
> ### ❌ Limitations
|
| 78 |
+
> - No complex reasoning or multi-step logic
|
| 79 |
+
> - Poor math and code generation
|
| 80 |
+
> - Limited world knowledge
|
| 81 |
+
> - May repeat or hallucinate frequently at higher temps
|
| 82 |
+
>
|
| 83 |
+
> ---
|
| 84 |
+
>
|
| 85 |
+
> 🔄 **Fast Iteration Friendly**
|
| 86 |
+
> Perfect for developers building prompt templates or testing UI integrations.
|
| 87 |
+
>
|
| 88 |
+
> 🔋 **Runs on Almost Anything**
|
| 89 |
+
> Even Raspberry Pi Zero W can run Q2_K with swap enabled.
|
| 90 |
+
>
|
| 91 |
+
> 📦 **Tiny Footprint**
|
| 92 |
+
> Fits easily on USB drives, microSD cards, or IoT devices.
|
| 93 |
+
|
| 94 |
+
## 🖥️ CLI Example Using Ollama or TGI Server
|
| 95 |
+
|
| 96 |
+
Here’s how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server (e.g., Ollama, Text Generation Inference).
|
| 97 |
+
|
| 98 |
+
```bash
|
| 99 |
+
curl http://localhost:11434/api/generate -s -N -d '{
|
| 100 |
+
"model": "hf.co/geoffmunn/Qwen3-0.6B:Q3_K_S;2D",
|
| 101 |
+
"prompt": "Respond exactly as follows: Repeat the word 'hello' five times separated by commas.",
|
| 102 |
+
"temperature": 0.1,
|
| 103 |
+
"top_p": 0.95,
|
| 104 |
+
"top_k": 20,
|
| 105 |
+
"min_p": 0.0,
|
| 106 |
+
"repeat_penalty": 1.1,
|
| 107 |
+
"stream": false
|
| 108 |
+
}' | jq -r '.response'
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
🎯 **Why this works well**:
|
| 112 |
+
- The prompt is meaningful yet achievable for a tiny model.
|
| 113 |
+
- Temperature tuned appropriately: lower for deterministic output (`0.1`), higher for jokes (`0.8`).
|
| 114 |
+
- Uses `jq` to extract clean response.
|
| 115 |
+
|
| 116 |
+
> 💬 Tip: For ultra-low-latency use, try `Q3_K_M` or `Q4_K_S` on older laptops.
|
| 117 |
|
| 118 |
## Verification
|
| 119 |
|
|
|
|
| 129 |
- [LM Studio](https://lmstudio.ai) – local AI model runner
|
| 130 |
- [OpenWebUI](https://openwebui.com) – self-hosted AI interface
|
| 131 |
- [GPT4All](https://gpt4all.io) – private, offline AI chatbot
|
| 132 |
+
- Directly via `llama.cpp`
|
| 133 |
|
| 134 |
## License
|
| 135 |
|
Qwen3-0.6B-Q4_K_M/README.md
CHANGED
|
@@ -6,8 +6,9 @@ tags:
|
|
| 6 |
- llama.cpp
|
| 7 |
- quantized
|
| 8 |
- text-generation
|
| 9 |
-
-
|
| 10 |
- edge-ai
|
|
|
|
| 11 |
base_model: Qwen/Qwen3-0.6B
|
| 12 |
author: geoffmunn
|
| 13 |
---
|
|
@@ -19,7 +20,7 @@ Quantized version of [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) a
|
|
| 19 |
## Model Info
|
| 20 |
|
| 21 |
- **Format**: GGUF (for llama.cpp and compatible runtimes)
|
| 22 |
-
- **Size**:
|
| 23 |
- **Precision**: Q4_K_M
|
| 24 |
- **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
|
| 25 |
- **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
|
|
@@ -59,7 +60,60 @@ Recommended defaults:
|
|
| 59 |
| Min-P | 0.0 |
|
| 60 |
| Repeat Penalty | 1.1 |
|
| 61 |
|
| 62 |
-
Stop sequences:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
|
| 64 |
## Verification
|
| 65 |
|
|
@@ -75,7 +129,7 @@ Compatible with:
|
|
| 75 |
- [LM Studio](https://lmstudio.ai) – local AI model runner
|
| 76 |
- [OpenWebUI](https://openwebui.com) – self-hosted AI interface
|
| 77 |
- [GPT4All](https://gpt4all.io) – private, offline AI chatbot
|
| 78 |
-
- Directly via
|
| 79 |
|
| 80 |
## License
|
| 81 |
|
|
|
|
| 6 |
- llama.cpp
|
| 7 |
- quantized
|
| 8 |
- text-generation
|
| 9 |
+
- chat
|
| 10 |
- edge-ai
|
| 11 |
+
- tiny-model
|
| 12 |
base_model: Qwen/Qwen3-0.6B
|
| 13 |
author: geoffmunn
|
| 14 |
---
|
|
|
|
| 20 |
## Model Info
|
| 21 |
|
| 22 |
- **Format**: GGUF (for llama.cpp and compatible runtimes)
|
| 23 |
+
- **Size**: 462M
|
| 24 |
- **Precision**: Q4_K_M
|
| 25 |
- **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
|
| 26 |
- **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
|
|
|
|
| 60 |
| Min-P | 0.0 |
|
| 61 |
| Repeat Penalty | 1.1 |
|
| 62 |
|
| 63 |
+
Stop sequences: `<|im_end|>`, `<|im_start|>`
|
| 64 |
+
|
| 65 |
+
> ⚠️ Due to model size, avoid temperatures above 0.9 — outputs become highly unpredictable.
|
| 66 |
+
|
| 67 |
+
## 💡 Usage Tips
|
| 68 |
+
|
| 69 |
+
> This model is best suited for lightweight tasks:
|
| 70 |
+
>
|
| 71 |
+
> ### ✅ Ideal Uses
|
| 72 |
+
> - Quick replies and canned responses
|
| 73 |
+
> - Intent classification (e.g., “Is this user asking for help?”)
|
| 74 |
+
> - UI prototyping and local AI testing
|
| 75 |
+
> - Embedded/NPU deployment
|
| 76 |
+
>
|
| 77 |
+
> ### ❌ Limitations
|
| 78 |
+
> - No complex reasoning or multi-step logic
|
| 79 |
+
> - Poor math and code generation
|
| 80 |
+
> - Limited world knowledge
|
| 81 |
+
> - May repeat or hallucinate frequently at higher temps
|
| 82 |
+
>
|
| 83 |
+
> ---
|
| 84 |
+
>
|
| 85 |
+
> 🔄 **Fast Iteration Friendly**
|
| 86 |
+
> Perfect for developers building prompt templates or testing UI integrations.
|
| 87 |
+
>
|
| 88 |
+
> 🔋 **Runs on Almost Anything**
|
| 89 |
+
> Even Raspberry Pi Zero W can run Q2_K with swap enabled.
|
| 90 |
+
>
|
| 91 |
+
> 📦 **Tiny Footprint**
|
| 92 |
+
> Fits easily on USB drives, microSD cards, or IoT devices.
|
| 93 |
+
|
| 94 |
+
## 🖥️ CLI Example Using Ollama or TGI Server
|
| 95 |
+
|
| 96 |
+
Here’s how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server (e.g., Ollama, Text Generation Inference).
|
| 97 |
+
|
| 98 |
+
```bash
|
| 99 |
+
curl http://localhost:11434/api/generate -s -N -d '{
|
| 100 |
+
"model": "hf.co/geoffmunn/Qwen3-0.6B:Q4_K_M;2D",
|
| 101 |
+
"prompt": "Respond exactly as follows: Write a short joke about cats.",
|
| 102 |
+
"temperature": 0.8,
|
| 103 |
+
"top_p": 0.95,
|
| 104 |
+
"top_k": 20,
|
| 105 |
+
"min_p": 0.0,
|
| 106 |
+
"repeat_penalty": 1.1,
|
| 107 |
+
"stream": false
|
| 108 |
+
}' | jq -r '.response'
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
🎯 **Why this works well**:
|
| 112 |
+
- The prompt is meaningful yet achievable for a tiny model.
|
| 113 |
+
- Temperature tuned appropriately: lower for deterministic output (`0.1`), higher for jokes (`0.8`).
|
| 114 |
+
- Uses `jq` to extract clean response.
|
| 115 |
+
|
| 116 |
+
> 💬 Tip: For ultra-low-latency use, try `Q3_K_M` or `Q4_K_S` on older laptops.
|
| 117 |
|
| 118 |
## Verification
|
| 119 |
|
|
|
|
| 129 |
- [LM Studio](https://lmstudio.ai) – local AI model runner
|
| 130 |
- [OpenWebUI](https://openwebui.com) – self-hosted AI interface
|
| 131 |
- [GPT4All](https://gpt4all.io) – private, offline AI chatbot
|
| 132 |
+
- Directly via `llama.cpp`
|
| 133 |
|
| 134 |
## License
|
| 135 |
|
Qwen3-0.6B-Q4_K_S/README.md
CHANGED
|
@@ -6,8 +6,9 @@ tags:
|
|
| 6 |
- llama.cpp
|
| 7 |
- quantized
|
| 8 |
- text-generation
|
| 9 |
-
-
|
| 10 |
- edge-ai
|
|
|
|
| 11 |
base_model: Qwen/Qwen3-0.6B
|
| 12 |
author: geoffmunn
|
| 13 |
---
|
|
@@ -19,7 +20,7 @@ Quantized version of [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) a
|
|
| 19 |
## Model Info
|
| 20 |
|
| 21 |
- **Format**: GGUF (for llama.cpp and compatible runtimes)
|
| 22 |
-
- **Size**:
|
| 23 |
- **Precision**: Q4_K_S
|
| 24 |
- **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
|
| 25 |
- **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
|
|
@@ -59,7 +60,60 @@ Recommended defaults:
|
|
| 59 |
| Min-P | 0.0 |
|
| 60 |
| Repeat Penalty | 1.1 |
|
| 61 |
|
| 62 |
-
Stop sequences:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
|
| 64 |
## Verification
|
| 65 |
|
|
@@ -75,7 +129,7 @@ Compatible with:
|
|
| 75 |
- [LM Studio](https://lmstudio.ai) – local AI model runner
|
| 76 |
- [OpenWebUI](https://openwebui.com) – self-hosted AI interface
|
| 77 |
- [GPT4All](https://gpt4all.io) – private, offline AI chatbot
|
| 78 |
-
- Directly via
|
| 79 |
|
| 80 |
## License
|
| 81 |
|
|
|
|
| 6 |
- llama.cpp
|
| 7 |
- quantized
|
| 8 |
- text-generation
|
| 9 |
+
- chat
|
| 10 |
- edge-ai
|
| 11 |
+
- tiny-model
|
| 12 |
base_model: Qwen/Qwen3-0.6B
|
| 13 |
author: geoffmunn
|
| 14 |
---
|
|
|
|
| 20 |
## Model Info
|
| 21 |
|
| 22 |
- **Format**: GGUF (for llama.cpp and compatible runtimes)
|
| 23 |
+
- **Size**: 449M
|
| 24 |
- **Precision**: Q4_K_S
|
| 25 |
- **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
|
| 26 |
- **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
|
|
|
|
| 60 |
| Min-P | 0.0 |
|
| 61 |
| Repeat Penalty | 1.1 |
|
| 62 |
|
| 63 |
+
Stop sequences: `<|im_end|>`, `<|im_start|>`
|
| 64 |
+
|
| 65 |
+
> ⚠️ Due to model size, avoid temperatures above 0.9 — outputs become highly unpredictable.
|
| 66 |
+
|
| 67 |
+
## 💡 Usage Tips
|
| 68 |
+
|
| 69 |
+
> This model is best suited for lightweight tasks:
|
| 70 |
+
>
|
| 71 |
+
> ### ✅ Ideal Uses
|
| 72 |
+
> - Quick replies and canned responses
|
| 73 |
+
> - Intent classification (e.g., “Is this user asking for help?”)
|
| 74 |
+
> - UI prototyping and local AI testing
|
| 75 |
+
> - Embedded/NPU deployment
|
| 76 |
+
>
|
| 77 |
+
> ### ❌ Limitations
|
| 78 |
+
> - No complex reasoning or multi-step logic
|
| 79 |
+
> - Poor math and code generation
|
| 80 |
+
> - Limited world knowledge
|
| 81 |
+
> - May repeat or hallucinate frequently at higher temps
|
| 82 |
+
>
|
| 83 |
+
> ---
|
| 84 |
+
>
|
| 85 |
+
> 🔄 **Fast Iteration Friendly**
|
| 86 |
+
> Perfect for developers building prompt templates or testing UI integrations.
|
| 87 |
+
>
|
| 88 |
+
> 🔋 **Runs on Almost Anything**
|
| 89 |
+
> Even Raspberry Pi Zero W can run Q2_K with swap enabled.
|
| 90 |
+
>
|
| 91 |
+
> 📦 **Tiny Footprint**
|
| 92 |
+
> Fits easily on USB drives, microSD cards, or IoT devices.
|
| 93 |
+
|
| 94 |
+
## 🖥️ CLI Example Using Ollama or TGI Server
|
| 95 |
+
|
| 96 |
+
Here’s how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server (e.g., Ollama, Text Generation Inference).
|
| 97 |
+
|
| 98 |
+
```bash
|
| 99 |
+
curl http://localhost:11434/api/generate -s -N -d '{
|
| 100 |
+
"model": "hf.co/geoffmunn/Qwen3-0.6B:Q4_K_S;2D",
|
| 101 |
+
"prompt": "Respond exactly as follows: Repeat the word 'hello' five times separated by commas.",
|
| 102 |
+
"temperature": 0.1,
|
| 103 |
+
"top_p": 0.95,
|
| 104 |
+
"top_k": 20,
|
| 105 |
+
"min_p": 0.0,
|
| 106 |
+
"repeat_penalty": 1.1,
|
| 107 |
+
"stream": false
|
| 108 |
+
}' | jq -r '.response'
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
🎯 **Why this works well**:
|
| 112 |
+
- The prompt is meaningful yet achievable for a tiny model.
|
| 113 |
+
- Temperature tuned appropriately: lower for deterministic output (`0.1`), higher for jokes (`0.8`).
|
| 114 |
+
- Uses `jq` to extract clean response.
|
| 115 |
+
|
| 116 |
+
> 💬 Tip: For ultra-low-latency use, try `Q3_K_M` or `Q4_K_S` on older laptops.
|
| 117 |
|
| 118 |
## Verification
|
| 119 |
|
|
|
|
| 129 |
- [LM Studio](https://lmstudio.ai) – local AI model runner
|
| 130 |
- [OpenWebUI](https://openwebui.com) – self-hosted AI interface
|
| 131 |
- [GPT4All](https://gpt4all.io) – private, offline AI chatbot
|
| 132 |
+
- Directly via `llama.cpp`
|
| 133 |
|
| 134 |
## License
|
| 135 |
|
Qwen3-0.6B-Q5_K_M/README.md
CHANGED
|
@@ -6,8 +6,9 @@ tags:
|
|
| 6 |
- llama.cpp
|
| 7 |
- quantized
|
| 8 |
- text-generation
|
| 9 |
-
-
|
| 10 |
- edge-ai
|
|
|
|
| 11 |
base_model: Qwen/Qwen3-0.6B
|
| 12 |
author: geoffmunn
|
| 13 |
---
|
|
@@ -19,7 +20,7 @@ Quantized version of [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) a
|
|
| 19 |
## Model Info
|
| 20 |
|
| 21 |
- **Format**: GGUF (for llama.cpp and compatible runtimes)
|
| 22 |
-
- **Size**:
|
| 23 |
- **Precision**: Q5_K_M
|
| 24 |
- **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
|
| 25 |
- **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
|
|
@@ -59,7 +60,60 @@ Recommended defaults:
|
|
| 59 |
| Min-P | 0.0 |
|
| 60 |
| Repeat Penalty | 1.1 |
|
| 61 |
|
| 62 |
-
Stop sequences:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
|
| 64 |
## Verification
|
| 65 |
|
|
@@ -75,7 +129,7 @@ Compatible with:
|
|
| 75 |
- [LM Studio](https://lmstudio.ai) – local AI model runner
|
| 76 |
- [OpenWebUI](https://openwebui.com) – self-hosted AI interface
|
| 77 |
- [GPT4All](https://gpt4all.io) – private, offline AI chatbot
|
| 78 |
-
- Directly via
|
| 79 |
|
| 80 |
## License
|
| 81 |
|
|
|
|
| 6 |
- llama.cpp
|
| 7 |
- quantized
|
| 8 |
- text-generation
|
| 9 |
+
- chat
|
| 10 |
- edge-ai
|
| 11 |
+
- tiny-model
|
| 12 |
base_model: Qwen/Qwen3-0.6B
|
| 13 |
author: geoffmunn
|
| 14 |
---
|
|
|
|
| 20 |
## Model Info
|
| 21 |
|
| 22 |
- **Format**: GGUF (for llama.cpp and compatible runtimes)
|
| 23 |
+
- **Size**: 526M
|
| 24 |
- **Precision**: Q5_K_M
|
| 25 |
- **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
|
| 26 |
- **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
|
|
|
|
| 60 |
| Min-P | 0.0 |
|
| 61 |
| Repeat Penalty | 1.1 |
|
| 62 |
|
| 63 |
+
Stop sequences: `<|im_end|>`, `<|im_start|>`
|
| 64 |
+
|
| 65 |
+
> ⚠️ Due to model size, avoid temperatures above 0.9 — outputs become highly unpredictable.
|
| 66 |
+
|
| 67 |
+
## 💡 Usage Tips
|
| 68 |
+
|
| 69 |
+
> This model is best suited for lightweight tasks:
|
| 70 |
+
>
|
| 71 |
+
> ### ✅ Ideal Uses
|
| 72 |
+
> - Quick replies and canned responses
|
| 73 |
+
> - Intent classification (e.g., “Is this user asking for help?”)
|
| 74 |
+
> - UI prototyping and local AI testing
|
| 75 |
+
> - Embedded/NPU deployment
|
| 76 |
+
>
|
| 77 |
+
> ### ❌ Limitations
|
| 78 |
+
> - No complex reasoning or multi-step logic
|
| 79 |
+
> - Poor math and code generation
|
| 80 |
+
> - Limited world knowledge
|
| 81 |
+
> - May repeat or hallucinate frequently at higher temps
|
| 82 |
+
>
|
| 83 |
+
> ---
|
| 84 |
+
>
|
| 85 |
+
> 🔄 **Fast Iteration Friendly**
|
| 86 |
+
> Perfect for developers building prompt templates or testing UI integrations.
|
| 87 |
+
>
|
| 88 |
+
> 🔋 **Runs on Almost Anything**
|
| 89 |
+
> Even Raspberry Pi Zero W can run Q2_K with swap enabled.
|
| 90 |
+
>
|
| 91 |
+
> 📦 **Tiny Footprint**
|
| 92 |
+
> Fits easily on USB drives, microSD cards, or IoT devices.
|
| 93 |
+
|
| 94 |
+
## 🖥️ CLI Example Using Ollama or TGI Server
|
| 95 |
+
|
| 96 |
+
Here’s how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server (e.g., Ollama, Text Generation Inference).
|
| 97 |
+
|
| 98 |
+
```bash
|
| 99 |
+
curl http://localhost:11434/api/generate -s -N -d '{
|
| 100 |
+
"model": "hf.co/geoffmunn/Qwen3-0.6B:Q5_K_M;2D",
|
| 101 |
+
"prompt": "Respond exactly as follows: Explain what gravity is in one sentence suitable for a child.",
|
| 102 |
+
"temperature": 0.6,
|
| 103 |
+
"top_p": 0.95,
|
| 104 |
+
"top_k": 20,
|
| 105 |
+
"min_p": 0.0,
|
| 106 |
+
"repeat_penalty": 1.1,
|
| 107 |
+
"stream": false
|
| 108 |
+
}' | jq -r '.response'
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
🎯 **Why this works well**:
|
| 112 |
+
- The prompt is meaningful yet achievable for a tiny model.
|
| 113 |
+
- Temperature tuned appropriately: lower for deterministic output (`0.1`), higher for jokes (`0.8`).
|
| 114 |
+
- Uses `jq` to extract clean response.
|
| 115 |
+
|
| 116 |
+
> 💬 Tip: For ultra-low-latency use, try `Q3_K_M` or `Q4_K_S` on older laptops.
|
| 117 |
|
| 118 |
## Verification
|
| 119 |
|
|
|
|
| 129 |
- [LM Studio](https://lmstudio.ai) – local AI model runner
|
| 130 |
- [OpenWebUI](https://openwebui.com) – self-hosted AI interface
|
| 131 |
- [GPT4All](https://gpt4all.io) – private, offline AI chatbot
|
| 132 |
+
- Directly via `llama.cpp`
|
| 133 |
|
| 134 |
## License
|
| 135 |
|
Qwen3-0.6B-Q5_K_S/README.md
CHANGED
|
@@ -6,8 +6,9 @@ tags:
|
|
| 6 |
- llama.cpp
|
| 7 |
- quantized
|
| 8 |
- text-generation
|
| 9 |
-
-
|
| 10 |
- edge-ai
|
|
|
|
| 11 |
base_model: Qwen/Qwen3-0.6B
|
| 12 |
author: geoffmunn
|
| 13 |
---
|
|
@@ -19,7 +20,7 @@ Quantized version of [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) a
|
|
| 19 |
## Model Info
|
| 20 |
|
| 21 |
- **Format**: GGUF (for llama.cpp and compatible runtimes)
|
| 22 |
-
- **Size**:
|
| 23 |
- **Precision**: Q5_K_S
|
| 24 |
- **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
|
| 25 |
- **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
|
|
@@ -59,7 +60,60 @@ Recommended defaults:
|
|
| 59 |
| Min-P | 0.0 |
|
| 60 |
| Repeat Penalty | 1.1 |
|
| 61 |
|
| 62 |
-
Stop sequences:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
|
| 64 |
## Verification
|
| 65 |
|
|
@@ -75,7 +129,7 @@ Compatible with:
|
|
| 75 |
- [LM Studio](https://lmstudio.ai) – local AI model runner
|
| 76 |
- [OpenWebUI](https://openwebui.com) – self-hosted AI interface
|
| 77 |
- [GPT4All](https://gpt4all.io) – private, offline AI chatbot
|
| 78 |
-
- Directly via
|
| 79 |
|
| 80 |
## License
|
| 81 |
|
|
|
|
| 6 |
- llama.cpp
|
| 7 |
- quantized
|
| 8 |
- text-generation
|
| 9 |
+
- chat
|
| 10 |
- edge-ai
|
| 11 |
+
- tiny-model
|
| 12 |
base_model: Qwen/Qwen3-0.6B
|
| 13 |
author: geoffmunn
|
| 14 |
---
|
|
|
|
| 20 |
## Model Info
|
| 21 |
|
| 22 |
- **Format**: GGUF (for llama.cpp and compatible runtimes)
|
| 23 |
+
- **Size**: 519M
|
| 24 |
- **Precision**: Q5_K_S
|
| 25 |
- **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
|
| 26 |
- **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
|
|
|
|
| 60 |
| Min-P | 0.0 |
|
| 61 |
| Repeat Penalty | 1.1 |
|
| 62 |
|
| 63 |
+
Stop sequences: `<|im_end|>`, `<|im_start|>`
|
| 64 |
+
|
| 65 |
+
> ⚠️ Due to model size, avoid temperatures above 0.9 — outputs become highly unpredictable.
|
| 66 |
+
|
| 67 |
+
## 💡 Usage Tips
|
| 68 |
+
|
| 69 |
+
> This model is best suited for lightweight tasks:
|
| 70 |
+
>
|
| 71 |
+
> ### ✅ Ideal Uses
|
| 72 |
+
> - Quick replies and canned responses
|
| 73 |
+
> - Intent classification (e.g., “Is this user asking for help?”)
|
| 74 |
+
> - UI prototyping and local AI testing
|
| 75 |
+
> - Embedded/NPU deployment
|
| 76 |
+
>
|
| 77 |
+
> ### ❌ Limitations
|
| 78 |
+
> - No complex reasoning or multi-step logic
|
| 79 |
+
> - Poor math and code generation
|
| 80 |
+
> - Limited world knowledge
|
| 81 |
+
> - May repeat or hallucinate frequently at higher temps
|
| 82 |
+
>
|
| 83 |
+
> ---
|
| 84 |
+
>
|
| 85 |
+
> 🔄 **Fast Iteration Friendly**
|
| 86 |
+
> Perfect for developers building prompt templates or testing UI integrations.
|
| 87 |
+
>
|
| 88 |
+
> 🔋 **Runs on Almost Anything**
|
| 89 |
+
> Even Raspberry Pi Zero W can run Q2_K with swap enabled.
|
| 90 |
+
>
|
| 91 |
+
> 📦 **Tiny Footprint**
|
| 92 |
+
> Fits easily on USB drives, microSD cards, or IoT devices.
|
| 93 |
+
|
| 94 |
+
## 🖥️ CLI Example Using Ollama or TGI Server
|
| 95 |
+
|
| 96 |
+
Here’s how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server (e.g., Ollama, Text Generation Inference).
|
| 97 |
+
|
| 98 |
+
```bash
|
| 99 |
+
curl http://localhost:11434/api/generate -s -N -d '{
|
| 100 |
+
"model": "hf.co/geoffmunn/Qwen3-0.6B:Q5_K_S;2D",
|
| 101 |
+
"prompt": "Respond exactly as follows: Write a short joke about cats.",
|
| 102 |
+
"temperature": 0.8,
|
| 103 |
+
"top_p": 0.95,
|
| 104 |
+
"top_k": 20,
|
| 105 |
+
"min_p": 0.0,
|
| 106 |
+
"repeat_penalty": 1.1,
|
| 107 |
+
"stream": false
|
| 108 |
+
}' | jq -r '.response'
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
🎯 **Why this works well**:
|
| 112 |
+
- The prompt is meaningful yet achievable for a tiny model.
|
| 113 |
+
- Temperature tuned appropriately: lower for deterministic output (`0.1`), higher for jokes (`0.8`).
|
| 114 |
+
- Uses `jq` to extract clean response.
|
| 115 |
+
|
| 116 |
+
> 💬 Tip: For ultra-low-latency use, try `Q3_K_M` or `Q4_K_S` on older laptops.
|
| 117 |
|
| 118 |
## Verification
|
| 119 |
|
|
|
|
| 129 |
- [LM Studio](https://lmstudio.ai) – local AI model runner
|
| 130 |
- [OpenWebUI](https://openwebui.com) – self-hosted AI interface
|
| 131 |
- [GPT4All](https://gpt4all.io) – private, offline AI chatbot
|
| 132 |
+
- Directly via `llama.cpp`
|
| 133 |
|
| 134 |
## License
|
| 135 |
|
Qwen3-0.6B-Q6_K/README.md
CHANGED
|
@@ -6,8 +6,9 @@ tags:
|
|
| 6 |
- llama.cpp
|
| 7 |
- quantized
|
| 8 |
- text-generation
|
| 9 |
-
-
|
| 10 |
- edge-ai
|
|
|
|
| 11 |
base_model: Qwen/Qwen3-0.6B
|
| 12 |
author: geoffmunn
|
| 13 |
---
|
|
@@ -19,7 +20,7 @@ Quantized version of [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) a
|
|
| 19 |
## Model Info
|
| 20 |
|
| 21 |
- **Format**: GGUF (for llama.cpp and compatible runtimes)
|
| 22 |
-
- **Size**:
|
| 23 |
- **Precision**: Q6_K
|
| 24 |
- **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
|
| 25 |
- **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
|
|
@@ -59,7 +60,60 @@ Recommended defaults:
|
|
| 59 |
| Min-P | 0.0 |
|
| 60 |
| Repeat Penalty | 1.1 |
|
| 61 |
|
| 62 |
-
Stop sequences:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
|
| 64 |
## Verification
|
| 65 |
|
|
@@ -75,7 +129,7 @@ Compatible with:
|
|
| 75 |
- [LM Studio](https://lmstudio.ai) – local AI model runner
|
| 76 |
- [OpenWebUI](https://openwebui.com) – self-hosted AI interface
|
| 77 |
- [GPT4All](https://gpt4all.io) – private, offline AI chatbot
|
| 78 |
-
- Directly via
|
| 79 |
|
| 80 |
## License
|
| 81 |
|
|
|
|
| 6 |
- llama.cpp
|
| 7 |
- quantized
|
| 8 |
- text-generation
|
| 9 |
+
- chat
|
| 10 |
- edge-ai
|
| 11 |
+
- tiny-model
|
| 12 |
base_model: Qwen/Qwen3-0.6B
|
| 13 |
author: geoffmunn
|
| 14 |
---
|
|
|
|
| 20 |
## Model Info
|
| 21 |
|
| 22 |
- **Format**: GGUF (for llama.cpp and compatible runtimes)
|
| 23 |
+
- **Size**: 594M
|
| 24 |
- **Precision**: Q6_K
|
| 25 |
- **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
|
| 26 |
- **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
|
|
|
|
| 60 |
| Min-P | 0.0 |
|
| 61 |
| Repeat Penalty | 1.1 |
|
| 62 |
|
| 63 |
+
Stop sequences: `<|im_end|>`, `<|im_start|>`
|
| 64 |
+
|
| 65 |
+
> ⚠️ Due to model size, avoid temperatures above 0.9 — outputs become highly unpredictable.
|
| 66 |
+
|
| 67 |
+
## 💡 Usage Tips
|
| 68 |
+
|
| 69 |
+
> This model is best suited for lightweight tasks:
|
| 70 |
+
>
|
| 71 |
+
> ### ✅ Ideal Uses
|
| 72 |
+
> - Quick replies and canned responses
|
| 73 |
+
> - Intent classification (e.g., “Is this user asking for help?”)
|
| 74 |
+
> - UI prototyping and local AI testing
|
| 75 |
+
> - Embedded/NPU deployment
|
| 76 |
+
>
|
| 77 |
+
> ### ❌ Limitations
|
| 78 |
+
> - No complex reasoning or multi-step logic
|
| 79 |
+
> - Poor math and code generation
|
| 80 |
+
> - Limited world knowledge
|
| 81 |
+
> - May repeat or hallucinate frequently at higher temps
|
| 82 |
+
>
|
| 83 |
+
> ---
|
| 84 |
+
>
|
| 85 |
+
> 🔄 **Fast Iteration Friendly**
|
| 86 |
+
> Perfect for developers building prompt templates or testing UI integrations.
|
| 87 |
+
>
|
| 88 |
+
> 🔋 **Runs on Almost Anything**
|
| 89 |
+
> Even Raspberry Pi Zero W can run Q2_K with swap enabled.
|
| 90 |
+
>
|
| 91 |
+
> 📦 **Tiny Footprint**
|
| 92 |
+
> Fits easily on USB drives, microSD cards, or IoT devices.
|
| 93 |
+
|
| 94 |
+
## 🖥️ CLI Example Using Ollama or TGI Server
|
| 95 |
+
|
| 96 |
+
Here’s how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server (e.g., Ollama, Text Generation Inference).
|
| 97 |
+
|
| 98 |
+
```bash
|
| 99 |
+
curl http://localhost:11434/api/generate -s -N -d '{
|
| 100 |
+
"model": "hf.co/geoffmunn/Qwen3-0.6B:Q6_K;2D",
|
| 101 |
+
"prompt": "Respond exactly as follows: Explain what gravity is in one sentence suitable for a child.",
|
| 102 |
+
"temperature": 0.6,
|
| 103 |
+
"top_p": 0.95,
|
| 104 |
+
"top_k": 20,
|
| 105 |
+
"min_p": 0.0,
|
| 106 |
+
"repeat_penalty": 1.1,
|
| 107 |
+
"stream": false
|
| 108 |
+
}' | jq -r '.response'
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
🎯 **Why this works well**:
|
| 112 |
+
- The prompt is meaningful yet achievable for a tiny model.
|
| 113 |
+
- Temperature tuned appropriately: lower for deterministic output (`0.1`), higher for jokes (`0.8`).
|
| 114 |
+
- Uses `jq` to extract clean response.
|
| 115 |
+
|
| 116 |
+
> 💬 Tip: For ultra-low-latency use, try `Q3_K_M` or `Q4_K_S` on older laptops.
|
| 117 |
|
| 118 |
## Verification
|
| 119 |
|
|
|
|
| 129 |
- [LM Studio](https://lmstudio.ai) – local AI model runner
|
| 130 |
- [OpenWebUI](https://openwebui.com) – self-hosted AI interface
|
| 131 |
- [GPT4All](https://gpt4all.io) – private, offline AI chatbot
|
| 132 |
+
- Directly via `llama.cpp`
|
| 133 |
|
| 134 |
## License
|
| 135 |
|
Qwen3-0.6B-Q8_0/README.md
CHANGED
|
@@ -6,8 +6,9 @@ tags:
|
|
| 6 |
- llama.cpp
|
| 7 |
- quantized
|
| 8 |
- text-generation
|
| 9 |
-
-
|
| 10 |
- edge-ai
|
|
|
|
| 11 |
base_model: Qwen/Qwen3-0.6B
|
| 12 |
author: geoffmunn
|
| 13 |
---
|
|
@@ -19,7 +20,7 @@ Quantized version of [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) a
|
|
| 19 |
## Model Info
|
| 20 |
|
| 21 |
- **Format**: GGUF (for llama.cpp and compatible runtimes)
|
| 22 |
-
- **Size**:
|
| 23 |
- **Precision**: Q8_0
|
| 24 |
- **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
|
| 25 |
- **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
|
|
@@ -59,7 +60,60 @@ Recommended defaults:
|
|
| 59 |
| Min-P | 0.0 |
|
| 60 |
| Repeat Penalty | 1.1 |
|
| 61 |
|
| 62 |
-
Stop sequences:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 63 |
|
| 64 |
## Verification
|
| 65 |
|
|
@@ -75,7 +129,7 @@ Compatible with:
|
|
| 75 |
- [LM Studio](https://lmstudio.ai) – local AI model runner
|
| 76 |
- [OpenWebUI](https://openwebui.com) – self-hosted AI interface
|
| 77 |
- [GPT4All](https://gpt4all.io) – private, offline AI chatbot
|
| 78 |
-
- Directly via
|
| 79 |
|
| 80 |
## License
|
| 81 |
|
|
|
|
| 6 |
- llama.cpp
|
| 7 |
- quantized
|
| 8 |
- text-generation
|
| 9 |
+
- chat
|
| 10 |
- edge-ai
|
| 11 |
+
- tiny-model
|
| 12 |
base_model: Qwen/Qwen3-0.6B
|
| 13 |
author: geoffmunn
|
| 14 |
---
|
|
|
|
| 20 |
## Model Info
|
| 21 |
|
| 22 |
- **Format**: GGUF (for llama.cpp and compatible runtimes)
|
| 23 |
+
- **Size**: 768M
|
| 24 |
- **Precision**: Q8_0
|
| 25 |
- **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
|
| 26 |
- **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
|
|
|
|
| 60 |
| Min-P | 0.0 |
|
| 61 |
| Repeat Penalty | 1.1 |
|
| 62 |
|
| 63 |
+
Stop sequences: `<|im_end|>`, `<|im_start|>`
|
| 64 |
+
|
| 65 |
+
> ⚠️ Due to model size, avoid temperatures above 0.9 — outputs become highly unpredictable.
|
| 66 |
+
|
| 67 |
+
## 💡 Usage Tips
|
| 68 |
+
|
| 69 |
+
> This model is best suited for lightweight tasks:
|
| 70 |
+
>
|
| 71 |
+
> ### ✅ Ideal Uses
|
| 72 |
+
> - Quick replies and canned responses
|
| 73 |
+
> - Intent classification (e.g., “Is this user asking for help?”)
|
| 74 |
+
> - UI prototyping and local AI testing
|
| 75 |
+
> - Embedded/NPU deployment
|
| 76 |
+
>
|
| 77 |
+
> ### ❌ Limitations
|
| 78 |
+
> - No complex reasoning or multi-step logic
|
| 79 |
+
> - Poor math and code generation
|
| 80 |
+
> - Limited world knowledge
|
| 81 |
+
> - May repeat or hallucinate frequently at higher temps
|
| 82 |
+
>
|
| 83 |
+
> ---
|
| 84 |
+
>
|
| 85 |
+
> 🔄 **Fast Iteration Friendly**
|
| 86 |
+
> Perfect for developers building prompt templates or testing UI integrations.
|
| 87 |
+
>
|
| 88 |
+
> 🔋 **Runs on Almost Anything**
|
| 89 |
+
> Even Raspberry Pi Zero W can run Q2_K with swap enabled.
|
| 90 |
+
>
|
| 91 |
+
> 📦 **Tiny Footprint**
|
| 92 |
+
> Fits easily on USB drives, microSD cards, or IoT devices.
|
| 93 |
+
|
| 94 |
+
## 🖥️ CLI Example Using Ollama or TGI Server
|
| 95 |
+
|
| 96 |
+
Here’s how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server (e.g., Ollama, Text Generation Inference).
|
| 97 |
+
|
| 98 |
+
```bash
|
| 99 |
+
curl http://localhost:11434/api/generate -s -N -d '{
|
| 100 |
+
"model": "hf.co/geoffmunn/Qwen3-0.6B:Q8_0;2D",
|
| 101 |
+
"prompt": "Respond exactly as follows: Explain what gravity is in one sentence suitable for a child.",
|
| 102 |
+
"temperature": 0.6,
|
| 103 |
+
"top_p": 0.95,
|
| 104 |
+
"top_k": 20,
|
| 105 |
+
"min_p": 0.0,
|
| 106 |
+
"repeat_penalty": 1.1,
|
| 107 |
+
"stream": false
|
| 108 |
+
}' | jq -r '.response'
|
| 109 |
+
```
|
| 110 |
+
|
| 111 |
+
🎯 **Why this works well**:
|
| 112 |
+
- The prompt is meaningful yet achievable for a tiny model.
|
| 113 |
+
- Temperature tuned appropriately: lower for deterministic output (`0.1`), higher for jokes (`0.8`).
|
| 114 |
+
- Uses `jq` to extract clean response.
|
| 115 |
+
|
| 116 |
+
> 💬 Tip: For ultra-low-latency use, try `Q3_K_M` or `Q4_K_S` on older laptops.
|
| 117 |
|
| 118 |
## Verification
|
| 119 |
|
|
|
|
| 129 |
- [LM Studio](https://lmstudio.ai) – local AI model runner
|
| 130 |
- [OpenWebUI](https://openwebui.com) – self-hosted AI interface
|
| 131 |
- [GPT4All](https://gpt4all.io) – private, offline AI chatbot
|
| 132 |
+
- Directly via `llama.cpp`
|
| 133 |
|
| 134 |
## License
|
| 135 |
|
README.md
CHANGED
|
@@ -1,29 +1,34 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
tags:
|
| 4 |
-
- gguf
|
| 5 |
-
- qwen
|
| 6 |
-
- llama.cpp
|
| 7 |
-
- quantized
|
| 8 |
-
- text-generation
|
| 9 |
-
-
|
| 10 |
-
- edge-ai
|
|
|
|
| 11 |
base_model: Qwen/Qwen3-0.6B
|
| 12 |
author: geoffmunn
|
|
|
|
| 13 |
language:
|
| 14 |
-
- en
|
|
|
|
| 15 |
---
|
| 16 |
|
| 17 |
# Qwen3-0.6B-GGUF
|
| 18 |
|
| 19 |
-
This is a **GGUF-quantized version** of the **[Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)** language model — a compact **
|
| 20 |
|
| 21 |
-
Converted for use with `llama.cpp
|
| 22 |
|
| 23 |
> ⚠️ **Note**: This is a *very small* model. It will not match larger models (e.g., 4B+) in reasoning, coding, or factual accuracy. However, it shines in **speed, portability, and efficiency**.
|
| 24 |
|
| 25 |
## Available Quantizations (from f16)
|
| 26 |
|
|
|
|
|
|
|
| 27 |
| Level | Quality | Speed | Size | Recommendation |
|
| 28 |
|----------|--------------|----------|-----------|----------------|
|
| 29 |
| Q2_K | Minimal | ⚡ Fastest | 347 MB | Use only on severely constrained systems (e.g., Raspberry Pi). Severely degraded output. |
|
|
@@ -70,14 +75,11 @@ Load this model using:
|
|
| 70 |
|
| 71 |
Each model includes its own `README.md` and `MODELFILE` for optimal configuration.
|
| 72 |
|
| 73 |
-
##
|
| 74 |
-
|
| 75 |
-
Use \`SHA256SUMS.txt\` to verify file integrity:
|
| 76 |
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
```
|
| 80 |
|
| 81 |
-
##
|
| 82 |
|
| 83 |
-
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
tags:
|
| 4 |
+
- gguf
|
| 5 |
+
- qwen
|
| 6 |
+
- llama.cpp
|
| 7 |
+
- quantized
|
| 8 |
+
- text-generation
|
| 9 |
+
- chat
|
| 10 |
+
- edge-ai
|
| 11 |
+
- tiny-model
|
| 12 |
base_model: Qwen/Qwen3-0.6B
|
| 13 |
author: geoffmunn
|
| 14 |
+
pipeline_tag: text-generation
|
| 15 |
language:
|
| 16 |
+
- en
|
| 17 |
+
- zh
|
| 18 |
---
|
| 19 |
|
| 20 |
# Qwen3-0.6B-GGUF
|
| 21 |
|
| 22 |
+
This is a **GGUF-quantized version** of the **[Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)** language model — a compact **600-million-parameter** LLM designed for **ultra-fast inference on low-resource devices**.
|
| 23 |
|
| 24 |
+
Converted for use with `llama.cpp`, [LM Studio](https://lmstudio.ai), [OpenWebUI](https://openwebui.com), and [GPT4All](https://gpt4all.io), enabling private AI anywhere — even offline.
|
| 25 |
|
| 26 |
> ⚠️ **Note**: This is a *very small* model. It will not match larger models (e.g., 4B+) in reasoning, coding, or factual accuracy. However, it shines in **speed, portability, and efficiency**.
|
| 27 |
|
| 28 |
## Available Quantizations (from f16)
|
| 29 |
|
| 30 |
+
These variants were built from a **f16** base model to ensure consistency across quant levels.
|
| 31 |
+
|
| 32 |
| Level | Quality | Speed | Size | Recommendation |
|
| 33 |
|----------|--------------|----------|-----------|----------------|
|
| 34 |
| Q2_K | Minimal | ⚡ Fastest | 347 MB | Use only on severely constrained systems (e.g., Raspberry Pi). Severely degraded output. |
|
|
|
|
| 75 |
|
| 76 |
Each model includes its own `README.md` and `MODELFILE` for optimal configuration.
|
| 77 |
|
| 78 |
+
## Author
|
|
|
|
|
|
|
| 79 |
|
| 80 |
+
👤 Geoff Munn (@geoffmunn)
|
| 81 |
+
🔗 [Hugging Face Profile](https://huggingface.co/geoffmunn)
|
|
|
|
| 82 |
|
| 83 |
+
## Disclaimer
|
| 84 |
|
| 85 |
+
This is a community conversion for local inference. Not affiliated with Alibaba Cloud or the Qwen team.
|