Add Q2–Q8_0 quantized models with per-model cards, MODELFILE, CLI examples, and auto-upload

Browse files

Files changed (11) hide show

MODELFILE +3 -3
Qwen3-0.6B-Q2_K/README.md +58 -4
Qwen3-0.6B-Q3_K_M/README.md +58 -4
Qwen3-0.6B-Q3_K_S/README.md +58 -4
Qwen3-0.6B-Q4_K_M/README.md +58 -4
Qwen3-0.6B-Q4_K_S/README.md +58 -4
Qwen3-0.6B-Q5_K_M/README.md +58 -4
Qwen3-0.6B-Q5_K_S/README.md +58 -4
Qwen3-0.6B-Q6_K/README.md +58 -4
Qwen3-0.6B-Q8_0/README.md +58 -4
README.md +20 -18

MODELFILE CHANGED Viewed

@@ -7,11 +7,11 @@ f16: cpu
 # Chat template using ChatML (used by Qwen)
 prompt_template: >-
-       <|im_start|>system
        You are a helpful assistant.<|im_end|>
-       <|im_start|>user
        {prompt}<|im_end|>
-       <|im_start|>assistant
 # Stop sequences help end generation cleanly
 stop: "<|im_end|>"

 # Chat template using ChatML (used by Qwen)
 prompt_template: >-
+        <|im_start|>system
        You are a helpful assistant.<|im_end|>
+        <|im_start|>user
        {prompt}<|im_end|>
+        <|im_start|>assistant
 # Stop sequences help end generation cleanly
 stop: "<|im_end|>"

Qwen3-0.6B-Q2_K/README.md CHANGED Viewed

@@ -6,8 +6,9 @@ tags:
   - llama.cpp
   - quantized
   - text-generation
-  - tiny-model
   - edge-ai
 base_model: Qwen/Qwen3-0.6B
 author: geoffmunn
 ---
@@ -19,7 +20,7 @@ Quantized version of [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) a
 ## Model Info
 - **Format**: GGUF (for llama.cpp and compatible runtimes)
-- **Size**: 347 MB
 - **Precision**: Q2_K
 - **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
 - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
@@ -59,7 +60,60 @@ Recommended defaults:
 | Min-P | 0.0 |
 | Repeat Penalty | 1.1 |
-Stop sequences: \`<|im_end|>\`, \`<|im_start|>\`
 ## Verification
@@ -75,7 +129,7 @@ Compatible with:
 - [LM Studio](https://lmstudio.ai) – local AI model runner
 - [OpenWebUI](https://openwebui.com) – self-hosted AI interface
 - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
-- Directly via \`llama.cpp\`
 ## License

   - llama.cpp
   - quantized
   - text-generation
+  - chat
   - edge-ai
+  - tiny-model
 base_model: Qwen/Qwen3-0.6B
 author: geoffmunn
 ---
 ## Model Info
 - **Format**: GGUF (for llama.cpp and compatible runtimes)
+- **Size**: 332M
 - **Precision**: Q2_K
 - **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
 - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
 | Min-P | 0.0 |
 | Repeat Penalty | 1.1 |
+Stop sequences: `<|im_end|>`, `<|im_start|>`
+> ⚠️ Due to model size, avoid temperatures above 0.9 — outputs become highly unpredictable.
+## 💡 Usage Tips
+> This model is best suited for lightweight tasks:
+>
+> ### ✅ Ideal Uses
+> - Quick replies and canned responses
+> - Intent classification (e.g., “Is this user asking for help?”)
+> - UI prototyping and local AI testing
+> - Embedded/NPU deployment
+>
+> ### ❌ Limitations
+> - No complex reasoning or multi-step logic
+> - Poor math and code generation
+> - Limited world knowledge
+> - May repeat or hallucinate frequently at higher temps
+>
+> ---
+>
+> 🔄 **Fast Iteration Friendly**
+> Perfect for developers building prompt templates or testing UI integrations.
+>
+> 🔋 **Runs on Almost Anything**
+> Even Raspberry Pi Zero W can run Q2_K with swap enabled.
+>
+> 📦 **Tiny Footprint**
+> Fits easily on USB drives, microSD cards, or IoT devices.
+## 🖥️ CLI Example Using Ollama or TGI Server
+Here’s how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server (e.g., Ollama, Text Generation Inference).
+```bash
+curl http://localhost:11434/api/generate -s -N -d '{
+  "model": "hf.co/geoffmunn/Qwen3-0.6B:Q2_K;2D",
+  "prompt": "Respond exactly as follows: Repeat the word 'hello' five times separated by commas.",
+  "temperature": 0.1,
+  "top_p": 0.95,
+  "top_k": 20,
+  "min_p": 0.0,
+  "repeat_penalty": 1.1,
+  "stream": false
+}' | jq -r '.response'
+```
+🎯 **Why this works well**:
+- The prompt is meaningful yet achievable for a tiny model.
+- Temperature tuned appropriately: lower for deterministic output (`0.1`), higher for jokes (`0.8`).
+- Uses `jq` to extract clean response.
+> 💬 Tip: For ultra-low-latency use, try `Q3_K_M` or `Q4_K_S` on older laptops.
 ## Verification
 - [LM Studio](https://lmstudio.ai) – local AI model runner
 - [OpenWebUI](https://openwebui.com) – self-hosted AI interface
 - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
+- Directly via `llama.cpp`
 ## License

Qwen3-0.6B-Q3_K_M/README.md CHANGED Viewed

@@ -6,8 +6,9 @@ tags:
   - llama.cpp
   - quantized
   - text-generation
-  - tiny-model
   - edge-ai
 base_model: Qwen/Qwen3-0.6B
 author: geoffmunn
 ---
@@ -19,7 +20,7 @@ Quantized version of [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) a
 ## Model Info
 - **Format**: GGUF (for llama.cpp and compatible runtimes)
-- **Size**: 414 MB
 - **Precision**: Q3_K_M
 - **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
 - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
@@ -59,7 +60,60 @@ Recommended defaults:
 | Min-P | 0.0 |
 | Repeat Penalty | 1.1 |
-Stop sequences: \`<|im_end|>\`, \`<|im_start|>\`
 ## Verification
@@ -75,7 +129,7 @@ Compatible with:
 - [LM Studio](https://lmstudio.ai) – local AI model runner
 - [OpenWebUI](https://openwebui.com) – self-hosted AI interface
 - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
-- Directly via \`llama.cpp\`
 ## License

   - llama.cpp
   - quantized
   - text-generation
+  - chat
   - edge-ai
+  - tiny-model
 base_model: Qwen/Qwen3-0.6B
 author: geoffmunn
 ---
 ## Model Info
 - **Format**: GGUF (for llama.cpp and compatible runtimes)
+- **Size**: 395M
 - **Precision**: Q3_K_M
 - **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
 - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
 | Min-P | 0.0 |
 | Repeat Penalty | 1.1 |
+Stop sequences: `<|im_end|>`, `<|im_start|>`
+> ⚠️ Due to model size, avoid temperatures above 0.9 — outputs become highly unpredictable.
+## 💡 Usage Tips
+> This model is best suited for lightweight tasks:
+>
+> ### ✅ Ideal Uses
+> - Quick replies and canned responses
+> - Intent classification (e.g., “Is this user asking for help?”)
+> - UI prototyping and local AI testing
+> - Embedded/NPU deployment
+>
+> ### ❌ Limitations
+> - No complex reasoning or multi-step logic
+> - Poor math and code generation
+> - Limited world knowledge
+> - May repeat or hallucinate frequently at higher temps
+>
+> ---
+>
+> 🔄 **Fast Iteration Friendly**
+> Perfect for developers building prompt templates or testing UI integrations.
+>
+> 🔋 **Runs on Almost Anything**
+> Even Raspberry Pi Zero W can run Q2_K with swap enabled.
+>
+> 📦 **Tiny Footprint**
+> Fits easily on USB drives, microSD cards, or IoT devices.
+## 🖥️ CLI Example Using Ollama or TGI Server
+Here’s how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server (e.g., Ollama, Text Generation Inference).
+```bash
+curl http://localhost:11434/api/generate -s -N -d '{
+  "model": "hf.co/geoffmunn/Qwen3-0.6B:Q3_K_M;2D",
+  "prompt": "Respond exactly as follows: Repeat the word 'hello' five times separated by commas.",
+  "temperature": 0.1,
+  "top_p": 0.95,
+  "top_k": 20,
+  "min_p": 0.0,
+  "repeat_penalty": 1.1,
+  "stream": false
+}' | jq -r '.response'
+```
+🎯 **Why this works well**:
+- The prompt is meaningful yet achievable for a tiny model.
+- Temperature tuned appropriately: lower for deterministic output (`0.1`), higher for jokes (`0.8`).
+- Uses `jq` to extract clean response.
+> 💬 Tip: For ultra-low-latency use, try `Q3_K_M` or `Q4_K_S` on older laptops.
 ## Verification
 - [LM Studio](https://lmstudio.ai) – local AI model runner
 - [OpenWebUI](https://openwebui.com) – self-hosted AI interface
 - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
+- Directly via `llama.cpp`
 ## License

Qwen3-0.6B-Q3_K_S/README.md CHANGED Viewed

@@ -6,8 +6,9 @@ tags:
   - llama.cpp
   - quantized
   - text-generation
-  - tiny-model
   - edge-ai
 base_model: Qwen/Qwen3-0.6B
 author: geoffmunn
 ---
@@ -19,7 +20,7 @@ Quantized version of [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) a
 ## Model Info
 - **Format**: GGUF (for llama.cpp and compatible runtimes)
-- **Size**: 390 MB
 - **Precision**: Q3_K_S
 - **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
 - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
@@ -59,7 +60,60 @@ Recommended defaults:
 | Min-P | 0.0 |
 | Repeat Penalty | 1.1 |
-Stop sequences: \`<|im_end|>\`, \`<|im_start|>\`
 ## Verification
@@ -75,7 +129,7 @@ Compatible with:
 - [LM Studio](https://lmstudio.ai) – local AI model runner
 - [OpenWebUI](https://openwebui.com) – self-hosted AI interface
 - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
-- Directly via \`llama.cpp\`
 ## License

   - llama.cpp
   - quantized
   - text-generation
+  - chat
   - edge-ai
+  - tiny-model
 base_model: Qwen/Qwen3-0.6B
 author: geoffmunn
 ---
 ## Model Info
 - **Format**: GGUF (for llama.cpp and compatible runtimes)
+- **Size**: 372M
 - **Precision**: Q3_K_S
 - **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
 - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
 | Min-P | 0.0 |
 | Repeat Penalty | 1.1 |
+Stop sequences: `<|im_end|>`, `<|im_start|>`
+> ⚠️ Due to model size, avoid temperatures above 0.9 — outputs become highly unpredictable.
+## 💡 Usage Tips
+> This model is best suited for lightweight tasks:
+>
+> ### ✅ Ideal Uses
+> - Quick replies and canned responses
+> - Intent classification (e.g., “Is this user asking for help?”)
+> - UI prototyping and local AI testing
+> - Embedded/NPU deployment
+>
+> ### ❌ Limitations
+> - No complex reasoning or multi-step logic
+> - Poor math and code generation
+> - Limited world knowledge
+> - May repeat or hallucinate frequently at higher temps
+>
+> ---
+>
+> 🔄 **Fast Iteration Friendly**
+> Perfect for developers building prompt templates or testing UI integrations.
+>
+> 🔋 **Runs on Almost Anything**
+> Even Raspberry Pi Zero W can run Q2_K with swap enabled.
+>
+> 📦 **Tiny Footprint**
+> Fits easily on USB drives, microSD cards, or IoT devices.
+## 🖥️ CLI Example Using Ollama or TGI Server
+Here’s how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server (e.g., Ollama, Text Generation Inference).
+```bash
+curl http://localhost:11434/api/generate -s -N -d '{
+  "model": "hf.co/geoffmunn/Qwen3-0.6B:Q3_K_S;2D",
+  "prompt": "Respond exactly as follows: Repeat the word 'hello' five times separated by commas.",
+  "temperature": 0.1,
+  "top_p": 0.95,
+  "top_k": 20,
+  "min_p": 0.0,
+  "repeat_penalty": 1.1,
+  "stream": false
+}' | jq -r '.response'
+```
+🎯 **Why this works well**:
+- The prompt is meaningful yet achievable for a tiny model.
+- Temperature tuned appropriately: lower for deterministic output (`0.1`), higher for jokes (`0.8`).
+- Uses `jq` to extract clean response.
+> 💬 Tip: For ultra-low-latency use, try `Q3_K_M` or `Q4_K_S` on older laptops.
 ## Verification
 - [LM Studio](https://lmstudio.ai) – local AI model runner
 - [OpenWebUI](https://openwebui.com) – self-hosted AI interface
 - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
+- Directly via `llama.cpp`
 ## License

Qwen3-0.6B-Q4_K_M/README.md CHANGED Viewed

@@ -6,8 +6,9 @@ tags:
   - llama.cpp
   - quantized
   - text-generation
-  - tiny-model
   - edge-ai
 base_model: Qwen/Qwen3-0.6B
 author: geoffmunn
 ---
@@ -19,7 +20,7 @@ Quantized version of [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) a
 ## Model Info
 - **Format**: GGUF (for llama.cpp and compatible runtimes)
-- **Size**: 484 MB
 - **Precision**: Q4_K_M
 - **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
 - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
@@ -59,7 +60,60 @@ Recommended defaults:
 | Min-P | 0.0 |
 | Repeat Penalty | 1.1 |
-Stop sequences: \`<|im_end|>\`, \`<|im_start|>\`
 ## Verification
@@ -75,7 +129,7 @@ Compatible with:
 - [LM Studio](https://lmstudio.ai) – local AI model runner
 - [OpenWebUI](https://openwebui.com) – self-hosted AI interface
 - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
-- Directly via \`llama.cpp\`
 ## License

   - llama.cpp
   - quantized
   - text-generation
+  - chat
   - edge-ai
+  - tiny-model
 base_model: Qwen/Qwen3-0.6B
 author: geoffmunn
 ---
 ## Model Info
 - **Format**: GGUF (for llama.cpp and compatible runtimes)
+- **Size**: 462M
 - **Precision**: Q4_K_M
 - **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
 - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
 | Min-P | 0.0 |
 | Repeat Penalty | 1.1 |
+Stop sequences: `<|im_end|>`, `<|im_start|>`
+> ⚠️ Due to model size, avoid temperatures above 0.9 — outputs become highly unpredictable.
+## 💡 Usage Tips
+> This model is best suited for lightweight tasks:
+>
+> ### ✅ Ideal Uses
+> - Quick replies and canned responses
+> - Intent classification (e.g., “Is this user asking for help?”)
+> - UI prototyping and local AI testing
+> - Embedded/NPU deployment
+>
+> ### ❌ Limitations
+> - No complex reasoning or multi-step logic
+> - Poor math and code generation
+> - Limited world knowledge
+> - May repeat or hallucinate frequently at higher temps
+>
+> ---
+>
+> 🔄 **Fast Iteration Friendly**
+> Perfect for developers building prompt templates or testing UI integrations.
+>
+> 🔋 **Runs on Almost Anything**
+> Even Raspberry Pi Zero W can run Q2_K with swap enabled.
+>
+> 📦 **Tiny Footprint**
+> Fits easily on USB drives, microSD cards, or IoT devices.
+## 🖥️ CLI Example Using Ollama or TGI Server
+Here’s how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server (e.g., Ollama, Text Generation Inference).
+```bash
+curl http://localhost:11434/api/generate -s -N -d '{
+  "model": "hf.co/geoffmunn/Qwen3-0.6B:Q4_K_M;2D",
+  "prompt": "Respond exactly as follows: Write a short joke about cats.",
+  "temperature": 0.8,
+  "top_p": 0.95,
+  "top_k": 20,
+  "min_p": 0.0,
+  "repeat_penalty": 1.1,
+  "stream": false
+}' | jq -r '.response'
+```
+🎯 **Why this works well**:
+- The prompt is meaningful yet achievable for a tiny model.
+- Temperature tuned appropriately: lower for deterministic output (`0.1`), higher for jokes (`0.8`).
+- Uses `jq` to extract clean response.
+> 💬 Tip: For ultra-low-latency use, try `Q3_K_M` or `Q4_K_S` on older laptops.
 ## Verification
 - [LM Studio](https://lmstudio.ai) – local AI model runner
 - [OpenWebUI](https://openwebui.com) – self-hosted AI interface
 - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
+- Directly via `llama.cpp`
 ## License

Qwen3-0.6B-Q4_K_S/README.md CHANGED Viewed

@@ -6,8 +6,9 @@ tags:
   - llama.cpp
   - quantized
   - text-generation
-  - tiny-model
   - edge-ai
 base_model: Qwen/Qwen3-0.6B
 author: geoffmunn
 ---
@@ -19,7 +20,7 @@ Quantized version of [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) a
 ## Model Info
 - **Format**: GGUF (for llama.cpp and compatible runtimes)
-- **Size**: 471 MB
 - **Precision**: Q4_K_S
 - **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
 - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
@@ -59,7 +60,60 @@ Recommended defaults:
 | Min-P | 0.0 |
 | Repeat Penalty | 1.1 |
-Stop sequences: \`<|im_end|>\`, \`<|im_start|>\`
 ## Verification
@@ -75,7 +129,7 @@ Compatible with:
 - [LM Studio](https://lmstudio.ai) – local AI model runner
 - [OpenWebUI](https://openwebui.com) – self-hosted AI interface
 - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
-- Directly via \`llama.cpp\`
 ## License

   - llama.cpp
   - quantized
   - text-generation
+  - chat
   - edge-ai
+  - tiny-model
 base_model: Qwen/Qwen3-0.6B
 author: geoffmunn
 ---
 ## Model Info
 - **Format**: GGUF (for llama.cpp and compatible runtimes)
+- **Size**: 449M
 - **Precision**: Q4_K_S
 - **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
 - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
 | Min-P | 0.0 |
 | Repeat Penalty | 1.1 |
+Stop sequences: `<|im_end|>`, `<|im_start|>`
+> ⚠️ Due to model size, avoid temperatures above 0.9 — outputs become highly unpredictable.
+## 💡 Usage Tips
+> This model is best suited for lightweight tasks:
+>
+> ### ✅ Ideal Uses
+> - Quick replies and canned responses
+> - Intent classification (e.g., “Is this user asking for help?”)
+> - UI prototyping and local AI testing
+> - Embedded/NPU deployment
+>
+> ### ❌ Limitations
+> - No complex reasoning or multi-step logic
+> - Poor math and code generation
+> - Limited world knowledge
+> - May repeat or hallucinate frequently at higher temps
+>
+> ---
+>
+> 🔄 **Fast Iteration Friendly**
+> Perfect for developers building prompt templates or testing UI integrations.
+>
+> 🔋 **Runs on Almost Anything**
+> Even Raspberry Pi Zero W can run Q2_K with swap enabled.
+>
+> 📦 **Tiny Footprint**
+> Fits easily on USB drives, microSD cards, or IoT devices.
+## 🖥️ CLI Example Using Ollama or TGI Server
+Here’s how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server (e.g., Ollama, Text Generation Inference).
+```bash
+curl http://localhost:11434/api/generate -s -N -d '{
+  "model": "hf.co/geoffmunn/Qwen3-0.6B:Q4_K_S;2D",
+  "prompt": "Respond exactly as follows: Repeat the word 'hello' five times separated by commas.",
+  "temperature": 0.1,
+  "top_p": 0.95,
+  "top_k": 20,
+  "min_p": 0.0,
+  "repeat_penalty": 1.1,
+  "stream": false
+}' | jq -r '.response'
+```
+🎯 **Why this works well**:
+- The prompt is meaningful yet achievable for a tiny model.
+- Temperature tuned appropriately: lower for deterministic output (`0.1`), higher for jokes (`0.8`).
+- Uses `jq` to extract clean response.
+> 💬 Tip: For ultra-low-latency use, try `Q3_K_M` or `Q4_K_S` on older laptops.
 ## Verification
 - [LM Studio](https://lmstudio.ai) – local AI model runner
 - [OpenWebUI](https://openwebui.com) – self-hosted AI interface
 - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
+- Directly via `llama.cpp`
 ## License

Qwen3-0.6B-Q5_K_M/README.md CHANGED Viewed

@@ -6,8 +6,9 @@ tags:
   - llama.cpp
   - quantized
   - text-generation
-  - tiny-model
   - edge-ai
 base_model: Qwen/Qwen3-0.6B
 author: geoffmunn
 ---
@@ -19,7 +20,7 @@ Quantized version of [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) a
 ## Model Info
 - **Format**: GGUF (for llama.cpp and compatible runtimes)
-- **Size**: 551 MB
 - **Precision**: Q5_K_M
 - **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
 - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
@@ -59,7 +60,60 @@ Recommended defaults:
 | Min-P | 0.0 |
 | Repeat Penalty | 1.1 |
-Stop sequences: \`<|im_end|>\`, \`<|im_start|>\`
 ## Verification
@@ -75,7 +129,7 @@ Compatible with:
 - [LM Studio](https://lmstudio.ai) – local AI model runner
 - [OpenWebUI](https://openwebui.com) – self-hosted AI interface
 - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
-- Directly via \`llama.cpp\`
 ## License

   - llama.cpp
   - quantized
   - text-generation
+  - chat
   - edge-ai
+  - tiny-model
 base_model: Qwen/Qwen3-0.6B
 author: geoffmunn
 ---
 ## Model Info
 - **Format**: GGUF (for llama.cpp and compatible runtimes)
+- **Size**: 526M
 - **Precision**: Q5_K_M
 - **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
 - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
 | Min-P | 0.0 |
 | Repeat Penalty | 1.1 |
+Stop sequences: `<|im_end|>`, `<|im_start|>`
+> ⚠️ Due to model size, avoid temperatures above 0.9 — outputs become highly unpredictable.
+## 💡 Usage Tips
+> This model is best suited for lightweight tasks:
+>
+> ### ✅ Ideal Uses
+> - Quick replies and canned responses
+> - Intent classification (e.g., “Is this user asking for help?”)
+> - UI prototyping and local AI testing
+> - Embedded/NPU deployment
+>
+> ### ❌ Limitations
+> - No complex reasoning or multi-step logic
+> - Poor math and code generation
+> - Limited world knowledge
+> - May repeat or hallucinate frequently at higher temps
+>
+> ---
+>
+> 🔄 **Fast Iteration Friendly**
+> Perfect for developers building prompt templates or testing UI integrations.
+>
+> 🔋 **Runs on Almost Anything**
+> Even Raspberry Pi Zero W can run Q2_K with swap enabled.
+>
+> 📦 **Tiny Footprint**
+> Fits easily on USB drives, microSD cards, or IoT devices.
+## 🖥️ CLI Example Using Ollama or TGI Server
+Here’s how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server (e.g., Ollama, Text Generation Inference).
+```bash
+curl http://localhost:11434/api/generate -s -N -d '{
+  "model": "hf.co/geoffmunn/Qwen3-0.6B:Q5_K_M;2D",
+  "prompt": "Respond exactly as follows: Explain what gravity is in one sentence suitable for a child.",
+  "temperature": 0.6,
+  "top_p": 0.95,
+  "top_k": 20,
+  "min_p": 0.0,
+  "repeat_penalty": 1.1,
+  "stream": false
+}' | jq -r '.response'
+```
+🎯 **Why this works well**:
+- The prompt is meaningful yet achievable for a tiny model.
+- Temperature tuned appropriately: lower for deterministic output (`0.1`), higher for jokes (`0.8`).
+- Uses `jq` to extract clean response.
+> 💬 Tip: For ultra-low-latency use, try `Q3_K_M` or `Q4_K_S` on older laptops.
 ## Verification
 - [LM Studio](https://lmstudio.ai) – local AI model runner
 - [OpenWebUI](https://openwebui.com) – self-hosted AI interface
 - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
+- Directly via `llama.cpp`
 ## License

Qwen3-0.6B-Q5_K_S/README.md CHANGED Viewed

@@ -6,8 +6,9 @@ tags:
   - llama.cpp
   - quantized
   - text-generation
-  - tiny-model
   - edge-ai
 base_model: Qwen/Qwen3-0.6B
 author: geoffmunn
 ---
@@ -19,7 +20,7 @@ Quantized version of [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) a
 ## Model Info
 - **Format**: GGUF (for llama.cpp and compatible runtimes)
-- **Size**: 544 MB
 - **Precision**: Q5_K_S
 - **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
 - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
@@ -59,7 +60,60 @@ Recommended defaults:
 | Min-P | 0.0 |
 | Repeat Penalty | 1.1 |
-Stop sequences: \`<|im_end|>\`, \`<|im_start|>\`
 ## Verification
@@ -75,7 +129,7 @@ Compatible with:
 - [LM Studio](https://lmstudio.ai) – local AI model runner
 - [OpenWebUI](https://openwebui.com) – self-hosted AI interface
 - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
-- Directly via \`llama.cpp\`
 ## License

   - llama.cpp
   - quantized
   - text-generation
+  - chat
   - edge-ai
+  - tiny-model
 base_model: Qwen/Qwen3-0.6B
 author: geoffmunn
 ---
 ## Model Info
 - **Format**: GGUF (for llama.cpp and compatible runtimes)
+- **Size**: 519M
 - **Precision**: Q5_K_S
 - **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
 - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
 | Min-P | 0.0 |
 | Repeat Penalty | 1.1 |
+Stop sequences: `<|im_end|>`, `<|im_start|>`
+> ⚠️ Due to model size, avoid temperatures above 0.9 — outputs become highly unpredictable.
+## 💡 Usage Tips
+> This model is best suited for lightweight tasks:
+>
+> ### ✅ Ideal Uses
+> - Quick replies and canned responses
+> - Intent classification (e.g., “Is this user asking for help?”)
+> - UI prototyping and local AI testing
+> - Embedded/NPU deployment
+>
+> ### ❌ Limitations
+> - No complex reasoning or multi-step logic
+> - Poor math and code generation
+> - Limited world knowledge
+> - May repeat or hallucinate frequently at higher temps
+>
+> ---
+>
+> 🔄 **Fast Iteration Friendly**
+> Perfect for developers building prompt templates or testing UI integrations.
+>
+> 🔋 **Runs on Almost Anything**
+> Even Raspberry Pi Zero W can run Q2_K with swap enabled.
+>
+> 📦 **Tiny Footprint**
+> Fits easily on USB drives, microSD cards, or IoT devices.
+## 🖥️ CLI Example Using Ollama or TGI Server
+Here’s how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server (e.g., Ollama, Text Generation Inference).
+```bash
+curl http://localhost:11434/api/generate -s -N -d '{
+  "model": "hf.co/geoffmunn/Qwen3-0.6B:Q5_K_S;2D",
+  "prompt": "Respond exactly as follows: Write a short joke about cats.",
+  "temperature": 0.8,
+  "top_p": 0.95,
+  "top_k": 20,
+  "min_p": 0.0,
+  "repeat_penalty": 1.1,
+  "stream": false
+}' | jq -r '.response'
+```
+🎯 **Why this works well**:
+- The prompt is meaningful yet achievable for a tiny model.
+- Temperature tuned appropriately: lower for deterministic output (`0.1`), higher for jokes (`0.8`).
+- Uses `jq` to extract clean response.
+> 💬 Tip: For ultra-low-latency use, try `Q3_K_M` or `Q4_K_S` on older laptops.
 ## Verification
 - [LM Studio](https://lmstudio.ai) – local AI model runner
 - [OpenWebUI](https://openwebui.com) – self-hosted AI interface
 - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
+- Directly via `llama.cpp`
 ## License

Qwen3-0.6B-Q6_K/README.md CHANGED Viewed

@@ -6,8 +6,9 @@ tags:
   - llama.cpp
   - quantized
   - text-generation
-  - tiny-model
   - edge-ai
 base_model: Qwen/Qwen3-0.6B
 author: geoffmunn
 ---
@@ -19,7 +20,7 @@ Quantized version of [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) a
 ## Model Info
 - **Format**: GGUF (for llama.cpp and compatible runtimes)
-- **Size**: 623 MB
 - **Precision**: Q6_K
 - **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
 - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
@@ -59,7 +60,60 @@ Recommended defaults:
 | Min-P | 0.0 |
 | Repeat Penalty | 1.1 |
-Stop sequences: \`<|im_end|>\`, \`<|im_start|>\`
 ## Verification
@@ -75,7 +129,7 @@ Compatible with:
 - [LM Studio](https://lmstudio.ai) – local AI model runner
 - [OpenWebUI](https://openwebui.com) – self-hosted AI interface
 - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
-- Directly via \`llama.cpp\`
 ## License

   - llama.cpp
   - quantized
   - text-generation
+  - chat
   - edge-ai
+  - tiny-model
 base_model: Qwen/Qwen3-0.6B
 author: geoffmunn
 ---
 ## Model Info
 - **Format**: GGUF (for llama.cpp and compatible runtimes)
+- **Size**: 594M
 - **Precision**: Q6_K
 - **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
 - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
 | Min-P | 0.0 |
 | Repeat Penalty | 1.1 |
+Stop sequences: `<|im_end|>`, `<|im_start|>`
+> ⚠️ Due to model size, avoid temperatures above 0.9 — outputs become highly unpredictable.
+## 💡 Usage Tips
+> This model is best suited for lightweight tasks:
+>
+> ### ✅ Ideal Uses
+> - Quick replies and canned responses
+> - Intent classification (e.g., “Is this user asking for help?”)
+> - UI prototyping and local AI testing
+> - Embedded/NPU deployment
+>
+> ### ❌ Limitations
+> - No complex reasoning or multi-step logic
+> - Poor math and code generation
+> - Limited world knowledge
+> - May repeat or hallucinate frequently at higher temps
+>
+> ---
+>
+> 🔄 **Fast Iteration Friendly**
+> Perfect for developers building prompt templates or testing UI integrations.
+>
+> 🔋 **Runs on Almost Anything**
+> Even Raspberry Pi Zero W can run Q2_K with swap enabled.
+>
+> 📦 **Tiny Footprint**
+> Fits easily on USB drives, microSD cards, or IoT devices.
+## 🖥️ CLI Example Using Ollama or TGI Server
+Here’s how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server (e.g., Ollama, Text Generation Inference).
+```bash
+curl http://localhost:11434/api/generate -s -N -d '{
+  "model": "hf.co/geoffmunn/Qwen3-0.6B:Q6_K;2D",
+  "prompt": "Respond exactly as follows: Explain what gravity is in one sentence suitable for a child.",
+  "temperature": 0.6,
+  "top_p": 0.95,
+  "top_k": 20,
+  "min_p": 0.0,
+  "repeat_penalty": 1.1,
+  "stream": false
+}' | jq -r '.response'
+```
+🎯 **Why this works well**:
+- The prompt is meaningful yet achievable for a tiny model.
+- Temperature tuned appropriately: lower for deterministic output (`0.1`), higher for jokes (`0.8`).
+- Uses `jq` to extract clean response.
+> 💬 Tip: For ultra-low-latency use, try `Q3_K_M` or `Q4_K_S` on older laptops.
 ## Verification
 - [LM Studio](https://lmstudio.ai) – local AI model runner
 - [OpenWebUI](https://openwebui.com) – self-hosted AI interface
 - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
+- Directly via `llama.cpp`
 ## License

Qwen3-0.6B-Q8_0/README.md CHANGED Viewed

@@ -6,8 +6,9 @@ tags:
   - llama.cpp
   - quantized
   - text-generation
-  - tiny-model
   - edge-ai
 base_model: Qwen/Qwen3-0.6B
 author: geoffmunn
 ---
@@ -19,7 +20,7 @@ Quantized version of [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) a
 ## Model Info
 - **Format**: GGUF (for llama.cpp and compatible runtimes)
-- **Size**: 805 MB
 - **Precision**: Q8_0
 - **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
 - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
@@ -59,7 +60,60 @@ Recommended defaults:
 | Min-P | 0.0 |
 | Repeat Penalty | 1.1 |
-Stop sequences: \`<|im_end|>\`, \`<|im_start|>\`
 ## Verification
@@ -75,7 +129,7 @@ Compatible with:
 - [LM Studio](https://lmstudio.ai) – local AI model runner
 - [OpenWebUI](https://openwebui.com) – self-hosted AI interface
 - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
-- Directly via \`llama.cpp\`
 ## License

   - llama.cpp
   - quantized
   - text-generation
+  - chat
   - edge-ai
+  - tiny-model
 base_model: Qwen/Qwen3-0.6B
 author: geoffmunn
 ---
 ## Model Info
 - **Format**: GGUF (for llama.cpp and compatible runtimes)
+- **Size**: 768M
 - **Precision**: Q8_0
 - **Base Model**: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)
 - **Conversion Tool**: [llama.cpp](https://github.com/ggerganov/llama.cpp)
 | Min-P | 0.0 |
 | Repeat Penalty | 1.1 |
+Stop sequences: `<|im_end|>`, `<|im_start|>`
+> ⚠️ Due to model size, avoid temperatures above 0.9 — outputs become highly unpredictable.
+## 💡 Usage Tips
+> This model is best suited for lightweight tasks:
+>
+> ### ✅ Ideal Uses
+> - Quick replies and canned responses
+> - Intent classification (e.g., “Is this user asking for help?”)
+> - UI prototyping and local AI testing
+> - Embedded/NPU deployment
+>
+> ### ❌ Limitations
+> - No complex reasoning or multi-step logic
+> - Poor math and code generation
+> - Limited world knowledge
+> - May repeat or hallucinate frequently at higher temps
+>
+> ---
+>
+> 🔄 **Fast Iteration Friendly**
+> Perfect for developers building prompt templates or testing UI integrations.
+>
+> 🔋 **Runs on Almost Anything**
+> Even Raspberry Pi Zero W can run Q2_K with swap enabled.
+>
+> 📦 **Tiny Footprint**
+> Fits easily on USB drives, microSD cards, or IoT devices.
+## 🖥️ CLI Example Using Ollama or TGI Server
+Here’s how you can query this model via API using `curl` and `jq`. Replace the endpoint with your local server (e.g., Ollama, Text Generation Inference).
+```bash
+curl http://localhost:11434/api/generate -s -N -d '{
+  "model": "hf.co/geoffmunn/Qwen3-0.6B:Q8_0;2D",
+  "prompt": "Respond exactly as follows: Explain what gravity is in one sentence suitable for a child.",
+  "temperature": 0.6,
+  "top_p": 0.95,
+  "top_k": 20,
+  "min_p": 0.0,
+  "repeat_penalty": 1.1,
+  "stream": false
+}' | jq -r '.response'
+```
+🎯 **Why this works well**:
+- The prompt is meaningful yet achievable for a tiny model.
+- Temperature tuned appropriately: lower for deterministic output (`0.1`), higher for jokes (`0.8`).
+- Uses `jq` to extract clean response.
+> 💬 Tip: For ultra-low-latency use, try `Q3_K_M` or `Q4_K_S` on older laptops.
 ## Verification
 - [LM Studio](https://lmstudio.ai) – local AI model runner
 - [OpenWebUI](https://openwebui.com) – self-hosted AI interface
 - [GPT4All](https://gpt4all.io) – private, offline AI chatbot
+- Directly via `llama.cpp`
 ## License

README.md CHANGED Viewed

@@ -1,29 +1,34 @@
 ---
 license: apache-2.0
 tags:
-- gguf
-- qwen
-- llama.cpp
-- quantized
-- text-generation
-- tiny-model
-- edge-ai
 base_model: Qwen/Qwen3-0.6B
 author: geoffmunn
 language:
-- en
 ---
 # Qwen3-0.6B-GGUF
-This is a **GGUF-quantized version** of the **[Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)** language model — a compact **600M-parameter** LLM designed for **ultra-fast inference on low-resource devices**.
-Converted for use with `llama.cpp` and compatible tools like OpenWebUI, LM Studio, GPT4All, and more.
 > ⚠️ **Note**: This is a *very small* model. It will not match larger models (e.g., 4B+) in reasoning, coding, or factual accuracy. However, it shines in **speed, portability, and efficiency**.
 ## Available Quantizations (from f16)
 | Level     | Quality       | Speed     | Size      | Recommendation |
 |----------|--------------|----------|-----------|----------------|
 | Q2_K     | Minimal      | ⚡ Fastest | 347 MB   | Use only on severely constrained systems (e.g., Raspberry Pi). Severely degraded output. |
@@ -70,14 +75,11 @@ Load this model using:
 Each model includes its own `README.md` and `MODELFILE` for optimal configuration.
-## Verification
-Use \`SHA256SUMS.txt\` to verify file integrity:
-```bash
-sha256sum -c SHA256SUMS.txt
-```
-## License
-Apache 2.0 – see base model for full terms.

 ---
 license: apache-2.0
 tags:
+  - gguf
+  - qwen
+  - llama.cpp
+  - quantized
+  - text-generation
+  - chat
+  - edge-ai
+  - tiny-model
 base_model: Qwen/Qwen3-0.6B
 author: geoffmunn
+pipeline_tag: text-generation
 language:
+  - en
+  - zh
 ---
 # Qwen3-0.6B-GGUF
+This is a **GGUF-quantized version** of the **[Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)** language model — a compact **600-million-parameter** LLM designed for **ultra-fast inference on low-resource devices**.
+Converted for use with `llama.cpp`, [LM Studio](https://lmstudio.ai), [OpenWebUI](https://openwebui.com), and [GPT4All](https://gpt4all.io), enabling private AI anywhere — even offline.
 > ⚠️ **Note**: This is a *very small* model. It will not match larger models (e.g., 4B+) in reasoning, coding, or factual accuracy. However, it shines in **speed, portability, and efficiency**.
 ## Available Quantizations (from f16)
+These variants were built from a **f16** base model to ensure consistency across quant levels.
 | Level     | Quality       | Speed     | Size      | Recommendation |
 |----------|--------------|----------|-----------|----------------|
 | Q2_K     | Minimal      | ⚡ Fastest | 347 MB   | Use only on severely constrained systems (e.g., Raspberry Pi). Severely degraded output. |
 Each model includes its own `README.md` and `MODELFILE` for optimal configuration.
+## Author
+👤 Geoff Munn (@geoffmunn)
+🔗 [Hugging Face Profile](https://huggingface.co/geoffmunn)
+## Disclaimer
+This is a community conversion for local inference. Not affiliated with Alibaba Cloud or the Qwen team.