license: apache-2.0
tags:
- gguf
- qwen
- qwen3-0.6b
- qwen3-0.6b-q6
- qwen3-0.6b-q6_k
- qwen3-0.6b-q6_k-gguf
- llama.cpp
- quantized
- text-generation
- chat
- edge-ai
- tiny-model
base_model: Qwen/Qwen3-0.6B
author: geoffmunn
Qwen3-0.6B-f16:Q6_K
Quantized version of Qwen/Qwen3-0.6B at Q6_K level, derived from f16 base weights.
Model Info
- Format: GGUF (for llama.cpp and compatible runtimes)
- Size: 623 MB
- Precision: Q6_K
- Base Model: Qwen/Qwen3-0.6B
- Conversion Tool: llama.cpp
Quality & Performance
| Metric | Value |
|---|---|
| Speed | π Slow |
| RAM Required | ~1.4 GB |
| Recommendation | Showed up in a few results, but not recommended. |
Prompt Template (ChatML)
This model uses the ChatML format used by Qwen:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
Set this in your app (LM Studio, OpenWebUI, etc.) for best results.
Generation Parameters
Recommended defaults:
| Parameter | Value |
|---|---|
| Temperature | 0.6 |
| Top-P | 0.95 |
| Top-K | 20 |
| Min-P | 0.0 |
| Repeat Penalty | 1.1 |
Stop sequences: <|im_end|>, <|im_start|>
β οΈ Due to model size, avoid temperatures above 0.9 β outputs become highly unpredictable.
π‘ Usage Tips
This model is best suited for lightweight tasks:
β Ideal Uses
- Quick replies and canned responses
- Intent classification (e.g., βIs this user asking for help?β)
- UI prototyping and local AI testing
- Embedded/NPU deployment
β Limitations
- No complex reasoning or multi-step logic
- Poor math and code generation
- Limited world knowledge
- May repeat or hallucinate frequently at higher temps
π Fast Iteration Friendly
Perfect for developers building prompt templates or testing UI integrations.π Runs on Almost Anything
Even Raspberry Pi Zero W can run Q2_K with swap enabled.π¦ Tiny Footprint
Fits easily on USB drives, microSD cards, or IoT devices.
Customisation & Troubleshooting
Importing directly into Ollama should work, but you might encounter this error: Error: invalid character '<' looking for beginning of value.
In this case try these steps:
wget https://huggingface.co/geoffmunn/Qwen3-0.6B-f16/resolve/main/Qwen3-0.6B-f16%3AQ6_K.ggufnano Modelfileand enter these details:
FROM ./Qwen3-0.6B-f16:Q6_K.gguf
# Chat template using ChatML (used by Qwen)
SYSTEM You are a helpful assistant
TEMPLATE "{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"
PARAMETER stop <|im_start|>
PARAMETER stop <|im_end|>
# Default sampling
PARAMETER temperature 0.6
PARAMETER top_p 0.95
PARAMETER top_k 20
PARAMETER min_p 0.0
PARAMETER repeat_penalty 1.1
PARAMETER num_ctx 4096
The num_ctx value has been dropped to increase speed significantly.
- Then run this command:
ollama create Qwen3-0.6B-f16:Q6_K -f Modelfile
You will now see "Qwen3-0.6B-f16:Q6_K" in your Ollama model list.
These import steps are also useful if you want to customise the default parameters or system prompt.
π₯οΈ CLI Example Using Ollama or TGI Server
Hereβs how you can query this model via API using curl and jq. Replace the endpoint with your local server (e.g., Ollama, Text Generation Inference).
curl http://localhost:11434/api/generate -s -N -d '{
"model": "hf.co/geoffmunn/Qwen3-0.6B-f16:Q6_K",
"prompt": "Respond exactly as follows: Explain what gravity is in one sentence suitable for a child.",
"temperature": 0.6,
"top_p": 0.95,
"top_k": 20,
"min_p": 0.0,
"repeat_penalty": 1.1,
"stream": false
}' | jq -r '.response'
π― Why this works well:
- The prompt is meaningful yet achievable for a tiny model.
- Temperature tuned appropriately: lower for deterministic output (
0.1), higher for jokes (0.8). - Uses
jqto extract clean response.
π¬ Tip: For ultra-low-latency use, try
Q3_K_MorQ4_K_Son older laptops.
Verification
Check integrity:
sha256sum -c ../SHA256SUMS.txt
Usage
Compatible with:
- LM Studio β local AI model runner
- OpenWebUI β self-hosted AI interface
- GPT4All β private, offline AI chatbot
- Directly via
llama.cpp
License
Apache 2.0 β see base model for full terms.