|
|
--- |
|
|
license: apache-2.0 |
|
|
tags: |
|
|
- gguf |
|
|
- qwen |
|
|
- qwen3 |
|
|
- qwen3-0.6b |
|
|
- qwen3-0.6b-gguf |
|
|
- llama.cpp |
|
|
- quantized |
|
|
- text-generation |
|
|
- chat |
|
|
- edge-ai |
|
|
- tiny-model |
|
|
- imatrix |
|
|
base_model: Qwen/Qwen3-0.6B |
|
|
author: geoffmunn |
|
|
pipeline_tag: text-generation |
|
|
language: |
|
|
- en |
|
|
- zh |
|
|
--- |
|
|
|
|
|
# Qwen3-0.6B-f16-GGUF |
|
|
|
|
|
This is a **GGUF-quantized version** of the **[Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B)** language model β a compact **600-million-parameter** LLM designed for **ultra-fast inference on low-resource devices**. |
|
|
|
|
|
Converted for use with `llama.cpp`, [LM Studio](https://lmstudio.ai), [OpenWebUI](https://openwebui.com), and [GPT4All](https://gpt4all.io), enabling private AI anywhere β even offline. |
|
|
|
|
|
> β οΈ **Note**: This is a *very small* model. It will not match larger models (e.g., 4B+) in reasoning, coding, or factual accuracy. However, it shines in **speed, portability, and efficiency**. |
|
|
|
|
|
## Available Quantizations (from f16) |
|
|
|
|
|
These variants were built from a **f16** base model to ensure consistency across quant levels. |
|
|
|
|
|
| Level | Speed | Size | Recommendation | |
|
|
|-----------|-----------|------------|--------------------------------------------------------------------| |
|
|
| Q2_K | β‘ Fastest | 347 MB | π¨ **DO NOT USE.** Could not provide an answer to any question. | |
|
|
| Q3_K_S | β‘ Fast | 390 MB | Not recommended, did not appear in any top 3 results. | |
|
|
| Q3_K_M | β‘ Fast | 414 MB | First place in the bat & ball question, no other top 3 appearances.| |
|
|
| Q4_K_S | π Fast | 471 MB | A good option for technical, low-temperature questions. | |
|
|
| Q4_K_M | π Fast | 484 MB | Showed up in a few results, but not recommended. | |
|
|
| π₯ Q5_K_S | π’ Medium | 544 MB | π₯ A very close second place. Good for all query types. | |
|
|
| π₯ Q5_K_M | π’ Medium | 551 MB | π₯ **Best overall model.** Highly recommended for all query types. | |
|
|
| Q6_K | π Slow | 623 MB | Showed up in a few results, but not recommended. | |
|
|
| π₯ Q8_0 | π Slow | 805 MB | π₯ Very good for non-technical, creative-style questions. | |
|
|
|
|
|
## Why Use a 0.6B Model? |
|
|
|
|
|
While limited in capability compared to larger models, **Qwen3-0.6B** excels at: |
|
|
- Running **instantly** on CPUs without GPU |
|
|
- Fitting into **<2GB RAM**, even when quantized |
|
|
- Enabling **offline AI on microcontrollers, phones, or edge devices** |
|
|
- Serving as a **fast baseline** for lightweight NLP tasks (intent detection, short responses) |
|
|
|
|
|
Itβs ideal for: |
|
|
- Chatbots with simple flows |
|
|
- On-device assistants |
|
|
- Educational demos |
|
|
- Rapid prototyping |
|
|
|
|
|
## Model anaysis and rankings |
|
|
|
|
|
I have run each of these models across 6 questions, and ranked them all based on the quality of the anwsers. |
|
|
**Qwen3-0.6B-f16:Q5_K_M** is the best model across all question types, but if you want to play it safe with a higher precision model, then you could consider using **Qwen3-0.6B-f16:Q8_0**. |
|
|
|
|
|
You can read the results here: [Qwen3-0.6b-f16-analysis.md](Qwen3-0.6b-f16-analysis.md) |
|
|
|
|
|
If you find this useful, please give the project a β€οΈ like. |
|
|
|
|
|
## Usage |
|
|
|
|
|
Load this model using: |
|
|
- [OpenWebUI](https://openwebui.com) β self-hosted AI interface with RAG & tools |
|
|
- [LM Studio](https://lmstudio.ai) β desktop app with GPU support and chat templates |
|
|
- [GPT4All](https://gpt4all.io) β private, local AI chatbot (offline-first) |
|
|
- Or directly via `llama.cpp` |
|
|
|
|
|
Each quantized model includes its own `README.md` and shares a common `MODELFILE` for optimal configuration. |
|
|
|
|
|
Importing directly into Ollama should work, but you might encounter this error: `Error: invalid character '<' looking for beginning of value`. |
|
|
In this case try these steps: |
|
|
|
|
|
1. `wget https://huggingface.co/geoffmunn/Qwen3-0.6B-f16/resolve/main/Qwen3-0.6B-f16%3AQ3_K_M.gguf` (replace the quantised version with the one you want) |
|
|
2. `nano Modelfile` and enter these details (again, replacing Q3_K_M with the version you want): |
|
|
```text |
|
|
FROM ./Qwen3-0.6B-f16:Q3_K_M.gguf |
|
|
|
|
|
# Chat template using ChatML (used by Qwen) |
|
|
SYSTEM You are a helpful assistant |
|
|
|
|
|
TEMPLATE "{{ if .System }}<|im_start|>system |
|
|
{{ .System }}<|im_end|>{{ end }}<|im_start|>user |
|
|
{{ .Prompt }}<|im_end|> |
|
|
<|im_start|>assistant |
|
|
" |
|
|
PARAMETER stop <|im_start|> |
|
|
PARAMETER stop <|im_end|> |
|
|
|
|
|
# Default sampling |
|
|
PARAMETER temperature 0.6 |
|
|
PARAMETER top_p 0.95 |
|
|
PARAMETER top_k 20 |
|
|
PARAMETER min_p 0.0 |
|
|
PARAMETER repeat_penalty 1.1 |
|
|
PARAMETER num_ctx 4096 |
|
|
``` |
|
|
|
|
|
The `num_ctx` value has been dropped to increase speed significantly. |
|
|
|
|
|
3. Then run this command: `ollama create Qwen3-0.6B-f16:Q3_K_M -f Modelfile` |
|
|
|
|
|
You will now see "Qwen3-0.6B-f16:Q3_K_M" in your Ollama model list. |
|
|
|
|
|
These import steps are also useful if you want to customise the default parameters or system prompt. |
|
|
|
|
|
## Author |
|
|
|
|
|
π€ Geoff Munn (@geoffmunn) |
|
|
π [Hugging Face Profile](https://huggingface.co/geoffmunn) |
|
|
|
|
|
## Disclaimer |
|
|
|
|
|
This is a community conversion for local inference. Not affiliated with Alibaba Cloud or the Qwen team. |
|
|
|