Update README.md
Browse files
README.md
CHANGED
|
@@ -8,11 +8,9 @@ pipeline_tag: text-generation
|
|
| 8 |
---
|
| 9 |
## Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bitgs8-metaoffload-HQQ
|
| 10 |
This is a version of the
|
| 11 |
-
<a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1"> Mixtral-8x7B-Instruct-v0.1 model</a> quantized with a mix of 4-bit and 2-bit via Half-Quadratic Quantization (HQQ).
|
| 12 |
|
| 13 |
-
|
| 14 |
-
|
| 15 |
-
The difference between this model and <a href="https://huggingface.co/mobiuslabsgmbh/Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bit-HQQ"> this </a> is that this one offloads the metadata to the CPU and you only need 13GB Vram to run it instead of 20GB!
|
| 16 |
|
| 17 |

|
| 18 |
|
|
|
|
| 8 |
---
|
| 9 |
## Mixtral-8x7B-Instruct-v0.1-hf-attn-4bit-moe-2bitgs8-metaoffload-HQQ
|
| 10 |
This is a version of the
|
| 11 |
+
<a href="https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1"> Mixtral-8x7B-Instruct-v0.1 model</a> quantized with a mix of 4-bit and 2-bit via Half-Quadratic Quantization (HQQ). More specifically, the attention layers are quantized to 4-bit and the experts are quantized to 2-bit.
|
| 12 |
|
| 13 |
+
This model was designed to get the best quality at a budget of ~13GB of VRAM. It reaches an impressive <b>70.01</b> LLM leaderboard score, not too far from the original model's <b>72.62</b>.
|
|
|
|
|
|
|
| 14 |
|
| 15 |

|
| 16 |
|