Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,65 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: apache-2.0
|
| 3 |
+
pipeline_tag: image-text-to-text
|
| 4 |
+
base_model:
|
| 5 |
+
- Qwen/Qwen3-VL-30B-A3B-Instruct
|
| 6 |
+
datasets:
|
| 7 |
+
- neuralmagic/calibration
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
# Qwen3-VL-30B-A3B-Instruct-NVFP4
|
| 11 |
+
|
| 12 |
+
NVFP4 quantization using [llm-compressor](https://github.com/vllm-project/llm-compressor) v0.8.2.dev28+g0f346cf7 (and transformers v4.57.1) based the officiel [NVFP4 example script](https://github.com/vllm-project/llm-compressor/blob/main/examples/quantization_w4a4_fp4/qwen3_vl_moe_w4a4_fp4.py) for [Qwen3-VL-235B-A22B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-235B-A22B-Instruct).
|
| 13 |
+
|
| 14 |
+
## Dataset adjustments
|
| 15 |
+
|
| 16 |
+
* Model ID has been obviously changed from `Qwen/Qwen3-VL-235B-A22B-Instruct` to `Qwen/Qwen3-VL-30B-A3B-Instruct`
|
| 17 |
+
* Increased the number of sample from 20 to 512
|
| 18 |
+
|
| 19 |
+
## vLLM execution
|
| 20 |
+
|
| 21 |
+
Because this is a NVFP4 MoE model, you might have some trouble running the model with the current vLLM version (v0.11.0) (`no kernel available`). To launch it you will need to compile the CUTLASS FP4 GEMM attention kernel for SM100 (RTX Pro 6000) or SM120 (RTX 5090). vLLM can do it automatically for you with the following configuration :
|
| 22 |
+
|
| 23 |
+
```bash
|
| 24 |
+
docker run -ti --name Qwen3-VL-30B-A3B-NVFP4 --gpus all -v '/srv/mountpoint_with_freespace/cache:/root/.cache' -e VLLM_USE_FLASHINFER_MOE_FP4=1 -p 8000:8000 "vllm/vllm-openai:nightly" "ig1/Qwen3-VL-30B-A3B-Instruct-NVFP4" --served-model-name Qwen3-VL-30B-A3B --enable-auto-tool-choice --tool-call-parser hermes
|
| 25 |
+
```
|
| 26 |
+
|
| 27 |
+
The important part here is the `VLLM_USE_FLASHINFER_MOE_FP4=1` environment variable instructing vLLM to compile the FP4 MoE kernel for your GPU architecture. The more CPU cores you have the more RAM you will need for the CUDA compilation.
|
| 28 |
+
|
| 29 |
+
For now you need the `vllm/vllm-openai:nightly` image (currently targeting `0.11.1rc4.dev6+g66a168a19`) but once the v0.11.1 is out, that should not be necessary anymore.
|
| 30 |
+
|
| 31 |
+
### A note for 5090 owners
|
| 32 |
+
|
| 33 |
+
While it is possible for you to run the model there is high chances that you:
|
| 34 |
+
|
| 35 |
+
* are running Windows with WSL2, and thus only giving half of your memory to the WSL virtual machines
|
| 36 |
+
* have a lot of CPU cores
|
| 37 |
+
|
| 38 |
+
This will most likely create a situation where the FP4 MoE kernel compilation will triggers a OOM kill within the container. Here is a small guide on how to get it running:
|
| 39 |
+
|
| 40 |
+
1. First you need to edit the `%USERPROFILE%/.wslconfig` file to reduce the CPU cores given to WSL (on so the docker containers you will run) and increase its RAM allocation. Reducing the number of availables cores will reduce the number of compilation jobs in parallel and therefor reduce the RAM consumption. If you have 64GiB of RAM the following configuration will work (otherwise reduce it):
|
| 41 |
+
|
| 42 |
+
```text
|
| 43 |
+
[wsl2]
|
| 44 |
+
processors=6
|
| 45 |
+
memory=50G
|
| 46 |
+
```
|
| 47 |
+
|
| 48 |
+
2. Once the file has been saved, logout and log back in to start your docker desktop with the new limits
|
| 49 |
+
3. Execute the the following command on a PowerShell terminal:
|
| 50 |
+
|
| 51 |
+
```powershell
|
| 52 |
+
docker run -ti --name Qwen3-VL-30B-A3B-NVFP4 --gpus all -v 'E:\cache:/root/.cache' -e VLLM_USE_FLASHINFER_MOE_FP4=1 -p 8000:8000 "vllm/vllm-openai:nightly" "ig1/Qwen3-VL-30B-A3B-Instruct-NVFP4" --served-model-name Qwen3-VL-30B-A3B --gpu-memory-utilization 0.8 --max-model-len 46K --enable-auto-tool-choice --tool-call-parser hermes --limit-mm-per-prompt '{\"image\": 2, \"video\": 0}'
|
| 53 |
+
```
|
| 54 |
+
|
| 55 |
+
a. Adjust `E:\cache` to a folder of your linking. It will contains the huggingface download cache folder but also vLLM cache folder (mostly for torch compilation) but also a bunch of others folders you want to keep between different starts.
|
| 56 |
+
|
| 57 |
+
b. `gpu-memory-utilization` and `max-model-len` have been adjusted to the 32GiB limit of the RTX 5090 and the fact that the host system still need a piece of it.
|
| 58 |
+
|
| 59 |
+
c. `limit-mm-per-prompt` has been adjusted to match the model len limitation (max 2 images and 0 videos)
|
| 60 |
+
|
| 61 |
+
4. Let vLLM cook. You can use the Docker Desktop `Exec` tab to check the compilation activity (and RAM usage !) with `htop` for example: `apt update && apt install -y htop && htop`
|
| 62 |
+
5. Once the service has successfully started, `CTRL-C` the execution to stop the container.
|
| 63 |
+
6. Edit back the `%USERPROFILE%/.wslconfig` to restore your original values. Log out / Log in to start fresh with this new values.
|
| 64 |
+
7. Open Docker Desktop and simply press the start button of the `Qwen3-VL-30B-A3B-NVFP4` container. You can now simply manage it using the UI when you need it.
|
| 65 |
+
8. Enjoy fast NVFP4 inference !
|