Qwen3-VL-30B-A3B-Instruct-NVFP4
NVFP4 quantization using llm-compressor v0.8.2.dev28+g0f346cf7 (and transformers v4.57.1) based the officiel NVFP4 example script for Qwen3-VL-235B-A22B-Instruct.
Dataset adjustments
- Model ID has been obviously changed from
Qwen/Qwen3-VL-235B-A22B-InstructtoQwen/Qwen3-VL-30B-A3B-Instruct - Increased the number of sample from 20 to 512
vLLM execution
Because this is a NVFP4 MoE model, you might have some trouble running the model with the current vLLM version (v0.11.0) (no kernel available). To launch it you will need to compile the CUTLASS FP4 GEMM attention kernel for SM100 (RTX Pro 6000) or SM120 (RTX 5090). vLLM can do it automatically for you with the following configuration :
docker run -ti --name Qwen3-VL-30B-A3B-NVFP4 --gpus all -v '/srv/mountpoint_with_freespace/cache:/root/.cache' -e VLLM_USE_FLASHINFER_MOE_FP4=1 -p 8000:8000 "vllm/vllm-openai:nightly" "ig1/Qwen3-VL-30B-A3B-Instruct-NVFP4" --served-model-name Qwen3-VL-30B-A3B --enable-auto-tool-choice --tool-call-parser hermes
The important part here is the VLLM_USE_FLASHINFER_MOE_FP4=1 environment variable instructing vLLM to compile the FP4 MoE kernel for your GPU architecture. The more CPU cores you have the more RAM you will need for the CUDA compilation.
For now you need the vllm/vllm-openai:nightly image (currently targeting 0.11.1rc4.dev6+g66a168a19) but once the v0.11.1 is out, that should not be necessary anymore.
A note for 5090 owners
While it is possible for you to run the model there is high chances that you:
- are running Windows with WSL2, and thus only giving half of your memory to the WSL virtual machines
- have a lot of CPU cores
This will most likely create a situation where the FP4 MoE kernel compilation will triggers a OOM kill within the container. Here is a small guide on how to get it running:
- First you need to edit the
%USERPROFILE%/.wslconfigfile to reduce the CPU cores given to WSL (on so the docker containers you will run) and increase its RAM allocation. Reducing the number of availables cores will reduce the number of compilation jobs in parallel and therefor reduce the RAM consumption. If you have 64GiB of RAM the following configuration will work (otherwise reduce it):
[wsl2]
processors=6
memory=50G
- Once the file has been saved, logout and log back in to start your docker desktop with the new limits
- Execute the the following command on a PowerShell terminal:
docker run -ti --name Qwen3-VL-30B-A3B-NVFP4 --gpus all -v 'E:\cache:/root/.cache' -e VLLM_USE_FLASHINFER_MOE_FP4=1 -p 8000:8000 "vllm/vllm-openai:nightly" "ig1/Qwen3-VL-30B-A3B-Instruct-NVFP4" --served-model-name Qwen3-VL-30B-A3B --gpu-memory-utilization 0.8 --max-model-len 46K --enable-auto-tool-choice --tool-call-parser hermes --limit-mm-per-prompt '{\"image\": 2, \"video\": 0}'
a. Adjust E:\cache to a folder of your linking. It will contains the huggingface download cache folder but also vLLM cache folder (mostly for torch compilation) but also a bunch of others folders you want to keep between different starts.
b. gpu-memory-utilization and max-model-len have been adjusted to the 32GiB limit of the RTX 5090 and the fact that the host system still need a piece of it.
c. limit-mm-per-prompt has been adjusted to match the model len limitation (max 2 images and 0 videos)
- Let vLLM cook. You can use the Docker Desktop
Exectab to check the compilation activity (and RAM usage !) withhtopfor example:apt update && apt install -y htop && htop - Once the service has successfully started,
CTRL-Cthe execution to stop the container. - Edit back the
%USERPROFILE%/.wslconfigto restore your original values. Log out / Log in to start fresh with this new values. - Open Docker Desktop and simply press the start button of the
Qwen3-VL-30B-A3B-NVFP4container. You can now simply manage it using the UI when you need it. - Enjoy fast NVFP4 inference !
- Downloads last month
- 272
Model tree for ig1/Qwen3-VL-30B-A3B-Instruct-NVFP4
Base model
Qwen/Qwen3-VL-30B-A3B-Instruct