Qwen3-VL-30B-A3B-Instruct-NVFP4

NVFP4 quantization using llm-compressor v0.8.2.dev28+g0f346cf7 (and transformers v4.57.1) based the officiel NVFP4 example script for Qwen3-VL-235B-A22B-Instruct.

Dataset adjustments

Model ID has been obviously changed from Qwen/Qwen3-VL-235B-A22B-Instruct to Qwen/Qwen3-VL-30B-A3B-Instruct
Increased the number of sample from 20 to 512

vLLM execution

Because this is a NVFP4 MoE model, you might have some trouble running the model with the current vLLM version (v0.11.0) (no kernel available). To launch it you will need to compile the CUTLASS FP4 GEMM attention kernel for SM100 (RTX Pro 6000) or SM120 (RTX 5090). vLLM can do it automatically for you with the following configuration :

docker run -ti --name Qwen3-VL-30B-A3B-NVFP4 --gpus all -v '/srv/mountpoint_with_freespace/cache:/root/.cache' -e VLLM_USE_FLASHINFER_MOE_FP4=1 -p 8000:8000 "vllm/vllm-openai:nightly" "ig1/Qwen3-VL-30B-A3B-Instruct-NVFP4" --served-model-name Qwen3-VL-30B-A3B --enable-auto-tool-choice --tool-call-parser hermes

The important part here is the VLLM_USE_FLASHINFER_MOE_FP4=1 environment variable instructing vLLM to compile the FP4 MoE kernel for your GPU architecture. The more CPU cores you have the more RAM you will need for the CUDA compilation.

For now you need the vllm/vllm-openai:nightly image (currently targeting 0.11.1rc4.dev6+g66a168a19) but once the v0.11.1 is out, that should not be necessary anymore.

A note for 5090 owners

While it is possible for you to run the model there is high chances that you:

are running Windows with WSL2, and thus only giving half of your memory to the WSL virtual machines
have a lot of CPU cores

This will most likely create a situation where the FP4 MoE kernel compilation will triggers a OOM kill within the container. Here is a small guide on how to get it running:

First you need to edit the %USERPROFILE%/.wslconfig file to reduce the CPU cores given to WSL (on so the docker containers you will run) and increase its RAM allocation. Reducing the number of availables cores will reduce the number of compilation jobs in parallel and therefor reduce the RAM consumption. If you have 64GiB of RAM the following configuration will work (otherwise reduce it):

[wsl2]
processors=6
memory=50G

Once the file has been saved, logout and log back in to start your docker desktop with the new limits
Execute the the following command on a PowerShell terminal:

docker run -ti --name Qwen3-VL-30B-A3B-NVFP4 --gpus all -v 'E:\cache:/root/.cache' -e VLLM_USE_FLASHINFER_MOE_FP4=1 -p 8000:8000 "vllm/vllm-openai:nightly" "ig1/Qwen3-VL-30B-A3B-Instruct-NVFP4" --served-model-name Qwen3-VL-30B-A3B --gpu-memory-utilization 0.8 --max-model-len 46K --enable-auto-tool-choice --tool-call-parser hermes --limit-mm-per-prompt '{\"image\": 2, \"video\": 0}'

a. Adjust E:\cache to a folder of your linking. It will contains the huggingface download cache folder but also vLLM cache folder (mostly for torch compilation) but also a bunch of others folders you want to keep between different starts.

b. gpu-memory-utilization and max-model-len have been adjusted to the 32GiB limit of the RTX 5090 and the fact that the host system still need a piece of it.

c. limit-mm-per-prompt has been adjusted to match the model len limitation (max 2 images and 0 videos)

Let vLLM cook. You can use the Docker Desktop Exec tab to check the compilation activity (and RAM usage !) with htop for example: apt update && apt install -y htop && htop
Once the service has successfully started, CTRL-C the execution to stop the container.
Edit back the %USERPROFILE%/.wslconfig to restore your original values. Log out / Log in to start fresh with this new values.
Open Docker Desktop and simply press the start button of the Qwen3-VL-30B-A3B-NVFP4 container. You can now simply manage it using the UI when you need it.
Enjoy fast NVFP4 inference !

Downloads last month: 272

Safetensors

Model size

18B params

Tensor type

F32

BF16

F8_E4M3

Model tree for ig1/Qwen3-VL-30B-A3B-Instruct-NVFP4

Base model

Qwen/Qwen3-VL-30B-A3B-Instruct

Quantized

(35)

this model

Dataset used to train ig1/Qwen3-VL-30B-A3B-Instruct-NVFP4

Collection including ig1/Qwen3-VL-30B-A3B-Instruct-NVFP4

NVFP4

Collection

Fast inference for Blackwell GPUs • 5 items • Updated 10 days ago