Works good with vLLM, just no tool calling
#1
by
Ununnilium
- opened
What does "This model could not run on vLLM" mean? For me, it works with vLLM 0.10.1 on a A5000 24 GB liked expected:
VLLM_ATTENTION_BACKEND=FLASHINFER vllm serve Intel/Mistral-Small-3.2-24B-Instruct-2506-int4-AutoRound --max-model-len 48000 --max-seq-to-capture 48000 --gpu-memory-utilization 0.95 --limit-mm-per-preompt '{"image": 10}'
Only tool calling does not work, if I add --tokenizer_mode mistral --config_format mistral --load_format mistral --tool-call-parser mistral --enable-auto-tool-choice, I get a CUDA OOM error, I think there is a bug in the vLLM Mistral tokenizer (which is needed for tool calling).
Sorry, I just noticed this since I haven’t watched this space.
Thank you for the information, model card is updated.