Works good with vLLM, just no tool calling

by Ununnilium - opened Aug 25

Aug 25

What does "This model could not run on vLLM" mean? For me, it works with vLLM 0.10.1 on a A5000 24 GB liked expected:

VLLM_ATTENTION_BACKEND=FLASHINFER vllm serve Intel/Mistral-Small-3.2-24B-Instruct-2506-int4-AutoRound --max-model-len 48000 --max-seq-to-capture 48000 --gpu-memory-utilization 0.95 --limit-mm-per-preompt '{"image": 10}'

Only tool calling does not work, if I add --tokenizer_mode mistral --config_format mistral --load_format mistral --tool-call-parser mistral --enable-auto-tool-choice, I get a CUDA OOM error, I think there is a bug in the vLLM Mistral tokenizer (which is needed for tool calling).

wenhuach

Intel org 2 days ago

Sorry, I just noticed this since I haven’t watched this space.
Thank you for the information, model card is updated.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment