Incorrect caption generated by ONNX version – need help reproducing the reference output
Dear Joshua Lochner @Xenova
I have created a Colab notebook that loads the ONNX files (vision_encoder.onnx, embed_tokens.onnx, decoder_model_merged.onnx) from this repository and runs inference with ONNX Runtime. The notebook executes without errors, but the caption produced for the sample image (Lake Zurich) does not match the caption shown in the official Hugging Face Space demo (A lake with trees in the background and a ball in the foreground.).
What I have tried:
- Installed
optimum[onnxruntime]≥1.20,transformers,torch,onnxruntimeas in the original demo. - Downloaded the three ONNX model files and their
.onnx_datacompanions viahf_hub_download. - Built separate ONNX Runtime
InferenceSessions for the vision encoder, token embedder, and decoder. - Implemented several generation strategies (greedy, top‑k/top‑p sampling, no‑cache, BOS/EOS‑corrected).
- Verified that the
<image>token ID from the tokenizer matchesconfig.image_token_id. - Inspected the decoder inputs with
show_inputs()– shapes and types appear correct.
Observed behavior: All generation functions return captions unrelated to the image (e.g., repeated “3D‑CAM camera with a lens and a tripod…”).
Request:
Could the ONNX Runtime / Optimum community help identify why the decoder logits differ from the original PyTorch implementation? Specifically, I need guidance on constructing the exact input sequence (including the <image> placeholder and correct BOS/EOS IDs) that the ONNX model expects, or any other missing preprocessing step.
The full notebook and a minimal script are available in the Gi[tHub repository: https://github.com/harisnae/granite-docling-ONNX. Any insights, patches, or suggestions are greatly appreciated.