Incorrect caption generated by ONNX version – need help reproducing the reference output

by harisnaeem - opened 8 days ago

ONNX Community org 8 days ago

•

Dear Joshua Lochner @Xenova

I have created a Colab notebook that loads the ONNX files (vision_encoder.onnx, embed_tokens.onnx, decoder_model_merged.onnx) from this repository and runs inference with ONNX Runtime. The notebook executes without errors, but the caption produced for the sample image (Lake Zurich) does not match the caption shown in the official Hugging Face Space demo (A lake with trees in the background and a ball in the foreground.).

What I have tried:

Installed optimum[onnxruntime]≥1.20, transformers, torch, onnxruntime as in the original demo.
Downloaded the three ONNX model files and their .onnx_data companions via hf_hub_download.
Built separate ONNX Runtime InferenceSessions for the vision encoder, token embedder, and decoder.
Implemented several generation strategies (greedy, top‑k/top‑p sampling, no‑cache, BOS/EOS‑corrected).
Verified that the <image> token ID from the tokenizer matches config.image_token_id.
Inspected the decoder inputs with show_inputs() – shapes and types appear correct.

Observed behavior: All generation functions return captions unrelated to the image (e.g., repeated “3D‑CAM camera with a lens and a tripod…”).

Request:

Could the ONNX Runtime / Optimum community help identify why the decoder logits differ from the original PyTorch implementation? Specifically, I need guidance on constructing the exact input sequence (including the <image> placeholder and correct BOS/EOS IDs) that the ONNX model expects, or any other missing preprocessing step.

The full notebook and a minimal script are available in the Gi[tHub repository: https://github.com/harisnae/granite-docling-ONNX. Any insights, patches, or suggestions are greatly appreciated.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment