--- datasets: - letxbe/BoundingDocs language: - en pipeline_tag: visual-question-answering tags: - Visual-Question-Answering - Question-Answering - Document license: apache-2.0 ---

DocExplainer: Document VQA with Bounding Box Localization

DocExplainer is a an approach to Document Visual Question Answering (Document VQA) with bounding box localization. Unlike standard VLMs that only provide text-based answers, DocExplainer adds **visual evidence through bounding boxes**, making model predictions more interpretable. It is designed as a **plug-and-play module** to be combined with existing Vision-Language Models (VLMs), decoupling answer generation from spatial grounding. - **Authors:** Alessio Chen, Simone Giovannini, Andrea Gemelli, Fabio Coppini, Simone Marinai - **Affiliations:** [Letxbe AI](https://letxbe.ai/), [University of Florence](https://www.unifi.it/it) - **License:** apache-2.0 - **Paper:** ["Towards Reliable and Interpretable Document Question Answering via VLMs"](https://arxiv.org/abs/2509.10129) by Alessio Chen et al.
letxbe ai logo Logo Unifi
## Model Details DocExplainer is a fine-tuned [SigLIP2 Giant](https://huggingface.co/google/siglip2-giant-opt-patch16-384)-based regressor that predicts bounding box coordinates for answer localization in document images. The system operates in a two-stage process: 1. **Question Answering**: Any VLM is used as a black box component to generate a textual answer given in input a document image and question. 2. **Bounding Box Explanation**: DocExplainer takes the image, question, and generated answer to predict the coordinates of the supporting evidence. ## Model Architecture DocExplainer builds on [SigLIP2 Giant](https://huggingface.co/google/siglip2-giant-opt-patch16-384) visual and text embeddings. ![https://i.postimg.cc/2yN9GYwb/Blank-diagram-5.jpg](https://i.postimg.cc/2yN9GYwb/Blank-diagram-5.jpg) ## Training Procedure - Visual and textual embeddings from SigLiP2 are projected into a shared latent space, fused via fully connected layers. - A regression head outputs normalized coordinates `[x1, y1, x2, y2]`. - **Backbone**: SigLiP2 Giant (frozen). - **Loss Function**: Smooth L1 (Huber loss) applied to normalized coordinates in [0,1]. #### Training Setup - **Dataset**: [BoundingDocs v2.0](https://huggingface.co/datasets/letxbe/BoundingDocs) - **Epochs**: 20 - **Optimizer**: AdamW - **Hardware**: 1 × NVIDIA L40S-1-48G GPU - **Model Selection**: Best checkpoint chosen by highest mean IoU on the validation split. ## Quick Start Here is a simple example of how to use `DocExplainer` to get an answer and its corresponding bounding box from a document image. ```python from PIL import Image import requests import torch from transformers import AutoModel, AutoModelForImageTextToText, AutoProcessor import json url = "https://i.postimg.cc/BvftyvS3/image-1d100e9.jpg" image = Image.open(requests.get(url, stream=True).raw).convert("RGB") question = "What is the invoice number?" # ----------------------- # 1. Load SmolVLM2-2.2B for answer generation # ----------------------- vlm_model = AutoModelForImageTextToText.from_pretrained( "HuggingFaceTB/SmolVLM2-2.2B-Instruct", torch_dtype=torch.bfloat16, device_map="auto", attn_implementation="flash_attention_2" ) processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM2-2.2B-Instruct") PROMPT = """Based only on the document image, answer the following question: Question: {QUESTION} Provide ONLY a JSON response in the following format (no trailing commas!): {{ "content": "answer" }} """ prompt_text = PROMPT.format(QUESTION=question) messages = [ { "role": "user", "content": [ {"type": "image", "image": image}, {"type": "text", "text": prompt_text}, ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(vlm_model.device, dtype=torch.bfloat16) input_length = inputs['input_ids'].shape[1] generated_ids = vlm_model.generate(**inputs, do_sample=False, max_new_tokens=2056) output_ids = generated_ids[:, input_length:] generated_texts = processor.batch_decode( output_ids, skip_special_tokens=True, ) decoded_output = generated_texts[0].replace("Assistant:", "", 1).strip() answer = json.loads(decoded_output)['content'] print(f"Answer: {answer}") # ----------------------- # 2. Load DocExplainer for bounding box prediction # ----------------------- explainer = AutoModel.from_pretrained("letxbe/DocExplainer", trust_remote_code=True) bbox = explainer.predict(image, answer) print(f"Predicted bounding box (normalized): {bbox}") ```
Example Output: **Question**: What is the invoice number?
**Answer**: 3Y8M2d-846

**Predicted BBox**: [0.6353235244750977, 0.03685223311185837, 0.8617828488349915, 0.058749228715896606]
Visualized Answer Location: Invoice with predicted bounding box
## Performance | Architecture | Prompting | ANLS | MeanIoU | |--------------------------------|------------|-------|---------| | Smolvlm-2.2B | Zero-shot | 0.527 | 0.011 | | | Anchors | 0.543 | 0.026 | | | CoT | 0.561 | 0.011 | | Qwen2-vl-7B | Zero-shot | 0.691 | 0.048 | | | Anchors | 0.694 | 0.051 | | | CoT | 0.720 | 0.038 | | Claude Sonnet 4 | Zero-shot | **0.737** | 0.031 | | Smolvlm-2.2B + DocExplainer | Zero-shot | 0.572 | 0.175 | | Qwen2-vl-7B + DocExplainer | Zero-shot | 0.689 | 0.188 | | Smol + Naive OCR | Zero-shot | 0.556 | 0.405 | | Qwen + Naive OCR | Zero-shot | 0.690 | **0.494** | Document VQA performance of different models and prompting strategies on the [BoundingDocs v2.0 dataset](https://huggingface.co/datasets/letxbe/BoundingDocs).
The best value is shown in **bold**, the second-best value is underlined. ## Citation If you use `DocExplainer`, please cite: ```bibtex @misc{chen2025reliableinterpretabledocumentquestion, title={Towards Reliable and Interpretable Document Question Answering via VLMs}, author={Alessio Chen and Simone Giovannini and Andrea Gemelli and Fabio Coppini and Simone Marinai}, year={2025}, eprint={2509.10129}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2509.10129}, } ``` ## Limitations - **Prototype only**: Intended as a first approach, not a production-ready solution. - **Dataset constraints**: Current evaluation is limited to cases where an answer fits in a single bounding box. Answers requiring reasoning over multiple regions or not fully captured by OCR cannot be properly