DocExplainer: Document VQA with Bounding Box Localization

---
datasets:
- letxbe/BoundingDocs
language:
- en
pipeline_tag: visual-question-answering
tags:
- Visual-Question-Answering
- Question-Answering
- Document
license: apache-2.0
---


<div align="center">

<h1>DocExplainer: Document VQA with Bounding Box Localization</h1>

</div>

DocExplainer is a an approach to Document Visual Question Answering (Document VQA) with bounding box localization.  
Unlike standard VLMs that only provide text-based answers, DocExplainer adds **visual evidence through bounding boxes**, making model predictions more interpretable.  
It is designed as a **plug-and-play module** to be combined with existing Vision-Language Models (VLMs), decoupling answer generation from spatial grounding.  
  
- **Authors:** Alessio Chen, Simone Giovannini, Andrea Gemelli, Fabio Coppini, Simone Marinai  
- **Affiliations:**  [Letxbe AI](https://letxbe.ai/), [University of Florence](https://www.unifi.it/it)
- **License:** apache-2.0  
- **Paper:** ["Towards Reliable and Interpretable Document Question Answering via VLMs"](https://arxiv.org/abs/2509.10129) by Alessio Chen et al.  

<div align="center">
<img src="https://cdn.prod.website-files.com/655f447668b4ad1dd3d4b3d9/664cc272c3e176608bc14a4c_LOGO%20v0%20-%20LetXBebicolore.svg" alt="letxbe ai logo" width="200">
<img src="https://www.dinfo.unifi.it/upload/notizie/Logo_Dinfo_web%20(1).png" alt="Logo Unifi" width="200">
</div>


## Model Details

DocExplainer is a fine-tuned [SigLIP2 Giant](https://huggingface.co/google/siglip2-giant-opt-patch16-384)-based regressor that predicts bounding box coordinates for answer localization in document images. The system operates in a two-stage process:

1. **Question Answering**: Any VLM is used as a black box component to generate a textual answer given in input a document image and question.
2. **Bounding Box Explanation**: DocExplainer takes the image, question, and generated answer to predict the coordinates of the supporting evidence.  


## Model Architecture 
DocExplainer builds on [SigLIP2 Giant](https://huggingface.co/google/siglip2-giant-opt-patch16-384) visual and text embeddings.

![https://i.postimg.cc/2yN9GYwb/Blank-diagram-5.jpg](https://i.postimg.cc/2yN9GYwb/Blank-diagram-5.jpg)

## Training Procedure
- Visual and textual embeddings from SigLiP2 are projected into a shared latent space, fused via fully connected layers.
- A regression head outputs normalized coordinates `[x1, y1, x2, y2]`.  
- **Backbone**: SigLiP2 Giant (frozen).  
- **Loss Function**: Smooth L1 (Huber loss) applied to normalized coordinates in [0,1].  

#### Training Setup
- **Dataset**: [BoundingDocs v2.0](https://huggingface.co/datasets/letxbe/BoundingDocs)  
- **Epochs**: 20  
- **Optimizer**: AdamW 
- **Hardware**: 1 × NVIDIA L40S-1-48G GPU  
- **Model Selection**: Best checkpoint chosen by highest mean IoU on the validation split.  


## Quick Start 

Here is a simple example of how to use `DocExplainer` to get an answer and its corresponding bounding box from a document image.

```python
from PIL import Image
import requests
import torch
from transformers import AutoModel, AutoModelForImageTextToText, AutoProcessor
import json 

url = "https://i.postimg.cc/BvftyvS3/image-1d100e9.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
question = "What is the invoice number?"

# -----------------------
# 1. Load SmolVLM2-2.2B for answer generation
# -----------------------
vlm_model = AutoModelForImageTextToText.from_pretrained(
    "HuggingFaceTB/SmolVLM2-2.2B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2"
)
processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM2-2.2B-Instruct")

PROMPT = """Based only on the document image, answer the following question:
Question: {QUESTION}
Provide ONLY a JSON response in the following format (no trailing commas!):
{{
  "content": "answer"
}}
"""

prompt_text = PROMPT.format(QUESTION=question)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": prompt_text},
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(vlm_model.device, dtype=torch.bfloat16)

input_length = inputs['input_ids'].shape[1]
generated_ids = vlm_model.generate(**inputs, do_sample=False, max_new_tokens=2056)

output_ids = generated_ids[:, input_length:]
generated_texts = processor.batch_decode(
    output_ids,
    skip_special_tokens=True,
)

decoded_output = generated_texts[0].replace("Assistant:", "", 1).strip()
answer = json.loads(decoded_output)['content']

print(f"Answer: {answer}")

# -----------------------
# 2. Load DocExplainer for bounding box prediction
# -----------------------
explainer = AutoModel.from_pretrained("letxbe/DocExplainer", trust_remote_code=True)
bbox = explainer.predict(image, answer)
print(f"Predicted bounding box (normalized): {bbox}")
```


<table> 
  <tr> 
    <td width="50%" valign="top">
Example Output:

**Question**: What is the invoice number? <br>
**Answer**: 3Y8M2d-846<br><br>
**Predicted BBox**: [0.6353235244750977, 0.03685223311185837, 0.8617828488349915, 0.058749228715896606] <br>
    </td> 
    <td width="50%" valign="top">
      Visualized Answer Location:
    <img src="https://i.postimg.cc/0NmBM0b1/invoice-explained.png" alt="Invoice with predicted bounding box" width="100%"> 
    </td> 
  </tr> 
</table> 


## Performance 

| Architecture                   | Prompting   | ANLS  | MeanIoU |
|--------------------------------|------------|-------|---------|
| Smolvlm-2.2B                       | Zero-shot   | 0.527 | 0.011   |
|                                | Anchors     | 0.543 | 0.026   |
|                                | CoT         | 0.561 | 0.011   |
| Qwen2-vl-7B                | Zero-shot   | 0.691 | 0.048   |
|                                | Anchors     | 0.694 | 0.051   |
|                                | CoT         | <ins>0.720</ins> | 0.038   |
| Claude Sonnet 4                 | Zero-shot   | **0.737** | 0.031 |
| Smolvlm-2.2B + DocExplainer                | Zero-shot   | 0.572 | 0.175   |
| Qwen2-vl-7B   + DocExplainer                 | Zero-shot           | 0.689 | 0.188   |
| Smol + Naive OCR           | Zero-shot   | 0.556 | <ins>0.405</ins> |
| Qwen + Naive OCR           | Zero-shot           | 0.690 | **0.494** |


Document VQA performance of different models and prompting strategies on the [BoundingDocs v2.0 dataset](https://huggingface.co/datasets/letxbe/BoundingDocs). <br>
The best value is shown in **bold**, the second-best value is <ins>underlined</ins>.

## Citation

If you use `DocExplainer`, please cite:

```bibtex
@misc{chen2025reliableinterpretabledocumentquestion,
      title={Towards Reliable and Interpretable Document Question Answering via VLMs}, 
      author={Alessio Chen and Simone Giovannini and Andrea Gemelli and Fabio Coppini and Simone Marinai},
      year={2025},
      eprint={2509.10129},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.10129}, 
}
```

## Limitations
- **Prototype only**: Intended as a first approach, not a production-ready solution.
- **Dataset constraints**: Current evaluation is limited to cases where an answer fits in a single bounding box. Answers requiring reasoning over multiple regions or not fully captured by OCR cannot be properly