qwen2_5vl-3b-roi-K21T3-152k-v1bf16Mheads-twiginit-filled

This model is associated with the paper Catching the Details: Self-Distilled RoI Predictors for Fine-Grained MLLM Perception.

Introduction

While recent methods leverage a Region-of-Interest (RoI) mechanism to focus on salient areas, they typically present a difficult trade-off: training-based approaches depend on large-scale annotated datasets, while training-free methods that utilize the model's internal attention are computationally inefficient, requiring either multi-pass prefill stages or reliance on the slow auto-regressive decoding process for RoI identification. We propose an efficient, annotation-free Self-Distilled Region Proposal Network (SD-RPN) that resolves this trade-off. Our core innovation is a pipeline that processes and denoises the noisy cross-attention maps from the MLLM's middle layers to generate pseudo-RoI labels. We then use these labels to train a lightweight and tunable Region Proposal Network (RPN) that is built upon the frozen MLLM backbone. Our RPN predicts the RoI in a single forward pass using features available from the MLLM's middle layers, completely decoupling RoI identification from the auto-regressive generation process and avoiding costly multi-pass operations.

For more details, code, and training instructions, visit the [GitHub repository](https://github.com/YuHengsss/SD-RPN). ## Citation If you use this model, please cite the original paper: ```bibtex @misc{shi2025catching, title={Catching the Details: Self-Distilled RoI Predictors for Fine-Grained MLLM Perception}, author={Yuheng Shi and Xiaohuan Pei and Minjing Dong and Chang Xu}, year={2025}, eprint={2509.16944}, archivePrefix={arXiv}, primaryClass={cs.CV} } ```

Downloads last month: 21

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for YuhengSSS/qwen2_5vl-3b-roi-K21T3-152k-v1bf16Mheads-twiginit-filled

Base model

Qwen/Qwen2.5-VL-3B-Instruct

Finetuned

(547)

this model