Papers
arXiv:2511.02650

Can Visual Input Be Compressed? A Visual Token Compression Benchmark for Large Multimodal Models

Published on Nov 4
· Submitted by Kailin Jiang, 蒋凯林 on Nov 5
Authors:
,
,
,
,
,
,
,
,
,
,

Abstract

UniPruneBench is a unified benchmark for evaluating visual token pruning in multimodal LLMs, providing standardized protocols and system-level metrics to assess performance across various tasks and models.

AI-generated summary

Large multimodal models (LMMs) often suffer from severe inference inefficiency due to the large number of visual tokens introduced by image encoders. While recent token compression methods, such as pruning and merging, have shown promise in reducing redundancy, their evaluation remains fragmented and inconsistent. In this work, we present UniPruneBench, a unified and extensible benchmark for visual token pruning in multimodal LLMs. UniPruneBench provides standardized protocols across six ability dimensions and ten datasets, covering ten representative compression algorithms and three families of LMMs (LLaVA-v1.5, Intern-VL3, and Qwen2.5-VL). Beyond task accuracy, it incorporates system-level metrics such as runtime and prefilling latency to provide a holistic view. Our experiments uncover several key findings: (1) random pruning is a surprisingly strong baseline, (2) no single method consistently outperforms others across scenarios, (3) pruning sensitivity varies significantly across tasks, with OCR being most vulnerable, and (4) pruning ratio is the dominant factor governing performance degradation. We believe UniPruneBench will serve as a reliable foundation for future research on efficient multimodal modeling.

Community

Paper author Paper submitter

Large multimodal models (LMMs) often suffer from severe inference inefficiency
due to the large number of visual tokens introduced by image encoders. While
recent token compression methods, such as pruning and merging, have shown
promise in reducing redundancy, their evaluation remains fragmented and incon
sistent. In this work, we present UniPruneBench, a unified and extensible bench
mark for visual token pruning in multimodal LLMs. UniPruneBench provides
standardized protocols across six ability dimensions and ten datasets, covering ten
representative compression algorithms and three families of LMMs (LLaVA-v1.5,
Intern-VL3, and Qwen2.5-VL). Beyond task accuracy, it incorporates system
level metrics such as runtime and prefilling latency to provide a holistic view.
Our experiments uncover several key findings: (1) random pruning is a surpris
ingly strong baseline, (2) no single method consistently outperforms others across
scenarios, (3) pruning sensitivity varies significantly across tasks, with OCR be
ing most vulnerable, and (4) pruning ratio is the dominant factor governing perfor
mance degradation. We believe UniPruneBench will serve as a reliable foundation
for future research on efficient multimodal modeling.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2511.02650 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2511.02650 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2511.02650 in a Space README.md to link it from this page.

Collections including this paper 5