Papers
arxiv:2510.27363

ToolScope: An Agentic Framework for Vision-Guided and Long-Horizon Tool Use

Published on Oct 31
ยท Submitted by Mengjie Deng on Nov 4
Authors:
,

Abstract

ToolScope, an agentic framework for multimodal large language models, enhances visual question answering by integrating external tools and achieving significant performance improvements across various benchmarks.

AI-generated summary

Recently, large language models (LLMs) have demonstrated remarkable problem-solving capabilities by autonomously integrating with external tools for collaborative reasoning. However, due to the inherently complex and diverse nature of multimodal information, enabling multimodal large language models (MLLMs) to flexibly and efficiently utilize external tools during reasoning remains an underexplored challenge. In this work, we introduce ToolScope, an agentic framework designed to unify global planning with local multimodal perception, adopting a specialized Perceive tool to mitigates visual context degradation in long-horizon VQA task. ToolScope comprises three primary components: the Global Navigator, the Agentic Executor, and the Response Synthesizer. The Global Navigator functions as a "telescope", offering high-level strategic guidance. The Agentic Executor operates iteratively to augment MLLM with local perception through the integration of external tools-Search, Code, and Perceive. Finally, the Response Synthesizer consolidates and organizes the reasoning process into a coherent, user-friendly output. We evaluate ToolScope on four VQA benchmarks across diverse domains, including VQA 2.0, ScienceQA, MAT-Search and MathVista. It demonstrates strong generalization capabilities, achieving an average performance improvement of up to +6.69% across all datasets.

Community

Paper author Paper submitter
โ€ข
edited about 18 hours ago

๐Ÿ‘‹

Paper author Paper submitter

๐Ÿ› ๏ธ ToolScope: Agentic Framework for Vision-Guided Tool Use

We're excited to share our work on ToolScope, a training-free framework that enhances multimodal LLMs with adaptive tool use for complex visual reasoning tasks.

๐ŸŽฏ Key Contributions

1. Novel Three-Phase Architecture

  • Global Navigator: Strategic task decomposition and tool selection
  • Agentic Executor: Iterative tool-augmented reasoning with dynamic visual re-attention
  • Response Synthesizer: Trajectory condensation and answer formatting

2. Visual Context Preservation
Our dedicated Perceive tool addresses a critical challenge in long-horizon reasoningโ€”it enables models to dynamically re-attend to visual details, mitigating context degradation that plagues traditional approaches.

3. Strong Empirical Results

  • Up to +6.69% average gains across 4 VQA benchmarks (VQA 2.0, ScienceQA, MAT-Search, MathVista)
  • +9.12% peak improvement on retrieval-heavy tasks
  • Consistent performance across Qwen2.5-VL, InternVL3, and MiMo-VL backends

4. Plug-and-Play Design

  • No task-specific fine-tuning required
  • Works with off-the-shelf LMMs via vLLM
  • Modular toolkit: Search (knowledge retrieval) + Code (computation) + Perceive (visual grounding)

๐Ÿš€ Try It Out

Compatible with any vLLM-supported model including InternVL3-8B, Qwen2.5-VL series, and more!

๐Ÿ“„ Paper: https://arxiv.org/abs/2510.27363
๐Ÿ’ป Code: https://github.com/dengmengjie/ToolScope

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.27363 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.27363 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.27363 in a Space README.md to link it from this page.

Collections including this paper 1