Any-to-Any
Transformers
Safetensors
qwen3_vl
image-to-text
File size: 2,338 Bytes
dd56fb4
 
 
2a13d6e
 
 
 
 
6ae98f5
 
2a13d6e
 
 
4f5b471
 
2a13d6e
37d21ee
2a13d6e
37d21ee
2a13d6e
 
 
37d21ee
2a13d6e
 
37d21ee
2a13d6e
 
37d21ee
2a13d6e
37d21ee
2a13d6e
37d21ee
2a13d6e
 
 
 
 
 
b3d0e03
2a13d6e
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
---
base_model:
- Qwen/Qwen3-VL-8B-Instruct
datasets:
- OneThink/OneThinker-train-data
pipeline_tag: any-to-any
library_name: transformers
license: apache-2.0
---

# OneThinker: All-in-one Reasoning Model for Image and Video



This repository contains the **SFT model** presented in: [OneThinker: All-in-one Reasoning Model for Image and Video](https://arxiv.org/pdf/2512.03043)

This is an intermediate model prepared for subsequent RL training.

For more detailed instructions on environment setup, training scripts, and comprehensive evaluation, please refer to the [OneThinker GitHub repository](https://github.com/tulerfeng/OneThinker).



## ๐Ÿ‘€ About OneThinker

<div align="center">
  <img src="https://github.com/tulerfeng/OneThinker/raw/main/assets/teaser.png" alt="OneThinker Teaser Image" width="95%">
</div>

We introduce **OneThinker**, an all-in-one multimodal reasoning generalist that is **capable of thinking across a wide range of fundamental visual tasks within a single model**.

OneThinker unifies image and video understanding across diverse fundamental visual tasks, including question answering, captioning, spatial and temporal grounding, tracking, and segmentation. To achieve this, we construct the large-scale **OneThinker-600k** multi-task training corpus and build **OneThinker-SFT-340k** with high-quality CoT annotations for SFT cold start. Furthermore, we propose **EMA-GRPO**, a new RL method that balances heterogeneous reward signals across diverse visual tasks by tracking task-wise moving averages of reward standard deviations for balanced optimization.

OneThinker demonstrates **strong performance on 31 benchmarks across 10 fundamental vision tasks**, while showing effective knowledge transfer between certain tasks and promising zero-shot generalization ability, marking a step toward a unified multimodal reasoning generalist.



## ๐Ÿ“„ Citations

If you find our work helpful for your research, please consider citing our work.

```bibtex
@article{feng2025onethinker,
  title={OneThinker: All-in-one Reasoning Model for Image and Video},
  author={Feng, Kaituo and Zhang, Manyuan and Li, Hongyu and Fan, Kaixuan and Chen, Shuang and Jiang, Yilei and Zheng, Dian and Sun, Peiwen and Zhang, Yiyuan and Sun, Haoze and others},
  journal={arXiv preprint arXiv:2512.03043},
  year={2025}
}
```