|
|
--- |
|
|
base_model: |
|
|
- Qwen/Qwen3-VL-8B-Instruct |
|
|
datasets: |
|
|
- OneThink/OneThinker-train-data |
|
|
pipeline_tag: any-to-any |
|
|
library_name: transformers |
|
|
license: apache-2.0 |
|
|
--- |
|
|
|
|
|
# OneThinker: All-in-one Reasoning Model for Image and Video |
|
|
|
|
|
|
|
|
|
|
|
This repository contains the **SFT model** presented in: [OneThinker: All-in-one Reasoning Model for Image and Video](https://arxiv.org/pdf/2512.03043) |
|
|
|
|
|
This is an intermediate model prepared for subsequent RL training. |
|
|
|
|
|
For more detailed instructions on environment setup, training scripts, and comprehensive evaluation, please refer to the [OneThinker GitHub repository](https://github.com/tulerfeng/OneThinker). |
|
|
|
|
|
|
|
|
|
|
|
## π About OneThinker |
|
|
|
|
|
<div align="center"> |
|
|
<img src="https://github.com/tulerfeng/OneThinker/raw/main/assets/teaser.png" alt="OneThinker Teaser Image" width="95%"> |
|
|
</div> |
|
|
|
|
|
We introduce **OneThinker**, an all-in-one multimodal reasoning generalist that is **capable of thinking across a wide range of fundamental visual tasks within a single model**. |
|
|
|
|
|
OneThinker unifies image and video understanding across diverse fundamental visual tasks, including question answering, captioning, spatial and temporal grounding, tracking, and segmentation. To achieve this, we construct the large-scale **OneThinker-600k** multi-task training corpus and build **OneThinker-SFT-340k** with high-quality CoT annotations for SFT cold start. Furthermore, we propose **EMA-GRPO**, a new RL method that balances heterogeneous reward signals across diverse visual tasks by tracking task-wise moving averages of reward standard deviations for balanced optimization. |
|
|
|
|
|
OneThinker demonstrates **strong performance on 31 benchmarks across 10 fundamental vision tasks**, while showing effective knowledge transfer between certain tasks and promising zero-shot generalization ability, marking a step toward a unified multimodal reasoning generalist. |
|
|
|
|
|
|
|
|
|
|
|
## π Citations |
|
|
|
|
|
If you find our work helpful for your research, please consider citing our work. |
|
|
|
|
|
```bibtex |
|
|
@article{feng2025onethinker, |
|
|
title={OneThinker: All-in-one Reasoning Model for Image and Video}, |
|
|
author={Feng, Kaituo and Zhang, Manyuan and Li, Hongyu and Fan, Kaixuan and Chen, Shuang and Jiang, Yilei and Zheng, Dian and Sun, Peiwen and Zhang, Yiyuan and Sun, Haoze and others}, |
|
|
journal={arXiv preprint arXiv:2512.03043}, |
|
|
year={2025} |
|
|
} |
|
|
``` |