Any-to-Any
Transformers
Safetensors
qwen3_vl
image-to-text
KaituoFeng commited on
Commit
37d21ee
Β·
verified Β·
1 Parent(s): 2a13d6e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -43
README.md CHANGED
@@ -10,63 +10,29 @@ license: apache-2.0
10
 
11
  # OneThinker: All-in-one Reasoning Model for Image and Video
12
 
13
- [[πŸ“– Paper](https://huggingface.co/papers/2512.03043)] [[πŸ€— OneThinker-8B-model](https://huggingface.co/OneThink/OneThinker-8B)] [[πŸ€— OneThinker-SFT-model](https://huggingface.co/OneThink/OneThinker-SFT-Qwen3-8B)] [[πŸ€— OneThinker-train-data](https://huggingface.co/datasets/OneThink/OneThinker-train-data)] [[πŸ€— OneThinker-eval](https://huggingface.co/datasets/OneThink/OneThinker-eval)] [[πŸ”— Code](https://github.com/tulerfeng/OneThinker)]
14
 
15
- ## πŸ‘€ About OneThinker
16
-
17
- <div align="center">
18
- <img src="https://github.com/tulerfeng/OneThinker/raw/main/assets/teaser.png" alt="OneThinker Teaser Image" width="95%">
19
- </div>
20
-
21
- We introduce **OneThinker**, an all-in-one multimodal reasoning generalist that is **capable of thinking across a wide range of fundamental visual tasks within a single model**.
22
-
23
- OneThinker unifies image and video understanding across diverse fundamental visual tasks, including question answering, captioning, spatial and temporal grounding, tracking, and segmentation. To achieve this, we construct the large-scale **OneThinker-600k** multi-task training corpus and build **OneThinker-SFT-340k** with high-quality CoT annotations for SFT cold start. Furthermore, we propose **EMA-GRPO**, a new RL method that balances heterogeneous reward signals across diverse visual tasks by tracking task-wise moving averages of reward standard deviations for balanced optimization.
24
-
25
- OneThinker demonstrates **strong performance on 31 benchmarks across 10 fundamental vision tasks**, while showing effective knowledge transfer between certain tasks and promising zero-shot generalization ability, marking a step toward a unified multimodal reasoning generalist.
26
-
27
- All code, models, and data are fully released.
28
-
29
- ## πŸ”₯ News
30
- - [2025/12/03] We release the code, model, data of OneThinker
31
 
32
- ## πŸ“ Features
33
 
34
- + Support Qwen3-VL Training
35
- + Support Image-Video mixed training
36
- + Support reward types in diverse visual tasks
37
- + Provide full pipeline (dataset, SFT training, RL training, evaluation, etc)
38
-
39
- ## πŸ” Dataset
40
-
41
- Our dataset covers both image and video modalities and spans a series of fundamental visual reasoning tasks, including rule-based QA, open-ended QA, captioning, spatial grounding, temporal grounding, spatio-temporal grounding, tracking, and segmentation.
42
-
43
- <div align="center">
44
- <img src="https://github.com/tulerfeng/OneThinker/raw/main/assets/dataset.png" alt="OneThinker Dataset Overview" width="90%">
45
- </div>
46
 
47
- To enable effective SFT initialization for reasoning, we leverage a strong proprietary model, Seed1.5-VL to produce CoT annotations.
48
 
49
- ## πŸ† Performance
50
 
51
- Our model obtains significant performance gains after training based on Qwen3-VL-Instruct-8B across diverse visual tasks. For example, OneThinker-8B reaches 70.6% accuracy on MMMU, 64.3% on MathVerse, 66.2% on VideoMMMU, 93.7 on Refcoco-testA, 54.9 J&F on ReasonVOS.
52
 
53
  <div align="center">
54
- <img src="https://github.com/tulerfeng/OneThinker/raw/main/assets/performance.png" alt="OneThinker Performance Table" width="90%">
55
  </div>
56
 
57
- Besides, we also observe beneficial cross-task and cross-modality knowledge transfer, along with promising preliminary zero-shot generalization under unified training. This highlights the effectiveness and generalization ability of our unified training framework across diverse visual tasks.
58
 
59
- ## πŸŽ₯ Demo
60
 
61
- For detailed interactive demos with reasoning examples across various tasks (QA, Tracking, Segmentation), please refer to the [GitHub repository's Demo section](https://github.com/tulerfeng/OneThinker#demo).
62
 
63
- ## πŸš€ Inference & Evaluation
64
 
65
- For inference on a single example, you may refer to the provided script in the GitHub repository:
66
- ```bash
67
- python ./Evaluation/inference_single/inference.py
68
- ```
69
- For more detailed instructions on environment setup, training scripts, and comprehensive evaluation, please refer to the [OneThinker GitHub repository](https://github.com/tulerfeng/OneThinker).
70
 
71
  ## πŸ“„ Citations
72
 
 
10
 
11
  # OneThinker: All-in-one Reasoning Model for Image and Video
12
 
13
+ [[πŸ“– Paper](https://huggingface.co/papers/2512.03043)]
14
 
15
+ This repository contains the **SFT model** presented in: OneThinker: All-in-one Reasoning Model for Image and Video
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
+ This is an intermediate model prepared for subsequent RL training.
18
 
19
+ For more detailed instructions on environment setup, training scripts, and comprehensive evaluation, please refer to the [OneThinker GitHub repository](https://github.com/tulerfeng/OneThinker).
 
 
 
 
 
 
 
 
 
 
 
20
 
 
21
 
 
22
 
23
+ ## πŸ‘€ About OneThinker
24
 
25
  <div align="center">
26
+ <img src="https://github.com/tulerfeng/OneThinker/raw/main/assets/teaser.png" alt="OneThinker Teaser Image" width="95%">
27
  </div>
28
 
29
+ We introduce **OneThinker**, an all-in-one multimodal reasoning generalist that is **capable of thinking across a wide range of fundamental visual tasks within a single model**.
30
 
31
+ OneThinker unifies image and video understanding across diverse fundamental visual tasks, including question answering, captioning, spatial and temporal grounding, tracking, and segmentation. To achieve this, we construct the large-scale **OneThinker-600k** multi-task training corpus and build **OneThinker-SFT-340k** with high-quality CoT annotations for SFT cold start. Furthermore, we propose **EMA-GRPO**, a new RL method that balances heterogeneous reward signals across diverse visual tasks by tracking task-wise moving averages of reward standard deviations for balanced optimization.
32
 
33
+ OneThinker demonstrates **strong performance on 31 benchmarks across 10 fundamental vision tasks**, while showing effective knowledge transfer between certain tasks and promising zero-shot generalization ability, marking a step toward a unified multimodal reasoning generalist.
34
 
 
35
 
 
 
 
 
 
36
 
37
  ## πŸ“„ Citations
38