OneThink
/

OneThinker-SFT-Qwen3-8B

@@ -10,63 +10,29 @@ license: apache-2.0
 # OneThinker: All-in-one Reasoning Model for Image and Video
-[[📖 Paper](https://huggingface.co/papers/2512.03043)] [[🤗 OneThinker-8B-model](https://huggingface.co/OneThink/OneThinker-8B)] [[🤗 OneThinker-SFT-model](https://huggingface.co/OneThink/OneThinker-SFT-Qwen3-8B)] [[🤗 OneThinker-train-data](https://huggingface.co/datasets/OneThink/OneThinker-train-data)] [[🤗 OneThinker-eval](https://huggingface.co/datasets/OneThink/OneThinker-eval)] [[🔗 Code](https://github.com/tulerfeng/OneThinker)]
-## 👀 About OneThinker
-<div align="center">
-  <img src="https://github.com/tulerfeng/OneThinker/raw/main/assets/teaser.png" alt="OneThinker Teaser Image" width="95%">
-</div>
-We introduce **OneThinker**, an all-in-one multimodal reasoning generalist that is **capable of thinking across a wide range of fundamental visual tasks within a single model**.
-OneThinker unifies image and video understanding across diverse fundamental visual tasks, including question answering, captioning, spatial and temporal grounding, tracking, and segmentation. To achieve this, we construct the large-scale **OneThinker-600k** multi-task training corpus and build **OneThinker-SFT-340k** with high-quality CoT annotations for SFT cold start. Furthermore, we propose **EMA-GRPO**, a new RL method that balances heterogeneous reward signals across diverse visual tasks by tracking task-wise moving averages of reward standard deviations for balanced optimization.
-OneThinker demonstrates **strong performance on 31 benchmarks across 10 fundamental vision tasks**, while showing effective knowledge transfer between certain tasks and promising zero-shot generalization ability, marking a step toward a unified multimodal reasoning generalist.
-All code, models, and data are fully released.
-## 🔥 News
-- [2025/12/03] We release the code, model, data of OneThinker
-## 📍 Features
-+ Support Qwen3-VL Training
-+ Support Image-Video mixed training
-+ Support reward types in diverse visual tasks
-+ Provide full pipeline (dataset, SFT training, RL training, evaluation, etc)
-## 🔍 Dataset
-Our dataset covers both image and video modalities and spans a series of fundamental visual reasoning tasks, including rule-based QA, open-ended QA, captioning, spatial grounding, temporal grounding, spatio-temporal grounding, tracking, and segmentation.
-<div align="center">
-  <img src="https://github.com/tulerfeng/OneThinker/raw/main/assets/dataset.png" alt="OneThinker Dataset Overview" width="90%">
-</div>
-To enable effective SFT initialization for reasoning, we leverage a strong proprietary model, Seed1.5-VL to produce CoT annotations.
-## 🏆 Performance
-Our model obtains significant performance gains after training based on Qwen3-VL-Instruct-8B across diverse visual tasks. For example, OneThinker-8B reaches 70.6% accuracy on MMMU, 64.3% on MathVerse, 66.2% on VideoMMMU, 93.7 on Refcoco-testA, 54.9 J&F on ReasonVOS.
 <div align="center">
-  <img src="https://github.com/tulerfeng/OneThinker/raw/main/assets/performance.png" alt="OneThinker Performance Table" width="90%">
 </div>
-Besides, we also observe beneficial cross-task and cross-modality knowledge transfer, along with promising preliminary zero-shot generalization under unified training. This highlights the effectiveness and generalization ability of our unified training framework across diverse visual tasks.
-## 🎥 Demo
-For detailed interactive demos with reasoning examples across various tasks (QA, Tracking, Segmentation), please refer to the [GitHub repository's Demo section](https://github.com/tulerfeng/OneThinker#demo).
-## 🚀 Inference & Evaluation
-For inference on a single example, you may refer to the provided script in the GitHub repository:
-```bash
-python ./Evaluation/inference_single/inference.py
-```
-For more detailed instructions on environment setup, training scripts, and comprehensive evaluation, please refer to the [OneThinker GitHub repository](https://github.com/tulerfeng/OneThinker).
 ## 📄 Citations

 # OneThinker: All-in-one Reasoning Model for Image and Video
+[[📖 Paper](https://huggingface.co/papers/2512.03043)]
+This repository contains the **SFT model** presented in: OneThinker: All-in-one Reasoning Model for Image and Video
+This is an intermediate model prepared for subsequent RL training.
+For more detailed instructions on environment setup, training scripts, and comprehensive evaluation, please refer to the [OneThinker GitHub repository](https://github.com/tulerfeng/OneThinker).
+## 👀 About OneThinker
 <div align="center">
+  <img src="https://github.com/tulerfeng/OneThinker/raw/main/assets/teaser.png" alt="OneThinker Teaser Image" width="95%">
 </div>
+We introduce **OneThinker**, an all-in-one multimodal reasoning generalist that is **capable of thinking across a wide range of fundamental visual tasks within a single model**.
+OneThinker unifies image and video understanding across diverse fundamental visual tasks, including question answering, captioning, spatial and temporal grounding, tracking, and segmentation. To achieve this, we construct the large-scale **OneThinker-600k** multi-task training corpus and build **OneThinker-SFT-340k** with high-quality CoT annotations for SFT cold start. Furthermore, we propose **EMA-GRPO**, a new RL method that balances heterogeneous reward signals across diverse visual tasks by tracking task-wise moving averages of reward standard deviations for balanced optimization.
+OneThinker demonstrates **strong performance on 31 benchmarks across 10 fundamental vision tasks**, while showing effective knowledge transfer between certain tasks and promising zero-shot generalization ability, marking a step toward a unified multimodal reasoning generalist.
 ## 📄 Citations