Improve model card: Add tags, license, detailed description, and performance (#1)

Browse files

- Improve model card: Add tags, license, detailed description, and performance (362721bcedbaeb7a246af06ac928b328521f4cf2)

Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show

README.md +76 -4

README.md CHANGED Viewed

@@ -1,10 +1,82 @@
 ---
-datasets:
-- OneThink/OneThinker-train-data
 base_model:
 - Qwen/Qwen3-VL-8B-Instruct
 ---
-This repository contains the model presented in: [OneThinker: All-in-one Reasoning Model for Image and Video](https://arxiv.org/abs/2512.03043)
-Code: https://github.com/tulerfeng/OneThinker

 ---
 base_model:
 - Qwen/Qwen3-VL-8B-Instruct
+datasets:
+- OneThink/OneThinker-train-data
+pipeline_tag: any-to-any
+library_name: transformers
+license: apache-2.0
 ---
+# OneThinker: All-in-one Reasoning Model for Image and Video
+[[📖 Paper](https://huggingface.co/papers/2512.03043)] [[🤗 OneThinker-8B-model](https://huggingface.co/OneThink/OneThinker-8B)] [[🤗 OneThinker-SFT-model](https://huggingface.co/OneThink/OneThinker-SFT-Qwen3-8B)] [[🤗 OneThinker-train-data](https://huggingface.co/datasets/OneThink/OneThinker-train-data)] [[🤗 OneThinker-eval](https://huggingface.co/datasets/OneThink/OneThinker-eval)] [[🔗 Code](https://github.com/tulerfeng/OneThinker)]
+## 👀 About OneThinker
+<div align="center">
+  <img src="https://github.com/tulerfeng/OneThinker/raw/main/assets/teaser.png" alt="OneThinker Teaser Image" width="95%">
+</div>
+We introduce **OneThinker**, an all-in-one multimodal reasoning generalist that is **capable of thinking across a wide range of fundamental visual tasks within a single model**.
+OneThinker unifies image and video understanding across diverse fundamental visual tasks, including question answering, captioning, spatial and temporal grounding, tracking, and segmentation. To achieve this, we construct the large-scale **OneThinker-600k** multi-task training corpus and build **OneThinker-SFT-340k** with high-quality CoT annotations for SFT cold start. Furthermore, we propose **EMA-GRPO**, a new RL method that balances heterogeneous reward signals across diverse visual tasks by tracking task-wise moving averages of reward standard deviations for balanced optimization.
+OneThinker demonstrates **strong performance on 31 benchmarks across 10 fundamental vision tasks**, while showing effective knowledge transfer between certain tasks and promising zero-shot generalization ability, marking a step toward a unified multimodal reasoning generalist.
+All code, models, and data are fully released.
+## 🔥 News
+- [2025/12/03] We release the code, model, data of OneThinker
+## 📍 Features
++ Support Qwen3-VL Training
++ Support Image-Video mixed training
++ Support reward types in diverse visual tasks
++ Provide full pipeline (dataset, SFT training, RL training, evaluation, etc)
+## 🔍 Dataset
+Our dataset covers both image and video modalities and spans a series of fundamental visual reasoning tasks, including rule-based QA, open-ended QA, captioning, spatial grounding, temporal grounding, spatio-temporal grounding, tracking, and segmentation.
+<div align="center">
+  <img src="https://github.com/tulerfeng/OneThinker/raw/main/assets/dataset.png" alt="OneThinker Dataset Overview" width="90%">
+</div>
+To enable effective SFT initialization for reasoning, we leverage a strong proprietary model, Seed1.5-VL to produce CoT annotations.
+## 🏆 Performance
+Our model obtains significant performance gains after training based on Qwen3-VL-Instruct-8B across diverse visual tasks. For example, OneThinker-8B reaches 70.6% accuracy on MMMU, 64.3% on MathVerse, 66.2% on VideoMMMU, 93.7 on Refcoco-testA, 54.9 J&F on ReasonVOS.
+<div align="center">
+  <img src="https://github.com/tulerfeng/OneThinker/raw/main/assets/performance.png" alt="OneThinker Performance Table" width="90%">
+</div>
+Besides, we also observe beneficial cross-task and cross-modality knowledge transfer, along with promising preliminary zero-shot generalization under unified training. This highlights the effectiveness and generalization ability of our unified training framework across diverse visual tasks.
+## 🎥 Demo
+For detailed interactive demos with reasoning examples across various tasks (QA, Tracking, Segmentation), please refer to the [GitHub repository's Demo section](https://github.com/tulerfeng/OneThinker#demo).
+## 🚀 Inference & Evaluation
+For inference on a single example, you may refer to the provided script in the GitHub repository:
+```bash
+python ./Evaluation/inference_single/inference.py
+```
+For more detailed instructions on environment setup, training scripts, and comprehensive evaluation, please refer to the [OneThinker GitHub repository](https://github.com/tulerfeng/OneThinker).
+## 📄 Citations
+If you find our work helpful for your research, please consider citing our work.
+```bibtex
+@article{feng2025onethinker,
+  title={OneThinker: All-in-one Reasoning Model for Image and Video},
+  author={Feng, Kaituo and Zhang, Manyuan and Li, Hongyu and Fan, Kaixuan and Chen, Shuang and Jiang, Yilei and Zheng, Dian and Sun, Peiwen and Zhang, Yiyuan and Sun, Haoze and others},
+  journal={arXiv preprint arXiv:2512.03043},
+  year={2025}
+}
+```