--- license: mit datasets: - cadene/droid_1.0.1 language: - en base_model: - stabilityai/stable-video-diffusion-img2vid pipeline_tag: robotics tags: - action_conditioned_video_model ---

👉 Ctrl-World: A Controllable Generative World Model for Robot Manipulation

[Yanjiang Guo*](https://robert-gyj.github.io), [Lucy Xiaoyang Shi*](https://lucys0.github.io), [Jianyu Chen](http://people.iiis.tsinghua.edu.cn/~jychen/), [Chelsea Finn](https://ai.stanford.edu/~cbfinn/) \*Equal contribution; Stanford University, Tsinghua University
## TL; DR: [**Ctrl-World**](https://sites.google.com/view/ctrl-world) is an action-conditioned world model compatible with modern VLA policies and enables policy-in-the-loop rollouts entirely in imagination, which can be used to evaluate and improve the **instruction following** ability of VLA.

wild-data

## Model Details: This repo include the Ctrl-World model checkpoint trained on opensourced [**DROID dataset**](https://droid-dataset.github.io/) (~95k trajectories, 564 scenes). The DROID platform consists of a Franka Panda robotic arm equipped with a Robotiq gripper and three cameras: two randomly placed third-person cameras and one wrist-mounted camera. ## Usage See the official [**Ctrl-World github repo**](https://github.com/Robert-gyj/Ctrl-World/tree/main) for detailed usage. ## Acknowledgement Ctrl-World is developed from the opensourced video foundation model [Stable-Video-Diffusion](https://github.com/Stability-AI/generative-models). The VLA model used in this repo is from [openpi](https://github.com/Physical-Intelligence/openpi). We thank the authors for their efforts! ## Bibtex If you find our work helpful, please leave us a star and cite our paper. Thank you! ``` @article{guo2025ctrl, title={Ctrl-World: A Controllable Generative World Model for Robot Manipulation}, author={Guo, Yanjiang and Shi, Lucy Xiaoyang and Chen, Jianyu and Finn, Chelsea}, journal={arXiv preprint arXiv:2510.10125}, year={2025} } ```