--- license: apache-2.0 base_model: - facebook/chameleon-7b tags: - VLA - Robotics ---

WorldVLA: Towards Autoregressive Action World Model

If our project helps you, please give us a star ⭐ on GitHub to support us. 🙏🙏

[![arXiv](https://img.shields.io/badge/Arxiv-2506.21539-AD1C18.svg?logo=arXiv)](https://arxiv.org/pdf/2506.21539) [![GitHub](https://img.shields.io/badge/GitHub-WorldVLA-9cf?logo=github)](https://github.com/alibaba-damo-academy/WorldVLA) [![hf_checkpoint](https://img.shields.io/badge/🤗-Checkpoints-9C276A.svg)](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA) [![License](https://img.shields.io/badge/License-Apache%202.0-yellow)](https://raw.githubusercontent.com/alibaba-damo-academy/WorldVLA/main/LICENSE)

## 🌟 Introduction WorldVLA is an autoregressive action world model that unifies action and image understanding and generation. WorldVLA intergrates Vision-Language-Action (VLA) model (action model) and world model in one single framework.

### Action Model Results (Text + Image -> Action) Action Model generates actions given the text instruction and image observations.


Input: Open the middle drawer of the cabinet.	Input: Pick up the alphabet soup and place it in the basket.	Input: Pick up the black bowl between the plate and the ramekin and place it on the plate.

### World Model Results (Action + Image -> Image) World Model generates the next frame given the current frame and action control.


Input: Action sequence of "Open the top drawer and put the bowl inside".	Input: Action sequence of "Push the plate to the front of the stove".	Input: Action sequence of "Put the bowl on the stove".

## Model Zoo | Model (256 * 256) | HF Link | Success Rate (%) | | :--------------------: | :------------------------------------------------------------: | :--------------------: | | LIBERO-Spatial | [Alibaba-DAMO-Academy/WorldVLA/model_256/libero_spatial](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_256/libero_spatial) | 85.6 | | LIBERO-Object | [Alibaba-DAMO-Academy/WorldVLA/model_256/libero_object](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_256/libero_object) | 89.0 | | LIBERO-Goal | [Alibaba-DAMO-Academy/WorldVLA/model_256/libero_goal](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_256/libero_goal) | 82.6 | | LIBERO-Long | [Alibaba-DAMO-Academy/WorldVLA/model_256/libero_10](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_256/libero_10) | 59.0 |
| Model (512 * 512) | HF Link | Success Rate (%) | | :--------------------: | :------------------------------------------------------------: | :--------------------: | | LIBERO-Spatial | [Alibaba-DAMO-Academy/WorldVLA/model_512/libero_spatial](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_512/libero_spatial) | 87.6 | | LIBERO-Object | [Alibaba-DAMO-Academy/WorldVLA/model_512/libero_object](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_512/libero_object) | 96.2 | | LIBERO-Goal | [Alibaba-DAMO-Academy/WorldVLA/model_512/libero_goal](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_512/libero_goal) | 83.4 | | LIBERO-Long | [Alibaba-DAMO-Academy/WorldVLA/model_512/libero_10](https://huggingface.co/Alibaba-DAMO-Academy/WorldVLA/tree/main/model_512/libero_10) | 60.0 | ## Citation If you find the project helpful for your research, please consider citing our paper: ```bibtex @article{cen2025worldvla, title={WorldVLA: Towards Autoregressive Action World Model}, author={Cen, Jun and Yu, Chaohui and Yuan, Hangjie and Jiang, Yuming and Huang, Siteng and Guo, Jiayan and Li, Xin and Song, Yibing and Luo, Hao and Wang, Fan and others}, journal={arXiv preprint arXiv:2506.21539}, year={2025} } ``` ## Acknowledgment This project builds upon [Lumina-mGPT](https://github.com/Alpha-VLLM/Lumina-mGPT), [Chemeleon](https://github.com/facebookresearch/chameleon), and [OpenVLA](http://github.com/openvla/openvla). We thank these teams for their open-source contributions.