Lumine: An Open Recipe for Building Generalist Agents in 3D Open Worlds
Abstract
Lumine, a vision-language model-based agent, completes complex missions in real-time across different 3D open-world environments with human-like efficiency and zero-shot cross-game generalization.
We introduce Lumine, the first open recipe for developing generalist agents capable of completing hours-long complex missions in real time within challenging 3D open-world environments. Lumine adopts a human-like interaction paradigm that unifies perception, reasoning, and action in an end-to-end manner, powered by a vision-language model. It processes raw pixels at 5 Hz to produce precise 30 Hz keyboard-mouse actions and adaptively invokes reasoning only when necessary. Trained in Genshin Impact, Lumine successfully completes the entire five-hour Mondstadt main storyline on par with human-level efficiency and follows natural language instructions to perform a broad spectrum of tasks in both 3D open-world exploration and 2D GUI manipulation across collection, combat, puzzle-solving, and NPC interaction. In addition to its in-domain performance, Lumine demonstrates strong zero-shot cross-game generalization. Without any fine-tuning, it accomplishes 100-minute missions in Wuthering Waves and the full five-hour first chapter of Honkai: Star Rail. These promising results highlight Lumine's effectiveness across distinct worlds and interaction dynamics, marking a concrete step toward generalist agents in open-ended environments.
Community
Proposes Lumine, an open, end-to-end vision-language agent for generalist, long-horizon tasks in 3D open worlds, achieving human-level efficiency and zero-shot cross-game generalization without fine-tuning.
This work is so amazing!!!
Genshin mentioned
Amazing work!
原神,启动!
arXiv explained breakdown of this paper 👉 https://arxivexplained.com/papers/lumine-an-open-recipe-for-building-generalist-agents-in-3d-open-worlds
希望开源
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Game-TARS: Pretrained Foundation Models for Scalable Generalist Multimodal Game Agents (2025)
- AnywhereVLA: Language-Conditioned Exploration and Mobile Manipulation (2025)
- ManiAgent: An Agentic Framework for General Robotic Manipulation (2025)
- PhysiAgent: An Embodied Agent Framework in Physical World (2025)
- MTRDrive: Memory-Tool Synergistic Reasoning for Robust Autonomous Driving in Corner Cases (2025)
- Bridge Thinking and Acting: Unleashing Physical Potential of VLM with Generalizable Action Expert (2025)
- MoTVLA: A Vision-Language-Action Model with Unified Fast-Slow Reasoning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Excellent work! Interestingly, Lumine’s remarkable generalization ability in games like Wuthering Waves further proves that it’s essentially a textbook Genshin-like game, haha
原神 启动!!!
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper