LongVie 2: Multimodal Controllable Ultra-Long Video World Model
Abstract
LongVie 2, an end-to-end autoregressive framework, enhances controllability, visual quality, and temporal consistency in video world models through three progressive training stages.
Building video world models upon pretrained video generation systems represents an important yet challenging step toward general spatiotemporal intelligence. A world model should possess three essential properties: controllability, long-term visual quality, and temporal consistency. To this end, we take a progressive approach-first enhancing controllability and then extending toward long-term, high-quality generation. We present LongVie 2, an end-to-end autoregressive framework trained in three stages: (1) Multi-modal guidance, which integrates dense and sparse control signals to provide implicit world-level supervision and improve controllability; (2) Degradation-aware training on the input frame, bridging the gap between training and long-term inference to maintain high visual quality; and (3) History-context guidance, which aligns contextual information across adjacent clips to ensure temporal consistency. We further introduce LongVGenBench, a comprehensive benchmark comprising 100 high-resolution one-minute videos covering diverse real-world and synthetic environments. Extensive experiments demonstrate that LongVie 2 achieves state-of-the-art performance in long-range controllability, temporal coherence, and visual fidelity, and supports continuous video generation lasting up to five minutes, marking a significant step toward unified video world modeling.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CtrlVDiff: Controllable Video Generation via Unified Multimodal Video Diffusion (2025)
- Astra: General Interactive World Model with Autoregressive Denoising (2025)
- ID-Crafter: VLM-Grounded Online RL for Compositional Multi-Subject Video Generation (2025)
- MotionDuet: Dual-Conditioned 3D Human Motion Generation with Video-Regularized Text Learning (2025)
- UnityVideo: Unified Multi-Modal Multi-Task Learning for Enhancing World-Aware Video Generation (2025)
- TempoMaster: Efficient Long Video Generation via Next-Frame-Rate Prediction (2025)
- EgoLCD: Egocentric Video Generation with Long Context Diffusion (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper