Task adaptation of Vision-Language-Action model: 1st Place Solution for the 2025 BEHAVIOR Challenge
Abstract
A vision-action policy using correlated noise for flow matching and learnable mixed-layer attention wins the 2025 BEHAVIOR Challenge with high performance across diverse household tasks.
We present a vision-action policy that won 1st place in the 2025 BEHAVIOR Challenge - a large-scale benchmark featuring 50 diverse long-horizon household tasks in photo-realistic simulation, requiring bimanual manipulation, navigation, and context-aware decision making. Building on the Pi0.5 architecture, we introduce several innovations. Our primary contribution is correlated noise for flow matching, which improves training efficiency and enables correlation-aware inpainting for smooth action sequences. We also apply learnable mixed-layer attention and System 2 stage tracking for ambiguity resolution. Training employs multi-sample flow matching to reduce variance, while inference uses action compression and challenge-specific correction rules. Our approach achieves 26% q-score across all 50 tasks on both public and private leaderboards.
Community
We present our 1st place solution to the 2025 NeurIPS BEHAVIOR Challenge, where a single Vision-Language-Action robotics policy is trained to perform 50 household manipulation tasks in a photorealistic simulator. The approach builds on Pi0.5 with several practical architecture, training, and inference modifications. We open-source the full solution, including code, model weights, and a detailed technical report, for anyone exploring or adapting VLAs to real tasks.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model (2025)
- Evo-1: Lightweight Vision-Language-Action Model with Preserved Semantic Alignment (2025)
- Mixture of Horizons in Action Chunking (2025)
- PosA-VLA: Enhancing Action Generation via Pose-Conditioned Anchor Attention (2025)
- MergeVLA: Cross-Skill Model Merging Toward a Generalist Vision-Language-Action Agent (2025)
- ActDistill: General Action-Guided Self-Derived Distillation for Efficient Vision-Language-Action Models (2025)
- iFlyBot-VLA Technical Report (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 2
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper