Abstract
Fast-FoundationStereo achieves real-time zero-shot stereo generalization by combining knowledge distillation, blockwise neural architecture search, and structured pruning.
Stereo foundation models achieve strong zero-shot generalization but remain computationally prohibitive for real-time applications. Efficient stereo architectures, on the other hand, sacrifice robustness for speed and require costly per-domain fine-tuning. To bridge this gap, we present Fast-FoundationStereo, a family of architectures that achieve, for the first time, strong zero-shot generalization at real-time frame rate. We employ a divide-and-conquer acceleration strategy with three components: (1) knowledge distillation to compress the hybrid backbone into a single efficient student; (2) blockwise neural architecture search for automatically discovering optimal cost filtering designs under latency budgets, reducing search complexity exponentially; and (3) structured pruning for eliminating redundancy in the iterative refinement module. Furthermore, we introduce an automatic pseudo-labeling pipeline used to curate 1.4M in-the-wild stereo pairs to supplement synthetic training data and facilitate knowledge distillation. The resulting model can run over 10x faster than FoundationStereo while closely matching its zero-shot accuracy, thus establishing a new state-of-the-art among real-time methods. Project page: https://nvlabs.github.io/Fast-FoundationStereo/
Community
A real-time foundation model for stereo depth estimation, which is crucial for robotics/humanoid 3D spatial perception.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Lite Any Stereo: Efficient Zero-Shot Stereo Matching (2025)
- Generalized Geometry Encoding Volume for Real-time Stereo Matching (2025)
- CogStereo: Neural Stereo Matching with Implicit Spatial Cognition Embedding (2025)
- RobIA: Robust Instance-aware Continual Test-time Adaptation for Deep Stereo (2025)
- PPMStereo: Pick-and-Play Memory Construction for Consistent Dynamic Stereo Matching (2025)
- StereoWorld: Geometry-Aware Monocular-to-Stereo Video Generation (2025)
- MAFNet:Multi-frequency Adaptive Fusion Network for Real-time Stereo Matching (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper