StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors
Guibao Shen1,3β , Yihua Du1, Wenhang Ge1,3*β , Jing He1, Chirui Chang3, Donghao Zhou4, Zhen Yang1, Luozhou Wang1, Xin Tao3, Ying-Cong Chen1,2β‘
1HKUST(GZ), 2HKUST, 3Kling Team, Kuaishou Technology, 4CUHK
(*Equal contribution, β This work was conducted during the author's internship at Kling, β‘Corresponding author)
π Introduction
TL;DR: We propose StereoPilot, an efficient feed-forward architecture that leverages pretrained video diffusion transformers to directly synthesize novel views, overcoming the limitations of Depth-Warp-Inpaint methods without iterative denoising. With a domain switcher and cycle consistency loss, it enables robust multi-format stereo conversion. We also introduce UniStereo, the first large-scale unified dataset featuring both parallel and converged stereo formats.
π₯ Updates
- [2025.12.16]: Release inference code and Project Page.
βοΈ Requirements
Our inference environment:
- Python 3.12
- CUDA 12.1
- PyTorch 2.4.1
- GPU: NVIDIA A800 (only ~23GB VRAM required)
π οΈ Installation
Step 1: Clone the repository
git clone https://github.com/KlingTeam/StereoPilot.git
cd StereoPilot
Step 2: Create conda environment
conda create -n StereoPilot python=3.12
conda activate StereoPilot
Step 3: Install dependencies
pip install -r requirements.txt
pip install flash-attn==2.7.4.post1 --no-build-isolation
Step 4: Download model checkpoints
Place the following files in the ckpt/ directory:
| File | Description |
|---|---|
StereoPilot.safetensors |
StereoPilot model weights |
Wan2.1-T2V-1.3B |
Base Wan2.1 model directory |
Download StereoPilot.safetensor & Wan2.1-1.3B base model:
pip install "huggingface_hub[cli]"
huggingface-cli download KlingTeam/StereoPilot StereoPilot.safetensors --local-dir ./ckpt
huggingface-cli download Wan-AI/Wan2.1-T2V-1.3B --local-dir ./ckpt/Wan2.1-T2V-1.3B
π Inference
Input Requirements
For each input video, you need:
- Video file (
.mp4): Monocular video, 81 frames, 832Γ480 resolution, 16fps - Prompt file (
.txt): Text description of the video content (same name as video)
Example (you can try the cases in the sample/ folder):
sample/
βββ my_video.mp4
βββ my_video.txt
Running Inference
Basic usage:
# Edit toml/infer.toml to customize model paths. If you followed the above steps, there is no need to change
python sample.py \
--config toml/infer.toml \
--input /path/to/input_video.mp4 \
--output_folder /path/to/output \
--device cuda:0
Using the example script:
bash sample.sh
Generate Stereo Visualization
After inference, you can generate Side-by-Side (SBS) and Red-Cyan anaglyph stereo videos for visualization:
python utils/stereo_video.py \
--left /path/to/left_eye.mp4 \
--right /path/to/right_eye.mp4 \
Output files:
| Output | Description | Viewing Device |
|---|---|---|
{name}_sbs.mp4 |
Side-by-Side stereo video | VR Headset |
{name}_anaglyph.mp4 |
Red-Cyan anaglyph stereo video | 3D Glasses |
π Dataset
We introduce UniStereo, the first large-scale unified stereo video dataset featuring both parallel and converged stereo formats.
UniStereo consists of two parts:
- 3DMovie - Converged stereo format from 3D movies
- Stereo4D - Parallel stereo format (coming soon)
For detailed data processing instructions, please refer to StereoPilot_Dataprocess.
π License
This project is licensed under the MIT License - see the LICENSE file for details.
π Acknowledgments
- Wan2.1 - Base video generation model
- Diffusion Pipe - Training code base
π Citation
If you find our work helpful, please consider citing:
@misc{shen2025stereopilot,
title={StereoPilot: Learning Unified and Efficient Stereo Conversion via Generative Priors},
author={Shen, Guibao and Du, Yihua and Ge, Wenhang and He, Jing and Chang, Chirui and Zhou, Donghao and Yang, Zhen and Wang, Luozhou and Tao, Xin and Chen, Ying-Cong},
year={2025},
eprint={2512.16915},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.16915},
}
Model tree for KlingTeam/StereoPilot
Base model
Wan-AI/Wan2.1-T2V-1.3B