File size: 2,533 Bytes
6fe7493 8de8e64 6fe7493 8de8e64 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
---
license: mit
---
# Delving into Latent Spectral Biasing of Video VAEs for Superior Diffusability
[](https://zhazhan.github.io/ssvae.github.io)
[](https://arxiv.org/abs/2512.05394)
Most existing video VAEs prioritize reconstruction fidelity, often overlooking the latent structure's impact on
downstream diffusion training. Our research identifies properties of video VAE latent spaces that facilitate diffusion
training through statistical analysis of VAE latents. Our key finding is that biased, rather than uniform, spectra lead
to improved diffusability. Motivated by this, we introduce **SSVAE (Spectral-Structured VAE)**, which optimizes the *
*spectral properties** of the latent space to enhance its **"Diffusability"**.
<div align="center">
<img src="https://raw.githubusercontent.com/zai-org/SSVAE/refs/heads/main/assets/figs/teaser.png" alt="Figure 1" width="400">
</div>
## π₯ Key Highlights
* **Spectral Analysis of Latents**: We identify two statistical properties essential for efficient diffusion training: a
**low-frequency biased spatio-temporal spectrum** and a **few-mode biased channel eigenspectrum**.
* **Local Correlation Regularization (LCR)**: A lightweight regularizer that explicitly enhances local spatio-temporal
correlations to induce low-frequency bias.
* **Latent Masked Reconstruction (LMR)**: A mechanism that simultaneously promotes few-mode bias and improves decoder
robustness against noise.
* **Superior Performance**:
* π **3Γ Faster Convergence**: Accelerates text-to-video generation convergence by 3Γ compared to strong baselines.
* π **Higher Quality**: Achieves a **10% gain** in video reward scores (UnifiedReward).
* π **Outperforms SOTA**: Surpasses open-source VAEs (e.g., Wan 2.2, CogVideoX) in generation quality with fewer
parameters.
## Using Model
Please View our [Github](https://github.com/zai-org/SSVAE).
## Citation
If you find this work useful in your research, please consider citing:
```bibtex
@misc{liu2025delvinglatentspectralbiasing,
title={Delving into Latent Spectral Biasing of Video VAEs for Superior Diffusability},
author={Shizhan Liu and Xinran Deng and Zhuoyi Yang and Jiayan Teng and Xiaotao Gu and Jie Tang},
year={2025},
eprint={2512.05394},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.05394},
}
``` |