|
|
--- |
|
|
license: mit |
|
|
--- |
|
|
|
|
|
# Delving into Latent Spectral Biasing of Video VAEs for Superior Diffusability |
|
|
|
|
|
[](https://zhazhan.github.io/ssvae.github.io) |
|
|
[](https://arxiv.org/abs/2512.05394) |
|
|
|
|
|
|
|
|
Most existing video VAEs prioritize reconstruction fidelity, often overlooking the latent structure's impact on |
|
|
downstream diffusion training. Our research identifies properties of video VAE latent spaces that facilitate diffusion |
|
|
training through statistical analysis of VAE latents. Our key finding is that biased, rather than uniform, spectra lead |
|
|
to improved diffusability. Motivated by this, we introduce **SSVAE (Spectral-Structured VAE)**, which optimizes the * |
|
|
*spectral properties** of the latent space to enhance its **"Diffusability"**. |
|
|
|
|
|
<div align="center"> |
|
|
<img src="https://raw.githubusercontent.com/zai-org/SSVAE/refs/heads/main/assets/figs/teaser.png" alt="Figure 1" width="400"> |
|
|
</div> |
|
|
|
|
|
## π₯ Key Highlights |
|
|
|
|
|
* **Spectral Analysis of Latents**: We identify two statistical properties essential for efficient diffusion training: a |
|
|
**low-frequency biased spatio-temporal spectrum** and a **few-mode biased channel eigenspectrum**. |
|
|
* **Local Correlation Regularization (LCR)**: A lightweight regularizer that explicitly enhances local spatio-temporal |
|
|
correlations to induce low-frequency bias. |
|
|
* **Latent Masked Reconstruction (LMR)**: A mechanism that simultaneously promotes few-mode bias and improves decoder |
|
|
robustness against noise. |
|
|
* **Superior Performance**: |
|
|
* π **3Γ Faster Convergence**: Accelerates text-to-video generation convergence by 3Γ compared to strong baselines. |
|
|
* π **Higher Quality**: Achieves a **10% gain** in video reward scores (UnifiedReward). |
|
|
* π **Outperforms SOTA**: Surpasses open-source VAEs (e.g., Wan 2.2, CogVideoX) in generation quality with fewer |
|
|
parameters. |
|
|
|
|
|
## Using Model |
|
|
|
|
|
Please View our [Github](https://github.com/zai-org/SSVAE). |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you find this work useful in your research, please consider citing: |
|
|
|
|
|
```bibtex |
|
|
@misc{liu2025delvinglatentspectralbiasing, |
|
|
title={Delving into Latent Spectral Biasing of Video VAEs for Superior Diffusability}, |
|
|
author={Shizhan Liu and Xinran Deng and Zhuoyi Yang and Jiayan Teng and Xiaotao Gu and Jie Tang}, |
|
|
year={2025}, |
|
|
eprint={2512.05394}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CV}, |
|
|
url={https://arxiv.org/abs/2512.05394}, |
|
|
} |
|
|
``` |