Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,34 @@
|
|
| 1 |
-
---
|
| 2 |
-
license: mit
|
| 3 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
license: mit
|
| 3 |
+
---
|
| 4 |
+
|
| 5 |
+
# Delving into Latent Spectral Biasing of Video VAEs for Superior Diffusability
|
| 6 |
+
|
| 7 |
+
|
| 8 |
+
Most existing video VAEs prioritize reconstruction fidelity, often overlooking the latent structure's impact on
|
| 9 |
+
downstream diffusion training. Our research identifies properties of video VAE latent spaces that facilitate diffusion
|
| 10 |
+
training through statistical analysis of VAE latents. Our key finding is that biased, rather than uniform, spectra lead
|
| 11 |
+
to improved diffusability. Motivated by this, we introduce **SSVAE (Spectral-Structured VAE)**, which optimizes the *
|
| 12 |
+
*spectral properties** of the latent space to enhance its **"Diffusability"**.
|
| 13 |
+
|
| 14 |
+
<div align="center">
|
| 15 |
+
<img src="https://raw.githubusercontent.com/zai-org/SSVAE/refs/heads/main/assets/figs/teaser.png" alt="Figure 1" width="400">
|
| 16 |
+
</div>
|
| 17 |
+
|
| 18 |
+
## ๐ฅ Key Highlights
|
| 19 |
+
|
| 20 |
+
* **Spectral Analysis of Latents**: We identify two statistical properties essential for efficient diffusion training: a
|
| 21 |
+
**low-frequency biased spatio-temporal spectrum** and a **few-mode biased channel eigenspectrum**.
|
| 22 |
+
* **Local Correlation Regularization (LCR)**: A lightweight regularizer that explicitly enhances local spatio-temporal
|
| 23 |
+
correlations to induce low-frequency bias.
|
| 24 |
+
* **Latent Masked Reconstruction (LMR)**: A mechanism that simultaneously promotes few-mode bias and improves decoder
|
| 25 |
+
robustness against noise.
|
| 26 |
+
* **Superior Performance**:
|
| 27 |
+
* ๐ **3ร Faster Convergence**: Accelerates text-to-video generation convergence by 3ร compared to strong baselines.
|
| 28 |
+
* ๐ **Higher Quality**: Achieves a **10% gain** in video reward scores (UnifiedReward).
|
| 29 |
+
* ๐ **Outperforms SOTA**: Surpasses open-source VAEs (e.g., Wan 2.2, CogVideoX) in generation quality with fewer
|
| 30 |
+
parameters.
|
| 31 |
+
|
| 32 |
+
## Using Model
|
| 33 |
+
|
| 34 |
+
Please View our [Github](https://github.com/zai-org/SSVAE).
|