ZHANGYUXUAN-zR commited on
Commit
6fe7493
ยท
verified ยท
1 Parent(s): 606bfe4

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +34 -3
README.md CHANGED
@@ -1,3 +1,34 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+
5
+ # Delving into Latent Spectral Biasing of Video VAEs for Superior Diffusability
6
+
7
+
8
+ Most existing video VAEs prioritize reconstruction fidelity, often overlooking the latent structure's impact on
9
+ downstream diffusion training. Our research identifies properties of video VAE latent spaces that facilitate diffusion
10
+ training through statistical analysis of VAE latents. Our key finding is that biased, rather than uniform, spectra lead
11
+ to improved diffusability. Motivated by this, we introduce **SSVAE (Spectral-Structured VAE)**, which optimizes the *
12
+ *spectral properties** of the latent space to enhance its **"Diffusability"**.
13
+
14
+ <div align="center">
15
+ <img src="https://raw.githubusercontent.com/zai-org/SSVAE/refs/heads/main/assets/figs/teaser.png" alt="Figure 1" width="400">
16
+ </div>
17
+
18
+ ## ๐Ÿ”ฅ Key Highlights
19
+
20
+ * **Spectral Analysis of Latents**: We identify two statistical properties essential for efficient diffusion training: a
21
+ **low-frequency biased spatio-temporal spectrum** and a **few-mode biased channel eigenspectrum**.
22
+ * **Local Correlation Regularization (LCR)**: A lightweight regularizer that explicitly enhances local spatio-temporal
23
+ correlations to induce low-frequency bias.
24
+ * **Latent Masked Reconstruction (LMR)**: A mechanism that simultaneously promotes few-mode bias and improves decoder
25
+ robustness against noise.
26
+ * **Superior Performance**:
27
+ * ๐Ÿš€ **3ร— Faster Convergence**: Accelerates text-to-video generation convergence by 3ร— compared to strong baselines.
28
+ * ๐Ÿ“ˆ **Higher Quality**: Achieves a **10% gain** in video reward scores (UnifiedReward).
29
+ * ๐Ÿ† **Outperforms SOTA**: Surpasses open-source VAEs (e.g., Wan 2.2, CogVideoX) in generation quality with fewer
30
+ parameters.
31
+
32
+ ## Using Model
33
+
34
+ Please View our [Github](https://github.com/zai-org/SSVAE).