WAN2.1 VAE - 3D Causal Video Variational Autoencoder

WAN2.1 VAE is a novel 3D causal Variational Autoencoder specifically designed for high-quality video generation and compression. This repository contains the standalone VAE component used in the WAN (Open and Advanced Large-Scale Video Generative Models) framework.

Model Description

The WAN2.1 VAE represents a breakthrough in video compression and reconstruction technology, featuring:

  • 3D Causal Architecture: Maintains temporal causality across video sequences
  • Unlimited Length Support: Can encode and decode unlimited-length 1080P videos without losing historical temporal information
  • High Compression Efficiency: Advanced spatio-temporal compression with minimal quality loss
  • Memory Optimized: Reduced memory footprint compared to traditional video VAEs
  • Temporal Information Preservation: Ensures consistent temporal dynamics across long sequences

Key Innovations

  1. Improved Spatio-Temporal Compression: Enhanced compression ratios while maintaining visual fidelity
  2. Causal Temporal Processing: Ensures frame-to-frame causality for coherent video generation
  3. Efficient Memory Usage: Optimized for consumer-grade GPU deployment
  4. High-Resolution Support: Native support for 1080P video encoding/decoding

Repository Contents

E:\huggingface\wan21-vae\
└── vae/
    └── wan/
        └── wan21-vae.safetensors (243 MB)

Model Files

File Size Format Description
wan21-vae.safetensors 243 MB SafeTensors WAN2.1 VAE weights

Total Repository Size: 243 MB

Hardware Requirements

Minimum Requirements

  • VRAM: 4 GB (inference only)
  • RAM: 8 GB system memory
  • Disk Space: 500 MB (including dependencies)
  • GPU: CUDA-compatible GPU (NVIDIA GTX 1060 or equivalent)

Recommended Requirements

  • VRAM: 8+ GB for optimal performance
  • RAM: 16 GB system memory
  • Disk Space: 1 GB
  • GPU: NVIDIA RTX 3060 or better

Resolution-Specific Requirements

  • 480P Video: 4-6 GB VRAM
  • 720P Video: 6-8 GB VRAM
  • 1080P Video: 8-12 GB VRAM

Usage Examples

Basic VAE Loading

import torch
from diffusers import AutoencoderKL

# Load the WAN2.1 VAE
vae = AutoencoderKL.from_pretrained(
    "E:/huggingface/wan21-vae/vae/wan",
    torch_dtype=torch.float16
).to("cuda")

print(f"VAE loaded: {vae.config}")

Video Encoding Example

import torch
from diffusers import AutoencoderKL
from PIL import Image
import numpy as np

# Load VAE
vae = AutoencoderKL.from_pretrained(
    "E:/huggingface/wan21-vae/vae/wan",
    torch_dtype=torch.float16
).to("cuda")

# Prepare video frames (example with dummy data)
# Shape: [batch, channels, frames, height, width]
video_frames = torch.randn(1, 3, 16, 480, 720).half().to("cuda")

# Encode video to latent space
with torch.no_grad():
    latents = vae.encode(video_frames).latent_dist.sample()

print(f"Latent shape: {latents.shape}")
print(f"Compression ratio: {np.prod(video_frames.shape) / np.prod(latents.shape):.2f}x")

Video Decoding Example

import torch
from diffusers import AutoencoderKL

# Load VAE
vae = AutoencoderKL.from_pretrained(
    "E:/huggingface/wan21-vae/vae/wan",
    torch_dtype=torch.float16
).to("cuda")

# Decode latents back to video frames
# Assuming you have latents from encoding step
with torch.no_grad():
    reconstructed_video = vae.decode(latents).sample

print(f"Reconstructed video shape: {reconstructed_video.shape}")

Integration with WAN Models

import torch
from diffusers import DiffusionPipeline, AutoencoderKL

# Load custom VAE
vae = AutoencoderKL.from_pretrained(
    "E:/huggingface/wan21-vae/vae/wan",
    torch_dtype=torch.float16
)

# Load WAN model with custom VAE
pipe = DiffusionPipeline.from_pretrained(
    "Wan-AI/Wan2.1-T2V-1.3B",
    vae=vae,
    torch_dtype=torch.float16
).to("cuda")

# Generate video
prompt = "A serene beach at sunset with waves crashing"
video = pipe(prompt, num_frames=16, height=480, width=720).frames

print(f"Generated video: {len(video)} frames")

Model Specifications

Architecture Details

  • Type: 3D Causal Variational Autoencoder
  • Architecture: Causal spatio-temporal convolutions
  • Compression: Variable compression ratios (4x, 8x, 16x depending on configuration)
  • Causality: Temporal causal processing for frame consistency
  • Latent Dimensions: Optimized for video generation tasks

Technical Specifications

  • Precision: FP16 (Half precision) recommended
  • Format: SafeTensors (secure, efficient loading)
  • Framework: PyTorch >= 2.4.0
  • Library: Diffusers (Hugging Face)
  • Temporal Support: Unlimited frame sequences
  • Resolution Support: Up to 1080P native

Supported Operations

  • Video encoding (frames β†’ latents)
  • Video decoding (latents β†’ frames)
  • Temporal compression
  • Spatial compression
  • Causal frame generation

Performance Tips and Optimization

Memory Optimization

# Use gradient checkpointing for lower memory usage
vae.enable_gradient_checkpointing()

# Use CPU offloading for very large videos
vae.enable_sequential_cpu_offload()

# Use attention slicing for reduced VRAM
vae.enable_attention_slicing(1)

Speed Optimization

# Compile model for faster inference (PyTorch 2.0+)
vae = torch.compile(vae, mode="reduce-overhead")

# Use xFormers for efficient attention
vae.enable_xformers_memory_efficient_attention()

# Use half precision for faster inference
vae = vae.half()

Batch Processing

# Process multiple video clips efficiently
batch_size = 4
video_clips = torch.randn(batch_size, 3, 16, 480, 720).half().to("cuda")

with torch.no_grad():
    latents = vae.encode(video_clips).latent_dist.sample()

Resolution Guidelines

  • 480P (854Γ—480): Best for real-time applications, lowest VRAM
  • 720P (1280Γ—720): Balanced quality and performance
  • 1080P (1920Γ—1080): Maximum quality, requires high-end GPU

License

This model is released under a custom WAN license. Please refer to the official WAN repository for detailed licensing terms and usage restrictions.

License Type: Other (Custom WAN License)

Usage Restrictions

  • Check official WAN-AI repository for commercial usage terms
  • Attribution required for research and non-commercial use
  • Refer to WAN-AI Organization for updates

Citation

If you use this VAE in your research or applications, please cite the WAN project:

@misc{wan2025,
  title={WAN: Open and Advanced Large-Scale Video Generative Models},
  author={WAN-AI Team},
  year={2025},
  publisher={Hugging Face},
  howpublished={https://huggingface.co/Wan-AI}
}

Related Resources

Official Links

Related Models

  • WAN2.2 VAE: Latest VAE with 64x compression (4Γ—16Γ—16)
  • WAN2.1 T2V: Text-to-video generation models
  • WAN2.1 I2V: Image-to-video generation models
  • WAN2.2 Animate: Character animation models

Community & Support

  • Hugging Face WAN-AI discussions
  • GitHub issues and community forums
  • Research papers and technical documentation

Model Card Contact

For questions, issues, or collaboration inquiries:


Version: v1.3 Last Updated: 2025-10-14 Model Size: 243 MB Format: SafeTensors

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including wangkanai/wan21-vae