--- license: other library_name: diffusers pipeline_tag: text-to-video tags: - wan - text-to-video - image-generation --- # WAN2.1 VAE - 3D Causal Video Variational Autoencoder WAN2.1 VAE is a novel 3D causal Variational Autoencoder specifically designed for high-quality video generation and compression. This repository contains the standalone VAE component used in the WAN (Open and Advanced Large-Scale Video Generative Models) framework. ## Model Description The WAN2.1 VAE represents a breakthrough in video compression and reconstruction technology, featuring: - **3D Causal Architecture**: Maintains temporal causality across video sequences - **Unlimited Length Support**: Can encode and decode unlimited-length 1080P videos without losing historical temporal information - **High Compression Efficiency**: Advanced spatio-temporal compression with minimal quality loss - **Memory Optimized**: Reduced memory footprint compared to traditional video VAEs - **Temporal Information Preservation**: Ensures consistent temporal dynamics across long sequences ### Key Innovations 1. **Improved Spatio-Temporal Compression**: Enhanced compression ratios while maintaining visual fidelity 2. **Causal Temporal Processing**: Ensures frame-to-frame causality for coherent video generation 3. **Efficient Memory Usage**: Optimized for consumer-grade GPU deployment 4. **High-Resolution Support**: Native support for 1080P video encoding/decoding ## Repository Contents ``` E:\huggingface\wan21-vae\ └── vae/ └── wan/ └── wan21-vae.safetensors (243 MB) ``` ### Model Files | File | Size | Format | Description | |------|------|--------|-------------| | `wan21-vae.safetensors` | 243 MB | SafeTensors | WAN2.1 VAE weights | **Total Repository Size**: 243 MB ## Hardware Requirements ### Minimum Requirements - **VRAM**: 4 GB (inference only) - **RAM**: 8 GB system memory - **Disk Space**: 500 MB (including dependencies) - **GPU**: CUDA-compatible GPU (NVIDIA GTX 1060 or equivalent) ### Recommended Requirements - **VRAM**: 8+ GB for optimal performance - **RAM**: 16 GB system memory - **Disk Space**: 1 GB - **GPU**: NVIDIA RTX 3060 or better ### Resolution-Specific Requirements - **480P Video**: 4-6 GB VRAM - **720P Video**: 6-8 GB VRAM - **1080P Video**: 8-12 GB VRAM ## Usage Examples ### Basic VAE Loading ```python import torch from diffusers import AutoencoderKL # Load the WAN2.1 VAE vae = AutoencoderKL.from_pretrained( "E:/huggingface/wan21-vae/vae/wan", torch_dtype=torch.float16 ).to("cuda") print(f"VAE loaded: {vae.config}") ``` ### Video Encoding Example ```python import torch from diffusers import AutoencoderKL from PIL import Image import numpy as np # Load VAE vae = AutoencoderKL.from_pretrained( "E:/huggingface/wan21-vae/vae/wan", torch_dtype=torch.float16 ).to("cuda") # Prepare video frames (example with dummy data) # Shape: [batch, channels, frames, height, width] video_frames = torch.randn(1, 3, 16, 480, 720).half().to("cuda") # Encode video to latent space with torch.no_grad(): latents = vae.encode(video_frames).latent_dist.sample() print(f"Latent shape: {latents.shape}") print(f"Compression ratio: {np.prod(video_frames.shape) / np.prod(latents.shape):.2f}x") ``` ### Video Decoding Example ```python import torch from diffusers import AutoencoderKL # Load VAE vae = AutoencoderKL.from_pretrained( "E:/huggingface/wan21-vae/vae/wan", torch_dtype=torch.float16 ).to("cuda") # Decode latents back to video frames # Assuming you have latents from encoding step with torch.no_grad(): reconstructed_video = vae.decode(latents).sample print(f"Reconstructed video shape: {reconstructed_video.shape}") ``` ### Integration with WAN Models ```python import torch from diffusers import DiffusionPipeline, AutoencoderKL # Load custom VAE vae = AutoencoderKL.from_pretrained( "E:/huggingface/wan21-vae/vae/wan", torch_dtype=torch.float16 ) # Load WAN model with custom VAE pipe = DiffusionPipeline.from_pretrained( "Wan-AI/Wan2.1-T2V-1.3B", vae=vae, torch_dtype=torch.float16 ).to("cuda") # Generate video prompt = "A serene beach at sunset with waves crashing" video = pipe(prompt, num_frames=16, height=480, width=720).frames print(f"Generated video: {len(video)} frames") ``` ## Model Specifications ### Architecture Details - **Type**: 3D Causal Variational Autoencoder - **Architecture**: Causal spatio-temporal convolutions - **Compression**: Variable compression ratios (4x, 8x, 16x depending on configuration) - **Causality**: Temporal causal processing for frame consistency - **Latent Dimensions**: Optimized for video generation tasks ### Technical Specifications - **Precision**: FP16 (Half precision) recommended - **Format**: SafeTensors (secure, efficient loading) - **Framework**: PyTorch >= 2.4.0 - **Library**: Diffusers (Hugging Face) - **Temporal Support**: Unlimited frame sequences - **Resolution Support**: Up to 1080P native ### Supported Operations - Video encoding (frames → latents) - Video decoding (latents → frames) - Temporal compression - Spatial compression - Causal frame generation ## Performance Tips and Optimization ### Memory Optimization ```python # Use gradient checkpointing for lower memory usage vae.enable_gradient_checkpointing() # Use CPU offloading for very large videos vae.enable_sequential_cpu_offload() # Use attention slicing for reduced VRAM vae.enable_attention_slicing(1) ``` ### Speed Optimization ```python # Compile model for faster inference (PyTorch 2.0+) vae = torch.compile(vae, mode="reduce-overhead") # Use xFormers for efficient attention vae.enable_xformers_memory_efficient_attention() # Use half precision for faster inference vae = vae.half() ``` ### Batch Processing ```python # Process multiple video clips efficiently batch_size = 4 video_clips = torch.randn(batch_size, 3, 16, 480, 720).half().to("cuda") with torch.no_grad(): latents = vae.encode(video_clips).latent_dist.sample() ``` ### Resolution Guidelines - **480P (854×480)**: Best for real-time applications, lowest VRAM - **720P (1280×720)**: Balanced quality and performance - **1080P (1920×1080)**: Maximum quality, requires high-end GPU ## License This model is released under a custom WAN license. Please refer to the official WAN repository for detailed licensing terms and usage restrictions. **License Type**: Other (Custom WAN License) ### Usage Restrictions - Check official WAN-AI repository for commercial usage terms - Attribution required for research and non-commercial use - Refer to [WAN-AI Organization](https://huggingface.co/Wan-AI) for updates ## Citation If you use this VAE in your research or applications, please cite the WAN project: ```bibtex @misc{wan2025, title={WAN: Open and Advanced Large-Scale Video Generative Models}, author={WAN-AI Team}, year={2025}, publisher={Hugging Face}, howpublished={https://huggingface.co/Wan-AI} } ``` ## Related Resources ### Official Links - **WAN Organization**: https://huggingface.co/Wan-AI - **WAN2.1 T2V 1.3B Model**: https://huggingface.co/Wan-AI/Wan2.1-T2V-1.3B - **WAN2.1 T2V 14B Model**: https://huggingface.co/Wan-AI/Wan2.1-T2V-14B - **WAN2.2 Models**: https://huggingface.co/Wan-AI (Latest versions) - **GitHub Repository**: https://github.com/Wan-Video ### Related Models - **WAN2.2 VAE**: Latest VAE with 64x compression (4×16×16) - **WAN2.1 T2V**: Text-to-video generation models - **WAN2.1 I2V**: Image-to-video generation models - **WAN2.2 Animate**: Character animation models ### Community & Support - Hugging Face WAN-AI discussions - GitHub issues and community forums - Research papers and technical documentation ## Model Card Contact For questions, issues, or collaboration inquiries: - Visit the [WAN-AI Hugging Face Organization](https://huggingface.co/Wan-AI) - Check the [official GitHub repository](https://github.com/Wan-Video) - Review model-specific documentation on individual model cards --- **Version**: v1.3 **Last Updated**: 2025-10-14 **Model Size**: 243 MB **Format**: SafeTensors