|
|
--- |
|
|
license: apache-2.0 |
|
|
pipeline_tag: text-to-video |
|
|
library_name: diffusers |
|
|
--- |
|
|
|
|
|
# PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation |
|
|
|
|
|
[π Paper](https://huggingface.co/papers/2512.04025) | [π Project Page](http://ziplab.co/PSA) | [π» Code](https://github.com/ziplab/Pyramid-Sparse-Attention) |
|
|
|
|
|
Official PyTorch implementation of [PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation](https://huggingface.co/papers/2512.04025). |
|
|
|
|
|
<p align="center"> |
|
|
<img src="https://github.com/ziplab/Pyramid-Sparse-Attention/raw/main/figures/prompt007comparison.jpg" width="100%"> |
|
|
</p> |
|
|
|
|
|
<p align="center"><em>Visual comparison of sparse attention methods at similar sparsity levels (~90%). PSA maintains visual fidelity close to full attention while other methods show noticeable artifacts.</em></p> |
|
|
|
|
|
Pyramid Sparse Attention (PSA) is a versatile attention module designed to overcome the quadratic complexity bottleneck of attention mechanisms in foundation models. It introduces multi-level pooled Key-Value (KV) representations, enabling a finer mask granularity than traditional binary masking approaches. This design allows critical KV blocks to receive full resolution attention while less important blocks utilize progressively pooled representations, creating an informative interpolation between full retention and complete pruning. This approach effectively mitigates information loss and preserves computational efficiency. PSA is applicable to both video understanding and generation tasks, consistently outperforming or achieving comparable performance to existing sparse attention baselines with superior efficiency-quality trade-offs. |
|
|
|
|
|
> **Note:** This release focuses on **inference-only** with **bidirectional attention**. Support for causal attention masks and backward propagation (training) is still under optimization and will be released in a future update. |
|
|
|
|
|
## Installation |
|
|
|
|
|
### Using uv (Recommended) |
|
|
|
|
|
```bash |
|
|
uv venv --python 3.11 |
|
|
source .venv/bin/activate |
|
|
uv pip install -e . |
|
|
``` |
|
|
|
|
|
### Using pip |
|
|
|
|
|
```bash |
|
|
python -m venv .venv |
|
|
source .venv/bin/activate |
|
|
pip install -e . |
|
|
``` |
|
|
|
|
|
> For best performance, we recommend using PyTorch nightly version. |
|
|
|
|
|
## Download Weights |
|
|
|
|
|
### CogVideoX-5B LoRA (4-step) |
|
|
|
|
|
```bash |
|
|
huggingface-cli download GYP666/BLADE cogvideox-5b-psa-lora/pytorch_lora_weights.safetensors --local-dir ./weights |
|
|
``` |
|
|
|
|
|
**Note:** After downloading, update the `lora_path` in `examples/configs/model_configs.py` to point to your weights directory. |
|
|
|
|
|
## Quick Start (Inference) |
|
|
|
|
|
### CogVideoX1.5-5B |
|
|
|
|
|
```bash |
|
|
python examples/inference/cogvideo/cogvideo_5b.py \ |
|
|
--model cogvideo1.5_5b \ |
|
|
--prompt "your prompt here" \ |
|
|
--use_psa |
|
|
``` |
|
|
|
|
|
### Wan2.1-1.3B |
|
|
|
|
|
```bash |
|
|
python examples/inference/wan21/wan21_1.3b.py \ |
|
|
--prompt "your prompt here" \ |
|
|
--use_psa --no_warmup |
|
|
``` |
|
|
|
|
|
For more inference examples, see [examples/README.md](https://github.com/ziplab/Pyramid-Sparse-Attention/blob/main/examples/README.md). |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you find this work useful, please cite our paper: |
|
|
|
|
|
```bibtex |
|
|
@misc{li2025psapyramidsparseattention, |
|
|
title={PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation}, |
|
|
author={Xiaolong Li and Youping Gu and Xi Lin and Weijie Wang and Bohan Zhuang}, |
|
|
year={2025}, |
|
|
eprint={2512.04025}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.CV}, |
|
|
url={https://arxiv.org/abs/2512.04025}, |
|
|
} |
|
|
``` |