File size: 3,442 Bytes
1b6d1b2 13eaa41 1b6d1b2 1deef16 13eaa41 1deef16 13eaa41 1deef16 13eaa41 1deef16 13eaa41 1deef16 13eaa41 1deef16 13eaa41 1deef16 13eaa41 1deef16 13eaa41 1deef16 13eaa41 1deef16 13eaa41 1deef16 13eaa41 1deef16 13eaa41 1deef16 13eaa41 1deef16 13eaa41 1deef16 13eaa41 1deef16 13eaa41 1deef16 13eaa41 1deef16 13eaa41 79b80df 13eaa41 79b80df 13eaa41 79b80df 13eaa41 79b80df 13eaa41 79b80df 13eaa41 79b80df 13eaa41 1deef16 13eaa41 1deef16 13eaa41 1deef16 13eaa41 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 |
---
license: apache-2.0
pipeline_tag: text-to-video
library_name: diffusers
---
# PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation
[๐ Paper](https://huggingface.co/papers/2512.04025) | [๐ Project Page](http://ziplab.co/PSA) | [๐ป Code](https://github.com/ziplab/Pyramid-Sparse-Attention)
Official PyTorch implementation of [PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation](https://huggingface.co/papers/2512.04025).
<p align="center">
<img src="https://github.com/ziplab/Pyramid-Sparse-Attention/raw/main/figures/prompt007comparison.jpg" width="100%">
</p>
<p align="center"><em>Visual comparison of sparse attention methods at similar sparsity levels (~90%). PSA maintains visual fidelity close to full attention while other methods show noticeable artifacts.</em></p>
Pyramid Sparse Attention (PSA) is a versatile attention module designed to overcome the quadratic complexity bottleneck of attention mechanisms in foundation models. It introduces multi-level pooled Key-Value (KV) representations, enabling a finer mask granularity than traditional binary masking approaches. This design allows critical KV blocks to receive full resolution attention while less important blocks utilize progressively pooled representations, creating an informative interpolation between full retention and complete pruning. This approach effectively mitigates information loss and preserves computational efficiency. PSA is applicable to both video understanding and generation tasks, consistently outperforming or achieving comparable performance to existing sparse attention baselines with superior efficiency-quality trade-offs.
> **Note:** This release focuses on **inference-only** with **bidirectional attention**. Support for causal attention masks and backward propagation (training) is still under optimization and will be released in a future update.
## Installation
### Using uv (Recommended)
```bash
uv venv --python 3.11
source .venv/bin/activate
uv pip install -e .
```
### Using pip
```bash
python -m venv .venv
source .venv/bin/activate
pip install -e .
```
> For best performance, we recommend using PyTorch nightly version.
## Download Weights
### CogVideoX-5B LoRA (4-step)
```bash
huggingface-cli download GYP666/BLADE cogvideox-5b-psa-lora/pytorch_lora_weights.safetensors --local-dir ./weights
```
**Note:** After downloading, update the `lora_path` in `examples/configs/model_configs.py` to point to your weights directory.
## Quick Start (Inference)
### CogVideoX1.5-5B
```bash
python examples/inference/cogvideo/cogvideo_5b.py \
--model cogvideo1.5_5b \
--prompt "your prompt here" \
--use_psa
```
### Wan2.1-1.3B
```bash
python examples/inference/wan21/wan21_1.3b.py \
--prompt "your prompt here" \
--use_psa --no_warmup
```
For more inference examples, see [examples/README.md](https://github.com/ziplab/Pyramid-Sparse-Attention/blob/main/examples/README.md).
## Citation
If you find this work useful, please cite our paper:
```bibtex
@misc{li2025psapyramidsparseattention,
title={PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation},
author={Xiaolong Li and Youping Gu and Xi Lin and Weijie Wang and Bohan Zhuang},
year={2025},
eprint={2512.04025},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.04025},
}
``` |