Papers
arxiv:2512.13492

Transform Trained Transformer: Accelerating Naive 4K Video Generation Over 10times

Published on Dec 15
Authors:
,
,
,
,
,
,
,
,
,

Abstract

The T3-Video model enhances 4K video generation efficiency and quality through a multi-scale weight-sharing window attention mechanism and hierarchical blocking.

AI-generated summary

Native 4K (2160times3840) video generation remains a critical challenge due to the quadratic computational explosion of full-attention as spatiotemporal resolution increases, making it difficult for models to strike a balance between efficiency and quality. This paper proposes a novel Transformer retrofit strategy termed T3 (Transform Trained Transformer) that, without altering the core architecture of full-attention pretrained models, significantly reduces compute requirements by optimizing their forward logic. Specifically, T3-Video introduces a multi-scale weight-sharing window attention mechanism and, via hierarchical blocking together with an axis-preserving full-attention design, can effect an "attention pattern" transformation of a pretrained model using only modest compute and data. Results on 4K-VBench show that T3-Video substantially outperforms existing approaches: while delivering performance improvements (+4.29uparrow VQA and +0.08uparrow VTC), it accelerates native 4K video generation by more than 10times. Project page at https://zhangzjn.github.io/projects/T3-Video

Community

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.13492 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.13492 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.