Abstract
DiP, a pixel space diffusion framework, combines a Diffusion Transformer and a Patch Detailer Head to achieve computational efficiency and high-quality image generation without using VAEs.
Diffusion models face a fundamental trade-off between generation quality and computational efficiency. Latent Diffusion Models (LDMs) offer an efficient solution but suffer from potential information loss and non-end-to-end training. In contrast, existing pixel space models bypass VAEs but are computationally prohibitive for high-resolution synthesis. To resolve this dilemma, we propose DiP, an efficient pixel space diffusion framework. DiP decouples generation into a global and a local stage: a Diffusion Transformer (DiT) backbone operates on large patches for efficient global structure construction, while a co-trained lightweight Patch Detailer Head leverages contextual features to restore fine-grained local details. This synergistic design achieves computational efficiency comparable to LDMs without relying on a VAE. DiP is accomplished with up to 10times faster inference speeds than previous method while increasing the total number of parameters by only 0.3%, and achieves an 1.79 FID score on ImageNet 256times256.
Community
Novel Pixel-Space Diffusion Framework: We propose DiP, an end-to-end pixel-level diffusion framework. By eliminating the dependency on VAEs, DiP successfully alleviates the long-standing trade-off between computational efficiency and generation quality.
Extreme Efficiency Enhancement: By reducing sequence length, DiP achieves an inference speed 10× faster than the previous state-of-the-art pixel-level method (PixelFlow-XL). Furthermore, it maintains a computational cost comparable to mainstream LDMs, enabling highly efficient pixel-level generation.
High Performance with Negligible Overhead: The introduction of a lightweight Patch Detailer Head adds only 0.3% to the total parameter count while significantly enhancing the fidelity of fine-grained image details.
State-of-the-Art (SOTA) Generation Quality: On the ImageNet 256×256 benchmark, DiP achieves a remarkable FID of 1.79, outperforming leading latent-based models such as DiT-XL and SiT-XL.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper