Papers
arxiv:2511.18822

DiP: Taming Diffusion Models in Pixel Space

Published on Nov 24
· Submitted by chen on Dec 1
Authors:
,
,
,
,
,

Abstract

DiP, a pixel space diffusion framework, combines a Diffusion Transformer and a Patch Detailer Head to achieve computational efficiency and high-quality image generation without using VAEs.

AI-generated summary

Diffusion models face a fundamental trade-off between generation quality and computational efficiency. Latent Diffusion Models (LDMs) offer an efficient solution but suffer from potential information loss and non-end-to-end training. In contrast, existing pixel space models bypass VAEs but are computationally prohibitive for high-resolution synthesis. To resolve this dilemma, we propose DiP, an efficient pixel space diffusion framework. DiP decouples generation into a global and a local stage: a Diffusion Transformer (DiT) backbone operates on large patches for efficient global structure construction, while a co-trained lightweight Patch Detailer Head leverages contextual features to restore fine-grained local details. This synergistic design achieves computational efficiency comparable to LDMs without relying on a VAE. DiP is accomplished with up to 10times faster inference speeds than previous method while increasing the total number of parameters by only 0.3%, and achieves an 1.79 FID score on ImageNet 256times256.

Community

Paper author Paper submitter
  1. Novel Pixel-Space Diffusion Framework: We propose DiP, an end-to-end pixel-level diffusion framework. By eliminating the dependency on VAEs, DiP successfully alleviates the long-standing trade-off between computational efficiency and generation quality.

  2. Extreme Efficiency Enhancement: By reducing sequence length, DiP achieves an inference speed 10× faster than the previous state-of-the-art pixel-level method (PixelFlow-XL). Furthermore, it maintains a computational cost comparable to mainstream LDMs, enabling highly efficient pixel-level generation.

  3. High Performance with Negligible Overhead: The introduction of a lightweight Patch Detailer Head adds only 0.3% to the total parameter count while significantly enhancing the fidelity of fine-grained image details.

  4. State-of-the-Art (SOTA) Generation Quality: On the ImageNet 256×256 benchmark, DiP achieves a remarkable FID of 1.79, outperforming leading latent-based models such as DiT-XL and SiT-XL.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2511.18822 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2511.18822 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2511.18822 in a Space README.md to link it from this page.

Collections including this paper 1