DiffusionVL: Translating Any Autoregressive Models into
Diffusion Vision Language Models

SOTA dVLM Performance with <5% Data & 2.0Γ— Inference Speedup!

Lunbin Zeng1,*, Jingfeng Yao1,*, Bencheng Liao1, Hongyuan Tao1, Wenyu Liu1, Xinggang Wang1,βœ‰οΈ

1Huazhong University of Science and Technology

*equal contribution, βœ‰οΈcorresponding author, [email protected]

arXiv Hugging Face Paper GitHub Hugging Face

πŸ“° News

  • [2025.12.18] πŸŽ‰ Our paper DiffusionVL is released on arXiv! And we release the DiffusionVL models translated from Qwen2.5VL at huggingface. The training code and more models are comming soon!

πŸ“„ Introduction

The diffusion paradigm has emerged as a promising alternative to autoregressive (AR) models, offering the potential for efficient parallel decoding. However, existing diffusion vision language models (dVLMs) largely lag behind mainstream autoregressive vision language models in performance, primarily due to the capability limitations of their base diffusion language models.

DiffusionVL bridges this gap by answering a fundamental question: Can we directly translate any existing autoregressive models into powerful diffusion vision language models? We propose a diffusion finetuning framework that "translates" any pretrained AR model into a diffusion vision language model through a simple paradigm shift and modality shift. Unlike prior dVLMs restricted by fixed generation lengths, DiffusionVL introduces a novel block decoding strategy. This allows for arbitrary-length generation and KV-cache reuse. With this integrated design, despite training with less than 5% of the training data required by previous methods, DiffusionVL translated from AR-VLMs achieves a state-of-the-art performance among exsiting dVLMs and delivers a 2.0Γ— inference speedup.

✨ Highlights

  • Universal Translation Framework: Translate any AR models into dVLMs with a simple yet effective approach.

  • Superior Performance: Achieve SOTA dVLM performance using <5% training data (738K vs 16.5M samples).

  • 2.0Γ— Faster Inference: Block decoding strategy enables KV-cache reuse and 2.0Γ— speedup over previous dVLMs.

Benchmark Image Framework

🎯 Inference with Pre-trained Models

  • Download Pre-trained Models:
Model Base Model Download
DiffusionVL-Qwen2.5VL-3B Qwen2.5-VL-3B HuggingFace
DiffusionVL-Qwen2.5VL-7B Qwen2.5-VL-7B HuggingFace
  • Environment Setup:

The core environments are list as follows:

torch==2.6.0
torchvision==0.21.0
torchaudio==2.6.0
transformers==4.55.0
accelerate==1.10.1
pillow==10.4.0
requests=2.32.5
  • Quick Start:
from transformers import AutoModelForCausalLM, AutoProcessor
import torch

# Load model with trust_remote_code
model = AutoModelForCausalLM.from_pretrained(
    "hustvl/DiffusionVL-Qwen2.5VL-7B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

# Load processor (includes tokenizer)
processor = AutoProcessor.from_pretrained("hustvl/DiffusionVL-Qwen2.5VL-7B", trust_remote_code=True)

from PIL import Image
import requests

url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "Describe this image."}
    ]}
]
text = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True)
inputs = {k: v.to(model.device) if hasattr(v, 'to') else v for k, v in inputs.items()}

# Generate with diffusion
output_ids = model.generate(
    inputs=inputs["input_ids"],
    images=inputs.get("pixel_values"),
    image_grid_thws=inputs.get("image_grid_thw"),
    gen_length=128,
    steps=8,
    temperature=0.0,
    remasking_strategy="low_confidence_static",
)

# Decode output
output_text = processor.decode(output_ids[0], skip_special_tokens=True)
print(output_text)

❀️ Acknowledgements

This repo is mainly built on Qwen2.5-VL, LLaDA-V, BD3LMs and SDAR. We thank the authors for their open-source contributions.

πŸ“ Citation

If you find our work useful, please cite our paper:

@misc{zeng2025diffusionvltranslatingautoregressivemodels,
      title={DiffusionVL: Translating Any Autoregressive Models into Diffusion Vision Language Models}, 
      author={Lunbin Zeng and Jingfeng Yao and Bencheng Liao and Hongyuan Tao and Wenyu Liu and Xinggang Wang},
      year={2025},
      eprint={2512.15713},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.15713}, 
}
Downloads last month
22
Safetensors
Model size
8B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Collection including hustvl/DiffusionVL-Qwen2.5VL-7B