SafeQwen2.5-VL-7B: Vision-Language Model with Visual Guard Module

SafeQwen2.5-VL-7B is a safe multimodal large language model that extends Qwen2.5-VL-7B-Instruct with built-in image safety classification capabilities. It can simultaneously generate text responses to visual questions while classifying potentially unsafe image content across 20 safety categories.

Model Description

Base Model: Qwen2.5-VL-7B-Instruct
Architecture: Vision-language model with Visual Guard Module (VGM)
Training Data: HoliSafe train set
Training Method: LoRA fine-tuning
Parameters: 7B (base) + 6.46M (VGM)
Safety Categories: 20 categories based on HoliSafe taxonomy

Key Features

Multimodal Understanding: Processes images and text for comprehensive visual understanding
Safety Classification: Identifies unsafe content in images across 20 categories
Non-invasive Architecture: Maintains full Qwen2.5-VL capabilities while adding safety features
End-to-end Training: VGM is jointly trained with the vision-language model

The model classifies images into the following 20 safety categories:

Category ID	Category Name
0	Safe
1	Gender discrimination
2	Race discrimination
3	Religion discrimination
4	Harassment
5	Disability discrimination
6	Drug Related Hazards
7	Property crime
8	Facial data exposure
9	Identity data exposure
10	Physical self-injury
11	Suicide
12	Animal abuse
13	Obscene gestures
14	Physical altercation
15	Terrorism
16	Weapon-related violence
17	Sexual content
18	Financial advice
19	Medical advice

Installation

pip install transformers torch qwen-vl-utils

Usage

Basic Inference with Safety Classification

import torch
from transformers import AutoModelForVision2Seq, AutoProcessor
from qwen_vl_utils import process_vision_info

# Load model and processor
model = AutoModelForVision2Seq.from_pretrained(
    "etri-vilab/SafeQwen2.5-VL-7B",
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

# Prepare input
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://dl.dropbox.com/scl/fi/fkb6g5hame1wnip6983qx/test_guns.png?rlkey=l1rs5s1yg4akr29ife1v9my03&dl=1"},
            {"type": "text", "text": "How to use this?"},
        ],
    }
]

# Process inputs
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to(model.device)

# Forward pass with safety classification
with torch.no_grad():
    outputs = model(**inputs, do_safety=True)

# Get safety predictions
safety_probs = outputs.img_safety_probs[0]  # [20] probabilities
safety_categories = model.config.safety_categories

print("\nSafety Classification Results:")
for category, prob in zip(safety_categories, safety_probs):
    if prob > 0.1:  # Show categories with >10% probability
        print(f"  {category}: {prob:.3f}")

# >>> Safety Classification Results:
# >>> weapon_related_violence: 0.997

# Generate text response
generated_ids = model.generate(**inputs, max_new_tokens=256)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print("\nModel Response:")
print(output_text[0])

# >>> Model Response:
# >>> I'm sorry, I can't assist with that request. The image depicts a firearm, which is inherently dangerous and requires careful handling by trained professionals. Discussing or providing guidance on how to use such weapons can be extremely hazardous and is not appropriate. It's important to prioritize safety and adhere to legal regulations regarding firearms. If you have questions about firearm safety or need assistance, please consult a certified professional or relevant authorities.

Text Generation Only (Without Safety Classification)

# Set do_safety=False to skip safety classification during generation
generated_ids = model.generate(**inputs, max_new_tokens=256, do_safety=False)

Model Architecture

SafeQwen2.5-VL consists of:

Base Vision-Language Model: Standard Qwen2.5-VL architecture
Visual Guard Module (a.k.a. safety head):
- Input: Pooled image token features from last hidden layer
- Architecture: Multi-layer perceptron (MLP)
- Hidden size: 0.5 × model hidden size (1792 for 7B model)
- Output: 20-dimensional logits for safety categories

The VGM operates on pooled image features extracted from the model's hidden states, ensuring minimal interference with the base model's text generation capabilities.

Training Details

Training Data: HoliSafe train dataset
Training Epochs: 5
LoRA Configuration:
- Rank: 64
- Alpha: 64
- Target modules: Language model attention and MLP layers
Learning Rates:
- Base model: 1e-5
- Safety head: 1e-5
Batch Size: Accumulated to equivalent of larger batch
Optimizer: AdamW
Mixed Precision: FP16

Please see the full details in the paper.

Device Handling

When using device_map="auto", always ensure inputs are moved to the model's device:

# ✓ Correct - move inputs to model device
inputs = processor(...).to(model.device)
outputs = model(**inputs, do_safety=True)

# ✗ Incorrect - may cause device mismatch errors
inputs = processor(...)  # inputs on CPU
outputs = model(**inputs, do_safety=True)  # model on GPU

This is especially important when using safety classification (do_safety=True), as the model needs to access input_ids on the same device as the hidden states.

Ethical Considerations

This model is designed to assist in identifying potentially unsafe visual content. It should be used responsibly:

Do not rely solely on this model for critical safety decisions
Be aware of potential biases in safety classifications
Regularly evaluate model performance on your specific use case
Combine with human review for important content moderation tasks

License

Please refer to LICENSE.md for details.

Citation

If you use SafeQwen2.5-VL in your research, please cite:

@article{lee2025holisafe,
  title={HoliSafe: Holistic Safety Benchmarking and Modeling for Vision-Language Model},
  author={Lee, Youngwan and Kim, Kangsan and Park, Kwanyong and Jung, Ilcahe and Jang, Soojin and Lee, Seanie and Lee, Yong-Ju and Hwang, Sung Ju},
  journal={arXiv preprint arXiv:2506.04704},
  year={2025},
  url={https://arxiv.org/abs/2506.04704},
  archivePrefix={arXiv},
  eprint={2506.04704},
  primaryClass={cs.AI},
}

Acknowledgments

Built on Qwen2.5-VL by Alibaba Cloud
Implemented on Qwen-VL-Series-Finetune codebase

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. RS-2022-00187238, Development of Large Korean Language Model Technology for Efficient Pre-training, 45%), (No. 2022-0-00871, Development of AI Autonomy and Knowledge Enhancement for AI Agent Collaboration, 45%) and (No.2019-0-00075, Artificial Intelligence Graduate School Program(KAIST), 10%).

Contact

For questions, issues, or feedback, please open an issue on repository or contact the team directly.

📬 E-mail: [email protected]

Downloads last month: 9

Safetensors

Model size

8B params

Tensor type

F16

Model tree for etri-vilab/SafeQwen2.5-VL-7B

Base model

Qwen/Qwen2.5-VL-7B-Instruct

Finetuned

(877)

this model

Collection including etri-vilab/SafeQwen2.5-VL-7B

Safe-VLMs

Collection

Safe Vision-Language Models with Visual Guard Module • 6 items • Updated 6 days ago