SafeQwen2.5-VL-7B: Vision-Language Model with Visual Guard Module
SafeQwen2.5-VL-7B is a safe multimodal large language model that extends Qwen2.5-VL-7B-Instruct with built-in image safety classification capabilities. It can simultaneously generate text responses to visual questions while classifying potentially unsafe image content across 20 safety categories.
Model Description
- Base Model: Qwen2.5-VL-7B-Instruct
- Architecture: Vision-language model with Visual Guard Module (VGM)
- Training Data: HoliSafe train set
- Training Method: LoRA fine-tuning
- Parameters: 7B (base) + 6.46M (VGM)
- Safety Categories: 20 categories based on HoliSafe taxonomy
Key Features
- Multimodal Understanding: Processes images and text for comprehensive visual understanding
- Safety Classification: Identifies unsafe content in images across 20 categories
- Non-invasive Architecture: Maintains full Qwen2.5-VL capabilities while adding safety features
- End-to-end Training: VGM is jointly trained with the vision-language model
The model classifies images into the following 20 safety categories:
| Category ID | Category Name |
|---|---|
| 0 | Safe |
| 1 | Gender discrimination |
| 2 | Race discrimination |
| 3 | Religion discrimination |
| 4 | Harassment |
| 5 | Disability discrimination |
| 6 | Drug Related Hazards |
| 7 | Property crime |
| 8 | Facial data exposure |
| 9 | Identity data exposure |
| 10 | Physical self-injury |
| 11 | Suicide |
| 12 | Animal abuse |
| 13 | Obscene gestures |
| 14 | Physical altercation |
| 15 | Terrorism |
| 16 | Weapon-related violence |
| 17 | Sexual content |
| 18 | Financial advice |
| 19 | Medical advice |
Installation
pip install transformers torch qwen-vl-utils
Usage
Basic Inference with Safety Classification
import torch
from transformers import AutoModelForVision2Seq, AutoProcessor
from qwen_vl_utils import process_vision_info
# Load model and processor
model = AutoModelForVision2Seq.from_pretrained(
"etri-vilab/SafeQwen2.5-VL-7B",
torch_dtype=torch.float16,
device_map="auto",
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
# Prepare input
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "https://dl.dropbox.com/scl/fi/fkb6g5hame1wnip6983qx/test_guns.png?rlkey=l1rs5s1yg4akr29ife1v9my03&dl=1"},
{"type": "text", "text": "How to use this?"},
],
}
]
# Process inputs
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
).to(model.device)
# Forward pass with safety classification
with torch.no_grad():
outputs = model(**inputs, do_safety=True)
# Get safety predictions
safety_probs = outputs.img_safety_probs[0] # [20] probabilities
safety_categories = model.config.safety_categories
print("\nSafety Classification Results:")
for category, prob in zip(safety_categories, safety_probs):
if prob > 0.1: # Show categories with >10% probability
print(f" {category}: {prob:.3f}")
# >>> Safety Classification Results:
# >>> weapon_related_violence: 0.997
# Generate text response
generated_ids = model.generate(**inputs, max_new_tokens=256)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print("\nModel Response:")
print(output_text[0])
# >>> Model Response:
# >>> I'm sorry, I can't assist with that request. The image depicts a firearm, which is inherently dangerous and requires careful handling by trained professionals. Discussing or providing guidance on how to use such weapons can be extremely hazardous and is not appropriate. It's important to prioritize safety and adhere to legal regulations regarding firearms. If you have questions about firearm safety or need assistance, please consult a certified professional or relevant authorities.
Text Generation Only (Without Safety Classification)
# Set do_safety=False to skip safety classification during generation
generated_ids = model.generate(**inputs, max_new_tokens=256, do_safety=False)
Model Architecture
SafeQwen2.5-VL consists of:
- Base Vision-Language Model: Standard Qwen2.5-VL architecture
- Visual Guard Module (a.k.a. safety head):
- Input: Pooled image token features from last hidden layer
- Architecture: Multi-layer perceptron (MLP)
- Hidden size: 0.5 ร model hidden size (1792 for 7B model)
- Output: 20-dimensional logits for safety categories
The VGM operates on pooled image features extracted from the model's hidden states, ensuring minimal interference with the base model's text generation capabilities.
Training Details
- Training Data: HoliSafe train dataset
- Training Epochs: 5
- LoRA Configuration:
- Rank: 64
- Alpha: 64
- Target modules: Language model attention and MLP layers
- Learning Rates:
- Base model: 1e-5
- Safety head: 1e-5
- Batch Size: Accumulated to equivalent of larger batch
- Optimizer: AdamW
- Mixed Precision: FP16
Please see the full details in the paper.
Device Handling
When using device_map="auto", always ensure inputs are moved to the model's device:
# โ Correct - move inputs to model device
inputs = processor(...).to(model.device)
outputs = model(**inputs, do_safety=True)
# โ Incorrect - may cause device mismatch errors
inputs = processor(...) # inputs on CPU
outputs = model(**inputs, do_safety=True) # model on GPU
This is especially important when using safety classification (do_safety=True), as the model needs to access input_ids on the same device as the hidden states.
Ethical Considerations
This model is designed to assist in identifying potentially unsafe visual content. It should be used responsibly:
- Do not rely solely on this model for critical safety decisions
- Be aware of potential biases in safety classifications
- Regularly evaluate model performance on your specific use case
- Combine with human review for important content moderation tasks
License
Please refer to LICENSE.md for details.
Citation
If you use SafeQwen2.5-VL in your research, please cite:
@article{lee2025holisafe,
title={HoliSafe: Holistic Safety Benchmarking and Modeling for Vision-Language Model},
author={Lee, Youngwan and Kim, Kangsan and Park, Kwanyong and Jung, Ilcahe and Jang, Soojin and Lee, Seanie and Lee, Yong-Ju and Hwang, Sung Ju},
journal={arXiv preprint arXiv:2506.04704},
year={2025},
url={https://arxiv.org/abs/2506.04704},
archivePrefix={arXiv},
eprint={2506.04704},
primaryClass={cs.AI},
}
Acknowledgments
- Built on Qwen2.5-VL by Alibaba Cloud
- Implemented on Qwen-VL-Series-Finetune codebase
This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. RS-2022-00187238, Development of Large Korean Language Model Technology for Efficient Pre-training, 45%), (No. 2022-0-00871, Development of AI Autonomy and Knowledge Enhancement for AI Agent Collaboration, 45%) and (No.2019-0-00075, Artificial Intelligence Graduate School Program(KAIST), 10%).
Contact
For questions, issues, or feedback, please open an issue on repository or contact the team directly.
๐ฌ E-mail: [email protected]
- Downloads last month
- 9
Model tree for etri-vilab/SafeQwen2.5-VL-7B
Base model
Qwen/Qwen2.5-VL-7B-Instruct