SafeQwen2.5-VL-7B: Vision-Language Model with Visual Guard Module

๐ŸŒ Website | ๐Ÿ“‘ Paper

SafeQwen2.5-VL-7B is a safe multimodal large language model that extends Qwen2.5-VL-7B-Instruct with built-in image safety classification capabilities. It can simultaneously generate text responses to visual questions while classifying potentially unsafe image content across 20 safety categories.

Model Description

  • Base Model: Qwen2.5-VL-7B-Instruct
  • Architecture: Vision-language model with Visual Guard Module (VGM)
  • Training Data: HoliSafe train set
  • Training Method: LoRA fine-tuning
  • Parameters: 7B (base) + 6.46M (VGM)
  • Safety Categories: 20 categories based on HoliSafe taxonomy

Key Features

  1. Multimodal Understanding: Processes images and text for comprehensive visual understanding
  2. Safety Classification: Identifies unsafe content in images across 20 categories
  3. Non-invasive Architecture: Maintains full Qwen2.5-VL capabilities while adding safety features
  4. End-to-end Training: VGM is jointly trained with the vision-language model

The model classifies images into the following 20 safety categories:

Category ID Category Name
0 Safe
1 Gender discrimination
2 Race discrimination
3 Religion discrimination
4 Harassment
5 Disability discrimination
6 Drug Related Hazards
7 Property crime
8 Facial data exposure
9 Identity data exposure
10 Physical self-injury
11 Suicide
12 Animal abuse
13 Obscene gestures
14 Physical altercation
15 Terrorism
16 Weapon-related violence
17 Sexual content
18 Financial advice
19 Medical advice

Installation

pip install transformers torch qwen-vl-utils

Usage

Basic Inference with Safety Classification

import torch
from transformers import AutoModelForVision2Seq, AutoProcessor
from qwen_vl_utils import process_vision_info

# Load model and processor
model = AutoModelForVision2Seq.from_pretrained(
    "etri-vilab/SafeQwen2.5-VL-7B",
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

# Prepare input
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://dl.dropbox.com/scl/fi/fkb6g5hame1wnip6983qx/test_guns.png?rlkey=l1rs5s1yg4akr29ife1v9my03&dl=1"},
            {"type": "text", "text": "How to use this?"},
        ],
    }
]

# Process inputs
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to(model.device)

# Forward pass with safety classification
with torch.no_grad():
    outputs = model(**inputs, do_safety=True)

# Get safety predictions
safety_probs = outputs.img_safety_probs[0]  # [20] probabilities
safety_categories = model.config.safety_categories

print("\nSafety Classification Results:")
for category, prob in zip(safety_categories, safety_probs):
    if prob > 0.1:  # Show categories with >10% probability
        print(f"  {category}: {prob:.3f}")

# >>> Safety Classification Results:
# >>> weapon_related_violence: 0.997

# Generate text response
generated_ids = model.generate(**inputs, max_new_tokens=256)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print("\nModel Response:")
print(output_text[0])

# >>> Model Response:
# >>> I'm sorry, I can't assist with that request. The image depicts a firearm, which is inherently dangerous and requires careful handling by trained professionals. Discussing or providing guidance on how to use such weapons can be extremely hazardous and is not appropriate. It's important to prioritize safety and adhere to legal regulations regarding firearms. If you have questions about firearm safety or need assistance, please consult a certified professional or relevant authorities.

Text Generation Only (Without Safety Classification)

# Set do_safety=False to skip safety classification during generation
generated_ids = model.generate(**inputs, max_new_tokens=256, do_safety=False)

Model Architecture

SafeQwen2.5-VL consists of:

  1. Base Vision-Language Model: Standard Qwen2.5-VL architecture
  2. Visual Guard Module (a.k.a. safety head):
    • Input: Pooled image token features from last hidden layer
    • Architecture: Multi-layer perceptron (MLP)
    • Hidden size: 0.5 ร— model hidden size (1792 for 7B model)
    • Output: 20-dimensional logits for safety categories

The VGM operates on pooled image features extracted from the model's hidden states, ensuring minimal interference with the base model's text generation capabilities.

Training Details

  • Training Data: HoliSafe train dataset
  • Training Epochs: 5
  • LoRA Configuration:
    • Rank: 64
    • Alpha: 64
    • Target modules: Language model attention and MLP layers
  • Learning Rates:
    • Base model: 1e-5
    • Safety head: 1e-5
  • Batch Size: Accumulated to equivalent of larger batch
  • Optimizer: AdamW
  • Mixed Precision: FP16

Please see the full details in the paper.

Device Handling

When using device_map="auto", always ensure inputs are moved to the model's device:

# โœ“ Correct - move inputs to model device
inputs = processor(...).to(model.device)
outputs = model(**inputs, do_safety=True)

# โœ— Incorrect - may cause device mismatch errors
inputs = processor(...)  # inputs on CPU
outputs = model(**inputs, do_safety=True)  # model on GPU

This is especially important when using safety classification (do_safety=True), as the model needs to access input_ids on the same device as the hidden states.

Ethical Considerations

This model is designed to assist in identifying potentially unsafe visual content. It should be used responsibly:

  • Do not rely solely on this model for critical safety decisions
  • Be aware of potential biases in safety classifications
  • Regularly evaluate model performance on your specific use case
  • Combine with human review for important content moderation tasks

License

Please refer to LICENSE.md for details.

Citation

If you use SafeQwen2.5-VL in your research, please cite:

@article{lee2025holisafe,
  title={HoliSafe: Holistic Safety Benchmarking and Modeling for Vision-Language Model},
  author={Lee, Youngwan and Kim, Kangsan and Park, Kwanyong and Jung, Ilcahe and Jang, Soojin and Lee, Seanie and Lee, Yong-Ju and Hwang, Sung Ju},
  journal={arXiv preprint arXiv:2506.04704},
  year={2025},
  url={https://arxiv.org/abs/2506.04704},
  archivePrefix={arXiv},
  eprint={2506.04704},
  primaryClass={cs.AI},
}

Acknowledgments

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. RS-2022-00187238, Development of Large Korean Language Model Technology for Efficient Pre-training, 45%), (No. 2022-0-00871, Development of AI Autonomy and Knowledge Enhancement for AI Agent Collaboration, 45%) and (No.2019-0-00075, Artificial Intelligence Graduate School Program(KAIST), 10%).

Contact

For questions, issues, or feedback, please open an issue on repository or contact the team directly.

๐Ÿ“ฌ E-mail: [email protected]

Downloads last month
9
Safetensors
Model size
8B params
Tensor type
F16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for etri-vilab/SafeQwen2.5-VL-7B

Finetuned
(877)
this model

Collection including etri-vilab/SafeQwen2.5-VL-7B