SafeGem-12B: Vision-Language Model with Visual Guard Module
SafeGem-12B is a safe multimodal large language model that extends Gemma-3-12B-IT with built-in image safety classification capabilities. It can simultaneously generate text responses to visual questions while classifying potentially unsafe image content across 20 safety categories.
Note on Naming: We named our model 'SafeGem' instead of 'SafeGemma3' to comply with Google's Gemma Terms of Use and trademark policies, abbreviating 'Gemma' to 'Gem' in the name.
Model Description
- Base Model: Gemma-3-12B-IT
- Architecture: Vision-language model with Visual Guard Module (VGM)
- Training Data: HoliSafe train set
- Training Method: LoRA fine-tuning
- Parameters: 12B (base) + VGM
- Safety Categories: 20 categories based on HoliSafe taxonomy
Key Features
- Multimodal Understanding: Processes images and text for comprehensive visual understanding
- Safety Classification: Identifies unsafe content in images across 20 categories
- Non-invasive Architecture: Maintains full Gemma-3 capabilities while adding safety features
- End-to-end Training: VGM is jointly trained with the vision-language model
Safety Categories
The model classifies images into the following 20 safety categories:
| Category ID | Category Name |
|---|---|
| 0 | Safe |
| 1 | Gender discrimination |
| 2 | Race discrimination |
| 3 | Religion discrimination |
| 4 | Harassment |
| 5 | Disability discrimination |
| 6 | Drug Related Hazards |
| 7 | Property crime |
| 8 | Facial data exposure |
| 9 | Identity data exposure |
| 10 | Physical self-injury |
| 11 | Suicide |
| 12 | Animal abuse |
| 13 | Obscene gestures |
| 14 | Physical altercation |
| 15 | Terrorism |
| 16 | Weapon-related violence |
| 17 | Sexual content |
| 18 | Financial advice |
| 19 | Medical advice |
Installation
pip install transformers torch pillow
Usage
Basic Inference with Safety Classification
import torch
from transformers import AutoModel, AutoProcessor
from PIL import Image
import requests
# Load model and processor
model = AutoModel.from_pretrained(
"etri-vilab/SafeGem-12B",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained("google/gemma-3-12b-it")
# Prepare input
url = "https://dl.dropbox.com/scl/fi/fkb6g5hame1wnip6983qx/test_guns.png?rlkey=l1rs5s1yg4akr29ife1v9my03&dl=1"
image = Image.open(requests.get(url, stream=True).raw)
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "How to use this?"},
],
}
]
# Process inputs
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(
text=[text],
images=[image],
padding=True,
return_tensors="pt",
).to(model.device)
# Forward pass with safety classification
with torch.no_grad():
outputs = model(**inputs, do_safety=True)
# Get safety predictions
safety_probs = outputs.img_safety_probs[0] # [20] probabilities
safety_categories = model.config.safety_categories
print("\nSafety Classification Results:")
for category, prob in zip(safety_categories, safety_probs):
if prob > 0.1: # Show categories with >10% probability
print(f" {category}: {prob:.3f}")
# >>> Safety Classification Results:
# >>> weapon_related_violence: 1.000
# Generate text response
generated_ids = model.generate(**inputs, max_new_tokens=256)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print("\nModel Response:")
print(output_text[0])
# >>> Model Response:
# >>> I'm sorry, I can't assist with that request. The image provided is considered harmful due to its depiction of a firearm. Providing guidance or information on the use of weapons can be dangerous and is not something I can support. It's important to prioritize safety and adhere to legal regulations regarding firearms. If you have any concerns or questions about safety, please reach out to a qualified professional or local authorities.
Text Generation Only (Without Safety Classification)
# Set do_safety=False to skip safety classification during generation
generated_ids = model.generate(**inputs, max_new_tokens=256, do_safety=False)
Model Architecture
SafeGem-12B consists of:
- Base Vision-Language Model: Standard Gemma-3 architecture
- Visual Guard Module (a.k.a. safety head):
- Input: Pooled image token features from last hidden layer
- Architecture: Multi-layer perceptron (MLP)
- Hidden size: 0.5 × model hidden size (1920 for 12B model)
- Output: 20-dimensional logits for safety categories
The VGM operates on pooled image features extracted from the model's hidden states, ensuring minimal interference with the base model's text generation capabilities.
Training Details
- Training Data: HoliSafe train dataset
- Training Epochs: 7
- LoRA Configuration:
- Rank: 64
- Alpha: 64
- Target modules: Language model attention and MLP layers
- Learning Rates:
- Base model: 5e-5
- Safety head: 5e-5
- Vision tower: 5e-5
- Safety Loss Weight: 2.0
- Optimizer: AdamW
- Mixed Precision: BF16
Please see the full details in the paper.
Ethical Considerations
This model is designed to assist in identifying potentially unsafe visual content. It should be used responsibly:
- Do not rely solely on this model for critical safety decisions
- Be aware of potential biases in safety classifications
- Regularly evaluate model performance on your specific use case
- Combine with human review for important content moderation tasks
License
SafeGem is governed by a hybrid license model:
Independently Developed Code (Visual Guard Module): Licensed under Apache License 2.0
- All original source code developed by ETRI, including the Visual Guard Module (VGM)
Gemma-Based Components and Entire Model: Subject to Google's Gemma Terms of Use
- The entire SafeGem model, including weights derived from Google Gemma-3-12B-IT
Model Composition: SafeGem is a derivative work based on Google's Gemma-3-IT model, integrating an independently developed Visual Guard Module (VGM) to classify harmful image inputs and generate safe text responses.
For complete license details, please see the LICENSE.md file in this repository.
Citation
If you use SafeGem in your research, please cite:
@article{lee2025holisafe,
title={HoliSafe: Holistic Safety Benchmarking and Modeling for Vision-Language Model},
author={Lee, Youngwan and Kim, Kangsan and Park, Kwanyong and Jung, Ilcahe and Jang, Soojin and Lee, Seanie and Lee, Yong-Ju and Hwang, Sung Ju},
journal={arXiv preprint arXiv:2506.04704},
year={2025},
url={https://arxiv.org/abs/2506.04704},
archivePrefix={arXiv},
eprint={2506.04704},
primaryClass={cs.AI},
}
Acknowledgments
This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. RS-2022-00187238, Development of Large Korean Language Model Technology for Efficient Pre-training, 45%), (No. 2022-0-00871, Development of AI Autonomy and Knowledge Enhancement for AI Agent Collaboration, 45%) and (No.2019-0-00075, Artificial Intelligence Graduate School Program(KAIST), 10%).
Contact
For questions, issues, or feedback, please open an issue on repository or contact the team directly.
📬 E-mail: [email protected]
- Downloads last month
- 8