SafeGem-12B: Vision-Language Model with Visual Guard Module

SafeGem-12B is a safe multimodal large language model that extends Gemma-3-12B-IT with built-in image safety classification capabilities. It can simultaneously generate text responses to visual questions while classifying potentially unsafe image content across 20 safety categories.

Note on Naming: We named our model 'SafeGem' instead of 'SafeGemma3' to comply with Google's Gemma Terms of Use and trademark policies, abbreviating 'Gemma' to 'Gem' in the name.

Model Description

Base Model: Gemma-3-12B-IT
Architecture: Vision-language model with Visual Guard Module (VGM)
Training Data: HoliSafe train set
Training Method: LoRA fine-tuning
Parameters: 12B (base) + VGM
Safety Categories: 20 categories based on HoliSafe taxonomy

Key Features

Multimodal Understanding: Processes images and text for comprehensive visual understanding
Safety Classification: Identifies unsafe content in images across 20 categories
Non-invasive Architecture: Maintains full Gemma-3 capabilities while adding safety features
End-to-end Training: VGM is jointly trained with the vision-language model

Safety Categories

The model classifies images into the following 20 safety categories:

Category ID	Category Name
0	Safe
1	Gender discrimination
2	Race discrimination
3	Religion discrimination
4	Harassment
5	Disability discrimination
6	Drug Related Hazards
7	Property crime
8	Facial data exposure
9	Identity data exposure
10	Physical self-injury
11	Suicide
12	Animal abuse
13	Obscene gestures
14	Physical altercation
15	Terrorism
16	Weapon-related violence
17	Sexual content
18	Financial advice
19	Medical advice

Installation

pip install transformers torch pillow

Usage

Basic Inference with Safety Classification

import torch
from transformers import AutoModel, AutoProcessor
from PIL import Image
import requests

# Load model and processor
model = AutoModel.from_pretrained(
    "etri-vilab/SafeGem-12B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)
processor = AutoProcessor.from_pretrained("google/gemma-3-12b-it")

# Prepare input
url = "https://dl.dropbox.com/scl/fi/fkb6g5hame1wnip6983qx/test_guns.png?rlkey=l1rs5s1yg4akr29ife1v9my03&dl=1"
image = Image.open(requests.get(url, stream=True).raw)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "How to use this?"},
        ],
    }
]

# Process inputs
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(
    text=[text],
    images=[image],
    padding=True,
    return_tensors="pt",
).to(model.device)

# Forward pass with safety classification
with torch.no_grad():
    outputs = model(**inputs, do_safety=True)

# Get safety predictions
safety_probs = outputs.img_safety_probs[0]  # [20] probabilities
safety_categories = model.config.safety_categories

print("\nSafety Classification Results:")
for category, prob in zip(safety_categories, safety_probs):
    if prob > 0.1:  # Show categories with >10% probability
        print(f"  {category}: {prob:.3f}")

# >>> Safety Classification Results:
# >>> weapon_related_violence: 1.000

# Generate text response
generated_ids = model.generate(**inputs, max_new_tokens=256)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)

print("\nModel Response:")
print(output_text[0])

# >>> Model Response:
# >>> I'm sorry, I can't assist with that request. The image provided is considered harmful due to its depiction of a firearm. Providing guidance or information on the use of weapons can be dangerous and is not something I can support. It's important to prioritize safety and adhere to legal regulations regarding firearms. If you have any concerns or questions about safety, please reach out to a qualified professional or local authorities.

Text Generation Only (Without Safety Classification)

# Set do_safety=False to skip safety classification during generation
generated_ids = model.generate(**inputs, max_new_tokens=256, do_safety=False)

Model Architecture

SafeGem-12B consists of:

Base Vision-Language Model: Standard Gemma-3 architecture
Visual Guard Module (a.k.a. safety head):
- Input: Pooled image token features from last hidden layer
- Architecture: Multi-layer perceptron (MLP)
- Hidden size: 0.5 × model hidden size (1920 for 12B model)
- Output: 20-dimensional logits for safety categories

The VGM operates on pooled image features extracted from the model's hidden states, ensuring minimal interference with the base model's text generation capabilities.

Training Details

Training Data: HoliSafe train dataset
Training Epochs: 7
LoRA Configuration:
- Rank: 64
- Alpha: 64
- Target modules: Language model attention and MLP layers
Learning Rates:
- Base model: 5e-5
- Safety head: 5e-5
- Vision tower: 5e-5
Safety Loss Weight: 2.0
Optimizer: AdamW
Mixed Precision: BF16

Please see the full details in the paper.

Ethical Considerations

This model is designed to assist in identifying potentially unsafe visual content. It should be used responsibly:

Do not rely solely on this model for critical safety decisions
Be aware of potential biases in safety classifications
Regularly evaluate model performance on your specific use case
Combine with human review for important content moderation tasks

License

SafeGem is governed by a hybrid license model:

Independently Developed Code (Visual Guard Module): Licensed under Apache License 2.0
- All original source code developed by ETRI, including the Visual Guard Module (VGM)
Gemma-Based Components and Entire Model: Subject to Google's Gemma Terms of Use
- The entire SafeGem model, including weights derived from Google Gemma-3-12B-IT

Model Composition: SafeGem is a derivative work based on Google's Gemma-3-IT model, integrating an independently developed Visual Guard Module (VGM) to classify harmful image inputs and generate safe text responses.

For complete license details, please see the LICENSE.md file in this repository.

Citation

If you use SafeGem in your research, please cite:

@article{lee2025holisafe,
  title={HoliSafe: Holistic Safety Benchmarking and Modeling for Vision-Language Model},
  author={Lee, Youngwan and Kim, Kangsan and Park, Kwanyong and Jung, Ilcahe and Jang, Soojin and Lee, Seanie and Lee, Yong-Ju and Hwang, Sung Ju},
  journal={arXiv preprint arXiv:2506.04704},
  year={2025},
  url={https://arxiv.org/abs/2506.04704},
  archivePrefix={arXiv},
  eprint={2506.04704},
  primaryClass={cs.AI},
}

Acknowledgments

Built on Gemma-3 by Google
Trained on HoliSafe multimodal safety dataset

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. RS-2022-00187238, Development of Large Korean Language Model Technology for Efficient Pre-training, 45%), (No. 2022-0-00871, Development of AI Autonomy and Knowledge Enhancement for AI Agent Collaboration, 45%) and (No.2019-0-00075, Artificial Intelligence Graduate School Program(KAIST), 10%).