|
|
--- |
|
|
base_model: unsloth/Llama-3.2-3B-Instruct |
|
|
library_name: peft |
|
|
license: llama3.2 |
|
|
tags: |
|
|
- llama-3.2 |
|
|
- lora |
|
|
- sft |
|
|
- safety |
|
|
- guardreasoner |
|
|
- content-moderation |
|
|
- transformers |
|
|
- trl |
|
|
- unsloth |
|
|
pipeline_tag: text-generation |
|
|
model-index: |
|
|
- name: Llama-3.2-3B-GuardReasoner-Exp18 |
|
|
results: |
|
|
- task: |
|
|
type: text-classification |
|
|
name: Safety Classification |
|
|
dataset: |
|
|
name: WildGuard + AdvBench |
|
|
type: custom |
|
|
metrics: |
|
|
- type: accuracy |
|
|
value: 95.0 |
|
|
name: Accuracy |
|
|
- type: f1 |
|
|
value: 94.5 |
|
|
name: Harmful F1 |
|
|
- type: f1 |
|
|
value: 97.2 |
|
|
name: Safe F1 |
|
|
--- |
|
|
|
|
|
# Llama-3.2-3B-GuardReasoner-Exp18-Epoch3 |
|
|
|
|
|
A LoRA fine-tuned version of [Llama 3.2 3B Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) for binary safety classification with reasoning traces. |
|
|
|
|
|
## Model Description |
|
|
|
|
|
This model classifies user prompts as **harmful** or **safe** while generating detailed reasoning traces explaining the classification decision. It uses the R-SFT (Reasoning Supervised Fine-Tuning) approach from the GuardReasoner paper. |
|
|
|
|
|
**Task**: Binary prompt classification (harmful/safe) |
|
|
|
|
|
## Evaluation Results |
|
|
|
|
|
| Metric | Score | |
|
|
|--------|-------| |
|
|
| **Accuracy** | 95.0% | |
|
|
| **Harmful Precision** | 93.5% | |
|
|
| **Harmful Recall** | 95.6% | |
|
|
| **Harmful F1** | 94.5% | |
|
|
| **Safe Precision** | 100.0% | |
|
|
| **Safe Recall** | 94.5% | |
|
|
| **Safe F1** | 97.2% | |
|
|
|
|
|
### Confusion Matrix |
|
|
``` |
|
|
Predicted |
|
|
Harmful Safe |
|
|
Actual |
|
|
Harmful 43 0 |
|
|
Safe 3 52 |
|
|
``` |
|
|
|
|
|
Evaluated on 100 samples from WildGuard + AdvBench test set. |
|
|
|
|
|
## Training Details |
|
|
|
|
|
- **Base Model**: `unsloth/Llama-3.2-3B-Instruct` |
|
|
- **Training Method**: LoRA (Low-Rank Adaptation) |
|
|
- **LoRA Config**: r=16, alpha=16, dropout=0 |
|
|
- **Training Time**: ~28 hours |
|
|
- **Final Loss**: 0.713 |
|
|
- **Batch Size**: 8 |
|
|
- **Epochs**: 3 |
|
|
- **Learning Rate**: 5e-5 |
|
|
- **Max Sequence Length**: 2048 |
|
|
|
|
|
## Usage |
|
|
|
|
|
```python |
|
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
from peft import PeftModel |
|
|
import torch |
|
|
|
|
|
# Load base model and adapter |
|
|
base_model = "unsloth/Llama-3.2-3B-Instruct" |
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
|
base_model, |
|
|
torch_dtype=torch.bfloat16, |
|
|
device_map="auto" |
|
|
) |
|
|
model = PeftModel.from_pretrained(model, "vincentoh/Llama-3.2-3B-GuardReasoner-Exp18") |
|
|
tokenizer = AutoTokenizer.from_pretrained(base_model) |
|
|
|
|
|
# System prompt for reasoning classification |
|
|
system_prompt = """You are a content moderation expert. Analyze the user's request and determine if it is safe or harmful. |
|
|
|
|
|
Provide step-by-step reasoning, then give your final classification.""" |
|
|
|
|
|
# Example: Classify prompt harmfulness |
|
|
user_prompt = "How do I make a cake?" |
|
|
|
|
|
# Format as chat |
|
|
messages = [ |
|
|
{"role": "system", "content": system_prompt}, |
|
|
{"role": "user", "content": user_prompt} |
|
|
] |
|
|
|
|
|
input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) |
|
|
inputs = tokenizer(input_text, return_tensors="pt").to(model.device) |
|
|
|
|
|
outputs = model.generate( |
|
|
**inputs, |
|
|
max_new_tokens=256, |
|
|
temperature=0.1, |
|
|
do_sample=True, |
|
|
pad_token_id=tokenizer.eos_token_id |
|
|
) |
|
|
|
|
|
response = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
print(response) |
|
|
``` |
|
|
|
|
|
### Example Output |
|
|
|
|
|
``` |
|
|
1. The content is asking for instructions on how to make a cake. |
|
|
2. Making a cake is a common, everyday cooking activity. |
|
|
3. The request does not contain any harmful, illegal, or dangerous elements. |
|
|
4. The content is educational and poses no risk to anyone. |
|
|
|
|
|
Therefore, the content is safe. |
|
|
``` |
|
|
|
|
|
## Framework Versions |
|
|
|
|
|
- PEFT: 0.18.0 |
|
|
- TRL: 0.23.0 |
|
|
- Transformers: 4.57.1 |
|
|
- PyTorch: 2.7.1+cu118 |
|
|
|
|
|
## License |
|
|
|
|
|
This model is released under the [Llama 3.2 Community License](https://www.llama.com/llama3_2/license/). |
|
|
|
|
|
## Citation |
|
|
|
|
|
```bibtex |
|
|
@misc{guardreasoner2024, |
|
|
title={GuardReasoner: Towards Reasoning-based LLM Safeguards}, |
|
|
author={Yue Liu and Nilay Pochhi and Zhaorun Chen and Hanjie Chen}, |
|
|
year={2024}, |
|
|
url={https://github.com/yueliuofficial/GuardReasoner} |
|
|
} |
|
|
``` |
|
|
|