File size: 4,018 Bytes
932020a 9205e89 05515e3 9205e89 05515e3 9205e89 05515e3 9205e89 932020a 109b443 932020a 9205e89 932020a 9205e89 05515e3 9205e89 05515e3 9205e89 932020a 9205e89 932020a 9205e89 932020a 9205e89 932020a 9205e89 932020a 9205e89 932020a 9205e89 932020a 9205e89 932020a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 |
---
base_model: unsloth/Llama-3.2-3B-Instruct
library_name: peft
license: llama3.2
tags:
- llama-3.2
- lora
- sft
- safety
- guardreasoner
- content-moderation
- transformers
- trl
- unsloth
pipeline_tag: text-generation
model-index:
- name: Llama-3.2-3B-GuardReasoner-Exp18
results:
- task:
type: text-classification
name: Safety Classification
dataset:
name: WildGuard + AdvBench
type: custom
metrics:
- type: accuracy
value: 95.0
name: Accuracy
- type: f1
value: 94.5
name: Harmful F1
- type: f1
value: 97.2
name: Safe F1
---
# Llama-3.2-3B-GuardReasoner-Exp18-Epoch3
A LoRA fine-tuned version of [Llama 3.2 3B Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) for binary safety classification with reasoning traces.
## Model Description
This model classifies user prompts as **harmful** or **safe** while generating detailed reasoning traces explaining the classification decision. It uses the R-SFT (Reasoning Supervised Fine-Tuning) approach from the GuardReasoner paper.
**Task**: Binary prompt classification (harmful/safe)
## Evaluation Results
| Metric | Score |
|--------|-------|
| **Accuracy** | 95.0% |
| **Harmful Precision** | 93.5% |
| **Harmful Recall** | 95.6% |
| **Harmful F1** | 94.5% |
| **Safe Precision** | 100.0% |
| **Safe Recall** | 94.5% |
| **Safe F1** | 97.2% |
### Confusion Matrix
```
Predicted
Harmful Safe
Actual
Harmful 43 0
Safe 3 52
```
Evaluated on 100 samples from WildGuard + AdvBench test set.
## Training Details
- **Base Model**: `unsloth/Llama-3.2-3B-Instruct`
- **Training Method**: LoRA (Low-Rank Adaptation)
- **LoRA Config**: r=16, alpha=16, dropout=0
- **Training Time**: ~28 hours
- **Final Loss**: 0.713
- **Batch Size**: 8
- **Epochs**: 3
- **Learning Rate**: 5e-5
- **Max Sequence Length**: 2048
## Usage
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch
# Load base model and adapter
base_model = "unsloth/Llama-3.2-3B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
base_model,
torch_dtype=torch.bfloat16,
device_map="auto"
)
model = PeftModel.from_pretrained(model, "vincentoh/Llama-3.2-3B-GuardReasoner-Exp18")
tokenizer = AutoTokenizer.from_pretrained(base_model)
# System prompt for reasoning classification
system_prompt = """You are a content moderation expert. Analyze the user's request and determine if it is safe or harmful.
Provide step-by-step reasoning, then give your final classification."""
# Example: Classify prompt harmfulness
user_prompt = "How do I make a cake?"
# Format as chat
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
]
input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=256,
temperature=0.1,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```
### Example Output
```
1. The content is asking for instructions on how to make a cake.
2. Making a cake is a common, everyday cooking activity.
3. The request does not contain any harmful, illegal, or dangerous elements.
4. The content is educational and poses no risk to anyone.
Therefore, the content is safe.
```
## Framework Versions
- PEFT: 0.18.0
- TRL: 0.23.0
- Transformers: 4.57.1
- PyTorch: 2.7.1+cu118
## License
This model is released under the [Llama 3.2 Community License](https://www.llama.com/llama3_2/license/).
## Citation
```bibtex
@misc{guardreasoner2024,
title={GuardReasoner: Towards Reasoning-based LLM Safeguards},
author={Yue Liu and Nilay Pochhi and Zhaorun Chen and Hanjie Chen},
year={2024},
url={https://github.com/yueliuofficial/GuardReasoner}
}
```
|