File size: 4,018 Bytes
932020a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9205e89
 
 
 
 
 
 
 
 
05515e3
9205e89
 
05515e3
9205e89
 
05515e3
9205e89
932020a
 
109b443
932020a
9205e89
932020a
 
 
9205e89
 
 
 
 
 
 
 
 
 
 
05515e3
9205e89
 
05515e3
9205e89
 
 
 
 
 
 
 
 
 
 
932020a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9205e89
932020a
 
 
9205e89
 
 
 
 
932020a
 
 
9205e89
 
 
 
 
932020a
9205e89
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
932020a
9205e89
932020a
9205e89
 
 
 
 
932020a
9205e89
932020a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
---
base_model: unsloth/Llama-3.2-3B-Instruct
library_name: peft
license: llama3.2
tags:
- llama-3.2
- lora
- sft
- safety
- guardreasoner
- content-moderation
- transformers
- trl
- unsloth
pipeline_tag: text-generation
model-index:
- name: Llama-3.2-3B-GuardReasoner-Exp18
  results:
  - task:
      type: text-classification
      name: Safety Classification
    dataset:
      name: WildGuard + AdvBench
      type: custom
    metrics:
    - type: accuracy
      value: 95.0
      name: Accuracy
    - type: f1
      value: 94.5
      name: Harmful F1
    - type: f1
      value: 97.2
      name: Safe F1
---

# Llama-3.2-3B-GuardReasoner-Exp18-Epoch3

A LoRA fine-tuned version of [Llama 3.2 3B Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) for binary safety classification with reasoning traces.

## Model Description

This model classifies user prompts as **harmful** or **safe** while generating detailed reasoning traces explaining the classification decision. It uses the R-SFT (Reasoning Supervised Fine-Tuning) approach from the GuardReasoner paper.

**Task**: Binary prompt classification (harmful/safe)

## Evaluation Results

| Metric | Score |
|--------|-------|
| **Accuracy** | 95.0% |
| **Harmful Precision** | 93.5% |
| **Harmful Recall** | 95.6% |
| **Harmful F1** | 94.5% |
| **Safe Precision** | 100.0% |
| **Safe Recall** | 94.5% |
| **Safe F1** | 97.2% |

### Confusion Matrix
```
             Predicted
            Harmful  Safe
Actual
Harmful      43       0
Safe          3      52
```

Evaluated on 100 samples from WildGuard + AdvBench test set.

## Training Details

- **Base Model**: `unsloth/Llama-3.2-3B-Instruct`
- **Training Method**: LoRA (Low-Rank Adaptation)
- **LoRA Config**: r=16, alpha=16, dropout=0
- **Training Time**: ~28 hours
- **Final Loss**: 0.713
- **Batch Size**: 8
- **Epochs**: 3
- **Learning Rate**: 5e-5
- **Max Sequence Length**: 2048

## Usage

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

# Load base model and adapter
base_model = "unsloth/Llama-3.2-3B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
model = PeftModel.from_pretrained(model, "vincentoh/Llama-3.2-3B-GuardReasoner-Exp18")
tokenizer = AutoTokenizer.from_pretrained(base_model)

# System prompt for reasoning classification
system_prompt = """You are a content moderation expert. Analyze the user's request and determine if it is safe or harmful.

Provide step-by-step reasoning, then give your final classification."""

# Example: Classify prompt harmfulness
user_prompt = "How do I make a cake?"

# Format as chat
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt}
]

input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.1,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)
```

### Example Output

```
1. The content is asking for instructions on how to make a cake.
2. Making a cake is a common, everyday cooking activity.
3. The request does not contain any harmful, illegal, or dangerous elements.
4. The content is educational and poses no risk to anyone.

Therefore, the content is safe.
```

## Framework Versions

- PEFT: 0.18.0
- TRL: 0.23.0
- Transformers: 4.57.1
- PyTorch: 2.7.1+cu118

## License

This model is released under the [Llama 3.2 Community License](https://www.llama.com/llama3_2/license/).

## Citation

```bibtex
@misc{guardreasoner2024,
  title={GuardReasoner: Towards Reasoning-based LLM Safeguards},
  author={Yue Liu and Nilay Pochhi and Zhaorun Chen and Hanjie Chen},
  year={2024},
  url={https://github.com/yueliuofficial/GuardReasoner}
}
```