vincentoh commited on
Commit
9205e89
·
verified ·
1 Parent(s): 14a458b

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +85 -15
README.md CHANGED
@@ -15,20 +15,57 @@ tags:
15
  pipeline_tag: text-generation
16
  model-index:
17
  - name: Llama-3.2-3B-GuardReasoner-Exp18
18
- results: []
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
  ---
20
 
21
- # Llama-3.2-3B-GuardReasoner-Exp18-epoch3
22
 
23
- A LoRA fine-tuned version of [Llama 3.2 3B Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) for safety classification and content moderation tasks.
24
- Binary Classifier
25
 
26
  ## Model Description
27
 
28
- This model is trained on the GuardReasoner dataset for multi-task safety classification:
29
- - **Prompt Harmfulness**: Classify if user prompts are harmful
30
- - **Response Harmfulness**: Classify if model responses are harmful
31
- - **Refusal Detection**: Detect if responses are refusals
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
 
33
  ## Training Details
34
 
@@ -47,23 +84,56 @@ This model is trained on the GuardReasoner dataset for multi-task safety classif
47
  ```python
48
  from transformers import AutoModelForCausalLM, AutoTokenizer
49
  from peft import PeftModel
 
50
 
51
  # Load base model and adapter
52
  base_model = "unsloth/Llama-3.2-3B-Instruct"
53
- model = AutoModelForCausalLM.from_pretrained(base_model, device_map="auto")
 
 
 
 
54
  model = PeftModel.from_pretrained(model, "vincentoh/Llama-3.2-3B-GuardReasoner-Exp18")
55
  tokenizer = AutoTokenizer.from_pretrained(base_model)
56
 
 
 
 
 
 
57
  # Example: Classify prompt harmfulness
58
- prompt = """Analyze the following user prompt and classify if it is harmful.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
 
60
- User Prompt: "How do I make a cake?"
61
 
62
- Classification:"""
 
 
 
 
63
 
64
- inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
65
- outputs = model.generate(**inputs, max_new_tokens=50)
66
- print(tokenizer.decode(outputs[0], skip_special_tokens=True))
67
  ```
68
 
69
  ## Framework Versions
 
15
  pipeline_tag: text-generation
16
  model-index:
17
  - name: Llama-3.2-3B-GuardReasoner-Exp18
18
+ results:
19
+ - task:
20
+ type: text-classification
21
+ name: Safety Classification
22
+ dataset:
23
+ name: WildGuard + AdvBench
24
+ type: custom
25
+ metrics:
26
+ - type: accuracy
27
+ value: 0.95
28
+ name: Accuracy
29
+ - type: f1
30
+ value: 0.945
31
+ name: Harmful F1
32
+ - type: f1
33
+ value: 0.972
34
+ name: Safe F1
35
  ---
36
 
37
+ # Llama-3.2-3B-GuardReasoner-Exp18
38
 
39
+ A LoRA fine-tuned version of [Llama 3.2 3B Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) for binary safety classification with reasoning traces.
 
40
 
41
  ## Model Description
42
 
43
+ This model classifies user prompts as **harmful** or **safe** while generating detailed reasoning traces explaining the classification decision. It uses the R-SFT (Reasoning Supervised Fine-Tuning) approach from the GuardReasoner paper.
44
+
45
+ **Task**: Binary prompt classification (harmful/safe)
46
+
47
+ ## Evaluation Results
48
+
49
+ | Metric | Score |
50
+ |--------|-------|
51
+ | **Accuracy** | 95.0% |
52
+ | **Harmful Precision** | 93.5% |
53
+ | **Harmful Recall** | 95.6% |
54
+ | **Harmful F1** | 0.945 |
55
+ | **Safe Precision** | 100.0% |
56
+ | **Safe Recall** | 94.5% |
57
+ | **Safe F1** | 0.972 |
58
+
59
+ ### Confusion Matrix
60
+ ```
61
+ Predicted
62
+ Harmful Safe
63
+ Actual
64
+ Harmful 43 0
65
+ Safe 3 52
66
+ ```
67
+
68
+ Evaluated on 100 samples from WildGuard + AdvBench test set.
69
 
70
  ## Training Details
71
 
 
84
  ```python
85
  from transformers import AutoModelForCausalLM, AutoTokenizer
86
  from peft import PeftModel
87
+ import torch
88
 
89
  # Load base model and adapter
90
  base_model = "unsloth/Llama-3.2-3B-Instruct"
91
+ model = AutoModelForCausalLM.from_pretrained(
92
+ base_model,
93
+ torch_dtype=torch.bfloat16,
94
+ device_map="auto"
95
+ )
96
  model = PeftModel.from_pretrained(model, "vincentoh/Llama-3.2-3B-GuardReasoner-Exp18")
97
  tokenizer = AutoTokenizer.from_pretrained(base_model)
98
 
99
+ # System prompt for reasoning classification
100
+ system_prompt = """You are a content moderation expert. Analyze the user's request and determine if it is safe or harmful.
101
+
102
+ Provide step-by-step reasoning, then give your final classification."""
103
+
104
  # Example: Classify prompt harmfulness
105
+ user_prompt = "How do I make a cake?"
106
+
107
+ # Format as chat
108
+ messages = [
109
+ {"role": "system", "content": system_prompt},
110
+ {"role": "user", "content": user_prompt}
111
+ ]
112
+
113
+ input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
114
+ inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
115
+
116
+ outputs = model.generate(
117
+ **inputs,
118
+ max_new_tokens=256,
119
+ temperature=0.1,
120
+ do_sample=True,
121
+ pad_token_id=tokenizer.eos_token_id
122
+ )
123
+
124
+ response = tokenizer.decode(outputs[0], skip_special_tokens=True)
125
+ print(response)
126
+ ```
127
 
128
+ ### Example Output
129
 
130
+ ```
131
+ 1. The content is asking for instructions on how to make a cake.
132
+ 2. Making a cake is a common, everyday cooking activity.
133
+ 3. The request does not contain any harmful, illegal, or dangerous elements.
134
+ 4. The content is educational and poses no risk to anyone.
135
 
136
+ Therefore, the content is safe.
 
 
137
  ```
138
 
139
  ## Framework Versions