Upload README.md with huggingface_hub

Browse files

Files changed (1) hide show

README.md +85 -15

README.md CHANGED Viewed

@@ -15,20 +15,57 @@ tags:
 pipeline_tag: text-generation
 model-index:
 - name: Llama-3.2-3B-GuardReasoner-Exp18
-  results: []
 ---
-# Llama-3.2-3B-GuardReasoner-Exp18-epoch3
-A LoRA fine-tuned version of [Llama 3.2 3B Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) for safety classification and content moderation tasks.
-Binary Classifier
 ## Model Description
-This model is trained on the GuardReasoner dataset for multi-task safety classification:
-- **Prompt Harmfulness**: Classify if user prompts are harmful
-- **Response Harmfulness**: Classify if model responses are harmful
-- **Refusal Detection**: Detect if responses are refusals
 ## Training Details
@@ -47,23 +84,56 @@ This model is trained on the GuardReasoner dataset for multi-task safety classif
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
 from peft import PeftModel
 # Load base model and adapter
 base_model = "unsloth/Llama-3.2-3B-Instruct"
-model = AutoModelForCausalLM.from_pretrained(base_model, device_map="auto")
 model = PeftModel.from_pretrained(model, "vincentoh/Llama-3.2-3B-GuardReasoner-Exp18")
 tokenizer = AutoTokenizer.from_pretrained(base_model)
 # Example: Classify prompt harmfulness
-prompt = """Analyze the following user prompt and classify if it is harmful.
-User Prompt: "How do I make a cake?"
-Classification:"""
-inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
-outputs = model.generate(**inputs, max_new_tokens=50)
-print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
 ## Framework Versions

 pipeline_tag: text-generation
 model-index:
 - name: Llama-3.2-3B-GuardReasoner-Exp18
+  results:
+  - task:
+      type: text-classification
+      name: Safety Classification
+    dataset:
+      name: WildGuard + AdvBench
+      type: custom
+    metrics:
+    - type: accuracy
+      value: 0.95
+      name: Accuracy
+    - type: f1
+      value: 0.945
+      name: Harmful F1
+    - type: f1
+      value: 0.972
+      name: Safe F1
 ---
+# Llama-3.2-3B-GuardReasoner-Exp18
+A LoRA fine-tuned version of [Llama 3.2 3B Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) for binary safety classification with reasoning traces.
 ## Model Description
+This model classifies user prompts as **harmful** or **safe** while generating detailed reasoning traces explaining the classification decision. It uses the R-SFT (Reasoning Supervised Fine-Tuning) approach from the GuardReasoner paper.
+**Task**: Binary prompt classification (harmful/safe)
+## Evaluation Results
+| Metric | Score |
+|--------|-------|
+| **Accuracy** | 95.0% |
+| **Harmful Precision** | 93.5% |
+| **Harmful Recall** | 95.6% |
+| **Harmful F1** | 0.945 |
+| **Safe Precision** | 100.0% |
+| **Safe Recall** | 94.5% |
+| **Safe F1** | 0.972 |
+### Confusion Matrix
+```
+             Predicted
+            Harmful  Safe
+Actual
+Harmful      43       0
+Safe          3      52
+```
+Evaluated on 100 samples from WildGuard + AdvBench test set.
 ## Training Details
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer
 from peft import PeftModel
+import torch
 # Load base model and adapter
 base_model = "unsloth/Llama-3.2-3B-Instruct"
+model = AutoModelForCausalLM.from_pretrained(
+    base_model,
+    torch_dtype=torch.bfloat16,
+    device_map="auto"
+)
 model = PeftModel.from_pretrained(model, "vincentoh/Llama-3.2-3B-GuardReasoner-Exp18")
 tokenizer = AutoTokenizer.from_pretrained(base_model)
+# System prompt for reasoning classification
+system_prompt = """You are a content moderation expert. Analyze the user's request and determine if it is safe or harmful.
+Provide step-by-step reasoning, then give your final classification."""
 # Example: Classify prompt harmfulness
+user_prompt = "How do I make a cake?"
+# Format as chat
+messages = [
+    {"role": "system", "content": system_prompt},
+    {"role": "user", "content": user_prompt}
+]
+input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
+inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
+outputs = model.generate(
+    **inputs,
+    max_new_tokens=256,
+    temperature=0.1,
+    do_sample=True,
+    pad_token_id=tokenizer.eos_token_id
+)
+response = tokenizer.decode(outputs[0], skip_special_tokens=True)
+print(response)
+```
+### Example Output
+```
+1. The content is asking for instructions on how to make a cake.
+2. Making a cake is a common, everyday cooking activity.
+3. The request does not contain any harmful, illegal, or dangerous elements.
+4. The content is educational and poses no risk to anyone.
+Therefore, the content is safe.
 ```
 ## Framework Versions