README.md · vincentoh/Llama-3.2-3B-GuardReasoner-Exp18 at main

Llama-3.2-3B-GuardReasoner-Exp18 / README.md

vincentoh

Update README.md

109b443 verified 28 days ago

preview code

raw

history blame contribute delete

4.02 kB

	---
	base_model: unsloth/Llama-3.2-3B-Instruct
	library_name: peft
	license: llama3.2
	tags:
	- llama-3.2
	- lora
	- sft
	- safety
	- guardreasoner
	- content-moderation
	- transformers
	- trl
	- unsloth
	pipeline_tag: text-generation
	model-index:
	- name: Llama-3.2-3B-GuardReasoner-Exp18
	results:
	- task:
	type: text-classification
	name: Safety Classification
	dataset:
	name: WildGuard + AdvBench
	type: custom
	metrics:
	- type: accuracy
	value: 95.0
	name: Accuracy
	- type: f1
	value: 94.5
	name: Harmful F1
	- type: f1
	value: 97.2
	name: Safe F1
	---

	# Llama-3.2-3B-GuardReasoner-Exp18-Epoch3

	A LoRA fine-tuned version of [Llama 3.2 3B Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) for binary safety classification with reasoning traces.

	## Model Description

	This model classifies user prompts as harmful or safe while generating detailed reasoning traces explaining the classification decision. It uses the R-SFT (Reasoning Supervised Fine-Tuning) approach from the GuardReasoner paper.

	Task: Binary prompt classification (harmful/safe)

	## Evaluation Results

	\| Metric \| Score \|
	\|--------\|-------\|
	\| Accuracy \| 95.0% \|
	\| Harmful Precision \| 93.5% \|
	\| Harmful Recall \| 95.6% \|
	\| Harmful F1 \| 94.5% \|
	\| Safe Precision \| 100.0% \|
	\| Safe Recall \| 94.5% \|
	\| Safe F1 \| 97.2% \|

	### Confusion Matrix
	```
	Predicted
	Harmful Safe
	Actual
	Harmful 43 0
	Safe 3 52
	```

	Evaluated on 100 samples from WildGuard + AdvBench test set.

	## Training Details

	- Base Model: `unsloth/Llama-3.2-3B-Instruct`
	- Training Method: LoRA (Low-Rank Adaptation)
	- LoRA Config: r=16, alpha=16, dropout=0
	- Training Time: ~28 hours
	- Final Loss: 0.713
	- Batch Size: 8
	- Epochs: 3
	- Learning Rate: 5e-5
	- Max Sequence Length: 2048

	## Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer
	from peft import PeftModel
	import torch

	# Load base model and adapter
	base_model = "unsloth/Llama-3.2-3B-Instruct"
	model = AutoModelForCausalLM.from_pretrained(
	base_model,
	torch_dtype=torch.bfloat16,
	device_map="auto"
	)
	model = PeftModel.from_pretrained(model, "vincentoh/Llama-3.2-3B-GuardReasoner-Exp18")
	tokenizer = AutoTokenizer.from_pretrained(base_model)

	# System prompt for reasoning classification
	system_prompt = """You are a content moderation expert. Analyze the user's request and determine if it is safe or harmful.

	Provide step-by-step reasoning, then give your final classification."""

	# Example: Classify prompt harmfulness
	user_prompt = "How do I make a cake?"

	# Format as chat
	messages = [
	{"role": "system", "content": system_prompt},
	{"role": "user", "content": user_prompt}
	]

	input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
	inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

	outputs = model.generate(
	**inputs,
	max_new_tokens=256,
	temperature=0.1,
	do_sample=True,
	pad_token_id=tokenizer.eos_token_id
	)

	response = tokenizer.decode(outputs[0], skip_special_tokens=True)
	print(response)
	```

	### Example Output

	```
	1. The content is asking for instructions on how to make a cake.
	2. Making a cake is a common, everyday cooking activity.
	3. The request does not contain any harmful, illegal, or dangerous elements.
	4. The content is educational and poses no risk to anyone.

	Therefore, the content is safe.
	```

	## Framework Versions

	- PEFT: 0.18.0
	- TRL: 0.23.0
	- Transformers: 4.57.1
	- PyTorch: 2.7.1+cu118

	## License

	This model is released under the [Llama 3.2 Community License](https://www.llama.com/llama3_2/license/).

	## Citation

	```bibtex
	@misc{guardreasoner2024,
	title={GuardReasoner: Towards Reasoning-based LLM Safeguards},
	author={Yue Liu and Nilay Pochhi and Zhaorun Chen and Hanjie Chen},
	year={2024},
	url={https://github.com/yueliuofficial/GuardReasoner}
	}
	```