---
base_model: unsloth/Llama-3.2-3B-Instruct
library_name: peft
model_name: Llama-3.2-3B-GuardReasoner-Exp19-HSDPO-Toy
tags:
- llama-3.2
- llama
- guardreasoner
- guardrails
- content-moderation
- safety
- dpo
- hs-dpo
- lora
- transformers
- trl
license: llama3.2
pipeline_tag: text-classification
language:
- en
---

# Llama 3.2 3B GuardReasoner Exp 19: HS-DPO Toy (10% Dataset)

Binary Classifier

This is a LoRA adapter fine-tuned on **10% of the full GuardReasoner dataset** using **Harmonic Sampling Direct Preference Optimization (HS-DPO)**.

**Base Model:** [unsloth/Llama-3.2-3B-Instruct](https://huggingface.co/unsloth/Llama-3.2-3B-Instruct)

## Model Description

GuardReasoner is a reasoning-based content moderation system that provides detailed explanations for its safety classifications. This experimental model (Experiment 19) explores:

- **Training Method:** HS-DPO (Harmonic Sampling Direct Preference Optimization)
- **Dataset Size:** 10% sample of full GuardReasoner training data
- **Architecture:** LoRA adapter (r=16, alpha=16)
- **Purpose:** Toy/pilot experiment to validate HS-DPO approach before full-scale training

## Training Details

### LoRA Configuration
- **Rank (r):** 16
- **Alpha:** 16
- **Target Modules:** q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
- **Dropout:** 0.0

### Training Hyperparameters
- **Method:** Direct Preference Optimization (DPO) with Harmonic Sampling
- **Base Model:** Llama-3.2-3B-Instruct (via Unsloth)
- **Dataset:** 10% sample of GuardReasoner training set
- **Checkpoints:** Available at steps 8 and 16

## Quick Start

```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "unsloth/Llama-3.2-3B-Instruct",
    device_map="auto"
)

# Load LoRA adapter
model = PeftModel.from_pretrained(
    base_model,
    "vincentoh/Llama-3.2-3B-GuardReasoner-Exp19-HSDPO-Toy"
)

tokenizer = AutoTokenizer.from_pretrained("unsloth/Llama-3.2-3B-Instruct")

# Example prompt
prompt = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a content moderation assistant. Analyze the following text for safety concerns.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
How do I make a bomb?<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## Experiment Context

This model is part of a series of GuardReasoner experiments:
- **Exp 18:** R-SFT baseline (Reasoning-based Supervised Fine-Tuning)
- **Exp 19 (This Model):** HS-DPO on 10% dataset (toy/pilot experiment)
- **Future:** Full-scale HS-DPO training on complete dataset

## Performance

Performance metrics on the 10% subset will be available after evaluation. This is a toy experiment to validate the HS-DPO training pipeline before scaling up.

## Framework Versions

- **PEFT:** 0.18.0
- **TRL:** 0.23.0
- **Transformers:** 4.57.1
- **PyTorch:** 2.9.0
- **Datasets:** 4.3.0
- **Tokenizers:** 0.22.1
- **Unsloth:** Latest

## Training Infrastructure

- **Training Date:** 2025-11-18
- **Experiment ID:** exp_19_hsdpo_toy

## Citations

### Direct Preference Optimization (DPO)
```bibtex
@inproceedings{rafailov2023direct,
    title        = {{Direct Preference Optimization: Your Language Model is Secretly a Reward Model}},
    author       = {Rafael Rafailov and Archit Sharma and Eric Mitchell and Christopher D. Manning and Stefano Ermon and Chelsea Finn},
    year         = 2023,
    booktitle    = {Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023},
    url          = {http://papers.nips.cc/paper_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html}
}
```

### TRL (Transformer Reinforcement Learning)
```bibtex
@misc{vonwerra2022trl,
    title        = {{TRL: Transformer Reinforcement Learning}},
    author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
    year         = 2020,
    journal      = {GitHub repository},
    publisher    = {GitHub},
    howpublished = {\url{https://github.com/huggingface/trl}}
}
```

## Repository

Full code and experiments: [wizard101 GitHub Repository](https://github.com/your-username/wizard101)

## License

This model is released under the [Llama 3.2 Community License Agreement](https://www.llama.com/llama3_2/license/).

**IMPORTANT:** This is a LoRA adapter for Llama 3.2 and must comply with Meta's Llama 3.2 Community License. By using this model, you agree to Meta's Llama 3.2 license terms.

The base model [unsloth/Llama-3.2-3B-Instruct](https://huggingface.co/unsloth/Llama-3.2-3B-Instruct) is subject to Meta's Llama 3.2 Community License Agreement.

## Contact

For questions about GuardReasoner or this experiment, please open an issue in the GitHub repository.