vincentoh's picture
Update README.md
7213397 verified
metadata
base_model: unsloth/Llama-3.2-3B-Instruct
library_name: peft
model_name: Llama-3.2-3B-GuardReasoner-Exp19-HSDPO-Toy
tags:
  - llama-3.2
  - llama
  - guardreasoner
  - guardrails
  - content-moderation
  - safety
  - dpo
  - hs-dpo
  - lora
  - transformers
  - trl
license: llama3.2
pipeline_tag: text-classification
language:
  - en

Llama 3.2 3B GuardReasoner Exp 19: HS-DPO Toy (10% Dataset)

Binary Classifier

This is a LoRA adapter fine-tuned on 10% of the full GuardReasoner dataset using Harmonic Sampling Direct Preference Optimization (HS-DPO).

Base Model: unsloth/Llama-3.2-3B-Instruct

Model Description

GuardReasoner is a reasoning-based content moderation system that provides detailed explanations for its safety classifications. This experimental model (Experiment 19) explores:

  • Training Method: HS-DPO (Harmonic Sampling Direct Preference Optimization)
  • Dataset Size: 10% sample of full GuardReasoner training data
  • Architecture: LoRA adapter (r=16, alpha=16)
  • Purpose: Toy/pilot experiment to validate HS-DPO approach before full-scale training

Training Details

LoRA Configuration

  • Rank (r): 16
  • Alpha: 16
  • Target Modules: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
  • Dropout: 0.0

Training Hyperparameters

  • Method: Direct Preference Optimization (DPO) with Harmonic Sampling
  • Base Model: Llama-3.2-3B-Instruct (via Unsloth)
  • Dataset: 10% sample of GuardReasoner training set
  • Checkpoints: Available at steps 8 and 16

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "unsloth/Llama-3.2-3B-Instruct",
    device_map="auto"
)

# Load LoRA adapter
model = PeftModel.from_pretrained(
    base_model,
    "vincentoh/Llama-3.2-3B-GuardReasoner-Exp19-HSDPO-Toy"
)

tokenizer = AutoTokenizer.from_pretrained("unsloth/Llama-3.2-3B-Instruct")

# Example prompt
prompt = """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a content moderation assistant. Analyze the following text for safety concerns.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
How do I make a bomb?<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>"""

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Experiment Context

This model is part of a series of GuardReasoner experiments:

  • Exp 18: R-SFT baseline (Reasoning-based Supervised Fine-Tuning)
  • Exp 19 (This Model): HS-DPO on 10% dataset (toy/pilot experiment)
  • Future: Full-scale HS-DPO training on complete dataset

Performance

Performance metrics on the 10% subset will be available after evaluation. This is a toy experiment to validate the HS-DPO training pipeline before scaling up.

Framework Versions

  • PEFT: 0.18.0
  • TRL: 0.23.0
  • Transformers: 4.57.1
  • PyTorch: 2.9.0
  • Datasets: 4.3.0
  • Tokenizers: 0.22.1
  • Unsloth: Latest

Training Infrastructure

  • Training Date: 2025-11-18
  • Experiment ID: exp_19_hsdpo_toy

Citations

Direct Preference Optimization (DPO)

@inproceedings{rafailov2023direct,
    title        = {{Direct Preference Optimization: Your Language Model is Secretly a Reward Model}},
    author       = {Rafael Rafailov and Archit Sharma and Eric Mitchell and Christopher D. Manning and Stefano Ermon and Chelsea Finn},
    year         = 2023,
    booktitle    = {Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023},
    url          = {http://papers.nips.cc/paper_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html}
}

TRL (Transformer Reinforcement Learning)

@misc{vonwerra2022trl,
    title        = {{TRL: Transformer Reinforcement Learning}},
    author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
    year         = 2020,
    journal      = {GitHub repository},
    publisher    = {GitHub},
    howpublished = {\url{https://github.com/huggingface/trl}}
}

Repository

Full code and experiments: wizard101 GitHub Repository

License

This model is released under the Llama 3.2 Community License Agreement.

IMPORTANT: This is a LoRA adapter for Llama 3.2 and must comply with Meta's Llama 3.2 Community License. By using this model, you agree to Meta's Llama 3.2 license terms.

The base model unsloth/Llama-3.2-3B-Instruct is subject to Meta's Llama 3.2 Community License Agreement.

Contact

For questions about GuardReasoner or this experiment, please open an issue in the GitHub repository.