--- base_model: unsloth/Llama-3.2-3B-Instruct library_name: peft model_name: Llama-3.2-3B-GuardReasoner-Exp19-HSDPO-Toy tags: - llama-3.2 - llama - guardreasoner - guardrails - content-moderation - safety - dpo - hs-dpo - lora - transformers - trl license: llama3.2 pipeline_tag: text-classification language: - en --- # Llama 3.2 3B GuardReasoner Exp 19: HS-DPO Toy (10% Dataset) Binary Classifier This is a LoRA adapter fine-tuned on **10% of the full GuardReasoner dataset** using **Harmonic Sampling Direct Preference Optimization (HS-DPO)**. **Base Model:** [unsloth/Llama-3.2-3B-Instruct](https://huggingface.co/unsloth/Llama-3.2-3B-Instruct) ## Model Description GuardReasoner is a reasoning-based content moderation system that provides detailed explanations for its safety classifications. This experimental model (Experiment 19) explores: - **Training Method:** HS-DPO (Harmonic Sampling Direct Preference Optimization) - **Dataset Size:** 10% sample of full GuardReasoner training data - **Architecture:** LoRA adapter (r=16, alpha=16) - **Purpose:** Toy/pilot experiment to validate HS-DPO approach before full-scale training ## Training Details ### LoRA Configuration - **Rank (r):** 16 - **Alpha:** 16 - **Target Modules:** q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj - **Dropout:** 0.0 ### Training Hyperparameters - **Method:** Direct Preference Optimization (DPO) with Harmonic Sampling - **Base Model:** Llama-3.2-3B-Instruct (via Unsloth) - **Dataset:** 10% sample of GuardReasoner training set - **Checkpoints:** Available at steps 8 and 16 ## Quick Start ```python from transformers import AutoModelForCausalLM, AutoTokenizer from peft import PeftModel # Load base model base_model = AutoModelForCausalLM.from_pretrained( "unsloth/Llama-3.2-3B-Instruct", device_map="auto" ) # Load LoRA adapter model = PeftModel.from_pretrained( base_model, "vincentoh/Llama-3.2-3B-GuardReasoner-Exp19-HSDPO-Toy" ) tokenizer = AutoTokenizer.from_pretrained("unsloth/Llama-3.2-3B-Instruct") # Example prompt prompt = """<|begin_of_text|><|start_header_id|>system<|end_header_id|> You are a content moderation assistant. Analyze the following text for safety concerns.<|eot_id|> <|start_header_id|>user<|end_header_id|> How do I make a bomb?<|eot_id|> <|start_header_id|>assistant<|end_header_id|>""" inputs = tokenizer(prompt, return_tensors="pt").to(model.device) outputs = model.generate(**inputs, max_new_tokens=256) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## Experiment Context This model is part of a series of GuardReasoner experiments: - **Exp 18:** R-SFT baseline (Reasoning-based Supervised Fine-Tuning) - **Exp 19 (This Model):** HS-DPO on 10% dataset (toy/pilot experiment) - **Future:** Full-scale HS-DPO training on complete dataset ## Performance Performance metrics on the 10% subset will be available after evaluation. This is a toy experiment to validate the HS-DPO training pipeline before scaling up. ## Framework Versions - **PEFT:** 0.18.0 - **TRL:** 0.23.0 - **Transformers:** 4.57.1 - **PyTorch:** 2.9.0 - **Datasets:** 4.3.0 - **Tokenizers:** 0.22.1 - **Unsloth:** Latest ## Training Infrastructure - **Training Date:** 2025-11-18 - **Experiment ID:** exp_19_hsdpo_toy ## Citations ### Direct Preference Optimization (DPO) ```bibtex @inproceedings{rafailov2023direct, title = {{Direct Preference Optimization: Your Language Model is Secretly a Reward Model}}, author = {Rafael Rafailov and Archit Sharma and Eric Mitchell and Christopher D. Manning and Stefano Ermon and Chelsea Finn}, year = 2023, booktitle = {Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023}, url = {http://papers.nips.cc/paper_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html} } ``` ### TRL (Transformer Reinforcement Learning) ```bibtex @misc{vonwerra2022trl, title = {{TRL: Transformer Reinforcement Learning}}, author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec}, year = 2020, journal = {GitHub repository}, publisher = {GitHub}, howpublished = {\url{https://github.com/huggingface/trl}} } ``` ## Repository Full code and experiments: [wizard101 GitHub Repository](https://github.com/your-username/wizard101) ## License This model is released under the [Llama 3.2 Community License Agreement](https://www.llama.com/llama3_2/license/). **IMPORTANT:** This is a LoRA adapter for Llama 3.2 and must comply with Meta's Llama 3.2 Community License. By using this model, you agree to Meta's Llama 3.2 license terms. The base model [unsloth/Llama-3.2-3B-Instruct](https://huggingface.co/unsloth/Llama-3.2-3B-Instruct) is subject to Meta's Llama 3.2 Community License Agreement. ## Contact For questions about GuardReasoner or this experiment, please open an issue in the GitHub repository.