RefAlign: RL with Similarity-based Rewards
GitHub repository: https://github.com/mzhaoshuai/RefAlign
This repository contains a PEFT (Parameter-Efficient Fine-Tuning) adapter for the Llama-2-13b-hf model, which is an SFT (Supervised Fine-Tuning) model trained with the CONQORD dataset.
RefAlign is a REINFORCE-style alignment algorithm designed to make Large Language Models (LLMs) helpful, harmless, and honest without relying on binary human preference data or explicit reward models. It achieves this by utilizing language generation evaluation metrics, such as BERTScore, between sampled generations and unary reference answers as surrogate rewards. This approach can be extended to diverse scenarios, including safety, confidence, and general preference alignment.
For more details on the methodology, full implementation, and additional models, please refer to the official GitHub repository.
To obtain the full model, you typically need to merge this adapter with its base model. You can use utility scripts like merge_model.py (e.g., found in mzhaoshuai/Llama-2-7b-hf-conf-sft) for this process.
Framework versions
- PEFT 0.11.1
- Downloads last month
- 27