RefAlign: RL with Similarity-based Rewards

GitHub repository: https://github.com/mzhaoshuai/RefAlign

Paper: Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data.

This repository contains a PEFT (Parameter-Efficient Fine-Tuning) adapter for the Llama-2-13b-hf model, which is an SFT (Supervised Fine-Tuning) model trained with the CONQORD dataset.

RefAlign is a REINFORCE-style alignment algorithm designed to make Large Language Models (LLMs) helpful, harmless, and honest without relying on binary human preference data or explicit reward models. It achieves this by utilizing language generation evaluation metrics, such as BERTScore, between sampled generations and unary reference answers as surrogate rewards. This approach can be extended to diverse scenarios, including safety, confidence, and general preference alignment.

For more details on the methodology, full implementation, and additional models, please refer to the official GitHub repository.

To obtain the full model, you typically need to merge this adapter with its base model. You can use utility scripts like merge_model.py (e.g., found in mzhaoshuai/Llama-2-7b-hf-conf-sft) for this process.

Framework versions

PEFT 0.11.1

Downloads last month: 27

Model tree for mzhaoshuai/Llama-2-13b-hf-conf-sft

Base model

meta-llama/Llama-2-13b-hf

Adapter

(189)

this model

Adapters

1 model

Dataset used to train mzhaoshuai/Llama-2-13b-hf-conf-sft

Collection including mzhaoshuai/Llama-2-13b-hf-conf-sft

RefAlign: RL with Similarity-based Rewards

Collection

Datasets and models in: Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data. • 19 items • Updated 13 days ago • 1