RefAlign: RL with Similarity-based Rewards

GitHub repository: https://github.com/mzhaoshuai/RefAlign

Paper: Learning from Reference Answers: Versatile Language Model Alignment without Binary Human Preference Data.

This repository contains a PEFT (Parameter-Efficient Fine-Tuning) adapter for the Llama-2-13b-hf model, which is an SFT (Supervised Fine-Tuning) model trained with the CONQORD dataset.

RefAlign is a REINFORCE-style alignment algorithm designed to make Large Language Models (LLMs) helpful, harmless, and honest without relying on binary human preference data or explicit reward models. It achieves this by utilizing language generation evaluation metrics, such as BERTScore, between sampled generations and unary reference answers as surrogate rewards. This approach can be extended to diverse scenarios, including safety, confidence, and general preference alignment.

For more details on the methodology, full implementation, and additional models, please refer to the official GitHub repository.

To obtain the full model, you typically need to merge this adapter with its base model. You can use utility scripts like merge_model.py (e.g., found in mzhaoshuai/Llama-2-7b-hf-conf-sft) for this process.

Framework versions

  • PEFT 0.11.1
Downloads last month
27
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mzhaoshuai/Llama-2-13b-hf-conf-sft

Adapter
(189)
this model
Adapters
1 model

Dataset used to train mzhaoshuai/Llama-2-13b-hf-conf-sft

Collection including mzhaoshuai/Llama-2-13b-hf-conf-sft