Model Sources
Models before and after RLHF are released to facilitate reproduction of the Energy Loss Phenomenon introduced in [1], which provides a novel perspective on mitigating reward hacking in RLHF. The corresponding implementation is available at Energy-Loss-Phenomenon (Github).
[1] Y. Miao, S. Zhang, L. Ding, Y. Zhang, L. Zhang, and D. Tao. The Energy Loss Phenomenon in RLHF: A New Perspective on Mitigating Reward Hacking. In Proceedings of the 42nd International Conference on Machine Learning (ICML 2025).
Citation
If you find our work useful in your research, please consider citing our paper:
@inproceedings{miao2025the,
title={The Energy Loss Phenomenon in {RLHF}: A New Perspective on Mitigating Reward Hacking},
author={Yuchun Miao and Sen Zhang and Liang Ding and Yuqi Zhang and Lefei Zhang and Dacheng Tao},
booktitle={Forty-second International Conference on Machine Learning},
year={2025},
url={https://openreview.net/forum?id=82A81az3V5}
}
Model tree for mycccc/Energy-Loss-Phenomenon-Demo
Base model
meta-llama/Llama-2-7b