Supervised Reinforcement Learning: From Expert Trajectories to Step-wise Reasoning Paper • 2510.25992 • Published 14 days ago • 41
ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory Paper • 2509.25140 • Published Sep 29 • 11
Vibe Checker: Aligning Code Evaluation with Human Preference Paper • 2510.07315 • Published Oct 8 • 31
mDPO: Conditional Preference Optimization for Multimodal Large Language Models Paper • 2406.11839 • Published Jun 17, 2024 • 39