Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark Paper • 2304.03279 • Published Apr 6, 2023 • 2
When Your AI Deceives You: Challenges with Partial Observability of Human Evaluators in Reward Learning Paper • 2402.17747 • Published Feb 27, 2024