Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute Paper • 2503.23803 • Published Mar 31 • 8
A.S.E: A Repository-Level Benchmark for Evaluating Security in AI-Generated Code Paper • 2508.18106 • Published Aug 25 • 342
Where LLM Agents Fail and How They can Learn From Failures Paper • 2509.25370 • Published Sep 29 • 11
SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents Paper • 2505.20411 • Published May 26 • 89