Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench Paper • 2510.26865 • Published 21 days ago • 11
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? Paper • 2509.16941 • Published Sep 21 • 21
FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions Paper • 2509.17177 • Published Sep 21 • 13
Beyond Solving Math Quiz: Evaluating the Ability of Large Reasoning Models to Ask for Information Paper • 2508.11252 • Published Aug 15 • 3
SWE-SQL: Illuminating LLM Pathways to Solve User SQL Issues in Real-World Applications Paper • 2506.18951 • Published Jun 23 • 21
view article Article Letting Large Models Debate: The First Multilingual LLM Debate Competition Nov 20, 2024 • 33
Running on CPU Upgrade 13.7k Open LLM Leaderboard 🏆 13.7k Track, rank and evaluate open LLMs and chatbots
WARM: On the Benefits of Weight Averaged Reward Models Paper • 2401.12187 • Published Jan 22, 2024 • 19 • 7
WARM: On the Benefits of Weight Averaged Reward Models Paper • 2401.12187 • Published Jan 22, 2024 • 19 • 7