RIMO: An Easy-to-Evaluate, Hard-to-Solve Olympiad Benchmark for Advanced Mathematical Reasoning Paper • 2509.07711 • Published Sep 9 • 2
Pushing on Multilingual Reasoning Models with Language-Mixed Chain-of-Thought Paper • 2510.04230 • Published Oct 5 • 26 • 2
ProBench: Benchmarking Large Language Models in Competitive Programming Paper • 2502.20868 • Published Feb 28 • 1
From KMMLU-Redux to KMMLU-Pro: A Professional Korean Benchmark Suite for LLM Evaluation Paper • 2507.08924 • Published Jul 11 • 17 • 1
When AI Co-Scientists Fail: SPOT-a Benchmark for Automated Verification of Scientific Research Paper • 2505.11855 • Published May 17 • 10 • 2