Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics Paper • 2510.05137 • Published Oct 1 • 4
Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics Paper • 2510.05137 • Published Oct 1 • 4 • 2
Training Vision-Language Process Reward Models for Test-Time Scaling in Multimodal Reasoning: Key Insights and Lessons Learned Paper • 2509.23250 • Published Sep 27 • 5
Training Vision-Language Process Reward Models for Test-Time Scaling in Multimodal Reasoning: Key Insights and Lessons Learned Paper • 2509.23250 • Published Sep 27 • 5 • 2
OffTopicEval: When Large Language Models Enter the Wrong Chat, Almost Always! Paper • 2509.26495 • Published Sep 30 • 10 • 2
OffTopicEval: When Large Language Models Enter the Wrong Chat, Almost Always! Paper • 2509.26495 • Published Sep 30 • 10