AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios Paper • 2505.16944 • Published May 22 • 8
DeepResearchGym: A Free, Transparent, and Reproducible Evaluation Sandbox for Deep Research Paper • 2505.19253 • Published May 25 • 31