Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision Paper • 2509.14234 • Published Sep 17 • 5
The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs Paper • 2509.09677 • Published Sep 11 • 34 • 4
The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs Paper • 2509.09677 • Published Sep 11 • 34
answer-matching Collection Free-form datasets, human annotations, and sample-level model outputs for "Answer Matching Outperforms Multiple Choice for Language Model Evaluation" • 2 items • Updated Jul 3 • 2
Answer Matching Outperforms Multiple Choice for Language Model Evaluation Paper • 2507.02856 • Published Jul 3 • 8 • 2
Pitfalls in Evaluating Language Model Forecasters Paper • 2506.00723 • Published May 31 • 3 • 2
Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation Paper • 2502.19414 • Published Feb 26 • 20
Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation Paper • 2502.19414 • Published Feb 26 • 20 • 2
Running 3.55k The Ultra-Scale Playbook 🌌 3.55k The ultimate guide to training LLM on large GPU Clusters
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach Paper • 2502.05171 • Published Feb 7 • 151