9 6 5

Shashwat Goel

shash42

https://www.shash42.github.io

AI & ML interests

Science of Deep Learning, Safe AI

Recent Activity

upvoted a paper 3 months ago

Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision

commented on a paper 3 months ago

The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs

upvoted a paper 3 months ago

The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs

View all activity

Organizations

upvoted a paper 3 months ago

Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision

Paper • 2509.14234 • Published Sep 17 • 5

commented a paper 3 months ago

The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs

Paper • 2509.09677 • Published Sep 11 • 34 •

upvoted a paper 3 months ago

The Illusion of Diminishing Returns: Measuring Long Horizon Execution in LLMs

Paper • 2509.09677 • Published Sep 11 • 34

liked a dataset 3 months ago

arvindh75/Long-Horizon-Execution

Viewer • Updated Sep 16 • 100 • 200 • 13

New activity in ByteDance-Seed/Seed-OSS-36B-Instruct 4 months ago

Official vllm support

👀 2

#1 opened 4 months ago by

shash42

upvoted a collection 5 months ago

answer-matching

Collection

Free-form datasets, human annotations, and sample-level model outputs for "Answer Matching Outperforms Multiple Choice for Language Model Evaluation" • 2 items • Updated Jul 3 • 2

commented a paper 5 months ago

Answer Matching Outperforms Multiple Choice for Language Model Evaluation

Paper • 2507.02856 • Published Jul 3 • 8 •

upvoted a paper 6 months ago

Pitfalls in Evaluating Language Model Forecasters

Paper • 2506.00723 • Published May 31 • 3

commented a paper 6 months ago

Pitfalls in Evaluating Language Model Forecasters

Paper • 2506.00723 • Published May 31 • 3 •

updated a dataset 7 months ago

shash42/GPQA-Diamond-Verify

Viewer • Updated May 9 • 792 • 20

published a dataset 7 months ago

shash42/GPQA-Diamond-Verify

Viewer • Updated May 9 • 792 • 20

updated a dataset 7 months ago

shash42/MATH-Verify

Viewer • Updated May 9 • 19.7k • 16

published a dataset 7 months ago

shash42/MATH-Verify

Viewer • Updated May 9 • 19.7k • 16

updated a dataset 7 months ago

shash42/MMLU-Pro-Verify

Viewer • Updated May 9 • 114k • 9

published a dataset 7 months ago

shash42/MMLU-Pro-Verify

Viewer • Updated May 9 • 114k • 9

liked a dataset 9 months ago

bethgelab/REFUTE

Viewer • Updated Feb 28 • 324 • 6 • 5

authored a paper 9 months ago

Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation

Paper • 2502.19414 • Published Feb 26 • 20

commented a paper 9 months ago

Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation

Paper • 2502.19414 • Published Feb 26 • 20 •

liked a Space 10 months ago

The Ultra-Scale Playbook

🌌

3.55k

The ultimate guide to training LLM on large GPU Clusters

upvoted a paper 10 months ago

Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

Paper • 2502.05171 • Published Feb 7 • 151

Shashwat Goel

AI & ML interests

Recent Activity

Organizations

shash42's activity

Official vllm support

The Ultra-Scale Playbook