7 5 10

Sherlock

eyuansu71

https://scholar.google.com/citations?user=75pkx3YAAAAJ&hl=en

AI & ML interests

None yet

Recent Activity

upvoted a paper 16 days ago

Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench

upvoted a paper about 2 months ago

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

upvoted a paper about 2 months ago

FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions

View all activity

Organizations

upvoted a paper 16 days ago

Do Vision-Language Models Measure Up? Benchmarking Visual Measurement Reading with MeasureBench

Paper • 2510.26865 • Published 21 days ago • 11

upvoted 2 papers about 2 months ago

SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?

Paper • 2509.16941 • Published Sep 21 • 21

FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions

Paper • 2509.17177 • Published Sep 21 • 13

upvoted a paper 3 months ago

Beyond Solving Math Quiz: Evaluating the Ability of Large Reasoning Models to Ask for Information

Paper • 2508.11252 • Published Aug 15 • 3

commented a paper 4 months ago

One Token to Fool LLM-as-a-Judge

Paper • 2507.08794 • Published Jul 11 • 31 •

upvoted a paper 5 months ago

SWE-SQL: Illuminating LLM Pathways to Solve User SQL Issues in Real-World Applications

Paper • 2506.18951 • Published Jun 23 • 21

updated a dataset 7 months ago

FlagEval/HMMT_2025

Viewer • Updated May 6 • 30 • 61 • 1

published a dataset 7 months ago

FlagEval/HMMT_2025

Viewer • Updated May 6 • 30 • 61 • 1

liked a dataset 7 months ago

zwhe99/DeepMath-103K

Viewer • Updated May 29 • 103k • 6.06k • 269

liked a dataset 10 months ago

KingNish/reasoning-base-20k

Viewer • Updated May 15 • 19.9k • 409 • 229

updated a model 11 months ago

FlagEval/flageval_judgemodel

Text Generation • 33B • Updated Dec 30, 2024 • 1

published an article about 1 year ago

Article

Letting Large Models Debate: The First Multilingual LLM Debate Competition

Nov 20, 2024

•

liked a model about 1 year ago

Shitao/OmniGen-v1

Text-to-Image • Updated Nov 7, 2024 • 1.95k • 321

liked a Space about 1 year ago

Open LLM Leaderboard

🏆

13.7k

Track, rank and evaluate open LLMs and chatbots

updated a dataset over 1 year ago

FlagEval/CLCC_v1

Viewer • Updated Jul 29, 2024 • 760 • 15 • 3

liked a dataset over 1 year ago

FlagEval/CLCC_v1

Viewer • Updated Jul 29, 2024 • 760 • 15 • 3

liked a Space over 1 year ago

Open Chinese LLM Leaderboard

🏆

122

Explore and submit LLM benchmarks

commented 2 papers almost 2 years ago

WARM: On the Benefits of Weight Averaged Reward Models

Paper • 2401.12187 • Published Jan 22, 2024 • 19 •

WARM: On the Benefits of Weight Averaged Reward Models

Paper • 2401.12187 • Published Jan 22, 2024 • 19 •

updated a dataset almost 2 years ago

eyuansu71/vg

Updated Jan 3, 2024 • 17