Eval Leaderboards - a andrewrreed Collection

andrewrreed 's Collections

Hallucination Detection

Eval Leaderboards

Small, but mighty chat models

Eval Leaderboards

updated Jun 17

Running

4.66k

LMArena Leaderboard

🏆

4.66k

Display LMArena Leaderboard
Running on CPU Upgrade

13.7k

Open LLM Leaderboard

🏆

13.7k

Track, rank and evaluate open LLMs and chatbots
Running on CPU Upgrade

6.7k

MTEB Leaderboard

🥇

6.7k

Embedding Leaderboard
Running

573

LLM-Perf Leaderboard

🏆

573

Explore hardware performance for LLMs
Running on CPU Upgrade

1.13k

Open ASR Leaderboard

🏆

1.13k

Display and request speech recognition model benchmarks
Running

1.46k

Big Code Models Leaderboard

📈

1.46k

Submit code models for evaluation and view leaderboard
Runtime error

144

Hallucinations Leaderboard

🔥

144

View and submit LLM evaluations
Runtime error

105

Enterprise Scenarios Leaderboard

🥇

105
Running on CPU Upgrade

93

LLM Safety Leaderboard

🥇

93

Explore and submit LLM benchmarks
Running

231

AI2 WildBench Leaderboard (V2)

🦁

231

Display and explore a leaderboard of language models
Running

173

Open Object Detection Leaderboard

🏆

173

Request evaluation for a new model
Running

30

Contextual Leaderboard

🐨

30

Submit and evaluate models for contextual understanding tasks
Running

191

Yet Another LLM Leaderboard

🌖

191

Generate interactive web apps with Streamlit
Running on CPU Upgrade

936

Open VLM Leaderboard

🌎

936

VLMEvalKit Evaluation Results Collection
Running

557

Vision Arena (Testing VLMs side-by-side)

🖼

557

Display image analysis results
Running

39

Leaderboard

🐠

39

Display LiveCodeBench Leaderboard
Runtime error

432

Open Medical-LLM Leaderboard

🥇

432

Explore and submit models for benchmarking
Running on CPU Upgrade

57

Open CoT Leaderboard

🥇

57

Track, rank and evaluate open LLMs' CoT quality
Running

23

MM-UPD Leaderboard

🥇

23

Submit and evaluate model results on MM-UPD benchmarks
Running

226

BigCodeBench Leaderboard

🥇

226

Explore and analyze code completion benchmarks
Runtime error

10

MJ Bench Leaderboard

🥇

10

Display and filter multimodal model leaderboard results
Running

409

Reward Bench Leaderboard

📐

409

Display and analyze reward model evaluation results
Running on CPU Upgrade

434

Agent Leaderboard

💬

434

Ranking of LLMs for agentic tasks
Running

116

Find a leaderboard

🔍

116

Explore and discover all leaderboards from the HF community
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

Paper • 2506.11763 • Published Jun 13 • 71