BabyBabelLM Collection A multilingual collection of datasets modeling the language a person observes from birth until they acquire a native language. • 45 items • Updated 19 days ago • 7
Democratizing Diplomacy: A Harness for Evaluating Any Large Language Model on Full-Press Diplomacy Paper • 2508.07485 • Published Aug 10 • 10
Pretraining Language Models for Diachronic Linguistic Change Discovery Paper • 2504.05523 • Published Apr 7 • 5
Scaling Analysis of Interleaved Speech-Text Language Models Paper • 2504.02398 • Published Apr 3 • 31
Can this Model Also Recognize Dogs? Zero-Shot Model Search from Weights Paper • 2502.09619 • Published Feb 13 • 35
Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation Paper • 2412.03304 • Published Dec 4, 2024 • 21
LiveXiv -- A Multi-Modal Live Benchmark Based on Arxiv Papers Content Paper • 2410.10783 • Published Oct 14, 2024 • 27
SELECT: A Large-Scale Benchmark of Data Curation Strategies for Image Classification Paper • 2410.05057 • Published Oct 7, 2024 • 7
The ShareLM Collection and Plugin: Contributing Human-Model Chats for the Benefit of the Community Paper • 2408.08291 • Published Aug 15, 2024 • 11
Benchmark Agreement Testing Done Right: A Guide for LLM Benchmark Evaluation Paper • 2407.13696 • Published Jul 18, 2024 • 5
Is It Really Long Context if All You Need Is Retrieval? Towards Genuinely Difficult Long Context NLP Paper • 2407.00402 • Published Jun 29, 2024 • 23