Spaces:

open-llm-leaderboard
/

open_llm_leaderboard

Running on CPU Upgrade

App Files Files Community

1149

Observation: instruction-tuned vs base models across reasoning benchmarks

#1151

by prithvi1029 - opened 12 days ago

Discussion

prithvi1029

12 days ago

While exploring the Open LLM Leaderboard, I noticed a recurring pattern:

Instruction-tuned models often show strong gains on reasoning-heavy benchmarks (e.g., HellaSwag, ARC),
while base models sometimes retain more stable performance across multilingual or code-related tasks.

I’m curious about:
• how much prompt formatting influences these scores
• whether instruction tuning introduces benchmark-specific bias
• how stable these rankings remain across leaderboard updates

Thanks for maintaining this leaderboard — it’s extremely useful.
Happy to explore specific model comparisons if helpful.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment