Observation: instruction-tuned vs base models across reasoning benchmarks

#1151
by prithvi1029 - opened

While exploring the Open LLM Leaderboard, I noticed a recurring pattern:

Instruction-tuned models often show strong gains on reasoning-heavy benchmarks (e.g., HellaSwag, ARC),
while base models sometimes retain more stable performance across multilingual or code-related tasks.

I’m curious about:
• how much prompt formatting influences these scores
• whether instruction tuning introduces benchmark-specific bias
• how stable these rankings remain across leaderboard updates

Thanks for maintaining this leaderboard — it’s extremely useful.
Happy to explore specific model comparisons if helpful.

Sign up or log in to comment