Spaces:
Running
on
CPU Upgrade
Running
on
CPU Upgrade
Observation: instruction-tuned vs base models across reasoning benchmarks
#1151
by
prithvi1029
- opened
While exploring the Open LLM Leaderboard, I noticed a recurring pattern:
Instruction-tuned models often show strong gains on reasoning-heavy benchmarks (e.g., HellaSwag, ARC),
while base models sometimes retain more stable performance across multilingual or code-related tasks.
I’m curious about:
• how much prompt formatting influences these scores
• whether instruction tuning introduces benchmark-specific bias
• how stable these rankings remain across leaderboard updates
Thanks for maintaining this leaderboard — it’s extremely useful.
Happy to explore specific model comparisons if helpful.