Cognitive Foundations for Reasoning and Their Manifestation in LLMs
Abstract
LLMs exhibit reasoning gaps compared to humans, underutilizing cognitive elements and failing to deploy meta-cognitive controls, but test-time guidance can improve their performance on complex problems.
Large language models (LLMs) solve complex problems yet fail on simpler variants, suggesting they achieve correct outputs through mechanisms fundamentally different from human reasoning. To understand this gap, we synthesize cognitive science research into a taxonomy of 28 cognitive elements spanning reasoning invariants, meta-cognitive controls, representations for organizing reasoning & knowledge, and transformation operations. We introduce a fine-grained evaluation framework and conduct the first large-scale empirical analysis of 192K traces from 18 models across text, vision, and audio, complemented by 54 human think-aloud traces, which we make publicly available. We find that models under-utilize cognitive elements correlated with success, narrowing to rigid sequential processing on ill-structured problems where diverse representations and meta-cognitive monitoring are critical. Human traces show more abstraction and conceptual processing, while models default to surface-level enumeration. Meta-analysis of 1.6K LLM reasoning papers reveals the research community concentrates on easily quantifiable elements (sequential organization: 55%, decomposition: 60%) but neglecting meta-cognitive controls (self-awareness: 16%) that correlate with success. Models possess behavioral repertoires associated with success but fail to deploy them spontaneously. Leveraging these patterns, we develop test-time reasoning guidance that automatically scaffold successful structures, improving performance by up to 66.7% on complex problems. By establishing a shared vocabulary between cognitive science and LLM research, our framework enables systematic diagnosis of reasoning failures and principled development of models that reason through robust cognitive mechanisms rather than spurious shortcuts, while providing tools to test theories of human cognition at scale.
Community
πWe built a comprehensive framework grounded in cognitive science to understand how LLMs actually reason. By analyzing 192K+ reasoning traces from 18 models alongside human think-aloud traces, we found that:
- Models use fundamentally different reasoning strategies than humans β relying on shallow sequential processing instead of hierarchical planning and metacognitive monitoring π€ vs π₯
- Models deploy behaviors inversely to what success requires β they narrow their behavioral repertoire precisely when they should expand it π
- Latent capabilities exist but aren't spontaneously expressed β test-time guidance unlocks up to 72% performance gains π
- Research focuses on easily measured behaviors β while neglecting meta-cognitive controls that correlate with success π
Check out our paper, code, dataset, and blog:
Excellent work! The 28-element cognitive framework and 192K annotated traces provide exactly the principled diagnostic lens our field needs.
Clarifying question on scope: The empirical analysis focuses on open models (1.5B-671B params) where traces are accessible, but the title suggests broader applicability. Do you expect similar cognitive element underutilization patterns in frontier foundation models (GPT-4o, Claude 3.5/4, Gemini)?
I ask because: (1) frontier models demonstrably benefit from scaffolding (CoT, self-reflection), suggesting your framework remains relevant, but (2) they often exhibit more spontaneous meta-cognition at scale. Without trace access to proprietary models, it's unclear if the cognitive gaps persist or are more pronounced in smaller models.
Suggestion: Would explicitly scoping findings to "open reasoning SLMs" (while noting frontier models as future work) strengthen the claims? Or perhaps outcome-only proxy experiments on frontier models could test generalizability?
Regardless, this taxonomy is exactly what we need to move beyond ad-hoc prompt engineering. Excited about applications to inference-time scaffolding and domain-specific reasoning frameworks!
P.S. Anthropic's Extended Thinking traces ARE available
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Reasoning Riddles: How Explainability Reveals Cognitive Limits in Vision-Language Models (2025)
- Bridging Reasoning to Learning: Unmasking Illusions using Complexity Out of Distribution Generalization (2025)
- Benchmarking Chinese Commonsense Reasoning with a Multi-hop Reasoning Perspective (2025)
- MM-OPERA: Benchmarking Open-ended Association Reasoning for Large Vision-Language Models (2025)
- Modeling Hierarchical Thinking in Large Reasoning Models (2025)
- An Empirical Study of Reasoning Steps in Thinking Code LLMs (2025)
- What Defines Good Reasoning in LLMs? Dissecting Reasoning Steps with Multi-Aspect Evaluation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper