arxiv:2511.16660

Cognitive Foundations for Reasoning and Their Manifestation in LLMs

Published on Nov 20

· Submitted by

Authors:

Abstract

LLMs exhibit reasoning gaps compared to humans, underutilizing cognitive elements and failing to deploy meta-cognitive controls, but test-time guidance can improve their performance on complex problems.

AI-generated summary

Large language models (LLMs) solve complex problems yet fail on simpler variants, suggesting they achieve correct outputs through mechanisms fundamentally different from human reasoning. To understand this gap, we synthesize cognitive science research into a taxonomy of 28 cognitive elements spanning reasoning invariants, meta-cognitive controls, representations for organizing reasoning & knowledge, and transformation operations. We introduce a fine-grained evaluation framework and conduct the first large-scale empirical analysis of 192K traces from 18 models across text, vision, and audio, complemented by 54 human think-aloud traces, which we make publicly available. We find that models under-utilize cognitive elements correlated with success, narrowing to rigid sequential processing on ill-structured problems where diverse representations and meta-cognitive monitoring are critical. Human traces show more abstraction and conceptual processing, while models default to surface-level enumeration. Meta-analysis of 1.6K LLM reasoning papers reveals the research community concentrates on easily quantifiable elements (sequential organization: 55%, decomposition: 60%) but neglecting meta-cognitive controls (self-awareness: 16%) that correlate with success. Models possess behavioral repertoires associated with success but fail to deploy them spontaneously. Leveraging these patterns, we develop test-time reasoning guidance that automatically scaffold successful structures, improving performance by up to 66.7% on complex problems. By establishing a shared vocabulary between cognitive science and LLM research, our framework enables systematic diagnosis of reasoning failures and principled development of models that reason through robust cognitive mechanisms rather than spurious shortcuts, while providing tools to test theories of human cognition at scale.

View arXiv page View PDF Project page GitHub 3 Add to collection

Community

pkargupta

Paper submitter 1 day ago

📌We built a comprehensive framework grounded in cognitive science to understand how LLMs actually reason. By analyzing 192K+ reasoning traces from 18 models alongside human think-aloud traces, we found that:

Models use fundamentally different reasoning strategies than humans — relying on shallow sequential processing instead of hierarchical planning and metacognitive monitoring 🤖 vs 👥
Models deploy behaviors inversely to what success requires — they narrow their behavioral repertoire precisely when they should expand it 📉
Latent capabilities exist but aren't spontaneously expressed — test-time guidance unlocks up to 72% performance gains 🚀
Research focuses on easily measured behaviors — while neglecting meta-cognitive controls that correlate with success 📊

Check out our paper, code, dataset, and blog:

📜Paper: https://arxiv.org/abs/2511.16660
💻Code: https://github.com/pkargupta/cognitive_foundations/
🗂️Data: https://huggingface.co/collections/stellalisy/cognitive-foundations
💬Blog: https://tinyurl.com/cognitive-foundations

socialhugger

1 day ago

•

edited 1 day ago

Excellent work! The 28-element cognitive framework and 192K annotated traces provide exactly the principled diagnostic lens our field needs.

Clarifying question on scope: The empirical analysis focuses on open models (1.5B-671B params) where traces are accessible, but the title suggests broader applicability. Do you expect similar cognitive element underutilization patterns in frontier foundation models (GPT-4o, Claude 3.5/4, Gemini)?

I ask because: (1) frontier models demonstrably benefit from scaffolding (CoT, self-reflection), suggesting your framework remains relevant, but (2) they often exhibit more spontaneous meta-cognition at scale. Without trace access to proprietary models, it's unclear if the cognitive gaps persist or are more pronounced in smaller models.

Suggestion: Would explicitly scoping findings to "open reasoning SLMs" (while noting frontier models as future work) strengthen the claims? Or perhaps outcome-only proxy experiments on frontier models could test generalizability?

Regardless, this taxonomy is exactly what we need to move beyond ad-hoc prompt engineering. Excited about applications to inference-time scaffolding and domain-specific reasoning frameworks!

P.S. Anthropic's Extended Thinking traces ARE available

librarian-bot

about 19 hours ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2511.16660 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2511.16660 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2511.16660 in a Space README.md to link it from this page.