Why do larger language models still fail on simple compositional reasoning tasks?

Question

Large language models often perform impressively on benchmark tasks, coding, and natural language generation, but they can still fail on reasoning problems that seem simple for humans, especially when multiple constraints must be combined step by step.

I want to understand this from a technical perspective.

Why do these models remain weak on compositional reasoning despite scaling in parameter count and training data?

Is this primarily a limitation of autoregressive next-token prediction, training objective mismatch, lack of symbolic structure, or benchmark contamination effects?

I’m looking for an explanation grounded in model architecture and learning behavior rather than anecdotal examples.

Frame challenge: Why do you expect them not to fail at these tasks? The fact that large language models can even appear to reason, or simulate a helpful digital assistant are emergent properties of LLMs, not what they're designed for. i find it a bit preposterous to say "i've been increasing the size and it's not working!" when there is little evidence doing that would achieve this. — Themoonisacheese
– Themoonisacheese, Commented 16 hours ago

adsp42 · Accepted Answer · 2026-03-30 16:41:37Z

8

Simple answer: LLM are not designed for Reasoning.

As I suppose you know, they are designed to generated the most likely next word (token) in a sequence.

More to the point...

despite scaling in parameter count and training data?

Brute force can only get you so far, law of diminishing returns...

Is this primarily a limitation of autoregressive next-token prediction

Yes. As simple as that.

training objective mismatch

You can say that as well, they are not trained for Reasoning.

Don't expect LLM is equivalent to AI (*) I'd say that for Reasoning we need to make the AI (intentionally not using LLM here) to learn more like humans do, explore the environment.

(*) I disagree we need to dilute the classic AI (because allegedly we already achieved AI - or was that Turingian AI?) and invent other abstractions / refined goals (?) as AGI, etc.

answered yesterday

adsp42

2366 bronze badges

1

$\begingroup$ One of the things I've heard that I think is a useful distinction is the clarification that 'LLMs don't reason, they rationalize'. That is, when an LLM is prompted to 'explain your reasoning', it actually produces a new generative output based on the context of the answer it gave. There's no understanding of the problem to explain. It's just more token prediction. $\endgroup$

JimmyJames
– JimmyJames

2026-03-31 14:00:23 +00:00
Commented 13 hours ago
1

$\begingroup$ I've frequently seen GenAI described as "really fancy predictive text". Keeping that in mind has been quite helpful whenever I've encountered any "why did the LLM do the silly thing?" scenarios. $\endgroup$

minnmass
– minnmass

2026-03-31 14:09:49 +00:00
Commented 13 hours ago

Add a comment |

Stack Exchange Network

Why do larger language models still fail on simple compositional reasoning tasks?

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Why do larger language models still fail on simple compositional reasoning tasks?

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions