Large language models often perform impressively on benchmark tasks, coding, and natural language generation, but they can still fail on reasoning problems that seem simple for humans, especially when multiple constraints must be combined step by step.
I want to understand this from a technical perspective.
Why do these models remain weak on compositional reasoning despite scaling in parameter count and training data?
Is this primarily a limitation of autoregressive next-token prediction, training objective mismatch, lack of symbolic structure, or benchmark contamination effects?
I’m looking for an explanation grounded in model architecture and learning behavior rather than anecdotal examples.