Why do large language models sometimes fail to learn long-term dependencies even with transformer architectures?

Question

Transformers are designed to capture long-range dependencies better than RNNs and LSTMs, but in practice, many models still fail to maintain consistent long-term reasoning.

For example, when working with:

Multi-step reasoning tasks
Chain-of-thought explanations
Long context sequences
Multi-hop QA

I still observe issues like hallucinations, loss of context, and sudden logical jumps.

What are the main reasons for this?

Is it a limitation of self-attention?
Context window length?
Token decay?
Training data distribution?
Optimization constraints?

Also, is there any research on architectures that improve long-term reasoning (e.g., RNN-Transformers, State-Space Models, Memory-Augmented LLMs)?

Franck Dernoncourt · Accepted Answer · 2026-02-28 17:58:36Z

Transformers use self-attention, unlike RNNs/LSTMs which propagate information sequentially and as result suffer from vanishing/exploding gradients and compression bottleneck in their hidden state.

Self-attention still suffer from issues though that sometimes cause them to fail to detect long-term dependencies, such as:

Attention weights dilute over long contexts.
Positional encoding degrades at long distances .(e.g., RoPE extrapolation).
Models are trained mostly on shorter sequences.
No explicit memory, which makes multi-hop reasoning harder.

In practice, the ability of LLMs to handle long contexts varies greatly between LLMs, e.g. see this ICML 2025 paper: NoLiMa: Long-Context Evaluation Beyond Literal Matching (GitHub).

It's even more challenging when using linear attention, e.g. see Lizard: An Efficient Linearization Framework for Large Language Models.

Stack Exchange Network

Why do large language models sometimes fail to learn long-term dependencies even with transformer architectures?

1 Answer 1

You must log in to answer this question.

Hot Network Questions

Why do large language models sometimes fail to learn long-term dependencies even with transformer architectures?

1 Answer 1

You must log in to answer this question.

Related

Hot Network Questions