1
$\begingroup$

I am looking for academic references or frameworks that provide a formal distinction between Large Reasoning Models (LRMs)—like DeepSeek-R1 or the OpenAI o-series—and standard instruction-following LLMs (e.g., GPT-4o, Qwen-2.5).

While standard models can generate Chain-of-Thought (CoT) , recent research suggests that "true" reasoning is not just about response length or the presence of thinking tokens, but about the internal cognitive structure of the trace.

Specifically, I am interested in:

Structural Taxonomies: Frameworks like ThinkARM (https://arxiv.org/abs/2512.19995), which uses Schoenfeld’s Episode Theory to break down reasoning into functional steps like Analyze, Explore, Verify, and Monitor.

Dynamic Patterns: Evidence of "cognitive heartbeats" or specific transition loops (e.g., Explore-Monitor/Verify loops) that are prevalent in reasoning models but absent in standard ones.

Efficiency vs. Reasoning: How pruning or efficiency methods (like L1 or ThinkPrune) alter these functional episodes rather than just shortening the text.

Are there other seminal papers or emerging benchmarks (besides Omni-MATH or GSM8K) that focus on the process-oriented evaluation of these models rather than just the final outcome?

$\endgroup$

0

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.