I am looking for academic references or frameworks that provide a formal distinction between Large Reasoning Models (LRMs)—like DeepSeek-R1 or the OpenAI o-series—and standard instruction-following LLMs (e.g., GPT-4o, Qwen-2.5).
While standard models can generate Chain-of-Thought (CoT) , recent research suggests that "true" reasoning is not just about response length or the presence of thinking tokens, but about the internal cognitive structure of the trace.
Specifically, I am interested in:
Structural Taxonomies: Frameworks like ThinkARM (https://arxiv.org/abs/2512.19995), which uses Schoenfeld’s Episode Theory to break down reasoning into functional steps like Analyze, Explore, Verify, and Monitor.
Dynamic Patterns: Evidence of "cognitive heartbeats" or specific transition loops (e.g., Explore-Monitor/Verify loops) that are prevalent in reasoning models but absent in standard ones.
Efficiency vs. Reasoning: How pruning or efficiency methods (like L1 or ThinkPrune) alter these functional episodes rather than just shortening the text.
Are there other seminal papers or emerging benchmarks (besides Omni-MATH or GSM8K) that focus on the process-oriented evaluation of these models rather than just the final outcome?