fix(cli): buffer reasoning preview chunks and fix duplicate display by teknium1 · Pull Request #3013 · NousResearch/hermes-agent

teknium1 · 2026-03-25T19:16:30Z

Summary

Three improvements to reasoning/thinking display in the CLI:

1. Buffer tiny reasoning chunks. Providers like DeepSeek stream reasoning one word at a time, producing a separate [thinking] word line per token. Adds a buffer that coalesces chunks and flushes at natural boundaries (newlines, sentence endings, terminal width).

2. Fix duplicate reasoning display. Centralizes callback selection into _current_reasoning_callback() — one method instead of 4 scattered inline ternaries. Prevents the streaming box and preview callback from firing simultaneously.

3. Fix post-response reasoning box guard. Changes check from not self._stream_started to not self._reasoning_stream_started, so the final reasoning box is only suppressed when reasoning was actually streamed live.

Cherry-picked from PR #2781 by @juanfradb.

Test plan

Unit tests: All 62 reasoning + streaming tests pass (7 new tests for buffering, mode selection, and terminal width)

Live PTY testing:

Claude Sonnet via OpenRouter: reasoning box + tool call + response — clean, no duplicates
Claude Sonnet via Anthropic direct: clean, tool calls work
Verbose mode with /reasoning on → off toggle: correct mode switching, no [thinking] leak alongside reasoning box
Quick query mode (-q): works on both OpenRouter and Anthropic direct
Zero ANSI artifacts across all tests

Three improvements to reasoning/thinking display in the CLI: 1. Buffer tiny reasoning chunks: providers like DeepSeek stream reasoning one word at a time, producing a separate [thinking] line per token. Add a buffer that coalesces chunks and flushes at natural boundaries (newlines, sentence endings, terminal width). 2. Fix duplicate reasoning display: centralize callback selection into _current_reasoning_callback() — one place instead of 4 scattered inline ternaries. Prevents both the streaming box AND the preview callback from firing simultaneously. 3. Fix post-response reasoning box guard: change the check from 'not self._stream_started' to 'not self._reasoning_stream_started' so the final reasoning box is only suppressed when reasoning was actually streamed live, not when any text was streamed. Cherry-picked from PR #2781 by juanfradb.

…ousResearch#3013) Three improvements to reasoning/thinking display in the CLI: 1. Buffer tiny reasoning chunks: providers like DeepSeek stream reasoning one word at a time, producing a separate [thinking] line per token. Add a buffer that coalesces chunks and flushes at natural boundaries (newlines, sentence endings, terminal width). 2. Fix duplicate reasoning display: centralize callback selection into _current_reasoning_callback() — one place instead of 4 scattered inline ternaries. Prevents both the streaming box AND the preview callback from firing simultaneously. 3. Fix post-response reasoning box guard: change the check from 'not self._stream_started' to 'not self._reasoning_stream_started' so the final reasoning box is only suppressed when reasoning was actually streamed live, not when any text was streamed. Cherry-picked from PR NousResearch#2781 by juanfradb.

…ng during streaming Local models (Ollama, LM Studio) embed reasoning in <think> tags in delta.content. During streaming, _stream_delta() already displays these blocks. Then _build_assistant_message() extracts them again and fires reasoning_callback, causing duplicate display. Track whether reasoning came from structured fields (reasoning_content) vs <think> tag extraction. Only fire the callback for <think>-extracted reasoning when stream_delta_callback is NOT active. Structured reasoning always fires regardless. Salvaged from PR #2076 by dusterbloom (Fix A only — Fix B was already covered by PR #3013's _current_reasoning_callback centralization). Closes #2069.

…ng during streaming (#3116) Local models (Ollama, LM Studio) embed reasoning in <think> tags in delta.content. During streaming, _stream_delta() already displays these blocks. Then _build_assistant_message() extracts them again and fires reasoning_callback, causing duplicate display. Track whether reasoning came from structured fields (reasoning_content) vs <think> tag extraction. Only fire the callback for <think>-extracted reasoning when stream_delta_callback is NOT active. Structured reasoning always fires regardless. Salvaged from PR #2076 by dusterbloom (Fix A only — Fix B was already covered by PR #3013's _current_reasoning_callback centralization). Closes #2069.

…ousResearch#3013) Three improvements to reasoning/thinking display in the CLI: 1. Buffer tiny reasoning chunks: providers like DeepSeek stream reasoning one word at a time, producing a separate [thinking] line per token. Add a buffer that coalesces chunks and flushes at natural boundaries (newlines, sentence endings, terminal width). 2. Fix duplicate reasoning display: centralize callback selection into _current_reasoning_callback() — one place instead of 4 scattered inline ternaries. Prevents both the streaming box AND the preview callback from firing simultaneously. 3. Fix post-response reasoning box guard: change the check from 'not self._stream_started' to 'not self._reasoning_stream_started' so the final reasoning box is only suppressed when reasoning was actually streamed live, not when any text was streamed. Cherry-picked from PR NousResearch#2781 by juanfradb.

…ng during streaming (NousResearch#3116) Local models (Ollama, LM Studio) embed reasoning in <think> tags in delta.content. During streaming, _stream_delta() already displays these blocks. Then _build_assistant_message() extracts them again and fires reasoning_callback, causing duplicate display. Track whether reasoning came from structured fields (reasoning_content) vs <think> tag extraction. Only fire the callback for <think>-extracted reasoning when stream_delta_callback is NOT active. Structured reasoning always fires regardless. Salvaged from PR NousResearch#2076 by dusterbloom (Fix A only — Fix B was already covered by PR NousResearch#3013's _current_reasoning_callback centralization). Closes NousResearch#2069.

…ousResearch#3013) Three improvements to reasoning/thinking display in the CLI: 1. Buffer tiny reasoning chunks: providers like DeepSeek stream reasoning one word at a time, producing a separate [thinking] line per token. Add a buffer that coalesces chunks and flushes at natural boundaries (newlines, sentence endings, terminal width). 2. Fix duplicate reasoning display: centralize callback selection into _current_reasoning_callback() — one place instead of 4 scattered inline ternaries. Prevents both the streaming box AND the preview callback from firing simultaneously. 3. Fix post-response reasoning box guard: change the check from 'not self._stream_started' to 'not self._reasoning_stream_started' so the final reasoning box is only suppressed when reasoning was actually streamed live, not when any text was streamed. Cherry-picked from PR NousResearch#2781 by juanfradb.

…ng during streaming (NousResearch#3116) Local models (Ollama, LM Studio) embed reasoning in <think> tags in delta.content. During streaming, _stream_delta() already displays these blocks. Then _build_assistant_message() extracts them again and fires reasoning_callback, causing duplicate display. Track whether reasoning came from structured fields (reasoning_content) vs <think> tag extraction. Only fire the callback for <think>-extracted reasoning when stream_delta_callback is NOT active. Structured reasoning always fires regardless. Salvaged from PR NousResearch#2076 by dusterbloom (Fix A only — Fix B was already covered by PR NousResearch#3013's _current_reasoning_callback centralization). Closes NousResearch#2069.

teknium1 merged commit 8f6ef04 into main Mar 25, 2026
1 of 2 checks passed

teknium1 mentioned this pull request Mar 25, 2026

Fix reasoning preview streaming and duplicate display #2781

Closed

teknium1 mentioned this pull request Mar 26, 2026

fix(reasoning): skip duplicate callback for <think>-extracted reasoning during streaming #3116

Merged

teknium1 mentioned this pull request Mar 26, 2026

fix(reasoning): prevent duplicate display and [thinking] spam for local models #2076

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cli): buffer reasoning preview chunks and fix duplicate display#3013

fix(cli): buffer reasoning preview chunks and fix duplicate display#3013
teknium1 merged 1 commit intomainfrom
hermes/hermes-7d7ac769

teknium1 commented Mar 25, 2026

Uh oh!

Labels

1 participant

Conversation