Skip to content

fix(cli): buffer reasoning preview chunks and fix duplicate display#3013

Merged
teknium1 merged 1 commit intomainfrom
hermes/hermes-7d7ac769
Mar 25, 2026
Merged

fix(cli): buffer reasoning preview chunks and fix duplicate display#3013
teknium1 merged 1 commit intomainfrom
hermes/hermes-7d7ac769

Conversation

@teknium1
Copy link
Copy Markdown
Contributor

Summary

Three improvements to reasoning/thinking display in the CLI:

1. Buffer tiny reasoning chunks. Providers like DeepSeek stream reasoning one word at a time, producing a separate [thinking] word line per token. Adds a buffer that coalesces chunks and flushes at natural boundaries (newlines, sentence endings, terminal width).

2. Fix duplicate reasoning display. Centralizes callback selection into _current_reasoning_callback() — one method instead of 4 scattered inline ternaries. Prevents the streaming box and preview callback from firing simultaneously.

3. Fix post-response reasoning box guard. Changes check from not self._stream_started to not self._reasoning_stream_started, so the final reasoning box is only suppressed when reasoning was actually streamed live.

Cherry-picked from PR #2781 by @juanfradb.

Test plan

Unit tests: All 62 reasoning + streaming tests pass (7 new tests for buffering, mode selection, and terminal width)

Live PTY testing:

  • Claude Sonnet via OpenRouter: reasoning box + tool call + response — clean, no duplicates
  • Claude Sonnet via Anthropic direct: clean, tool calls work
  • Verbose mode with /reasoning on → off toggle: correct mode switching, no [thinking] leak alongside reasoning box
  • Quick query mode (-q): works on both OpenRouter and Anthropic direct
  • Zero ANSI artifacts across all tests
Three improvements to reasoning/thinking display in the CLI:

1. Buffer tiny reasoning chunks: providers like DeepSeek stream reasoning
   one word at a time, producing a separate [thinking] line per token.
   Add a buffer that coalesces chunks and flushes at natural boundaries
   (newlines, sentence endings, terminal width).

2. Fix duplicate reasoning display: centralize callback selection into
   _current_reasoning_callback() — one place instead of 4 scattered
   inline ternaries. Prevents both the streaming box AND the preview
   callback from firing simultaneously.

3. Fix post-response reasoning box guard: change the check from
   'not self._stream_started' to 'not self._reasoning_stream_started'
   so the final reasoning box is only suppressed when reasoning was
   actually streamed live, not when any text was streamed.

Cherry-picked from PR #2781 by juanfradb.
@teknium1 teknium1 merged commit 8f6ef04 into main Mar 25, 2026
1 of 2 checks passed
InB4DevOps pushed a commit to InB4DevOps/hermes-agent that referenced this pull request Mar 25, 2026
…ousResearch#3013)

Three improvements to reasoning/thinking display in the CLI:

1. Buffer tiny reasoning chunks: providers like DeepSeek stream reasoning
   one word at a time, producing a separate [thinking] line per token.
   Add a buffer that coalesces chunks and flushes at natural boundaries
   (newlines, sentence endings, terminal width).

2. Fix duplicate reasoning display: centralize callback selection into
   _current_reasoning_callback() — one place instead of 4 scattered
   inline ternaries. Prevents both the streaming box AND the preview
   callback from firing simultaneously.

3. Fix post-response reasoning box guard: change the check from
   'not self._stream_started' to 'not self._reasoning_stream_started'
   so the final reasoning box is only suppressed when reasoning was
   actually streamed live, not when any text was streamed.

Cherry-picked from PR NousResearch#2781 by juanfradb.
teknium1 added a commit that referenced this pull request Mar 26, 2026
…ng during streaming

Local models (Ollama, LM Studio) embed reasoning in <think> tags in
delta.content. During streaming, _stream_delta() already displays these
blocks. Then _build_assistant_message() extracts them again and fires
reasoning_callback, causing duplicate display.

Track whether reasoning came from structured fields (reasoning_content)
vs <think> tag extraction. Only fire the callback for <think>-extracted
reasoning when stream_delta_callback is NOT active. Structured reasoning
always fires regardless.

Salvaged from PR #2076 by dusterbloom (Fix A only — Fix B was already
covered by PR #3013's _current_reasoning_callback centralization).
Closes #2069.
teknium1 added a commit that referenced this pull request Mar 26, 2026
…ng during streaming (#3116)

Local models (Ollama, LM Studio) embed reasoning in <think> tags in
delta.content. During streaming, _stream_delta() already displays these
blocks. Then _build_assistant_message() extracts them again and fires
reasoning_callback, causing duplicate display.

Track whether reasoning came from structured fields (reasoning_content)
vs <think> tag extraction. Only fire the callback for <think>-extracted
reasoning when stream_delta_callback is NOT active. Structured reasoning
always fires regardless.

Salvaged from PR #2076 by dusterbloom (Fix A only — Fix B was already
covered by PR #3013's _current_reasoning_callback centralization).
Closes #2069.
outsourc-e pushed a commit to outsourc-e/hermes-agent that referenced this pull request Mar 26, 2026
…ousResearch#3013)

Three improvements to reasoning/thinking display in the CLI:

1. Buffer tiny reasoning chunks: providers like DeepSeek stream reasoning
   one word at a time, producing a separate [thinking] line per token.
   Add a buffer that coalesces chunks and flushes at natural boundaries
   (newlines, sentence endings, terminal width).

2. Fix duplicate reasoning display: centralize callback selection into
   _current_reasoning_callback() — one place instead of 4 scattered
   inline ternaries. Prevents both the streaming box AND the preview
   callback from firing simultaneously.

3. Fix post-response reasoning box guard: change the check from
   'not self._stream_started' to 'not self._reasoning_stream_started'
   so the final reasoning box is only suppressed when reasoning was
   actually streamed live, not when any text was streamed.

Cherry-picked from PR NousResearch#2781 by juanfradb.
outsourc-e pushed a commit to outsourc-e/hermes-agent that referenced this pull request Mar 26, 2026
…ng during streaming (NousResearch#3116)

Local models (Ollama, LM Studio) embed reasoning in <think> tags in
delta.content. During streaming, _stream_delta() already displays these
blocks. Then _build_assistant_message() extracts them again and fires
reasoning_callback, causing duplicate display.

Track whether reasoning came from structured fields (reasoning_content)
vs <think> tag extraction. Only fire the callback for <think>-extracted
reasoning when stream_delta_callback is NOT active. Structured reasoning
always fires regardless.

Salvaged from PR NousResearch#2076 by dusterbloom (Fix A only — Fix B was already
covered by PR NousResearch#3013's _current_reasoning_callback centralization).
Closes NousResearch#2069.
StreamOfRon pushed a commit to StreamOfRon/hermes-agent that referenced this pull request Mar 29, 2026
…ousResearch#3013)

Three improvements to reasoning/thinking display in the CLI:

1. Buffer tiny reasoning chunks: providers like DeepSeek stream reasoning
   one word at a time, producing a separate [thinking] line per token.
   Add a buffer that coalesces chunks and flushes at natural boundaries
   (newlines, sentence endings, terminal width).

2. Fix duplicate reasoning display: centralize callback selection into
   _current_reasoning_callback() — one place instead of 4 scattered
   inline ternaries. Prevents both the streaming box AND the preview
   callback from firing simultaneously.

3. Fix post-response reasoning box guard: change the check from
   'not self._stream_started' to 'not self._reasoning_stream_started'
   so the final reasoning box is only suppressed when reasoning was
   actually streamed live, not when any text was streamed.

Cherry-picked from PR NousResearch#2781 by juanfradb.
StreamOfRon pushed a commit to StreamOfRon/hermes-agent that referenced this pull request Mar 29, 2026
…ng during streaming (NousResearch#3116)

Local models (Ollama, LM Studio) embed reasoning in <think> tags in
delta.content. During streaming, _stream_delta() already displays these
blocks. Then _build_assistant_message() extracts them again and fires
reasoning_callback, causing duplicate display.

Track whether reasoning came from structured fields (reasoning_content)
vs <think> tag extraction. Only fire the callback for <think>-extracted
reasoning when stream_delta_callback is NOT active. Structured reasoning
always fires regardless.

Salvaged from PR NousResearch#2076 by dusterbloom (Fix A only — Fix B was already
covered by PR NousResearch#3013's _current_reasoning_callback centralization).
Closes NousResearch#2069.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant