Skip to content

fix(agent): detect thinking-budget exhaustion on truncation, skip useless retries#3444

Merged
teknium1 merged 1 commit intomainfrom
hermes/hermes-9420d6a3
Mar 27, 2026
Merged

fix(agent): detect thinking-budget exhaustion on truncation, skip useless retries#3444
teknium1 merged 1 commit intomainfrom
hermes/hermes-9420d6a3

Conversation

@teknium1
Copy link
Copy Markdown
Contributor

Summary

When finish_reason='length' and the response contains only reasoning content (think blocks or empty text), the model exhausted its entire output token budget on thinking with nothing left for the actual response.

Before this fix, two things happened depending on the API mode:

  • chat_completions: 3 useless continuation retry attempts (the model hits the same token limit every time, wasting ~30s and 3 API calls)
  • anthropic/codex: generic "Response truncated due to output length limit" error with rollback — gives no indication that reasoning was the cause

After this fix: the think-only + length condition is detected immediately. Returns a targeted error:

Model used all output tokens on reasoning with none left for the response. Try lowering reasoning effort or increasing max_tokens.

This saves 2 wasted API calls on the chat_completions path and gives users actionable guidance.

What's NOT changed

The existing think-only retry logic for finish_reason='stop' is untouched — that path handles genuine model glitches where the model stopped intentionally but only produced reasoning. Retrying there is correct and useful.

How it works

After extracting finish_reason and before entering the continuation/rollback paths, the code now:

  1. Extracts the text content from the response (mode-aware: chat_completions reads message.content, anthropic reads text-type content blocks)
  2. Checks if content is think-only using the existing _has_content_after_think_block() helper
  3. If think-only AND finish_reason='length' → return immediately with the targeted error
  4. If there's real content → fall through to the existing continuation/rollback logic unchanged

Companion to PR #3426

PR #3426 increased the Anthropic adapter's max_tokens from 16K to the model's native limit (64-128K), which dramatically reduces how often thinking-budget exhaustion occurs. This PR handles the remaining edge cases where it still can happen (user-configured low max_tokens, very complex reasoning with high effort, etc.).

Test plan

  • 3 new tests: think-only + length skips continuation, empty content + length detected, normal truncation still continues
  • Updated existing parametrized test (removed the think-only + continuation case which is now handled differently)
  • Full suite: 6497 passed, 1 pre-existing unrelated failure (test_auto_does_not_select_copilot_from_github_token — env pollution from local HuggingFace token)
…less retries

When finish_reason='length' and the response contains only reasoning
(think blocks or empty content), the model exhausted its output token
budget on thinking with nothing left for the actual response.

Previously, this fell into either:
- chat_completions: 3 useless continuation retries (model hits same limit)
- anthropic/codex: generic 'Response truncated' error with rollback

Now: detect the think-only + length condition early and return immediately
with a targeted error message: 'Model used all output tokens on reasoning
with none left for the response. Try lowering reasoning effort or
increasing max_tokens.'

This saves 2 wasted API calls on the chat_completions path and gives
users actionable guidance instead of a cryptic error.

The existing think-only retry logic (finish_reason='stop') is unchanged —
that's a genuine model glitch where retrying can help.
@teknium1 teknium1 force-pushed the hermes/hermes-9420d6a3 branch from bc4ab38 to 4da939a Compare March 27, 2026 20:31
@teknium1 teknium1 merged commit 8fdfc4b into main Mar 27, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant