fix(agent): detect thinking-budget exhaustion on truncation, skip useless retries#3444
Merged
fix(agent): detect thinking-budget exhaustion on truncation, skip useless retries#3444
Conversation
…less retries When finish_reason='length' and the response contains only reasoning (think blocks or empty content), the model exhausted its output token budget on thinking with nothing left for the actual response. Previously, this fell into either: - chat_completions: 3 useless continuation retries (model hits same limit) - anthropic/codex: generic 'Response truncated' error with rollback Now: detect the think-only + length condition early and return immediately with a targeted error message: 'Model used all output tokens on reasoning with none left for the response. Try lowering reasoning effort or increasing max_tokens.' This saves 2 wasted API calls on the chat_completions path and gives users actionable guidance instead of a cryptic error. The existing think-only retry logic (finish_reason='stop') is unchanged — that's a genuine model glitch where retrying can help.
bc4ab38 to
4da939a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When
finish_reason='length'and the response contains only reasoning content (think blocks or empty text), the model exhausted its entire output token budget on thinking with nothing left for the actual response.Before this fix, two things happened depending on the API mode:
After this fix: the think-only + length condition is detected immediately. Returns a targeted error:
This saves 2 wasted API calls on the chat_completions path and gives users actionable guidance.
What's NOT changed
The existing think-only retry logic for
finish_reason='stop'is untouched — that path handles genuine model glitches where the model stopped intentionally but only produced reasoning. Retrying there is correct and useful.How it works
After extracting
finish_reasonand before entering the continuation/rollback paths, the code now:message.content, anthropic reads text-type content blocks)_has_content_after_think_block()helperfinish_reason='length'→ return immediately with the targeted errorCompanion to PR #3426
PR #3426 increased the Anthropic adapter's max_tokens from 16K to the model's native limit (64-128K), which dramatically reduces how often thinking-budget exhaustion occurs. This PR handles the remaining edge cases where it still can happen (user-configured low max_tokens, very complex reasoning with high effort, etc.).
Test plan
test_auto_does_not_select_copilot_from_github_token— env pollution from local HuggingFace token)