Skip to content

Fix context overrun crash with local LLM backends#403

Merged
teknium1 merged 1 commit intoNousResearch:mainfrom
ch3ronsa:fix/context-size-error-phrase
Mar 5, 2026
Merged

Fix context overrun crash with local LLM backends#403
teknium1 merged 1 commit intoNousResearch:mainfrom
ch3ronsa:fix/context-size-error-phrase

Conversation

@ch3ronsa
Copy link
Copy Markdown
Contributor

@ch3ronsa ch3ronsa commented Mar 4, 2026

Fixes #348

Problem

Local inference backends (LM Studio, Ollama, llama.cpp) return HTTP 400 with error messages like "Context size has been exceeded" when the context window is full. The context-length error phrase list did not include "context size" or "context window", so these errors fell through to the generic 4xx abort handler — crashing the session instead of triggering compression.

Error flow before this fix:

LM Studio returns 400: "Context size has been exceeded"
  → 413 check: no match
  → 4xx abort check: status 400 matches (400 >= 400 and < 500)
  → ❌ "Non-retryable client error. Aborting immediately."
  → Session crashes, context never compressed

Fix

  1. Moved context-length check above the generic 4xx handler (same pattern as the existing 413 check) so context errors are caught before they reach the abort path
  2. Added missing phrases to the detection list: "context size" (LM Studio), "context window" (Ollama)
  3. Guarded the 4xx handler with not is_context_length_error so context-related 400s are never treated as non-retryable

Error flow after this fix:

LM Studio returns 400: "Context size has been exceeded"
  → 413 check: no match
  → Context-length check: "context size" matches!
  → ⚠️ "Context length exceeded - attempting compression..."
  → 🗜️ Compressed 42 → 18 messages, retrying...
  → ✅ Session continues

Tested error messages

Backend HTTP Error message Result
LM Studio 400 "Context size has been exceeded" COMPRESS
Ollama 400 "context window exceeded" COMPRESS
llama.cpp 400 "the context size is too small" COMPRESS
vLLM 400 "maximum context length is 8192" COMPRESS
OpenAI 400 "maximum context length is 128000" COMPRESS
Auth error 401 "invalid api key" ABORT
Model error 404 "model not found" ABORT
Generic 400 400 "invalid json in request body" ABORT

Test plan

  • Verified LM Studio's exact error message from Context overrun #348 now triggers compression
  • Verified Ollama and llama.cpp error patterns also match
  • Verified non-context 4xx errors (auth, model, generic) still abort correctly
  • No changes to 413 handling (unchanged)
…#348)

Local backends (LM Studio, Ollama, llama.cpp) return HTTP 400
with messages like "Context size has been exceeded" when the
context window is full. The error phrase list did not include
"context size" or "context window", so these errors fell through
to the generic 4xx abort handler instead of triggering compression.

Changes:
- Move context-length check above generic 4xx handler so it runs
  first (same pattern as the existing 413 check)
- Add "context size" and "context window" to the phrase list
- Guard 4xx handler with `not is_context_length_error` to prevent
  context-related 400s from being treated as non-retryable
@teknium1 teknium1 merged commit 3220bb8 into NousResearch:main Mar 5, 2026
@teknium1
Copy link
Copy Markdown
Contributor

teknium1 commented Mar 5, 2026

Merged in commit 3220bb8 — your PR was based on an older main where the error handler ordering was different, so it had merge conflicts, but the fix was applied with your changes preserved (added context size and context window phrases, removed error code: 400 from non-retryable list). Thanks for the thorough analysis and test matrix! 🙏

@teknium1 teknium1 mentioned this pull request Mar 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants