feat: query local servers for actual context window size#2091
Merged
teknium1 merged 4 commits intoNousResearch:mainfrom Mar 20, 2026
Merged
Conversation
Instead of defaulting to 2M for unknown local models, query the server API for the real context length. Supports Ollama (/api/show), vLLM (max_model_len), and LM Studio (/v1/models). Results are cached to avoid repeated queries.
When LM Studio has a model loaded with a custom context size (e.g., 122K), prefer that over the model's max_context_length (e.g., 1M). This makes the TUI status bar show the actual runtime context window.
Custom endpoints (LM Studio, Ollama, vLLM, llama.cpp) silently fall
back to 2M tokens when /v1/models doesn't include context_length.
Adds _query_local_context_length() which queries server-specific APIs:
- LM Studio: /api/v1/models (max_context_length + loaded instances)
- Ollama: /api/show (model_info + num_ctx parameters)
- llama.cpp: /props (n_ctx from default_generation_settings)
- vLLM: /v1/models/{model} (max_model_len)
Prefers loaded instance context over max (e.g., 122K loaded vs 1M max).
Results are cached via save_context_length() to avoid repeated queries.
Also fixes detect_local_server_type() misidentifying LM Studio as
Ollama (LM Studio returns 200 for /api/tags with an error body).
… blocks Local models (especially Qwen 3.5) sometimes wrap their entire response inside <think> tags, leaving actual content empty. Previously this caused 3 retries and then an error, wasting tokens and failing the request. Now when retries are exhausted and reasoning_text contains the response, it is used as final_response instead of returning an error. The user sees the actual answer instead of "Model generated only think blocks."
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Partially addresses #2057 — auto-detects context window size for local servers instead of falling back to 2M.
Also fixes a bug where models that wrap their entire response in
<think>tags cause 3 retries and an error, even though the response content is available in the reasoning.Changes
1. Local context window detection
Custom endpoints (LM Studio, Ollama, vLLM, llama.cpp) silently fall back to 2M tokens when
/v1/modelsdoesn't include context_length. Adds_query_local_context_length()which queries server-specific APIs:/api/v1/modelsmax_context_length,loaded_instances[].config.context_length/api/showmodel_info.*context_length,parameters.num_ctx/propsdefault_generation_settings.n_ctx/v1/models/{model}max_model_lenPrefers loaded instance context over max (e.g., 122K loaded vs 1M max).
2. LM Studio detection fix
detect_local_server_type()misidentified LM Studio as Ollama — LM Studio returns 200 for/api/tagswith an error body. Fixed by checking for"models"key and probing LM Studio first.3. Think-block-only response recovery
Local models (Qwen 3.5) sometimes wrap their entire response in
<think>tags, leaving content empty. Previously: 3 retries then error. Now: uses reasoning text as the response content.How to test
96 tests pass (19 new + 77 existing). Full suite: 0 new regressions.
Platform
Tested on Linux (WSL2, Python 3.12) against LM Studio 0.3.x with 44 models.