Skip to content

feat: query local servers for actual context window size#2091

Merged
teknium1 merged 4 commits intoNousResearch:mainfrom
dusterbloom:fix/lmstudio-context-length-detection
Mar 20, 2026
Merged

feat: query local servers for actual context window size#2091
teknium1 merged 4 commits intoNousResearch:mainfrom
dusterbloom:fix/lmstudio-context-length-detection

Conversation

@dusterbloom
Copy link
Copy Markdown

@dusterbloom dusterbloom commented Mar 19, 2026

Summary

Partially addresses #2057 — auto-detects context window size for local servers instead of falling back to 2M.

Also fixes a bug where models that wrap their entire response in <think> tags cause 3 retries and an error, even though the response content is available in the reasoning.

Changes

1. Local context window detection

Custom endpoints (LM Studio, Ollama, vLLM, llama.cpp) silently fall back to 2M tokens when /v1/models doesn't include context_length. Adds _query_local_context_length() which queries server-specific APIs:

Server Endpoint Context key
LM Studio /api/v1/models max_context_length, loaded_instances[].config.context_length
Ollama /api/show model_info.*context_length, parameters.num_ctx
llama.cpp /props default_generation_settings.n_ctx
vLLM /v1/models/{model} max_model_len

Prefers loaded instance context over max (e.g., 122K loaded vs 1M max).

2. LM Studio detection fix

detect_local_server_type() misidentified LM Studio as Ollama — LM Studio returns 200 for /api/tags with an error body. Fixed by checking for "models" key and probing LM Studio first.

3. Think-block-only response recovery

Local models (Qwen 3.5) sometimes wrap their entire response in <think> tags, leaving content empty. Previously: 3 retries then error. Now: uses reasoning text as the response content.

How to test

# Context detection (with LM Studio running)
python3 -c "
from agent.model_metadata import _query_local_context_length
print(_query_local_context_length('your-model', 'http://localhost:1234/v1'))
"

# Tests
python3 -m pytest tests/test_model_metadata_local_ctx.py tests/agent/test_model_metadata.py -v

96 tests pass (19 new + 77 existing). Full suite: 0 new regressions.

Platform

Tested on Linux (WSL2, Python 3.12) against LM Studio 0.3.x with 44 models.

mraxai added 4 commits March 19, 2026 21:24
Instead of defaulting to 2M for unknown local models, query the server
API for the real context length. Supports Ollama (/api/show), vLLM
(max_model_len), and LM Studio (/v1/models). Results are cached to
avoid repeated queries.
When LM Studio has a model loaded with a custom context size (e.g.,
122K), prefer that over the model's max_context_length (e.g., 1M).
This makes the TUI status bar show the actual runtime context window.
Custom endpoints (LM Studio, Ollama, vLLM, llama.cpp) silently fall
back to 2M tokens when /v1/models doesn't include context_length.

Adds _query_local_context_length() which queries server-specific APIs:
- LM Studio: /api/v1/models (max_context_length + loaded instances)
- Ollama: /api/show (model_info + num_ctx parameters)
- llama.cpp: /props (n_ctx from default_generation_settings)
- vLLM: /v1/models/{model} (max_model_len)

Prefers loaded instance context over max (e.g., 122K loaded vs 1M max).
Results are cached via save_context_length() to avoid repeated queries.

Also fixes detect_local_server_type() misidentifying LM Studio as
Ollama (LM Studio returns 200 for /api/tags with an error body).
… blocks

Local models (especially Qwen 3.5) sometimes wrap their entire response
inside <think> tags, leaving actual content empty. Previously this caused
3 retries and then an error, wasting tokens and failing the request.

Now when retries are exhausted and reasoning_text contains the response,
it is used as final_response instead of returning an error. The user
sees the actual answer instead of "Model generated only think blocks."
@dusterbloom dusterbloom changed the title feat: query local servers (LM Studio, Ollama, llama.cpp) for actual context window size Mar 19, 2026
@teknium1 teknium1 merged commit 3a9a1bb into NousResearch:main Mar 20, 2026
1 check passed
@dusterbloom dusterbloom deleted the fix/lmstudio-context-length-detection branch March 30, 2026 12:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants