feat: query local servers for actual context window size by dusterbloom · Pull Request #2091 · NousResearch/hermes-agent

dusterbloom · 2026-03-19T20:32:56Z

Summary

Partially addresses #2057 — auto-detects context window size for local servers instead of falling back to 2M.

Also fixes a bug where models that wrap their entire response in <think> tags cause 3 retries and an error, even though the response content is available in the reasoning.

Changes

1. Local context window detection

Custom endpoints (LM Studio, Ollama, vLLM, llama.cpp) silently fall back to 2M tokens when /v1/models doesn't include context_length. Adds _query_local_context_length() which queries server-specific APIs:

Server	Endpoint	Context key
LM Studio	`/api/v1/models`	`max_context_length`, `loaded_instances[].config.context_length`
Ollama	`/api/show`	`model_info.*context_length`, `parameters.num_ctx`
llama.cpp	`/props`	`default_generation_settings.n_ctx`
vLLM	`/v1/models/{model}`	`max_model_len`

Prefers loaded instance context over max (e.g., 122K loaded vs 1M max).

2. LM Studio detection fix

detect_local_server_type() misidentified LM Studio as Ollama — LM Studio returns 200 for /api/tags with an error body. Fixed by checking for "models" key and probing LM Studio first.

3. Think-block-only response recovery

Local models (Qwen 3.5) sometimes wrap their entire response in <think> tags, leaving content empty. Previously: 3 retries then error. Now: uses reasoning text as the response content.

How to test

# Context detection (with LM Studio running)
python3 -c "
from agent.model_metadata import _query_local_context_length
print(_query_local_context_length('your-model', 'http://localhost:1234/v1'))
"

# Tests
python3 -m pytest tests/test_model_metadata_local_ctx.py tests/agent/test_model_metadata.py -v

96 tests pass (19 new + 77 existing). Full suite: 0 new regressions.

Platform

Tested on Linux (WSL2, Python 3.12) against LM Studio 0.3.x with 44 models.

Instead of defaulting to 2M for unknown local models, query the server API for the real context length. Supports Ollama (/api/show), vLLM (max_model_len), and LM Studio (/v1/models). Results are cached to avoid repeated queries.

When LM Studio has a model loaded with a custom context size (e.g., 122K), prefer that over the model's max_context_length (e.g., 1M). This makes the TUI status bar show the actual runtime context window.

Custom endpoints (LM Studio, Ollama, vLLM, llama.cpp) silently fall back to 2M tokens when /v1/models doesn't include context_length. Adds _query_local_context_length() which queries server-specific APIs: - LM Studio: /api/v1/models (max_context_length + loaded instances) - Ollama: /api/show (model_info + num_ctx parameters) - llama.cpp: /props (n_ctx from default_generation_settings) - vLLM: /v1/models/{model} (max_model_len) Prefers loaded instance context over max (e.g., 122K loaded vs 1M max). Results are cached via save_context_length() to avoid repeated queries. Also fixes detect_local_server_type() misidentifying LM Studio as Ollama (LM Studio returns 200 for /api/tags with an error body).

… blocks Local models (especially Qwen 3.5) sometimes wrap their entire response inside <think> tags, leaving actual content empty. Previously this caused 3 retries and then an error, wasting tokens and failing the request. Now when retries are exhausted and reasoning_text contains the response, it is used as final_response instead of returning an error. The user sees the actual answer instead of "Model generated only think blocks."

mraxai added 4 commits March 19, 2026 21:24

feat: query local server for actual context window size

d223f73

Instead of defaulting to 2M for unknown local models, query the server API for the real context length. Supports Ollama (/api/show), vLLM (max_model_len), and LM Studio (/v1/models). Results are cached to avoid repeated queries.

fix: prefer loaded instance context size over max for LM Studio

c030ac1

When LM Studio has a model loaded with a custom context size (e.g., 122K), prefer that over the model's max_context_length (e.g., 1M). This makes the TUI status bar show the actual runtime context window.

dusterbloom changed the title ~~feat: query local servers (LM Studio, Ollama, llama.cpp) for actual context window size~~ Mar 19, 2026

teknium1 merged commit 3a9a1bb into NousResearch:main Mar 20, 2026
1 check passed

teknium1 mentioned this pull request Mar 20, 2026

fix: preserve Ollama model:tag colons in context length detection #2149

Merged

dusterbloom deleted the fix/lmstudio-context-length-detection branch March 30, 2026 12:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: query local servers for actual context window size#2091

feat: query local servers for actual context window size#2091
teknium1 merged 4 commits intoNousResearch:mainfrom
dusterbloom:fix/lmstudio-context-length-detection

dusterbloom commented Mar 19, 2026 •

edited

Loading

Uh oh!

Labels

3 participants

Conversation

dusterbloom commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

1. Local context window detection

2. LM Studio detection fix

3. Think-block-only response recovery

How to test

Platform

Uh oh!

Labels

3 participants

dusterbloom commented Mar 19, 2026 •

edited

Loading