docs: add context length detection references to FAQ and quickstart by teknium1 · Pull Request #2179 · NousResearch/hermes-agent

teknium1 · 2026-03-20T15:29:22Z

Adds context length detection documentation to the two places users are most likely to look:

quickstart.md:

Custom Endpoint row now includes Ollama
Tip box mentions context length prompt and links to the full detection docs

faq.md:

"Can I use local models?" — rewritten with hermes model flow showing the context length prompt, Ollama num_ctx tip
"Context length exceeded" — expanded with detection troubleshooting, /context check, config.yaml override examples, custom_providers per-model example, link to full docs

All three doc locations now cross-reference each other:

Quickstart → Configuration (context length detection)
FAQ → Configuration (context length detection)
Configuration → (canonical reference)

- quickstart.md: mention context length prompt for custom endpoints, link to configuration docs, add Ollama to provider table - faq.md: rewrite local models section with hermes model flow and context length prompt example, add Ollama num_ctx tip, expand context-length-exceeded troubleshooting with detection override options and config.yaml examples

@Nebula037

* fix(delegate): save parent tool names before child construction mutates global * feat: query local server for actual context window size Instead of defaulting to 2M for unknown local models, query the server API for the real context length. Supports Ollama (/api/show), vLLM (max_model_len), and LM Studio (/v1/models). Results are cached to avoid repeated queries. * fix: prefer loaded instance context size over max for LM Studio When LM Studio has a model loaded with a custom context size (e.g., 122K), prefer that over the model's max_context_length (e.g., 1M). This makes the TUI status bar show the actual runtime context window. * feat: query local servers for actual context window size Custom endpoints (LM Studio, Ollama, vLLM, llama.cpp) silently fall back to 2M tokens when /v1/models doesn't include context_length. Adds _query_local_context_length() which queries server-specific APIs: - LM Studio: /api/v1/models (max_context_length + loaded instances) - Ollama: /api/show (model_info + num_ctx parameters) - llama.cpp: /props (n_ctx from default_generation_settings) - vLLM: /v1/models/{model} (max_model_len) Prefers loaded instance context over max (e.g., 122K loaded vs 1M max). Results are cached via save_context_length() to avoid repeated queries. Also fixes detect_local_server_type() misidentifying LM Studio as Ollama (LM Studio returns 200 for /api/tags with an error body). * fix: normalize MCP object schemas without properties * fix: use reasoning content as response when model only produces think blocks Local models (especially Qwen 3.5) sometimes wrap their entire response inside <think> tags, leaving actual content empty. Previously this caused 3 retries and then an error, wasting tokens and failing the request. Now when retries are exhausted and reasoning_text contains the response, it is used as final_response instead of returning an error. The user sees the actual answer instead of "Model generated only think blocks." * fix(cli): expand session list columns for full ID visibility Show complete session IDs in 'hermes sessions list' instead of truncating to 20 characters. Widens title column from 20→30 chars and adjusts header widths accordingly. Fixes NousResearch#2068. Based on PR NousResearch#2085 by @Nebula037 with a correction to preserve the no-titles layout (the original PR accidentally replaced the Preview/Src header with a duplicate Title/Preview header). * feat: show reasoning/thinking blocks when show_reasoning is enabled - Add <thinking> tag to streaming filter's tag list - When show_reasoning is on, route XML reasoning content to the reasoning display box instead of silently discarding it - Expand _strip_think_blocks to handle all tag variants: <think>, <thinking>, <THINKING>, <reasoning>, <REASONING_SCRATCHPAD> * fix: preserve Ollama model:tag colons in context length detection (NousResearch#2149) The colon-split logic in get_model_context_length() and _query_local_context_length() assumed any colon meant provider:model format (e.g. "local:my-model"). But Ollama uses model:tag format (e.g. "qwen3.5:27b"), so the split turned "qwen3.5:27b" into just "27b" — which matches nothing, causing a fallback to the 2M token probe tier. Now only recognised provider prefixes (local, openrouter, anthropic, etc.) are stripped. Ollama model:tag names pass through intact. Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com> * fix: complete session reset — missing compressor counters + test Follow-up to PR NousResearch#2101 (InB4DevOps). Adds three missing context compressor resets in reset_session_state(): - compression_count (displayed in status bar) - last_total_tokens - _context_probed (stale context-error flag) Also fixes the test_cli_new_session.py prompt_toolkit mock (missing auto_suggest stub) and adds a regression test for NousResearch#2099 that verifies all token counters and compressor state are zeroed on /new. * fix: skip model auto-detection for custom/local providers When the user is on a custom provider (provider=custom, localhost, or 127.0.0.1 endpoint), /model <name> no longer tries to auto-detect a provider switch. The model name changes on the current endpoint as-is. To switch away from a custom endpoint, users must use explicit provider:model syntax (e.g. /model openai-codex:gpt-5.2-codex). A helpful tip is printed when changing models on a custom endpoint. This prevents the confusing case where someone on LM Studio types /model gpt-5.2-codex, the auto-detection tries to switch providers, fails or partially succeeds, and requests still go to the old endpoint. Also fixes the missing prompt_toolkit.auto_suggest mock stub in test_cli_init.py (same issue already fixed in test_cli_new_session.py). * fix(honcho): read HONCHO_BASE_URL for local/self-hosted instances Cherry-picked from PR NousResearch#2120 by @unclebumpy. - from_env() now reads HONCHO_BASE_URL and enables Honcho when base_url is set, even without an API key - from_global_config() reads baseUrl from config root with HONCHO_BASE_URL env var as fallback - get_honcho_client() guard relaxed to allow base_url without api_key for no-auth local instances - Added HONCHO_BASE_URL to OPTIONAL_ENV_VARS registry Result: Setting HONCHO_BASE_URL=http://localhost:8000 in ~/.hermes/.env now correctly routes the Honcho client to a local instance. * fix: update claude 4.6 context length from 200K to 1M (NousResearch#2155) * fix: preserve Ollama model:tag colons in context length detection The colon-split logic in get_model_context_length() and _query_local_context_length() assumed any colon meant provider:model format (e.g. "local:my-model"). But Ollama uses model:tag format (e.g. "qwen3.5:27b"), so the split turned "qwen3.5:27b" into just "27b" — which matches nothing, causing a fallback to the 2M token probe tier. Now only recognised provider prefixes (local, openrouter, anthropic, etc.) are stripped. Ollama model:tag names pass through intact. * fix: update claude-opus-4-6 and claude-sonnet-4-6 context length from 200K to 1M Both models support 1,000,000 token context windows. The hardcoded defaults were set before Anthropic expanded the context for the 4.6 generation. Verified via models.dev and OpenRouter API data. --------- Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com> Co-authored-by: Test <test@test.com> * fix(signal): handle Note to Self messages with echo-back protection Support Signal 'Note to Self' messages in single-number setups where signal-cli is linked as a secondary device on the user's own account. syncMessage.sentMessage envelopes addressed to the bot's own account are now promoted to dataMessage for normal processing, while other sync events (read receipts, typing, etc.) are still filtered. Echo-back prevention mirrors the WhatsApp bridge pattern: - Track timestamps of recently sent messages (bounded set of 50) - When a Note to Self sync arrives, check if its timestamp matches a recent outbound — skip if so (agent echo-back) - Only process sync messages that are genuinely user-initiated Based on PR NousResearch#2115 by @Stonelinks with added echo-back protection. * docs(signal): add Note to Self section to Signal setup guide * fix(openai): route api.openai.com to Responses API for GPT-5.x Based on PR NousResearch#1859 by @magi-morph (too stale to cherry-pick, reimplemented). GPT-5.x models reject tool calls + reasoning_effort on /v1/chat/completions with a 400 error directing to /v1/responses. This auto-detects api.openai.com in the base URL and switches to codex_responses mode in three places: - AIAgent.__init__: upgrades chat_completions → codex_responses - _try_activate_fallback(): same routing for fallback model - runtime_provider.py: _detect_api_mode_for_url() for both custom provider and openrouter runtime resolution paths Also extracts _is_direct_openai_url() helper to replace the inline check in _max_tokens_param(). * fix(display): show spinners and tool progress during streaming mode When streaming was enabled, two visual feedback mechanisms were completely suppressed: 1. The thinking spinner (TUI toolbar) was skipped because the entire spinner block was gated on 'not self._has_stream_consumers()'. Now the thinking_callback fires in streaming mode too — the raw KawaiiSpinner is still skipped (would conflict with streamed tokens) but the TUI toolbar widget works fine alongside streaming. 2. Tool progress lines (the ┊ feed) were invisible because _vprint was blanket-suppressed when stream consumers existed. But during tool execution, no tokens are actively streaming, so printing is safe. Added an _executing_tools flag that _vprint respects to allow output during tool execution even with stream consumers registered. * fix(cron): remove send_message/clarify from cron agents + autonomous prompt Cron jobs run unattended with no user present. Previously the agent had send_message and clarify tools available, which makes no sense — the final response is auto-delivered, and there's nobody to ask questions to. Changes: - Disable messaging and clarify toolsets for cron agent sessions - Update cron platform hint to emphasize autonomous execution: no user present, cannot ask questions, must execute fully and make decisions - Update cronjob tool schema description to match (remove stale send_message guidance) * feat: overhaul context length detection with models.dev and provider-aware resolution (NousResearch#2158) Replace the fragile hardcoded context length system with a multi-source resolution chain that correctly identifies context windows per provider. Key changes: - New agent/models_dev.py: Fetches and caches the models.dev registry (3800+ models across 100+ providers with per-provider context windows). In-memory cache (1hr TTL) + disk cache for cold starts. - Rewritten get_model_context_length() resolution chain: 0. Config override (model.context_length) 1. Custom providers per-model context_length 2. Persistent disk cache 3. Endpoint /models (local servers) 4. Anthropic /v1/models API (max_input_tokens, API-key only) 5. OpenRouter live API (existing, unchanged) 6. Nous suffix-match via OpenRouter (dot/dash normalization) 7. models.dev registry lookup (provider-aware) 8. Thin hardcoded defaults (broad family patterns) 9. 128K fallback (was 2M) - Provider-aware context: same model now correctly resolves to different context windows per provider (e.g. claude-opus-4.6: 1M on Anthropic, 128K on GitHub Copilot). Provider name flows through ContextCompressor. - DEFAULT_CONTEXT_LENGTHS shrunk from 80+ entries to ~16 broad patterns. models.dev replaces the per-model hardcoding. - CONTEXT_PROBE_TIERS changed from [2M, 1M, 512K, 200K, 128K, 64K, 32K] to [128K, 64K, 32K, 16K, 8K]. Unknown models no longer start at 2M. - hermes model: prompts for context_length when configuring custom endpoints. Supports shorthand (32k, 128K). Saved to custom_providers per-model config. - custom_providers schema extended with optional models dict for per-model context_length (backward compatible). - Nous Portal: suffix-matches bare IDs (claude-opus-4-6) against OpenRouter's prefixed IDs (anthropic/claude-opus-4.6) with dot/dash normalization. Handles all 15 current Nous models. - Anthropic direct: queries /v1/models for max_input_tokens. Only works with regular API keys (sk-ant-api*), not OAuth tokens. Falls through to models.dev for OAuth users. Tests: 5574 passed (18 new tests for models_dev + updated probe tiers) Docs: Updated configuration.md context length section, AGENTS.md Co-authored-by: Test <test@test.com> * feat(gateway): add webhook platform adapter for external event triggers Add a generic webhook platform adapter that receives HTTP POSTs from external services (GitHub, GitLab, JIRA, Stripe, etc.), validates HMAC signatures, transforms payloads into agent prompts, and routes responses back to the source or to another platform. Features: - Configurable routes with per-route HMAC secrets, event filters, prompt templates with dot-notation payload access, skill loading, and pluggable delivery (github_comment, telegram, discord, log) - HMAC signature validation (GitHub SHA-256, GitLab token, generic) - Rate limiting (30 req/min per route, configurable) - Idempotency cache (1hr TTL, prevents duplicate runs on retries) - Body size limits (1MB default, checked before reading payload) - Setup wizard integration with security warnings and docs links - 33 tests (29 unit + 4 integration), all passing Security: - HMAC secret required per route (startup validation) - Setup wizard warns about internet exposure for webhook/SMS platforms - Sandboxing (Docker/VM) recommended in docs for public-facing deployments Files changed: - gateway/config.py — Platform.WEBHOOK enum + env var overrides - gateway/platforms/webhook.py — WebhookAdapter (~420 lines) - gateway/run.py — factory wiring + auth bypass for webhook events - hermes_cli/config.py — WEBHOOK_* env var definitions - hermes_cli/setup.py — webhook section in setup_gateway() - tests/gateway/test_webhook_adapter.py — 29 unit tests - tests/gateway/test_webhook_integration.py — 4 integration tests - website/docs/user-guide/messaging/webhooks.md — full user docs - website/docs/reference/environment-variables.md — WEBHOOK_* vars - website/sidebars.ts — nav entry * fix(cron): add Matrix to scheduler delivery platform_map Matrix is a supported gateway platform but was missing from the cron scheduler's delivery platform_map, causing cron job results to silently fail delivery when targeting Matrix rooms. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * feat: context pressure warnings for CLI and gateway (NousResearch#2159) * feat: context pressure warnings for CLI and gateway User-facing notifications as context approaches the compaction threshold. Warnings fire at 60% and 85% of the way to compaction — relative to the configured compression threshold, not the raw context window. CLI: Formatted line with a progress bar showing distance to compaction. Cyan at 60% (approaching), bold yellow at 85% (imminent). ◐ context ▰▰▰▰▰▰▰▰▰▰▰▰▱▱▱▱▱▱▱▱ 60% to compaction 100k threshold (50%) · approaching compaction ⚠ context ▰▰▰▰▰▰▰▰▰▰▰▰▰▰▰▰▰▱▱▱ 85% to compaction 100k threshold (50%) · compaction imminent Gateway: Plain-text notification sent to the user's chat via the new status_callback mechanism (asyncio.run_coroutine_threadsafe bridge, same pattern as step_callback). Does NOT inject into the message stream. The LLM never sees these warnings. Flags reset after each compaction cycle. Files changed: - agent/display.py — format_context_pressure(), format_context_pressure_gateway() - run_agent.py — status_callback param, _context_50/70_warned flags, _emit_context_pressure(), flag reset in _compress_context() - gateway/run.py — _status_callback_sync bridge, wired to AIAgent - tests/test_context_pressure.py — 23 tests * Merge remote-tracking branch 'origin/main' into hermes/hermes-7ea545bf --------- Co-authored-by: Test <test@test.com> * docs: add context length detection references to FAQ and quickstart (NousResearch#2179) - quickstart.md: mention context length prompt for custom endpoints, link to configuration docs, add Ollama to provider table - faq.md: rewrite local models section with hermes model flow and context length prompt example, add Ollama num_ctx tip, expand context-length-exceeded troubleshooting with detection override options and config.yaml examples Co-authored-by: Test <test@test.com> * fix(gateway): strip orphaned tool_results + let /reset bypass running agent (NousResearch#2180) Two fixes for Telegram/gateway-specific bugs: 1. Anthropic adapter: strip orphaned tool_result blocks (mirror of existing tool_use stripping). Context compression or session truncation can remove an assistant message containing a tool_use while leaving the subsequent tool_result intact. Anthropic rejects these with a 400: 'unexpected tool_use_id found in tool_result blocks'. The adapter now collects all tool_use IDs and filters out any tool_result blocks referencing IDs not in that set. 2. Gateway: /reset and /new now bypass the running-agent guard (like /status already does). Previously, sending /reset while an agent was running caused the raw text to be queued and later fed back as a user message with the same broken history — replaying the corrupted session instead of resetting it. Now the running agent is interrupted, pending messages are cleared, and the reset command dispatches immediately. Tests updated: existing tests now include proper tool_use→tool_result pairs; two new tests cover orphaned tool_result stripping. Co-authored-by: Test <test@test.com> * fix: add missing platforms to cron/send_message delivery maps and tool schema Matrix, Mattermost, Home Assistant, and DingTalk were missing from the platform_map in both cron/scheduler.py and tools/send_message_tool.py, causing delivery to those platforms to silently fail. Also updates the cronjob tool schema description to list all available delivery targets so the model knows its options. * fix: 6 bugs in model metadata, reasoning detection, and delegate tool Cherry-picked from PR NousResearch#2169 by @0xbyt4. 1. _strip_provider_prefix: skip Ollama model:tag names (qwen:0.5b) 2. Fuzzy match: remove reverse direction that made claude-sonnet-4 resolve to 1M instead of 200K 3. _has_content_after_think_block: reuse _strip_think_blocks() to handle all tag variants (thinking, reasoning, REASONING_SCRATCHPAD) 4. models.dev lookup: elif→if so nous provider also queries models.dev 5. Disk cache fallback: use 5-min TTL instead of full hour so network is retried soon 6. Delegate build: wrap child construction in try/finally so _last_resolved_tool_names is always restored on exception * docs: fill documentation gaps from recent PRs (NousResearch#2183) - slash-commands.md: add /approve, /deny (gateway-only), /statusbar (CLI-only); update Notes section with new platform-specific commands - messaging/index.md: add Webhooks to architecture diagram, platform toolsets table, and Next Steps links; add /approve and /deny to Chat Commands table - environment-variables.md: add HONCHO_BASE_URL for self-hosted Honcho instances - configuration.md: add Context Pressure Warnings section (separate from iteration budget pressure); add base_url to OpenAI TTS config; add display.show_cost to Display Settings - tts.md: add base_url to OpenAI TTS config example Co-authored-by: Test <test@test.com> * fix(whatsapp): image downloading, bridge reuse, LID allowlist, Baileys 7.x compat Salvaged from PR NousResearch#2162 by @Zindar. Reply prefix changes excluded (already on main via NousResearch#1756 configurable prefix). Bridge improvements (bridge.js): - Download incoming images to ~/.hermes/image_cache/ via downloadMediaMessage so the agent can actually see user-sent photos - Add getMessage callback required for Baileys 7.x E2EE session re-establishment (without it, some messages arrive as null) - Build LID→phone reverse map for allowlist resolution (WhatsApp LID format) - Add placeholder body for media without caption: [image received] - Bind express to 127.0.0.1 instead of 0.0.0.0 for security - Use 127.0.0.1 consistently throughout (more reliable than localhost) Adapter improvements (whatsapp.py): - Detect and reuse already-running bridge (only if status=connected) - Handle local file paths from bridge-cached images in _build_message_event - Don't kill external bridges on disconnect - Use 127.0.0.1 throughout for consistency with bridge binding Fix vs original PR: bridge reuse now checks status=connected, not just HTTP 200. A disconnected bridge gets restarted instead of reused. Co-authored-by: Zindar <zindar@users.noreply.github.com> * fix(acp): preserve leading whitespace in streaming chunks * feat: add /queue command to queue prompts without interrupting (NousResearch#2191) Adds /queue <prompt> (alias /q) that queues a message for the next turn while the agent is busy, without interrupting the current run. - CLI: /queue <prompt> puts it in _pending_input for the next turn - Gateway: /queue <prompt> creates a pending MessageEvent on the adapter, picked up after the current agent run finishes - Enter still interrupts as usual (no behavior change) - /queue with no prompt shows usage - /queue when agent is idle tells user to just type normally Co-authored-by: Test <test@test.com> * fix: persistent event loop in _run_async prevents 'Event loop is closed' (NousResearch#2190) Cherry-picked from PR NousResearch#2146 by @crazywriter1. Fixes NousResearch#2104. asyncio.run() creates and closes a fresh event loop each call. Cached httpx/AsyncOpenAI clients bound to the dead loop crash on GC with 'Event loop is closed'. This hit vision_analyze on first use in CLI. Two-layer fix: - model_tools._run_async(): replace asyncio.run() with persistent loop via _get_tool_loop() + run_until_complete() - auxiliary_client._get_cached_client(): track which loop created each async client, discard stale entries if loop is closed 6 regression tests covering loop lifecycle, reuse, and full vision dispatch chain. Co-authored-by: Test <test@test.com> * passed gateway runner entirely * added default lattice server --------- Co-authored-by: ygd58 <buraysandro9@gmail.com> Co-authored-by: Peppi Littera <giuseppe.littera@gmail.com> Co-authored-by: hermes <hermes@hermes-vps> Co-authored-by: Test <test@test.com> Co-authored-by: Teknium <127238744+teknium1@users.noreply.github.com> Co-authored-by: kshitijk4poor <82637225+kshitijk4poor@users.noreply.github.com> Co-authored-by: bunting szn <108427749+buntingszn@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Zindar <zindar@users.noreply.github.com> Co-authored-by: Dilee <uzmpsk.dilekakbas@gmail.com>

…ousResearch#2179) - quickstart.md: mention context length prompt for custom endpoints, link to configuration docs, add Ollama to provider table - faq.md: rewrite local models section with hermes model flow and context length prompt example, add Ollama num_ctx tip, expand context-length-exceeded troubleshooting with detection override options and config.yaml examples Co-authored-by: Test <test@test.com>

teknium1 merged commit 80e578d into main Mar 20, 2026
1 of 2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: add context length detection references to FAQ and quickstart#2179

docs: add context length detection references to FAQ and quickstart#2179
teknium1 merged 1 commit intomainfrom
hermes/hermes-3369cdb1

teknium1 commented Mar 20, 2026

Uh oh!

Labels

1 participant

Conversation

teknium1 commented Mar 20, 2026

Uh oh!

Labels

1 participant