Skip to content

feat: use endpoint metadata for custom model context and pricing#1906

Merged
teknium1 merged 2 commits intomainfrom
hermes/hermes-562a3784
Mar 18, 2026
Merged

feat: use endpoint metadata for custom model context and pricing#1906
teknium1 merged 2 commits intomainfrom
hermes/hermes-562a3784

Conversation

@teknium1
Copy link
Copy Markdown
Contributor

Salvage of PR #1875 by @kshitijk4poor (cherry-picked with authorship preserved, 2 commits).

Summary

Custom endpoints (Chutes, local llama.cpp, etc.) were getting wrong context lengths because get_model_context_length() fell through to fuzzy name-matching against hardcoded defaults — e.g. zai-org/GLM-5-TEE on Chutes would match the unrelated glm-5 entry.

This PR queries the endpoint's own /models API for real metadata instead of guessing.

Changes

Commit 1 (perf cleanup):

  • Cache base_url.lower() via a property setter (_base_url_lower) — eliminates ~15 repeated .lower() calls throughout run_agent.py
  • Consolidate 3 separate load_config() calls in __init__ into one
  • Hoist _READ_SEARCH_TOOLS set to module level in model_tools.py

Commit 2 (endpoint metadata):

  • New fetch_endpoint_model_metadata() in model_metadata.py — queries /models on custom OpenAI-compatible endpoints, cached 5 min per base URL
  • Extraction helpers for context length, max completion tokens, and pricing from varied API response formats
  • Custom endpoints check their own /models before fuzzy name-matching; unknown third-party endpoints skip fuzzy matching entirely (falls back to probe tiers)
  • Pricing integration: custom endpoints that expose pricing in /models get accurate cost estimates
  • Model alias support: provider/model-name entries also get a bare model-name alias in the cache

Test plan

  • pytest tests/agent/test_model_metadata.py tests/agent/test_usage_pricing.py tests/agent/test_context_compressor.py — 100 passed
  • Full suite — 5349 passed (7 pre-existing failures in test_anthropic_adapter.py and test_whatsapp_reply_prefix.py)
…nfig(), hoist set constant

run_agent.py:
- Add base_url property that auto-caches _base_url_lower on every
  assignment, eliminating 12+ redundant .lower() calls per API cycle
  across __init__, _build_api_kwargs, _supports_reasoning_extra_body,
  and the main conversation loop
- Consolidate three separate load_config() disk reads in __init__
  (memory, skills, compression) into a single call, reusing the
  result dict for all three config sections

model_tools.py:
- Hoist _READ_SEARCH_TOOLS set to module level (was rebuilt inside
  handle_function_call on every tool invocation)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants