fix(agent): prevent AsyncOpenAI/httpx cross-loop deadlock in gateway mode by ctlst · Pull Request #2701 · NousResearch/hermes-agent

ctlst · 2026-03-24T00:37:42Z

Summary

Fix async tool deadlock in gateway mode where vision_analyze, web_extract, and session_search hang forever because cached AsyncOpenAI clients are reused across different event loops
Include event loop identity in the async client cache key so each loop gets its own client instance
Replace session_search_tool.py's manual asyncio.run() in ThreadPoolExecutor with the centralized _run_async() bridge

Fixes #2681. Related to #2338.

Relationship to #2682

PR #2682 fixes the same issue but only for vision_analyze by switching to sync call_llm. This PR fixes the root cause in the client cache layer, so all async tools are fixed without modifying each tool individually:

Tool	Uses `async_call_llm`	Fixed by #2682	Fixed here
`vision_analyze`	Yes	✅	✅
`web_extract`	Yes	❌	✅
`session_search`	Yes	❌	✅
`mixture_of_agents`	Yes	❌	✅

Both approaches are compatible — #2682's sync switch is a reasonable defense-in-depth for vision specifically, while this PR prevents the class of bug from affecting any current or future async tool.

Root Cause

In gateway mode, _run_async() spawns a new thread with asyncio.run() which creates a fresh event loop. But _get_cached_client() returns an AsyncOpenAI client that was created on (and bound to) a different loop. Since httpx.AsyncClient cannot operate across event loop boundaries, await client.chat.completions.create() hangs indefinitely.

session_search_tool.py had the same bug independently — its own asyncio.run() in a ThreadPoolExecutor created the same cross-loop conflict.

Changes

agent/auxiliary_client.py — Add id(current_loop) to the async client cache key so each event loop gets its own AsyncOpenAI instance. Sync clients (no loop binding) are unaffected.

tools/session_search_tool.py — Replace manual asyncio.run() in ThreadPoolExecutor with _run_async() which properly handles loop lifecycle across CLI, gateway, and worker-thread contexts.

tests/test_crossloop_client_cache.py — 5 new tests:

Same loop reuses cached client
Different loops get separate clients
Sync clients shared globally (not affected)
Gateway simulation (asyncio.run in thread gets fresh client)
Closed loop client is discarded

How to Test

Run hermes in gateway mode (Telegram) with a multimodal model
Send an image and ask the bot to describe it (triggers vision_analyze)
Ask the bot to search past sessions (triggers session_search)
Both should complete without timeout — previously both would deadlock

Tested On

Linux (Docker, Python 3.11) — Telegram gateway with Qwen3.5-27B via llama.cpp
macOS (Python 3.14) — unit tests

…mode In gateway mode, async tools (vision_analyze, web_extract, session_search) deadlock because _run_async() spawns a thread with asyncio.run(), creating a new event loop, but _get_cached_client() returns an AsyncOpenAI client bound to a different loop. httpx.AsyncClient cannot work across event loop boundaries, causing await client.chat.completions.create() to hang forever. Fix: include the event loop identity in the async client cache key so each loop gets its own AsyncOpenAI instance. Also fix session_search_tool.py which had its own broken asyncio.run()-in-thread pattern — now uses the centralized _run_async() bridge.

…mode (NousResearch#2701) In gateway mode, async tools (vision_analyze, web_extract, session_search) deadlock because _run_async() spawns a thread with asyncio.run(), creating a new event loop, but _get_cached_client() returns an AsyncOpenAI client bound to a different loop. httpx.AsyncClient cannot work across event loop boundaries, causing await client.chat.completions.create() to hang forever. Fix: include the event loop identity in the async client cache key so each loop gets its own AsyncOpenAI instance. Also fix session_search_tool.py which had its own broken asyncio.run()-in-thread pattern — now uses the centralized _run_async() bridge.

teknium1 merged commit 281100e into NousResearch:main Mar 26, 2026

teknium1 mentioned this pull request Mar 26, 2026

fix(vision): use sync call_llm to avoid AsyncOpenAI deadlock in gateway mode #2682

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(agent): prevent AsyncOpenAI/httpx cross-loop deadlock in gateway mode#2701

fix(agent): prevent AsyncOpenAI/httpx cross-loop deadlock in gateway mode#2701
teknium1 merged 1 commit intoNousResearch:mainfrom
ctlst:fix/async-tool-crossloop-deadlock

ctlst commented Mar 24, 2026

Labels

2 participants

Conversation