Skip to content

fix: thread safety for concurrent subagent delegation#1672

Merged
teknium1 merged 2 commits intomainfrom
hermes/hermes-3218df83
Mar 17, 2026
Merged

fix: thread safety for concurrent subagent delegation#1672
teknium1 merged 2 commits intomainfrom
hermes/hermes-3218df83

Conversation

@teknium1
Copy link
Copy Markdown
Contributor

Summary

Salvage of PR #1471 by @peteromallet — thread safety fixes for concurrent subagent delegation.

The problem

Running 3+ subagents concurrently via delegate_task in batch mode causes segfaults, data corruption, and intermittent crashes from four distinct race conditions.

Fixes

1. Remove redirect_stdout/redirect_stderr from delegate_tool
contextlib.redirect_stdout mutates the global sys.stdout. When multiple child agents start concurrently in a ThreadPoolExecutor, the race between redirect and the spinner thread corrupts the file descriptor, causing segfaults. The redirect was redundant — children already run with quiet_mode=True.

2. Split agent construction from execution
_run_single_child()_build_child_agent() (main thread, serial) + _run_single_child() (worker thread, parallel). AIAgent construction creates httpx clients and initializes SSL contexts, which are not thread-safe to do concurrently.

3. Add threading.Lock to SessionDB
Subagents share the parent's SessionDB and call create_session(), append_message(), etc. from worker threads with no synchronization. Every database-accessing method is now wrapped in with self._lock:.

4. Add _active_children_lock to AIAgent
interrupt() iterates _active_children while worker threads append/remove children. Now copies the list under lock before iterating.

5. Add _client_cache_lock to auxiliary_client
Multiple subagent threads may resolve auxiliary clients concurrently via call_llm(). Double-checked locking pattern prevents duplicate client creation.

What was NOT included from the original PR

  • Per-task model/provider overrides in delegate_task schema (feature addition, not a safety fix)
  • resolve_provider_credentials() helper (utility, not needed for the safety fixes)
  • _apply_provider_credentials() extraction in run_agent.py (refactoring, not a safety fix)

Files changed

File Change
tools/delegate_tool.py Split build/run, remove redirect, use lock
hermes_state.py Add threading.Lock to all DB methods
run_agent.py Add _active_children_lock, use in interrupt()
agent/auxiliary_client.py Add _client_cache_lock, double-checked locking
6 test files Update for new _run_single_child signature + add _active_children_lock

Tests

Full suite: 4911 passed, 8 pre-existing failures (unrelated), 200 skipped.

Credit

Original implementation by @peteromallet (PR #1471).
Closes #1471

peteromallet and others added 2 commits March 17, 2026 02:51
Four thread-safety fixes that prevent crashes and data races when
running multiple subagents concurrently via delegate_task:

1. Remove redirect_stdout/stderr from delegate_tool — mutating global
   sys.stdout races with the spinner thread when multiple children start
   concurrently, causing segfaults. Children already run with
   quiet_mode=True so the redirect was redundant.

2. Split _run_single_child into _build_child_agent (main thread) +
   _run_single_child (worker thread). AIAgent construction creates
   httpx/SSL clients which are not thread-safe to initialize
   concurrently.

3. Add threading.Lock to SessionDB — subagents share the parent's
   SessionDB and call create_session/append_message from worker threads
   with no synchronization.

4. Add _active_children_lock to AIAgent — interrupt() iterates
   _active_children while worker threads append/remove children.

5. Add _client_cache_lock to auxiliary_client — multiple subagent
   threads may resolve clients concurrently via call_llm().

Based on PR #1471 by peteromallet.
…type

Two features salvaged from PR #1576:

1. Honcho base_url override: allows pointing Hermes at a remote
   self-hosted Honcho deployment via config.yaml:

     honcho:
       base_url: "http://192.168.x.x:8000"

   When set, this overrides the Honcho SDK's environment mapping
   (production/local), enabling LAN/VPN Honcho deployments without
   requiring the server to live on localhost. Uses config.yaml instead
   of env var (HONCHO_URL) per project convention.

2. Quick command alias type: adds a new 'alias' quick command type
   that rewrites to another slash command before normal dispatch:

     quick_commands:
       sc:
         type: alias
         target: /context

   Supports both CLI and gateway. Arguments are forwarded to the
   target command.

Based on PR #1576 by redhelix.
@teknium1 teknium1 force-pushed the hermes/hermes-3218df83 branch from 5d64871 to a6777ad Compare March 17, 2026 09:53
@teknium1 teknium1 merged commit 1d5a39e into main Mar 17, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants