fix: thread safety for concurrent subagent delegation#1672
Merged
Conversation
13 tasks
Four thread-safety fixes that prevent crashes and data races when running multiple subagents concurrently via delegate_task: 1. Remove redirect_stdout/stderr from delegate_tool — mutating global sys.stdout races with the spinner thread when multiple children start concurrently, causing segfaults. Children already run with quiet_mode=True so the redirect was redundant. 2. Split _run_single_child into _build_child_agent (main thread) + _run_single_child (worker thread). AIAgent construction creates httpx/SSL clients which are not thread-safe to initialize concurrently. 3. Add threading.Lock to SessionDB — subagents share the parent's SessionDB and call create_session/append_message from worker threads with no synchronization. 4. Add _active_children_lock to AIAgent — interrupt() iterates _active_children while worker threads append/remove children. 5. Add _client_cache_lock to auxiliary_client — multiple subagent threads may resolve clients concurrently via call_llm(). Based on PR #1471 by peteromallet.
…type Two features salvaged from PR #1576: 1. Honcho base_url override: allows pointing Hermes at a remote self-hosted Honcho deployment via config.yaml: honcho: base_url: "http://192.168.x.x:8000" When set, this overrides the Honcho SDK's environment mapping (production/local), enabling LAN/VPN Honcho deployments without requiring the server to live on localhost. Uses config.yaml instead of env var (HONCHO_URL) per project convention. 2. Quick command alias type: adds a new 'alias' quick command type that rewrites to another slash command before normal dispatch: quick_commands: sc: type: alias target: /context Supports both CLI and gateway. Arguments are forwarded to the target command. Based on PR #1576 by redhelix.
5d64871 to
a6777ad
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Salvage of PR #1471 by @peteromallet — thread safety fixes for concurrent subagent delegation.
The problem
Running 3+ subagents concurrently via
delegate_taskin batch mode causes segfaults, data corruption, and intermittent crashes from four distinct race conditions.Fixes
1. Remove
redirect_stdout/redirect_stderrfrom delegate_toolcontextlib.redirect_stdoutmutates the globalsys.stdout. When multiple child agents start concurrently in aThreadPoolExecutor, the race between redirect and the spinner thread corrupts the file descriptor, causing segfaults. The redirect was redundant — children already run withquiet_mode=True.2. Split agent construction from execution
_run_single_child()→_build_child_agent()(main thread, serial) +_run_single_child()(worker thread, parallel).AIAgentconstruction creates httpx clients and initializes SSL contexts, which are not thread-safe to do concurrently.3. Add
threading.LocktoSessionDBSubagents share the parent's
SessionDBand callcreate_session(),append_message(), etc. from worker threads with no synchronization. Every database-accessing method is now wrapped inwith self._lock:.4. Add
_active_children_locktoAIAgentinterrupt()iterates_active_childrenwhile worker threads append/remove children. Now copies the list under lock before iterating.5. Add
_client_cache_locktoauxiliary_clientMultiple subagent threads may resolve auxiliary clients concurrently via
call_llm(). Double-checked locking pattern prevents duplicate client creation.What was NOT included from the original PR
model/provideroverrides indelegate_taskschema (feature addition, not a safety fix)resolve_provider_credentials()helper (utility, not needed for the safety fixes)_apply_provider_credentials()extraction inrun_agent.py(refactoring, not a safety fix)Files changed
tools/delegate_tool.pyhermes_state.pythreading.Lockto all DB methodsrun_agent.py_active_children_lock, use ininterrupt()agent/auxiliary_client.py_client_cache_lock, double-checked locking_run_single_childsignature + add_active_children_lockTests
Full suite: 4911 passed, 8 pre-existing failures (unrelated), 200 skipped.
Credit
Original implementation by @peteromallet (PR #1471).
Closes #1471