Skip to content

fix(gateway): recover from hung agents — /stop hard-kills session lock#3104

Merged
teknium1 merged 1 commit intomainfrom
hermes/hermes-9f3f51e2
Mar 26, 2026
Merged

fix(gateway): recover from hung agents — /stop hard-kills session lock#3104
teknium1 merged 1 commit intomainfrom
hermes/hermes-9f3f51e2

Conversation

@teknium1
Copy link
Copy Markdown
Contributor

Summary

Salvage of PR #2498 by @Mibayy onto current main.

When an agent thread hangs (truly blocked, never checks _interrupt_requested), /stop now force-cleans _running_agents to unlock the session immediately. Previously, /stop called agent.interrupt() which sets a flag the hung agent never reads — the session stayed locked forever, showing "writing..." with no output.

Changes

Early /stop intercept — New block in the running-agent guard (following the existing /new intercept pattern) that catches /stop, calls interrupt() on the agent, then force-deletes the entry from _running_agents and clears pending messages. Returns immediately with a confirmation.

Sentinel /stop force-clean/stop during agent startup now force-cleans the sentinel instead of returning "nothing to stop yet", so the session actually unlocks.

10-minute hard timeout — Wraps loop.run_in_executor() in asyncio.wait_for(timeout=600). On timeout, interrupts the agent and constructs a synthetic response. The thread keeps running (Python can't kill threads) but the session lock is released.

Follow-up improvements over original PR

  • Consolidated duplicate resolve_command imports — single early resolution shared by /stop and /new intercepts
  • Updated _handle_stop_command() to also force-clean for consistency (both paths now behave identically)
  • Added zombie thread documentation on the timeout handler

Tests

  • Updated test 6 (sentinel /stop) to verify force-cleanup
  • Added test 6b: /stop hard-kills a running agent
  • Added test 6c: /stop clears pending messages

All 6178 tests pass.

Closes #2491. Cherry-picked from #2498 by @Mibayy.

When an agent thread hangs (truly blocked, never checks _interrupt_requested),
/stop now force-cleans _running_agents to unlock the session immediately.

Two changes:
- Early /stop intercept in the running-agent guard: bypasses normal command
  dispatch to force-interrupt and unlock the session. Follows the same pattern
  as the existing /new intercept.
- Sentinel /stop: force-cleans the sentinel instead of returning 'nothing to
  stop yet', so /stop during slow startup actually unlocks the session.

Follow-up improvements over original PR:
- Consolidated duplicate resolve_command imports into single early resolution
- Updated _handle_stop_command to also force-clean for consistency
- Removed 10-minute hard timeout on the executor (would kill legitimate
  long-running agent tasks; the /stop force-clean handles recovery)

Cherry-picked from Mibayy's PR #2498.
@teknium1 teknium1 force-pushed the hermes/hermes-9f3f51e2 branch from 19742e2 to ebfbfa5 Compare March 26, 2026 01:38
@github-actions
Copy link
Copy Markdown

⚠️ Supply Chain Risk Detected

This PR contains patterns commonly associated with supply chain attacks. This does not mean the PR is malicious — but these patterns require careful human review before merging.

⚠️ WARNING: Outbound network calls (POST/PUT)

Outbound POST/PUT requests in new code could be data exfiltration. Verify the destination URLs are legitimate.

Matches (first 10):

144:+        with urllib.request.urlopen(req, timeout=15) as resp:
225:+        with urllib.request.urlopen(req, timeout=10) as resp:

Automated scan triggered by supply-chain-audit. If this is a false positive, a maintainer can approve after manual review.

@teknium1 teknium1 merged commit 59575d6 into main Mar 26, 2026
1 of 2 checks passed
outsourc-e pushed a commit to outsourc-e/hermes-agent that referenced this pull request Mar 26, 2026
…ousResearch#3104)

When an agent thread hangs (truly blocked, never checks _interrupt_requested),
/stop now force-cleans _running_agents to unlock the session immediately.

Two changes:
- Early /stop intercept in the running-agent guard: bypasses normal command
  dispatch to force-interrupt and unlock the session. Follows the same pattern
  as the existing /new intercept.
- Sentinel /stop: force-cleans the sentinel instead of returning 'nothing to
  stop yet', so /stop during slow startup actually unlocks the session.

Follow-up improvements over original PR:
- Consolidated duplicate resolve_command imports into single early resolution
- Updated _handle_stop_command to also force-clean for consistency
- Removed 10-minute hard timeout on the executor (would kill legitimate
  long-running agent tasks; the /stop force-clean handles recovery)

Cherry-picked from Mibayy's PR NousResearch#2498.

Co-authored-by: Mibayy <Mibayy@users.noreply.github.com>
StreamOfRon pushed a commit to StreamOfRon/hermes-agent that referenced this pull request Mar 29, 2026
…ousResearch#3104)

When an agent thread hangs (truly blocked, never checks _interrupt_requested),
/stop now force-cleans _running_agents to unlock the session immediately.

Two changes:
- Early /stop intercept in the running-agent guard: bypasses normal command
  dispatch to force-interrupt and unlock the session. Follows the same pattern
  as the existing /new intercept.
- Sentinel /stop: force-cleans the sentinel instead of returning 'nothing to
  stop yet', so /stop during slow startup actually unlocks the session.

Follow-up improvements over original PR:
- Consolidated duplicate resolve_command imports into single early resolution
- Updated _handle_stop_command to also force-clean for consistency
- Removed 10-minute hard timeout on the executor (would kill legitimate
  long-running agent tasks; the /stop force-clean handles recovery)

Cherry-picked from Mibayy's PR NousResearch#2498.

Co-authored-by: Mibayy <Mibayy@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants