-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Background processes are lost when hermes-gateway restarts #1144
Description
Summary
When hermes-gateway restarts, background processes started from a messaging session can be lost.
This causes Hermes to later report that the process ID no longer exists, even though the job was launched successfully and the user expects Hermes to keep tracking it.
Why this is a bug
A gateway restart should not make Hermes forget user-started background jobs.
At minimum, Hermes should persist enough metadata to recover tracking after restart:
- process/session id
- pid
- originating platform/chat/thread
- command
- output/log paths
- current status
If the child process is still alive, Hermes should reattach.
If the child process died because of the restart, Hermes should still report that clearly with preserved stderr/exit state.
Right now the behavior is effectively:
- background job starts
- gateway restarts
- process registry is reset
- later poll returns
No process with ID ... - user loses the job state and useful failure information
Repro
- Start
hermes-gateway - Launch a long-running background command from Telegram
- Restart
hermes-gateway - Ask Hermes to poll/check the background process
Observed
Hermes reports the background process is gone / not found.
Example:
- process launched successfully with a
proc_*session id - after gateway restart, Hermes responds with:
No process with ID ...- or says the handle is gone
~/.hermes/processes.jsonis reset on restart, so in-memory tracking is lost
Expected
One of:
- Hermes reattaches to the existing background process and continues tracking it
- Hermes restores durable job state and can report final outcome/log path after restart
- If the restart terminates the child, Hermes reports that explicitly as a restart-induced termination, not “process not found”
Likely root cause
Background process tracking appears too dependent on in-memory state / processes.json state that is rewritten on restart.
The gateway should persist background job metadata durably and recover it on boot.
Impact
High for messaging users:
- long backtests/research jobs become unreliable
- users lose output and failure traces
- restart during an active job silently breaks trust in background execution
Suggested fix direction
- persist background watcher/job metadata durably
- recover watchers on gateway startup
- attempt reattach by pid/session id
- preserve stderr/exit info across restart
- distinguish “process exited” from “registry forgot process”