Skip to content

Background processes are lost when hermes-gateway restarts #1144

@spanishflu-est1918

Description

@spanishflu-est1918

Summary

When hermes-gateway restarts, background processes started from a messaging session can be lost.

This causes Hermes to later report that the process ID no longer exists, even though the job was launched successfully and the user expects Hermes to keep tracking it.

Why this is a bug

A gateway restart should not make Hermes forget user-started background jobs.

At minimum, Hermes should persist enough metadata to recover tracking after restart:

  • process/session id
  • pid
  • originating platform/chat/thread
  • command
  • output/log paths
  • current status

If the child process is still alive, Hermes should reattach.
If the child process died because of the restart, Hermes should still report that clearly with preserved stderr/exit state.

Right now the behavior is effectively:

  • background job starts
  • gateway restarts
  • process registry is reset
  • later poll returns No process with ID ...
  • user loses the job state and useful failure information

Repro

  1. Start hermes-gateway
  2. Launch a long-running background command from Telegram
  3. Restart hermes-gateway
  4. Ask Hermes to poll/check the background process

Observed

Hermes reports the background process is gone / not found.

Example:

  • process launched successfully with a proc_* session id
  • after gateway restart, Hermes responds with:
    • No process with ID ...
    • or says the handle is gone
  • ~/.hermes/processes.json is reset on restart, so in-memory tracking is lost

Expected

One of:

  1. Hermes reattaches to the existing background process and continues tracking it
  2. Hermes restores durable job state and can report final outcome/log path after restart
  3. If the restart terminates the child, Hermes reports that explicitly as a restart-induced termination, not “process not found”

Likely root cause

Background process tracking appears too dependent on in-memory state / processes.json state that is rewritten on restart.

The gateway should persist background job metadata durably and recover it on boot.

Impact

High for messaging users:

  • long backtests/research jobs become unreliable
  • users lose output and failure traces
  • restart during an active job silently breaks trust in background execution

Suggested fix direction

  • persist background watcher/job metadata durably
  • recover watchers on gateway startup
  • attempt reattach by pid/session id
  • preserve stderr/exit info across restart
  • distinguish “process exited” from “registry forgot process”

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions