Skip to content

fix(cron): prevent recurring job re-fire on gateway crash/restart loop#3396

Merged
teknium1 merged 1 commit intomainfrom
hermes/hermes-9420d6a3
Mar 27, 2026
Merged

fix(cron): prevent recurring job re-fire on gateway crash/restart loop#3396
teknium1 merged 1 commit intomainfrom
hermes/hermes-9420d6a3

Conversation

@teknium1
Copy link
Copy Markdown
Contributor

Summary

When a gateway crashes mid-job execution (before mark_job_run can persist the updated next_run_at), recurring cron jobs fire again on every restart attempt within the grace window. For a daily 6:15 AM job with a 2-hour grace period, rapidly restarting the gateway could trigger dozens of duplicate runs.

Root cause: tick() calls run_job() (which spawns a full agent session — potentially minutes of execution) before mark_job_run() updates next_run_at on disk. If the process dies between these two calls, the next restart finds the job still due and fires it again.

Reported by: ludw1OP (DietPi user) — their gateway was unstable due to missing dbus-user-session package, causing repeated restarts that flooded their Telegram with duplicate morning wake-up messages.

Changes

  • cron/jobs.py — Added advance_next_run(job_id): for recurring jobs (cron/interval), preemptively computes and persists the next future next_run_at before execution begins. One-shot jobs are left alone so they can retry on restart.
  • cron/scheduler.pytick() now calls advance_next_run() before run_job(). If the process crashes mid-run, the persisted next_run_at is already in the future, preventing re-fire.

Semantics change

Recurring jobs move from at-least-once to at-most-once delivery. Missing one scheduled run due to a crash is far better than sending dozens of duplicates in a crash loop. mark_job_run() still runs after successful execution and re-confirms the next run time.

Test plan

  • 7 new tests covering: interval advance, cron advance, one-shot skip, nonexistent job, already-future jobs, crash-safety scenario, and tick call ordering
  • Full suite: 6445 passed, 0 failed
When a gateway crashes mid-job execution (before mark_job_run can persist
the updated next_run_at), the job would fire again on every restart attempt
within the grace window. For a daily 6:15 AM job with a 2-hour grace,
rapidly restarting the gateway could trigger dozens of duplicate runs.

Fix: call advance_next_run() BEFORE run_job() in tick(). For recurring
jobs (cron/interval), this preemptively advances next_run_at to the next
future occurrence and persists it to disk. If the process then crashes
during execution, the job won't be considered due on restart.

One-shot jobs are left unchanged — they still retry on restart since
there's no future occurrence to advance to.

This changes the scheduler from at-least-once to at-most-once semantics
for recurring jobs, which is the correct tradeoff: missing one daily
message is far better than sending it dozens of times.
@teknium1 teknium1 merged commit eb2127c into main Mar 27, 2026
2 checks passed
StreamOfRon pushed a commit to StreamOfRon/hermes-agent that referenced this pull request Mar 29, 2026
NousResearch#3396)

When a gateway crashes mid-job execution (before mark_job_run can persist
the updated next_run_at), the job would fire again on every restart attempt
within the grace window. For a daily 6:15 AM job with a 2-hour grace,
rapidly restarting the gateway could trigger dozens of duplicate runs.

Fix: call advance_next_run() BEFORE run_job() in tick(). For recurring
jobs (cron/interval), this preemptively advances next_run_at to the next
future occurrence and persists it to disk. If the process then crashes
during execution, the job won't be considered due on restart.

One-shot jobs are left unchanged — they still retry on restart since
there's no future occurrence to advance to.

This changes the scheduler from at-least-once to at-most-once semantics
for recurring jobs, which is the correct tradeoff: missing one daily
message is far better than sending it dozens of times.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant