fix(cron): prevent recurring job re-fire on gateway crash/restart loop by teknium1 · Pull Request #3396 · NousResearch/hermes-agent

teknium1 · 2026-03-27T14:09:43Z

Summary

When a gateway crashes mid-job execution (before mark_job_run can persist the updated next_run_at), recurring cron jobs fire again on every restart attempt within the grace window. For a daily 6:15 AM job with a 2-hour grace period, rapidly restarting the gateway could trigger dozens of duplicate runs.

Root cause: tick() calls run_job() (which spawns a full agent session — potentially minutes of execution) before mark_job_run() updates next_run_at on disk. If the process dies between these two calls, the next restart finds the job still due and fires it again.

Reported by: ludw1OP (DietPi user) — their gateway was unstable due to missing dbus-user-session package, causing repeated restarts that flooded their Telegram with duplicate morning wake-up messages.

Changes

cron/jobs.py — Added advance_next_run(job_id): for recurring jobs (cron/interval), preemptively computes and persists the next future next_run_at before execution begins. One-shot jobs are left alone so they can retry on restart.
cron/scheduler.py — tick() now calls advance_next_run() before run_job(). If the process crashes mid-run, the persisted next_run_at is already in the future, preventing re-fire.

Semantics change

Recurring jobs move from at-least-once to at-most-once delivery. Missing one scheduled run due to a crash is far better than sending dozens of duplicates in a crash loop. mark_job_run() still runs after successful execution and re-confirms the next run time.

Test plan

7 new tests covering: interval advance, cron advance, one-shot skip, nonexistent job, already-future jobs, crash-safety scenario, and tick call ordering
Full suite: 6445 passed, 0 failed

When a gateway crashes mid-job execution (before mark_job_run can persist the updated next_run_at), the job would fire again on every restart attempt within the grace window. For a daily 6:15 AM job with a 2-hour grace, rapidly restarting the gateway could trigger dozens of duplicate runs. Fix: call advance_next_run() BEFORE run_job() in tick(). For recurring jobs (cron/interval), this preemptively advances next_run_at to the next future occurrence and persists it to disk. If the process then crashes during execution, the job won't be considered due on restart. One-shot jobs are left unchanged — they still retry on restart since there's no future occurrence to advance to. This changes the scheduler from at-least-once to at-most-once semantics for recurring jobs, which is the correct tradeoff: missing one daily message is far better than sending it dozens of times.

NousResearch#3396) When a gateway crashes mid-job execution (before mark_job_run can persist the updated next_run_at), the job would fire again on every restart attempt within the grace window. For a daily 6:15 AM job with a 2-hour grace, rapidly restarting the gateway could trigger dozens of duplicate runs. Fix: call advance_next_run() BEFORE run_job() in tick(). For recurring jobs (cron/interval), this preemptively advances next_run_at to the next future occurrence and persists it to disk. If the process then crashes during execution, the job won't be considered due on restart. One-shot jobs are left unchanged — they still retry on restart since there's no future occurrence to advance to. This changes the scheduler from at-least-once to at-most-once semantics for recurring jobs, which is the correct tradeoff: missing one daily message is far better than sending it dozens of times.

teknium1 merged commit eb2127c into main Mar 27, 2026
2 checks passed

teknium1 mentioned this pull request Mar 27, 2026

Response truncated due to output length limit #2706

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(cron): prevent recurring job re-fire on gateway crash/restart loop#3396

fix(cron): prevent recurring job re-fire on gateway crash/restart loop#3396
teknium1 merged 1 commit intomainfrom
hermes/hermes-9420d6a3

teknium1 commented Mar 27, 2026

Uh oh!

Labels

1 participant

Conversation