Skip to content

fix(cron): scale missed-job grace window with schedule frequency#2112

Closed
ticketclosed-wontfix wants to merge 1 commit intoNousResearch:mainfrom
ticketclosed-wontfix:fix/cron-missed-job-grace-window
Closed

fix(cron): scale missed-job grace window with schedule frequency#2112
ticketclosed-wontfix wants to merge 1 commit intoNousResearch:mainfrom
ticketclosed-wontfix:fix/cron-missed-job-grace-window

Conversation

@ticketclosed-wontfix
Copy link
Copy Markdown
Contributor

What

Replaces the hardcoded 120-second grace window in get_due_jobs() with a dynamic window that scales with the job's scheduling frequency.

Formula: min(period / 2, 2 hours), floored at 120 seconds.

Schedule Grace window
Every 5 min 2.5 min
Every 10 min 5 min
Hourly 30 min
Daily 2 hours (capped)

Why

The current hardcoded 120-second threshold silently skips recurring cron jobs that fire even 2 minutes late. For daily jobs this is problematic — a brief gateway reconnect, network blip, or load spike at the scheduled time causes the job to be fast-forwarded to the next day with no retry.

This was hit in production: a ~1-minute gateway reconnect at 07:05 caused a daily briefing (scheduled at 5 7 * * *) to be silently skipped. The other two daily jobs at 07:00 and 07:10 ran fine — only the one whose tick landed during the reconnect was lost.

Changes

  • cron/jobs.py: Added _compute_grace_seconds(schedule) helper that calculates grace as half the schedule period, clamped to [120s, 7200s]. Updated get_due_jobs() to use it instead of the hardcoded 120. Updated the log message to include the grace value for observability.
  • tests/cron/test_jobs.py: Updated two existing tests to match the new dynamic grace behaviour for hourly jobs.

How to test

  1. Create a daily cron job: hermes cron create "test" --schedule "0 9 * * *" --name test-daily
  2. Manually set its next_run_at to 10 minutes ago in ~/.hermes/cron/jobs.json
  3. Run python -c "from cron.scheduler import tick; tick()" — the job should fire (within 2h grace)
  4. Set next_run_at to 3 hours ago — the job should be fast-forwarded, not fired

Platforms tested

  • Linux (Ubuntu, production gateway)

Note

Happy to adjust the approach if you'd prefer to make the grace window user-configurable (e.g. a per-job grace_seconds field or a global config option), or if you'd rather use a different calculation for the grace period. Open to feedback on the formula.

Replace hardcoded 120-second grace period with a dynamic window that
scales with the job's scheduling frequency (half the period, clamped
to [120s, 2h]). Daily jobs now catch up if missed by up to 2 hours
instead of being silently skipped after just 2 minutes.
@teknium1
Copy link
Copy Markdown
Contributor

Merged via #2449. Cherry-picked with authorship preserved. Great fix — the production scenario in the PR description made this easy to evaluate. Thanks!

@teknium1 teknium1 closed this Mar 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants