Skip to content

fix(gateway): retry Telegram 409 polling conflicts before giving up#2297

Closed
robbyczgw-cla wants to merge 1 commit intoNousResearch:mainfrom
robbyczgw-cla:fix/telegram-polling-conflict-retry
Closed

fix(gateway): retry Telegram 409 polling conflicts before giving up#2297
robbyczgw-cla wants to merge 1 commit intoNousResearch:mainfrom
robbyczgw-cla:fix/telegram-polling-conflict-retry

Conversation

@robbyczgw-cla
Copy link
Copy Markdown

Summary

A single Telegram 409 Conflict from getUpdates permanently killed Telegram polling with no recovery possible. This PR adds retry logic so transient conflicts during gateway restarts resolve automatically.

Problem

_handle_polling_conflict() (introduced in 5a2fcaa) calls _set_fatal_error("telegram_polling_conflict", ..., retryable=False) on the first 409 error. While this prevents endless retry-spam from genuine dual-instance conflicts, it is too aggressive for production deployments with process supervisors.

Transient 409s are expected during:

  • --replace handoffs: SIGTERM kills the old gateway, but Telegram's server may still hold the previous long-poll session for a few seconds
  • Restart=on-failure respawns: systemd restarts the gateway after a 409-triggered exit, and the new instance's first poll overlaps with the dying instance's cleanup

Additionally, python-telegram-bot's built-in network_retry_loop already handles 409s with exponential backoff (max_retries=-1), but the error_callback overrides this by immediately marking the error as fatal and stopping the updater.

Changes

gateway/platforms/telegram.py:

  • _handle_polling_conflict() now retries up to 3 times with a 10-second delay between attempts
  • On successful retry, the conflict counter resets to 0
  • If a retry's start_polling() call fails, it returns and waits for the next conflict to trigger another attempt
  • After exhausting all retries, the error is marked permanently fatal (same behavior as before)
  • New instance attributes _polling_conflict_count and _polling_error_callback_ref initialized in __init__
  • The error callback reference is stored during connect() for reuse in retries

tests/gateway/test_telegram_conflict.py:

  • Split test_polling_conflict_stops_polling_and_notifies_handler into two tests:
    • test_polling_conflict_retries_before_fatal: verifies a single 409 triggers a retry (not fatal)
    • test_polling_conflict_becomes_fatal_after_retries: verifies fatal error after exhausting retries

How to test

  1. Run the gateway via systemd with Restart=on-failure
  2. systemctl --user restart hermes-gateway — transient 409 during handoff should auto-recover
  3. Run two gateway instances with the same bot token — after 3 failed retries, polling stops (same as before)
  4. pytest tests/gateway/test_telegram_conflict.py -v — all 5 tests pass

Platform tested

  • Linux (Ubuntu 24.04, kernel 6.17)

Closes #2296

A single Telegram 409 Conflict from getUpdates permanently killed
Telegram polling with no recovery possible (retryable=False on
first occurrence).  This is too aggressive for production use with
process supervisors.

Transient 409s are expected during:
- --replace handoffs where the old long-poll session lingers on
  Telegram servers for a few seconds after SIGTERM
- systemd Restart=on-failure respawns that overlap with the dying
  instance cleanup

Now _handle_polling_conflict() retries up to 3 times with a
10-second delay between attempts.  The 30-second total retry window
lets stale server-side sessions expire.  If all retries fail, the
error is still marked as permanently fatal — preserving the original
protection against genuine dual-instance conflicts.

Tests updated: split the single conflict test into two — one verifying
retry on transient conflict, one verifying fatal after exhausted
retries.

Closes NousResearch#2296
@teknium1
Copy link
Copy Markdown
Contributor

Merged via PR #2312. Your commit was cherry-picked onto current main with authorship preserved. Excellent issue report and clean fix — thanks for the contribution!

@teknium1 teknium1 closed this Mar 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants