fix(gateway): retry Telegram 409 polling conflicts before giving up#2297
Closed
robbyczgw-cla wants to merge 1 commit intoNousResearch:mainfrom
Closed
fix(gateway): retry Telegram 409 polling conflicts before giving up#2297robbyczgw-cla wants to merge 1 commit intoNousResearch:mainfrom
robbyczgw-cla wants to merge 1 commit intoNousResearch:mainfrom
Conversation
A single Telegram 409 Conflict from getUpdates permanently killed Telegram polling with no recovery possible (retryable=False on first occurrence). This is too aggressive for production use with process supervisors. Transient 409s are expected during: - --replace handoffs where the old long-poll session lingers on Telegram servers for a few seconds after SIGTERM - systemd Restart=on-failure respawns that overlap with the dying instance cleanup Now _handle_polling_conflict() retries up to 3 times with a 10-second delay between attempts. The 30-second total retry window lets stale server-side sessions expire. If all retries fail, the error is still marked as permanently fatal — preserving the original protection against genuine dual-instance conflicts. Tests updated: split the single conflict test into two — one verifying retry on transient conflict, one verifying fatal after exhausted retries. Closes NousResearch#2296
This was referenced Mar 21, 2026
Contributor
|
Merged via PR #2312. Your commit was cherry-picked onto current main with authorship preserved. Excellent issue report and clean fix — thanks for the contribution! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
A single Telegram
409 ConflictfromgetUpdatespermanently killed Telegram polling with no recovery possible. This PR adds retry logic so transient conflicts during gateway restarts resolve automatically.Problem
_handle_polling_conflict()(introduced in 5a2fcaa) calls_set_fatal_error("telegram_polling_conflict", ..., retryable=False)on the first 409 error. While this prevents endless retry-spam from genuine dual-instance conflicts, it is too aggressive for production deployments with process supervisors.Transient 409s are expected during:
--replacehandoffs: SIGTERM kills the old gateway, but Telegram's server may still hold the previous long-poll session for a few secondsRestart=on-failurerespawns: systemd restarts the gateway after a 409-triggered exit, and the new instance's first poll overlaps with the dying instance's cleanupAdditionally,
python-telegram-bot's built-innetwork_retry_loopalready handles 409s with exponential backoff (max_retries=-1), but theerror_callbackoverrides this by immediately marking the error as fatal and stopping the updater.Changes
gateway/platforms/telegram.py:_handle_polling_conflict()now retries up to 3 times with a 10-second delay between attemptsstart_polling()call fails, it returns and waits for the next conflict to trigger another attempt_polling_conflict_countand_polling_error_callback_refinitialized in__init__connect()for reuse in retriestests/gateway/test_telegram_conflict.py:test_polling_conflict_stops_polling_and_notifies_handlerinto two tests:test_polling_conflict_retries_before_fatal: verifies a single 409 triggers a retry (not fatal)test_polling_conflict_becomes_fatal_after_retries: verifies fatal error after exhausting retriesHow to test
Restart=on-failuresystemctl --user restart hermes-gateway— transient 409 during handoff should auto-recoverpytest tests/gateway/test_telegram_conflict.py -v— all 5 tests passPlatform tested
Closes #2296