feat: auto-reconnect failed gateway platforms with exponential backoff#2584
Merged
feat: auto-reconnect failed gateway platforms with exponential backoff#2584
Conversation
When a messaging platform fails to connect at startup (e.g. transient DNS failure) or disconnects at runtime with a retryable error, the gateway now queues it for background reconnection instead of giving up permanently. - New _platform_reconnect_watcher background task runs alongside the existing session expiry watcher - Exponential backoff: 30s, 60s, 120s, 240s, 300s cap - Max 20 retry attempts before giving up on a platform - Non-retryable errors (bad auth token, etc.) are not retried - Runtime disconnections via _handle_adapter_fatal_error now queue retryable failures instead of triggering gateway shutdown - On successful reconnect, adapter is wired up and channel directory is rebuilt automatically Fixes the case where a DNS blip during gateway startup caused Telegram and Discord to be permanently unavailable until manual restart.
outsourc-e
pushed a commit
to outsourc-e/hermes-agent
that referenced
this pull request
Mar 26, 2026
NousResearch#2584) When a messaging platform fails to connect at startup (e.g. transient DNS failure) or disconnects at runtime with a retryable error, the gateway now queues it for background reconnection instead of giving up permanently. - New _platform_reconnect_watcher background task runs alongside the existing session expiry watcher - Exponential backoff: 30s, 60s, 120s, 240s, 300s cap - Max 20 retry attempts before giving up on a platform - Non-retryable errors (bad auth token, etc.) are not retried - Runtime disconnections via _handle_adapter_fatal_error now queue retryable failures instead of triggering gateway shutdown - On successful reconnect, adapter is wired up and channel directory is rebuilt automatically Fixes the case where a DNS blip during gateway startup caused Telegram and Discord to be permanently unavailable until manual restart.
aashizpoudel
pushed a commit
to aashizpoudel/hermes-agent
that referenced
this pull request
Mar 30, 2026
NousResearch#2584) When a messaging platform fails to connect at startup (e.g. transient DNS failure) or disconnects at runtime with a retryable error, the gateway now queues it for background reconnection instead of giving up permanently. - New _platform_reconnect_watcher background task runs alongside the existing session expiry watcher - Exponential backoff: 30s, 60s, 120s, 240s, 300s cap - Max 20 retry attempts before giving up on a platform - Non-retryable errors (bad auth token, etc.) are not retried - Runtime disconnections via _handle_adapter_fatal_error now queue retryable failures instead of triggering gateway shutdown - On successful reconnect, adapter is wired up and channel directory is rebuilt automatically Fixes the case where a DNS blip during gateway startup caused Telegram and Discord to be permanently unavailable until manual restart.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When a messaging platform fails to connect at startup (e.g. transient DNS failure, network timeout) or disconnects at runtime with a retryable error, the gateway now queues it for background reconnection instead of giving up permanently.
Problem: A DNS blip during gateway startup caused Telegram and Discord to be permanently unavailable until manual restart. The gateway had no retry mechanism for failed platform connections.
Changes
◆
gateway/run.py— Core reconnection logic:_failed_platformstracking dict toGatewayRunner.__init___platform_reconnect_watcher()background task with exponential backoff (30s → 60s → 120s → 240s → 300s cap, max 20 attempts)_handle_adapter_fatal_error()now queues retryable runtime disconnections for reconnection instead of triggering gateway shutdown◆
tests/gateway/test_platform_reconnect.py— 13 new tests covering:◆
tests/gateway/test_runner_fatal_adapter.py— Updated existing test to reflect new behavior (retryable errors now queue for reconnection instead of shutting down)Design
min(30 * 2^(attempt-1), 300)seconds between retriesTest plan
tests/gateway/test_platform_reconnect.py(13 tests)test_runner_fatal_adapter.pyreflects new reconnection behavior