Skip to content

feat: auto-reconnect failed gateway platforms with exponential backoff#2584

Merged
teknium1 merged 1 commit intomainfrom
hermes/hermes-f9506ecc
Mar 23, 2026
Merged

feat: auto-reconnect failed gateway platforms with exponential backoff#2584
teknium1 merged 1 commit intomainfrom
hermes/hermes-f9506ecc

Conversation

@teknium1
Copy link
Copy Markdown
Contributor

Summary

When a messaging platform fails to connect at startup (e.g. transient DNS failure, network timeout) or disconnects at runtime with a retryable error, the gateway now queues it for background reconnection instead of giving up permanently.

Problem: A DNS blip during gateway startup caused Telegram and Discord to be permanently unavailable until manual restart. The gateway had no retry mechanism for failed platform connections.

Changes

gateway/run.py — Core reconnection logic:

  • Added _failed_platforms tracking dict to GatewayRunner.__init__
  • Startup connection loop now queues failed platforms for retry (retryable errors only)
  • New _platform_reconnect_watcher() background task with exponential backoff (30s → 60s → 120s → 240s → 300s cap, max 20 attempts)
  • _handle_adapter_fatal_error() now queues retryable runtime disconnections for reconnection instead of triggering gateway shutdown
  • On successful reconnect: adapter is wired up, delivery router updated, channel directory rebuilt

tests/gateway/test_platform_reconnect.py — 13 new tests covering:

  • Startup failure queueing
  • Reconnect success/failure/backoff/max-attempts/idle behavior
  • Non-retryable error removal from queue
  • Runtime disconnection queueing and shutdown prevention

tests/gateway/test_runner_fatal_adapter.py — Updated existing test to reflect new behavior (retryable errors now queue for reconnection instead of shutting down)

Design

  • Backoff: min(30 * 2^(attempt-1), 300) seconds between retries
  • Max 20 attempts (~100 min at cap) before giving up
  • Non-retryable errors (bad token, auth failure) are never retried
  • Watcher checks every 10 seconds for platforms due for retry
  • When all adapters disconnect but platforms are queued, gateway stays alive
  • Watcher runs even when no platforms initially failed (handles runtime disconnections)

Test plan

  • All 5940 tests pass (0 failures)
  • New test file: tests/gateway/test_platform_reconnect.py (13 tests)
  • Updated test: test_runner_fatal_adapter.py reflects new reconnection behavior
When a messaging platform fails to connect at startup (e.g. transient DNS
failure) or disconnects at runtime with a retryable error, the gateway now
queues it for background reconnection instead of giving up permanently.

- New _platform_reconnect_watcher background task runs alongside the
  existing session expiry watcher
- Exponential backoff: 30s, 60s, 120s, 240s, 300s cap
- Max 20 retry attempts before giving up on a platform
- Non-retryable errors (bad auth token, etc.) are not retried
- Runtime disconnections via _handle_adapter_fatal_error now queue
  retryable failures instead of triggering gateway shutdown
- On successful reconnect, adapter is wired up and channel directory
  is rebuilt automatically

Fixes the case where a DNS blip during gateway startup caused Telegram
and Discord to be permanently unavailable until manual restart.
@teknium1 teknium1 merged commit 3b509da into main Mar 23, 2026
1 check passed
outsourc-e pushed a commit to outsourc-e/hermes-agent that referenced this pull request Mar 26, 2026
NousResearch#2584)

When a messaging platform fails to connect at startup (e.g. transient DNS
failure) or disconnects at runtime with a retryable error, the gateway now
queues it for background reconnection instead of giving up permanently.

- New _platform_reconnect_watcher background task runs alongside the
  existing session expiry watcher
- Exponential backoff: 30s, 60s, 120s, 240s, 300s cap
- Max 20 retry attempts before giving up on a platform
- Non-retryable errors (bad auth token, etc.) are not retried
- Runtime disconnections via _handle_adapter_fatal_error now queue
  retryable failures instead of triggering gateway shutdown
- On successful reconnect, adapter is wired up and channel directory
  is rebuilt automatically

Fixes the case where a DNS blip during gateway startup caused Telegram
and Discord to be permanently unavailable until manual restart.
aashizpoudel pushed a commit to aashizpoudel/hermes-agent that referenced this pull request Mar 30, 2026
NousResearch#2584)

When a messaging platform fails to connect at startup (e.g. transient DNS
failure) or disconnects at runtime with a retryable error, the gateway now
queues it for background reconnection instead of giving up permanently.

- New _platform_reconnect_watcher background task runs alongside the
  existing session expiry watcher
- Exponential backoff: 30s, 60s, 120s, 240s, 300s cap
- Max 20 retry attempts before giving up on a platform
- Non-retryable errors (bad auth token, etc.) are not retried
- Runtime disconnections via _handle_adapter_fatal_error now queue
  retryable failures instead of triggering gateway shutdown
- On successful reconnect, adapter is wired up and channel directory
  is rebuilt automatically

Fixes the case where a DNS blip during gateway startup caused Telegram
and Discord to be permanently unavailable until manual restart.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant