feat: auto-reconnect failed gateway platforms with exponential backoff by teknium1 · Pull Request #2584 · NousResearch/hermes-agent

teknium1 · 2026-03-23T06:46:45Z

Summary

When a messaging platform fails to connect at startup (e.g. transient DNS failure, network timeout) or disconnects at runtime with a retryable error, the gateway now queues it for background reconnection instead of giving up permanently.

Problem: A DNS blip during gateway startup caused Telegram and Discord to be permanently unavailable until manual restart. The gateway had no retry mechanism for failed platform connections.

Changes

◆ gateway/run.py — Core reconnection logic:

Added _failed_platforms tracking dict to GatewayRunner.__init__
Startup connection loop now queues failed platforms for retry (retryable errors only)
New _platform_reconnect_watcher() background task with exponential backoff (30s → 60s → 120s → 240s → 300s cap, max 20 attempts)
_handle_adapter_fatal_error() now queues retryable runtime disconnections for reconnection instead of triggering gateway shutdown
On successful reconnect: adapter is wired up, delivery router updated, channel directory rebuilt

◆ tests/gateway/test_platform_reconnect.py — 13 new tests covering:

Startup failure queueing
Reconnect success/failure/backoff/max-attempts/idle behavior
Non-retryable error removal from queue
Runtime disconnection queueing and shutdown prevention

◆ tests/gateway/test_runner_fatal_adapter.py — Updated existing test to reflect new behavior (retryable errors now queue for reconnection instead of shutting down)

Design

Backoff: min(30 * 2^(attempt-1), 300) seconds between retries
Max 20 attempts (~100 min at cap) before giving up
Non-retryable errors (bad token, auth failure) are never retried
Watcher checks every 10 seconds for platforms due for retry
When all adapters disconnect but platforms are queued, gateway stays alive
Watcher runs even when no platforms initially failed (handles runtime disconnections)

Test plan

All 5940 tests pass (0 failures)
New test file: tests/gateway/test_platform_reconnect.py (13 tests)
Updated test: test_runner_fatal_adapter.py reflects new reconnection behavior

When a messaging platform fails to connect at startup (e.g. transient DNS failure) or disconnects at runtime with a retryable error, the gateway now queues it for background reconnection instead of giving up permanently. - New _platform_reconnect_watcher background task runs alongside the existing session expiry watcher - Exponential backoff: 30s, 60s, 120s, 240s, 300s cap - Max 20 retry attempts before giving up on a platform - Non-retryable errors (bad auth token, etc.) are not retried - Runtime disconnections via _handle_adapter_fatal_error now queue retryable failures instead of triggering gateway shutdown - On successful reconnect, adapter is wired up and channel directory is rebuilt automatically Fixes the case where a DNS blip during gateway startup caused Telegram and Discord to be permanently unavailable until manual restart.

NousResearch#2584) When a messaging platform fails to connect at startup (e.g. transient DNS failure) or disconnects at runtime with a retryable error, the gateway now queues it for background reconnection instead of giving up permanently. - New _platform_reconnect_watcher background task runs alongside the existing session expiry watcher - Exponential backoff: 30s, 60s, 120s, 240s, 300s cap - Max 20 retry attempts before giving up on a platform - Non-retryable errors (bad auth token, etc.) are not retried - Runtime disconnections via _handle_adapter_fatal_error now queue retryable failures instead of triggering gateway shutdown - On successful reconnect, adapter is wired up and channel directory is rebuilt automatically Fixes the case where a DNS blip during gateway startup caused Telegram and Discord to be permanently unavailable until manual restart.

teknium1 merged commit 3b509da into main Mar 23, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: auto-reconnect failed gateway platforms with exponential backoff#2584

feat: auto-reconnect failed gateway platforms with exponential backoff#2584
teknium1 merged 1 commit intomainfrom
hermes/hermes-f9506ecc

teknium1 commented Mar 23, 2026

Uh oh!

Labels

1 participant

Conversation