Skip to content

fix(status): surface degraded gateway health for Telegram runtime failures#4393

Open
ajmeese7 wants to merge 2 commits intoNousResearch:mainfrom
ajmeese7:fix/telegram-health-status
Open

fix(status): surface degraded gateway health for Telegram runtime failures#4393
ajmeese7 wants to merge 2 commits intoNousResearch:mainfrom
ajmeese7:fix/telegram-health-status

Conversation

@ajmeese7
Copy link
Copy Markdown

@ajmeese7 ajmeese7 commented Apr 1, 2026

Summary

Fix gateway status reporting so Hermes no longer appears healthy when the gateway process is alive but Telegram runtime health is degraded.

This addresses the case where Telegram polling breaks behind the scenes, the gateway service stays running, and hermes status / hermes gateway status misleadingly suggest everything is fine.

Problem

Previously, Hermes could end up in a bad-but-running state:

  • the gateway process stayed alive
  • Telegram polling was degraded or dead behind the scenes
  • hermes status still looked effectively healthy
  • stale runtime entries could also poison later health output

That made failures on a primary messaging platform hard to detect and forced manual investigation.

What changed

Runtime health tracking

  • Mark Telegram polling recovery as reconnecting while reconnect attempts are in progress
  • Return to connected when polling recovery succeeds
  • Preserve fatal behavior for real unrecoverable failures

Runtime state hygiene

  • Reset persisted platform runtime state on gateway startup
  • Record the enabled platform set for the current process
  • Prevent stale platform entries from previous runs from affecting current status

Status classification

  • Distinguish service liveness from runtime health
  • Only degrade overall status for relevant configured messaging platforms
  • Treat reconnecting / fatal as meaningful runtime problems
  • Do not degrade overall health for plain disconnected entries

CLI output

  • Show platform runtime details in hermes status
  • Show degraded runtime state in hermes status / hermes gateway status
  • Render warning icons in yellow at the CLI layer
  • Keep runtime status helpers presentation-agnostic

Example outcome

Before:

  • Gateway process running
  • Telegram broken
  • Status still looked healthy enough to mislead

After:

  • Gateway process can be running
  • Runtime can independently be degraded
  • Status clearly surfaces Telegram reconnect/failure state when it matters

Tests

Targeted tests added/updated for:

  • runtime health classification
  • stale state reset on startup
  • Telegram reconnect state reporting
  • CLI degraded status rendering
  • filtering irrelevant/stale platforms out of health output

Example targeted run:

source venv/bin/activate && python -m pytest \
  tests/gateway/test_status.py \
  tests/gateway/test_telegram_runtime_health.py \
  tests/hermes_cli/test_gateway_runtime_health.py \
  tests/hermes_cli/test_status.py \
  tests/hermes_cli/test_gateway.py \
  tests/hermes_cli/test_status_model_provider.py \
  tests/gateway/test_runner_startup_failures.py -q

All targeted tests passed locally.

Related work

This complements prior Telegram gateway reliability fixes, especially:

This PR focuses on a separate but related problem: making hermes status and hermes gateway status accurately reflect degraded runtime health when the gateway process is still alive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant