Skip to content

fix(core): prevent hanging turn/start due to websocket warming issues#14838

Merged
owenlin0 merged 1 commit intomainfrom
owen/fix_websocket_prewarm_hang
Mar 17, 2026
Merged

fix(core): prevent hanging turn/start due to websocket warming issues#14838
owenlin0 merged 1 commit intomainfrom
owen/fix_websocket_prewarm_hang

Conversation

@owenlin0
Copy link
Copy Markdown
Collaborator

@owenlin0 owenlin0 commented Mar 16, 2026

Description

This PR fixes a bad first-turn failure mode in app-server when the startup websocket prewarm hangs. Before this change, initialize -> thread/start -> turn/start could sit behind the prewarm for up to five minutes, so the client would not see turn/started, and even turn/interrupt would block because the turn had not actually started yet.

Now, we:

  • set a (configurable) timeout of 15s for websocket startup time, exposed as websocket_startup_timeout_ms in config.toml
  • turn/started is sent immediately on turn/start even if the websocket is still connecting
  • turn/interrupt can be used to cancel a turn that is still waiting on the websocket warmup
  • the turn task will wait for the full 15s websocket warming timeout before falling back

Why

The old behavior made app-server feel stuck at exactly the moment the client expects turn lifecycle events to start flowing. That was especially painful for external clients, because from their point of view the server had accepted the request but then went silent for minutes.

Configuring the websocket startup timeout

Can set it in config.toml like this:

[model_providers.openai]
supports_websockets = true
websocket_connect_timeout_ms = 15000
@owenlin0 owenlin0 force-pushed the owen/fix_websocket_prewarm_hang branch from c96827b to e0db860 Compare March 16, 2026 18:59
@owenlin0 owenlin0 marked this pull request as ready for review March 16, 2026 19:00
@owenlin0 owenlin0 requested a review from pakrym-oai March 16, 2026 19:05
@owenlin0 owenlin0 force-pushed the owen/fix_websocket_prewarm_hang branch 4 times, most recently from abe0e47 to c6c288f Compare March 16, 2026 22:38
@owenlin0 owenlin0 changed the title fix(core): fallback to default task if websocket warm is not ready Mar 16, 2026
@owenlin0
Copy link
Copy Markdown
Collaborator Author

@codex review

@owenlin0 owenlin0 force-pushed the owen/fix_websocket_prewarm_hang branch from ac37a80 to cf34cf3 Compare March 16, 2026 22:59
Copy link
Copy Markdown
Contributor

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d880b09652

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@owenlin0 owenlin0 force-pushed the owen/fix_websocket_prewarm_hang branch from bc399c5 to 7fa8577 Compare March 16, 2026 23:52
Ok(new_conn) => new_conn,
Err(err) => {
if matches!(err, ApiError::Transport(TransportError::Timeout)) {
self.reset_websocket_session();
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed? Don't we auto retry?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so a retry does happen but at the turn level, not by creating a brand new websocket client from scratch

So when a request fails, Codex retries the same turn using the same turn-scoped session object. If we do not call reset_websocket_session(), that retry can carry leftover websocket state from the timed-out attempt and act like it is continuing a connection that never really came up.

@owenlin0 owenlin0 force-pushed the owen/fix_websocket_prewarm_hang branch from 10f193c to 0a87c31 Compare March 17, 2026 00:39
@owenlin0 owenlin0 force-pushed the owen/fix_websocket_prewarm_hang branch 6 times, most recently from fc3b456 to 4f88caa Compare March 17, 2026 16:25
use similar::TextDiff;
use tempfile::TempDir;

fn trim_single_trailing_newline(contents: &str) -> &str {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is random

@owenlin0 owenlin0 force-pushed the owen/fix_websocket_prewarm_hang branch from 4f88caa to be145ab Compare March 17, 2026 16:39
@owenlin0 owenlin0 merged commit 6ea0410 into main Mar 17, 2026
33 checks passed
@owenlin0 owenlin0 deleted the owen/fix_websocket_prewarm_hang branch March 17, 2026 17:07
@github-actions github-actions bot locked and limited conversation to collaborators Mar 17, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

2 participants