Skip to content

Implement SSH connection pool for runner instances#3936

Merged
r4victor merged 26 commits into
masterfrom
issue_3920_instances_ssh_pool
Jun 8, 2026
Merged

Implement SSH connection pool for runner instances#3936
r4victor merged 26 commits into
masterfrom
issue_3920_instances_ssh_pool

Conversation

@r4victor

@r4victor r4victor commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

Part of #3920
Closes #2933

Add InstanceConnectionPool/InstanceConnection classes that allow re-using SSH connections to runner instances for shim and runner API port forwarding. Previously, the dstack server had to constantly re-establish SSH connections which affected CPU load and slowed down processing. The runner_ssh_tunnel() decorator is updated to use the pool, so clients are mostly unchanged.

Impact

The run startup time (#3920) on a provisioned instance with pulled image went from ~7s to 1-2s (as it was mostly limited by ssh connection re-creation):

[2026-06-05 14:32:29] [👤admin] [run blue-dog-1] Run submitted. Status: SUBMITTED
[2026-06-05 14:32:29] [job blue-dog-1-0-0] Job created on run submission. Status: SUBMITTED
[2026-06-05 14:32:29] [instance cloud-fleet-0] Instance status changed IDLE -> BUSY
[2026-06-05 14:32:29] [job blue-dog-1-0-0, instance cloud-fleet-0] Job assigned to instance. Instance blocks: 1/1 busy
[2026-06-05 14:32:29] [job blue-dog-1-0-0] Job status changed SUBMITTED -> PROVISIONING
[2026-06-05 14:32:30] [job blue-dog-1-0-0] Job status changed PROVISIONING -> PULLING
[2026-06-05 14:32:31] [job blue-dog-1-0-0] Job status changed PULLING -> RUNNING
[2026-06-05 14:32:31] [run blue-dog-1] Run status changed SUBMITTED -> RUNNING

Also CPU utilization on the dstack server machine no longer spikes due to opening many SSH connections to many instances constantly (#2933).

Notes and implementation details

  • The pool is unbounded. One active instance is expected to add ~2-10MB of RAM usage. The pool is disabled by default and can be enabled with DSTACK_SERVER_SSH_POOL_ENABLED. The plan is to test the pool in the next release, then enable the pool by default, and document how to opt-out if RAM usage is a concern.
  • The pool is not currently intended for arbitrary ports forwarding, only for shim and runner ports. It's not used to forward services ports for probes or router-worker communication. This probably can be generalized later.
  • The pool is not used for container-based backends. (Connections from dstack-server to runner's sshd are expected to be short as the inactivity_duration feature distinguishes user and server connections based on duration.)
  • The pool is incompatible with multiple dstack server processes on one host with the same DSTACK_SERVER_DIR. It's expected that the pool is disabled in such setups. (It's already kinda half-working with gateway connections.)
  • Dropped all params from runner_ssh_tunnel() incl. retries=3 – retries seems to be legacy here and are no longer needed after Introduce JOB_DISCONNECTED_RETRY_TIMEOUT #2627 (2m timeout before running job is kicked from an unreachable instance). Added and documented DSTACK_SERVER_SSH_CONNECT_TIMEOUT env var to increase default ConnectTimeout if server-instance latency is always >3s.
@r4victor r4victor changed the title Implement SSH pool for runner instances Jun 5, 2026
@r4victor r4victor requested review from jvstme and un-def June 5, 2026 09:55
@r4victor r4victor merged commit 1203e3e into master Jun 8, 2026
25 checks passed
@r4victor r4victor deleted the issue_3920_instances_ssh_pool branch June 8, 2026 10:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants