Skip to content

shim: avoid re-downloading runner/shim binaries on forced re-install#3954

Open
peterschmidt85 wants to merge 2 commits into
masterfrom
fix/shim-redownload-loop
Open

shim: avoid re-downloading runner/shim binaries on forced re-install#3954
peterschmidt85 wants to merge 2 commits into
masterfrom
fix/shim-redownload-loop

Conversation

@peterschmidt85

@peterschmidt85 peterschmidt85 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Bounds the wasted egress from the runner/shim re-install loop (#3953): the shim re-downloaded the full binaries on every server-issued install (force=true), and a transient HTTP error during bootstrap could brick the instance. Two changes, both in the download path.

downloadFile — cache + revalidate instead of always re-downloading

runner/internal/shim/components/utils.go. Installing url into path now:

  1. force=false and path exists → return (unchanged).
  2. Resolve a per-URL cache file next to the binary: <path>.<sha256(url)[:16]>.cache (+ a .etag sidecar). Keying by URL means different versions get separate caches instead of overwriting one slot.
  3. ensureCached GETs url with If-None-Match: <stored etag>:
    • 304 Not Modified → cache is current, no body transferred;
    • 200 → stream into the cache (temp file + atomic rename) and store the response ETag.
  4. Install the cached bytes into path: copy → chmod → atomic rename. Skipped if path already matches the cache (same size). If chmod/rename fails, the next call retries from the cache — no re-download.
  5. pruneCaches keeps the newest maxCachedVersions (5) and deletes older ones, so disk stays bounded.

Net: a repeated/forced install of an unchanged version is a 304 with no transfer; two servers alternating between versions download each version once, then 304.

Bootstrap curl — fail instead of saving an error page

src/dstack/_internal/core/backends/base/compute.py. The shim-bootstrap curl gains -f and chains mv, so a transient 403/5xx is no longer written to dstack-shim and then executed (/usr/local/bin/dstack-shim: 2: Syntax error: newline unexpected); it fails cleanly.

Tests

runner/internal/shim/components/utils_test.go (pass under -race):

  • unchanged artifact under force → body served once;
  • changed ETag → re-downloaded and updated;
  • force=false + existing file → no request;
  • two alternating versions → each downloaded exactly once (guards the per-URL-cache regression).

Validated end-to-end (AWS)

Two servers on one DB + backend, one real eu-west-1 instance, custom-versioned builds on the staging bucket, measured via S3 request metrics:

  • Both shims carry this fix (0.20.30 ↔ 0.20.31): 22 conditional GETs over 25 min of continuous flip-flop → 1 real transfer (~17 MB warmup), the rest 304. Steady-state egress ≈ 0 (unfixed: ~430 MB and growing).
  • Mixed (unfixed 0.20.29 shim ↔ fixed 0.20.31): ~454 MB in 11 min, climbing — see scope below.

Scope — what this does NOT fix (tracked in #3953)

This PR bounds the egress, not the loop:

  • With two servers expecting different versions the version never converges, so the server keeps issuing installs and the shim still restarts every ~30s — now at zero bytes, but the churn remains.
  • Because the fix lives in the shim binary, a mixed fleet whose in-control shim predates this PR still re-downloads — the older binary has no cache (the ~454 MB case above).

Fully stopping the loop needs a server-side guard (no-downgrade on comparable versions + an attempt cap/backoff). That's deferred: dstack's version space (real semver vs CI build-numbers like 27305 vs dev/None) has no reliable ordering yet, so the server can't safely decide which version wins. See #3953.

🤖 Generated with Claude Code

The shim re-downloaded the full runner/shim binary on every server-issued
install (which is always force=true), with no content check. When an
instance's installed version never converges to the version the server
expects, the server re-issues the install every check, so the same binaries
were re-downloaded indefinitely -- unbounded egress from the downloads bucket.

Cache the downloaded bytes next to the target and revalidate them with the
object's ETag via a conditional request: an unchanged artifact returns 304 and
no body is transferred, even under force. Install from the cache so a
finalization (chmod/rename) failure is retried without re-downloading, and
propagate those errors instead of masking them. A genuinely changed artifact
(different ETag) is still downloaded exactly once; without force an existing
file is still left untouched.

This bounds the loop to at most one download per distinct artifact regardless
of why versions fail to converge.

Refs #3953

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The shim download cache used a single slot per binary, so when two server
versions reconcile the same instance (the confirmed real-world trigger), each
version-swap overwrote the slot and re-downloaded the full binary every cycle.

- downloadFile now keys the cache + ETag by URL, so alternating versions each
  persist and are revalidated independently -> each version is downloaded once,
  then 304s -> egress ~0 even while the binary keeps swapping. Bounded by
  maxCachedVersions (oldest pruned).
- bootstrap curl: add -f and chain `mv`, so a transient 403/5xx is never saved as
  the shim binary (which otherwise runs as a shell script -> "Syntax error:
  newline unexpected" crash loop) -- fail cleanly instead.
- add a flip-flop test (each of two alternating versions downloaded exactly once).

Refs #3953

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@peterschmidt85 peterschmidt85 force-pushed the fix/shim-redownload-loop branch from 5bfbf12 to 59f28e8 Compare June 10, 2026 20:59
@peterschmidt85 peterschmidt85 marked this pull request as ready for review June 10, 2026 22:40
@peterschmidt85 peterschmidt85 requested a review from un-def June 10, 2026 22:40

// Install the cached bytes; skip if path already matches the cache.
if !downloaded {
installed, err := sameSize(path, cachePath)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same-size adjacent Go builds are unlikely but not impossible.

Suggested fix: after a successful rename in installFile, write the cache's urlKey to a <path>.installed marker and replace sameSize with a marker comparison.


// downloadFile ensures path holds the artifact at url.
//
// Bytes are cached next to path (one cache per URL, validated by ETag), so a repeated

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite like putting these helper-files in /usr/local/bin. How about using a shimHomeDir subdir?

case "/0.20.23/runner":
body, etag = "RUNNER-0.20.23", `"e23"`
case "/0.20.18/runner":
body, etag = "RUNNER-0.20.18-x", `"e18"` // intentionally different length

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See tests only work with different lengths!

# 403/5xx is never saved as the shim binary; chain `mv` so a failed download
# is never installed (otherwise it would run as a script -> "Syntax error").
f'sudo curl -fsS --compressed --connect-timeout 60 --max-time 240 --retry 1 --output "$dlpath" "{url}"'
f' && sudo mv "$dlpath" {dstack_shim_binary_path}',

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if && is needed since commands are chained with &&. Adding -f should be sufficient

@r4victor r4victor left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice overall

@github-actions

Copy link
Copy Markdown

This PR is stale because it has been open for 14 days with no activity.

@github-actions github-actions Bot added the stale label Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2 participants