Skip to content

Add JarvisLabs backend#3875

Merged
peterschmidt85 merged 5 commits into
masterfrom
jarvislabs
May 21, 2026
Merged

Add JarvisLabs backend#3875
peterschmidt85 merged 5 commits into
masterfrom
jarvislabs

Conversation

@peterschmidt85

@peterschmidt85 peterschmidt85 commented May 12, 2026

Copy link
Copy Markdown
Contributor

Adds JarvisLabs as a dstack backend.

Implementation notes:

  • Adds backend registration, config models, configurator, API client, compute implementation, docs, and backend tests.
  • Uses the JarvisLabs provider from gpuhunt for offer selection. This branch depends on Add JarvisLabs provider gpuhunt#231 until the provider is released.
  • Supports JarvisLabs VM workloads only. GPU VMs and CPU VMs use separate JarvisLabs create/destroy APIs.
  • Supports GPU spot by passing the selected offer's spot flag to JarvisLabs GPU VM creation. CPU spot is not emitted by gpuhunt and is not supported.
  • Does not select a JarvisLabs template or custom image; provisioning uses the provider default VM image.
  • Validates configured regions against gpuhunt's JarvisLabs supported-region map and fails closed if an unsupported region reaches a regional API call.
  • Registers the project SSH key in JarvisLabs before creating an instance.
  • Starts the dstack shim over SSH and persists hostname only after shim startup succeeds, so provisioning can retry after a server restart.
  • Maps immediate and delayed JarvisLabs create capacity failures to NoCapacityError and destroys any failed machine id returned by JarvisLabs before retrying another offer. Non-capacity failed create status raises ProvisioningError. After a VM is running, interruption/unreachability is handled by the generic VM health path, as with other VM backends.
  • Wraps JarvisLabs request failures and malformed success responses as BackendError instead of leaking raw transport/JSON exceptions.

E2E validation:

  • CPU on-demand task provisioned and completed on JarvisLabs.
  • L4 GPU on-demand task provisioned and completed CUDA tensor matmul on the GPU.
  • H100 GPU spot task provisioned with JarvisLabs is_spot: true and completed CUDA tensor matmul on the GPU.
  • Requested 120GB/200GB disks were visible inside containers in the live disk checks.
  • Server restart was tested while JarvisLabs runs were active; provisioning resumed instead of losing the run.
  • L4 spot no-capacity was observed from JarvisLabs and handled as a capacity failure.

Added tests cover config validation, API payloads, API error normalization, spot flag propagation, region failure behavior, capacity-failure mapping and cleanup, CPU/GPU provisioning data, disk sizing, SSH username parsing, termination routing, and restart-safe hostname persistence.

@peterschmidt85 peterschmidt85 force-pushed the jarvislabs branch 4 times, most recently from c8850b2 to 3ad620f Compare May 12, 2026 21:01
@peterschmidt85 peterschmidt85 marked this pull request as ready for review May 12, 2026 21:17
@peterschmidt85 peterschmidt85 requested a review from jvstme May 12, 2026 21:17
Comment thread pyproject.toml
Comment thread src/dstack/_internal/core/backends/jarvislabs/compute.py Outdated
Comment thread src/dstack/_internal/core/backends/jarvislabs/compute.py
Comment thread src/dstack/_internal/core/backends/jarvislabs/compute.py Outdated
Comment thread src/dstack/_internal/core/backends/jarvislabs/compute.py Outdated
@peterschmidt85 peterschmidt85 requested a review from jvstme May 20, 2026 15:16
@peterschmidt85 peterschmidt85 merged commit 0bc4300 into master May 21, 2026
25 checks passed
@peterschmidt85 peterschmidt85 deleted the jarvislabs branch May 21, 2026 07:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants