Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

HPC Batch Jobs Use Cases

This document presents two use cases that demonstrate the utilization of kueue to manage multi-team batch jobs within a Cluster.

Use Cases:

Use Case 1: HPC Financial Analysis Using Monte Carlo Simulations (Team: hpc-team-b)

This example is an adaptation from Google Cloud Platform's Risk and Research Blueprints.

Requirements

  • Docker Registry Connectivity If you are using a private cluster with private nodes, they must be able to fetch Kueue Docker images from registry.k8s.io. This can be done by adding Cloud NAT to the private nodes network, having your own NAT setup on your cluster network, or by following the tutorial for Artifact Registry Remote Repositories.

  • Kubectl with Cluster Connection If using a private cluster, you can use Connect Gateway.

    gcloud container fleet memberships get-credentials CLUSTER-NAME --project=YOUR-CLUSTER-PROJECT --location=YOUR-CLUSTER-REGION

    If you have access to specific namespace, you can run:

    gcloud container fleet scopes namespaces get-credentials NAMESPACE
  • Kueue

    • Option 1 (Cluster Network with NAT): Install Kueue by running the following command:

      kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.10.1/manifests.yaml

      Note: To uninstall a released version from your cluster, run:

      kubectl delete -f https://github.com/kubernetes-sigs/kueue/releases/download/v0.10.1/manifests.yaml

      Wait for Kueue installation to complete:

      kubectl wait deploy/kueue-controller-manager -nkueue-system --for=condition=available --timeout=5m
    • Option 2: Install Kueue by following the tutorial for Artifact Registry Remote Repositories

  • Cluster Toolkit (gcluster)

    This guide assumes you have gcluster installed on your home directory. More information on how to setup gcluster in the following link

Create Namespaces

Add hpc-team-a and hpc-team-b Namespaces at the Fleetscope repository

Typically, the application namespace will be created on 3-fleetscope and specified in 6-appsource.

  1. Navigate to Fleetscope repository and add the hpc-team-a and hpc-team-b namespaces at terraform.tfvars, if the namespace was not created already:

    namespace_ids = {
    +    "hpc-team-a"     = "your-hpc-team-a-group@yourdomain.com",
    +    "hpc-team-b"     = "your-hpc-team-b-group@yourdomain.com",
         ...
    }
  2. Apply changes by commiting to a named environment branch (development, nonproduction, production). After the build associated with the fleetscope repository finishes it's execution, the namespaces should be present in the cluster.

Create Teams Environments and Infrastructure

Create projects in 4-appfactory

You will find an example terraform.tfvars in this example to create hpc-team-a and hpc-team-b.

applications = {
+  "hpc" = {
+    "hpc-team-a" = {
+      create_infra_project = true
+      create_admin_project = true
+    },
+    "hpc-team-b" = {
+      create_infra_project = true
+      create_admin_project = true
+    }
+  }
}

cloudbuildv2_repository_config = {
  repo_type = "GITLABv2"
  repositories = {
+    hpc-team-a = {
+      repository_name = "hpc-team-a-i-r"
+      repository_url  = "https://gitlab.com/user/hpc-team-a-i-r.git"
+    },
+    hpc-team-b = {
+      repository_name = "hpc-team-b-i-r"
+      repository_url  = "https://gitlab.com/user/hpc-team-b-i-r.git"
+    }
  }
  # The Secret ID format is: projects/PROJECT_NUMBER/secrets/SECRET_NAME
  gitlab_authorizer_credential_secret_id      = "REPLACE_WITH_READ_API_SECRET_ID"
  gitlab_read_authorizer_credential_secret_id = "REPLACE_WITH_READ_USER_SECRET_ID"
  gitlab_webhook_secret_id                    = "REPLACE_WITH_WEBHOOK_SECRET_ID"
  # If you are using a self-hosted instance, you may change the URL below accordingly
  gitlab_enterprise_host_uri = "https://gitlab.com"
}

Apply the modifications by pushing code to a named branch, after updating the variables.

Deploy baseline infrastructure in 5-appinfra

Under 5-appinfra you will find the two environment folders. They just need to be copied to you AppInfra Pipeline repository and pushed to a named branch.

Apply Kueue Resources

Run the following command to create the necessary Kueue resources (ClusterQueue and LocalQueue), this step should be run by a Batch Administrator, after the namespaces are created and should be run only once:

kubectl apply -f manifests/kueue-resources.yaml

The queues that are created in this step will later be used to schedule batch jobs.

Usage

Permissions within the Developer Platform

The team members will run the code through a Vertex AI Workbench Instance. They must have permission to connect to the instance and the instance will have permission to apply changes on their respective team namespace.

If the team member belongs to the hpc-team group defined in 3-fleetscope, they will have ADMIN permissions on the namespace (see module fleet_app_operator_permissions on 3-fleetscope).

If the team member wants to manage kubernetes resources outside the instance, the user will also need permission to connect to the cluster using ConnectGateway. For more information on managing ConnectGateway, refer to the following documentation.

If the user lacks the necessary privileges to assign these permissions, they can submit a pull request (PR) to the 3-fleetscope repository. This will allow the relevant personnel in charge of the cluster to review and address the request. Basic Kubernetes RBAC roles can be assigned using terraform with the following module.

Examples PR's requesting permission assignment

For example, the user can open a PR to 3-fleetscope terraform.tfvars file adding an identity to the namespace ADMIN permissions.

additional_namespace_identities = {
+  "hpc-team-b" = ["vertex-ai-instance-sa@infra-project-id.iam.gserviceaccount.com"]
}

And add Terraform Code to assign ConnectGateway permissions:

+resource "google_project_iam_member" "compute_sa_roles" {
+  for_each = toset([
+    "roles/gkehub.connect",
+    "roles/gkehub.viewer",
+    "roles/gkehub.gatewayReader",
+    "roles/gkehub.scopeEditorProjectLevel"
+  ])
+  role    = each.key
+  project = var.fleet_project_id
+  member  = "serviceAccount:vertex-ai-instance-sa@infra-project-id.iam.gserviceaccount.com"
}

Set Project for gcloud Commands

gcloud config set project REPLACE_WITH_YOUR_INFRA_PROJECT

Run gcluster Blueprint

The fsi-montecarlo-on-batch.yaml file contains a blueprint that is deployed with gcluster (cluster-toolkit). It will create a notebook instance on the infrastructure project, alongs with the it's dependencies.

To deploy the blueprint, navigate to the source directory and run the following command, make sure you replace CLUSTER_NAME with your environment's cluster name, use your team infrastructure project that was created on 4-appfactory for the PROJECT_ID:

PROJECT_ID=REPLACE_WITH_YOUR_INFRA_PROJECT
CLUSTER_NAME=REPLACE_WITH_CLUSTER_NAME
CLUSTER_PROJECT=REPLACE_WITH_CLUSTER_PROJECT

~/cluster-toolkit/gcluster deploy fsi-montecarlo-on-batch.yaml --vars "project_id=$PROJECT_ID,cluster_name=$CLUSTER_NAME,cluster_project=$CLUSTER_PROJECT" --auto-approve

NOTE: the example code is deployed for hpc-team-b. If you wish to deploy the example on hpc-team-a environment, you will need to adjust settings.tpl.toml and change the namespace and LocalQueue name.

Run the Simulation Jobs and Visualize the Results

Requisites before running

Before running the jobs, historical stocks data must be downloaded and uploaded to a bucket, which will then be used by the containers that run the batch jobs. This procedure allows isolating the container from the external network and running the simulation in a Secure environment.

You will find an auxiliary script on helpers directory named download_data.py. The script will use yfinance library to download stocks data and a Google Cloud Storage Python Client to upload this data in the required format for the application. Here is a step-by-step to download the data and upload it using the script.

IMPORTANT: The script must be run in an authenticated environment that has access to the internet. It will use Application Default Credentials (ADC) to authenticate with the bucket that was created on 5-appinfra stage.

  1. Navigate to helpers directory.

  2. Before running the Script, you will need to install the script dependencies, by running:

    pip install -r download_data_requirements.txt
  3. A bucket is created in 5-appinfra stage and is passed as a flag (--bucket_name=YOUR_BUCKET_NAME) to the script, you should the bucket created on 5-appinfra for this purpose. The bucket follows the naming ${var.infra_project}-stocks-historical-data. Alternatively, if you have access to the terraform state, you may also retrieve the bucket name by running terraform -chdir="../../5-appinfra/envs/development" output -raw stocks_data_bucket_name.

  4. To download data for all tickers that will be used for the simulation, execute the script by running the following command:

    BUCKET_NAME=YOUR_BUCKET_NAME
    python3 dowload_data.py --bucket_name=$BUCKET_NAME

    NOTE: Please be aware that the script processes a significant amount of stock data. As a result, it may take approximately 10 minutes to complete, depending on your machine's specifications and your network bandwidth.

After uploading the data to the bucket, you may proceed.

Follow the tutorial on the original repository

Follow the steps outlined in the following document, after the "Open the Vertex AI Workbench Notebook" section:

Open the Vertex AI Workbench Notebook

IMPORTANT: Your Vertex AI Workbench Instance will be located on the application infrastructure project that was created on 4-appfactory.

Use Case 2: HPC AI Model Training with GPU (Team: hpc-team-a)

This use case is based on the following example: Training with a Single GPU on Google Cloud.

Step 1: Connect to the Cluster

Before proceeding, ensure that the user is a member of the hpc-team-a group and has the necessary permissions to connect using ConnectGateway:

  • roles/gkehub.connect
  • roles/gkehub.viewer
  • roles/gkehub.gatewayReader

Once confirmed, execute the following command to connect to your cluster:

gcloud container fleet memberships get-credentials CLUSTER-NAME --project=YOUR-CLUSTER-PROJECT --location=YOUR-CLUSTER-REGION

Replace the placeholders as follows:

  • CLUSTER-NAME: The name of your cluster.
  • YOUR-CLUSTER-PROJECT: The project ID where your cluster is located.
  • YOUR-CLUSTER-REGION: The region of your cluster.

Step 2: Deploy the Example

  1. Retrieve the value for IMAGE_URL variable:

    terraform -chdir=./5-appinfra/hpc/hpc-team-a/envs/development init
    export IMAGE_URL="terraform -chdir=./5-appinfra/hpc/hpc-team-a/envs/development output -raw image_url"

    NOTE: If you don't have access to the terraform state, the IMAGE_URL format is: us-central1-docker.pkg.dev/INFRA_PROJECT/private-images/ai-train:v1 where INFRA_PROJECT is your hpc-team-a infrastructure project ID.

  2. Run the job on hpc-team-a-development using the namespace LocalQueue and the variables retrieved above:

    envsubst < ./6-appsource/manifests/ai-training-job.yaml | kubectl -n hpc-team-a-development apply -f -
  3. Validate your job finished by looking at the Container Logs, search for "Training finished. Model saved":

    kubectl -n hpc-team-a-development logs jobs/mnist-training-job -c tensorflow