本教程介绍了如何使用 GKE 推理网关在 Google Kubernetes Engine (GKE) 上部署大语言模型 (LLM)。本教程包含集群设置、模型部署、GKE 推理网关配置和处理 LLM 请求的步骤。
本教程面向机器学习 (ML) 工程师、平台管理员和运维人员,以及希望使用 GKE 推理网关在 GKE 上使用 LLM 部署和管理 LLM 应用的数据和 AI 专家。
在阅读本页面之前,请先熟悉以下内容:
背景
本部分介绍本教程中使用的关键技术。如需详细了解模型服务概念和术语,以及 GKE 生成式 AI 功能如何增强和支持模型服务性能,请参阅 GKE 上的模型推理简介。
vLLM
vLLM 是一个经过高度优化的开源 LLM 服务框架,可提高 GPU 上的服务吞吐量,具有如下功能:
- 具有 PagedAttention 且经过优化的 Transformer(转换器)实现
- 连续批处理,可提高整体服务吞吐量
- 跨多个 GPU 的张量并行处理和分布式服务
如需了解详情,请参阅 vLLM 文档。
GKE 推理网关
GKE 推理网关增强了 GKE 在提供 LLM 方面的功能。它通过以下功能优化推理工作负载:
- 根据负载指标进行优化的推理负载均衡。
- 支持 LoRA 适配器的密集多工作负载服务。
- 支持模型感知的路由,以简化操作。
如需了解详情,请参阅 GKE 推理网关简介。
目标
准备工作
- Sign in to your Google Cloud account. If you're new to Google Cloud, create an account to evaluate how our products perform in real-world scenarios. New customers also get $300 in free credits to run, test, and deploy workloads.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the required API.
-
In the Google Cloud console, on the project selector page, select or create a Google Cloud project.
-
Make sure that billing is enabled for your Google Cloud project.
-
Enable the required API.
-
Make sure that you have the following role or roles on the project: roles/container.admin, roles/iam.serviceAccountAdmin
Check for the roles
-
In the Google Cloud console, go to the IAM page.
Go to IAM - Select the project.
-
In the Principal column, find all rows that identify you or a group that you're included in. To learn which groups you're included in, contact your administrator.
- For all rows that specify or include you, check the Role column to see whether the list of roles includes the required roles.
Grant the roles
-
In the Google Cloud console, go to the IAM page.
进入 IAM - 选择项目。
- 点击 授予访问权限。
-
在新的主账号字段中,输入您的用户标识符。 这通常是 Google 账号的电子邮件地址。
- 在选择角色列表中,选择一个角色。
- 如需授予其他角色,请点击 添加其他角色,然后添加其他各个角色。
- 点击 Save(保存)。
-
- 如果您还没有 Hugging Face 账号,请创建一个。
- 确保您的项目具有足够的 H100 GPU 配额。如需了解详情,请参阅规划 GPU 配额和分配配额。
获取对模型的访问权限
如需将 Llama3.1
模型部署到 GKE,请签署许可同意协议并生成 Hugging Face 访问令牌。
签署许可同意协议
您必须签署同意协议才能使用 Llama3.1
模型。请按照以下说明操作:
- 访问同意页面,使用您的 Hugging Face 账号验证同意情况。
- 接受模型条款。
生成一个访问令牌
如需通过 Hugging Face 访问模型,您需要 Hugging Face 令牌。
如果您还没有令牌,请按照以下步骤生成新令牌:
- 点击您的个人资料 > 设置 > 访问令牌。
- 选择新建令牌 (New Token)。
- 指定您选择的名称和一个至少为
Read
的角色。 - 选择生成令牌。
- 将生成的令牌复制到剪贴板。
准备环境
在本教程中,您将使用 Cloud Shell 来管理Google Cloud上托管的资源。Cloud Shell 预安装了本教程所需的软件,包括 kubectl
和
gcloud CLI。
如需使用 Cloud Shell 设置您的环境,请执行以下步骤:
在 Google Cloud 控制台中,点击 Google Cloud 控制台中的
激活 Cloud Shell 以启动 Cloud Shell 会话。此操作会在 Google Cloud 控制台的底部窗格中启动会话。
设置默认环境变量:
gcloud config set project PROJECT_ID export PROJECT_ID=$(gcloud config get project) export REGION=REGION export CLUSTER_NAME=CLUSTER_NAME export HF_TOKEN=HF_TOKEN
替换以下值:
PROJECT_ID
:您的 Google Cloud项目 ID。REGION
:支持要使用的加速器类型的区域,例如适用于 H100 GPU 的us-central1
。CLUSTER_NAME
:您的集群的名称。HF_TOKEN
:您之前生成的 Hugging Face 令牌。
创建和配置 Google Cloud 资源
如需创建所需的资源,请按照以下说明操作。
创建 GKE 集群和节点池
在 GKE Autopilot 或 Standard 集群中的 GPU 上部署 LLM。我们建议您使用 Autopilot 集群获得全托管式 Kubernetes 体验。如需选择最适合您的工作负载的 GKE 操作模式,请参阅选择 GKE 操作模式。
Autopilot
在 Cloud Shell 中,运行以下命令:
gcloud container clusters create-auto CLUSTER_NAME \
--project=PROJECT_ID \
--region=REGION \
--release-channel=rapid \
--cluster-version=1.32.3-gke.1170000
替换以下值:
PROJECT_ID
:您的 Google Cloud项目 ID。REGION
:支持要使用的加速器类型的区域,例如适用于 H100 GPU 的us-central1
。CLUSTER_NAME
:您的集群的名称。
GKE 会根据所部署的工作负载的请求,创建具有所需 CPU 和 GPU 节点的 Autopilot 集群。
Standard
在 Cloud Shell 中,运行以下命令以创建 Standard 集群:
gcloud container clusters create CLUSTER_NAME \ --project=PROJECT_ID \ --region=REGION \ --workload-pool=PROJECT_ID.svc.id.goog \ --release-channel=rapid \ --num-nodes=1 \ --enable-managed-prometheus \ --monitoring=SYSTEM,DCGM \ --cluster-version=1.32.3-gke.1170000
替换以下值:
PROJECT_ID
:您的 Google Cloud项目 ID。REGION
:支持要使用的加速器类型的区域,例如适用于 H100 GPU 的us-central1
。CLUSTER_NAME
:您的集群的名称。
集群创建可能需要几分钟的时间。
如需创建具有适当磁盘大小的节点池以运行
Llama-3.1-8B-Instruct
模型,请运行以下命令:gcloud container node-pools create gpupool \ --accelerator type=nvidia-h100-80gb,count=2,gpu-driver-version=latest \ --project=PROJECT_ID \ --location=REGION \ --node-locations=REGION-a \ --cluster=CLUSTER_NAME \ --machine-type=a3-highgpu-2g \ --num-nodes=1 \ --disk-type="pd-standard"
GKE 会创建一个节点池,其中包含一个 H100 GPU。
如需设置授权以抓取指标,请创建
inference-gateway-sa-metrics-reader-secret
Secret:kubectl apply -f - <<EOF --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: inference-gateway-metrics-reader rules: - nonResourceURLs: - /metrics verbs: - get --- apiVersion: v1 kind: ServiceAccount metadata: name: inference-gateway-sa-metrics-reader namespace: default --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: inference-gateway-sa-metrics-reader-role-binding namespace: default subjects: - kind: ServiceAccount name: inference-gateway-sa-metrics-reader namespace: default roleRef: kind: ClusterRole name: inference-gateway-metrics-reader apiGroup: rbac.authorization.k8s.io --- apiVersion: v1 kind: Secret metadata: name: inference-gateway-sa-metrics-reader-secret namespace: default annotations: kubernetes.io/service-account.name: inference-gateway-sa-metrics-reader type: kubernetes.io/service-account-token --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: name: inference-gateway-sa-metrics-reader-secret-read rules: - resources: - secrets apiGroups: [""] verbs: ["get", "list", "watch"] resourceNames: ["inference-gateway-sa-metrics-reader-secret"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRoleBinding metadata: name: gmp-system:collector:inference-gateway-sa-metrics-reader-secret-read namespace: default roleRef: name: inference-gateway-sa-metrics-reader-secret-read kind: ClusterRole apiGroup: rbac.authorization.k8s.io subjects: - name: collector namespace: gmp-system kind: ServiceAccount EOF
为 Hugging Face 凭据创建 Kubernetes Secret
在 Cloud Shell 中,执行以下操作:
如需与集群通信,请配置
kubectl
:gcloud container clusters get-credentials CLUSTER_NAME \ --location=REGION
替换以下值:
REGION
:支持要使用的加速器类型的区域,例如适用于 H100 GPU 的us-central1
。CLUSTER_NAME
:您的集群的名称。
创建包含 Hugging Face 令牌的 Kubernetes Secret:
kubectl create secret generic HF_SECRET \ --from-literal=token=HF_TOKEN \ --dry-run=client -o yaml | kubectl apply -f -
替换以下内容:
HF_TOKEN
:您之前生成的 Hugging Face 令牌。HF_SECRET
:Kubernetes Secret 的名称。例如hf-secret
。
安装 InferenceModel
和 InferencePool
CRD
在本部分中,您将为 GKE 推理网关安装必要的自定义资源定义 (CRD)。
CRD 会扩展 Kubernetes API。这样,您就可以定义新的资源类型。如需使用 GKE 推理网关,请运行以下命令,在 GKE 集群中安装 InferencePool
和 InferenceModel
CRD:
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v0.3.0/manifests.yaml
部署模型服务器
此示例使用 vLLM 模型服务器部署 Llama3.1
模型。部署标记为 app:vllm-llama3-8b-instruct
。此部署还使用了 Hugging Face 中名为 food-review
和 cad-fabricator
的两个 LoRA 适配器。您可以使用自己的模型服务器和模型容器、服务端口和部署名称更新此部署。您可以选择在部署中配置 LoRA 适配器,也可以部署基准模型。
如需在
nvidia-h100-80gb
加速器类型上部署,请将以下清单保存为vllm-llama3-8b-instruct.yaml
。此清单定义了一个包含模型和模型服务器的 Kubernetes Deployment:apiVersion: apps/v1 kind: Deployment metadata: name: vllm-llama3-8b-instruct spec: replicas: 3 selector: matchLabels: app: vllm-llama3-8b-instruct template: metadata: labels: app: vllm-llama3-8b-instruct spec: containers: - name: vllm image: "vllm/vllm-openai:latest" imagePullPolicy: Always command: ["python3", "-m", "vllm.entrypoints.openai.api_server"] args: - "--model" - "meta-llama/Llama-3.1-8B-Instruct" - "--tensor-parallel-size" - "1" - "--port" - "8000" - "--enable-lora" - "--max-loras" - "2" - "--max-cpu-loras" - "12" env: # Enabling LoRA support temporarily disables automatic v1, we want to force it on # until 0.8.3 vLLM is released. - name: VLLM_USE_V1 value: "1" - name: PORT value: "8000" - name: HUGGING_FACE_HUB_TOKEN valueFrom: secretKeyRef: name: hf-token key: token - name: VLLM_ALLOW_RUNTIME_LORA_UPDATING value: "true" ports: - containerPort: 8000 name: http protocol: TCP lifecycle: preStop: # vLLM stops accepting connections when it receives SIGTERM, so we need to sleep # to give upstream gateways a chance to take us out of rotation. The time we wait # is dependent on the time it takes for all upstreams to completely remove us from # rotation. Older or simpler load balancers might take upwards of 30s, but we expect # our deployment to run behind a modern gateway like Envoy which is designed to # probe for readiness aggressively. sleep: # Upstream gateway probers for health should be set on a low period, such as 5s, # and the shorter we can tighten that bound the faster that we release # accelerators during controlled shutdowns. However, we should expect variance, # as load balancers may have internal delays, and we don't want to drop requests # normally, so we're often aiming to set this value to a p99 propagation latency # of readiness -> load balancer taking backend out of rotation, not the average. # # This value is generally stable and must often be experimentally determined on # for a given load balancer and health check period. We set the value here to # the highest value we observe on a supported load balancer, and we recommend # tuning this value down and verifying no requests are dropped. # # If this value is updated, be sure to update terminationGracePeriodSeconds. # seconds: 30 # # IMPORTANT: preStop.sleep is beta as of Kubernetes 1.30 - for older versions # replace with this exec action. #exec: # command: # - /usr/bin/sleep # - 30 livenessProbe: httpGet: path: /health port: http scheme: HTTP # vLLM's health check is simple, so we can more aggressively probe it. Liveness # check endpoints should always be suitable for aggressive probing. periodSeconds: 1 successThreshold: 1 # vLLM has a very simple health implementation, which means that any failure is # likely significant. However, any liveness triggered restart requires the very # large core model to be reloaded, and so we should bias towards ensuring the # server is definitely unhealthy vs immediately restarting. Use 5 attempts as # evidence of a serious problem. failureThreshold: 5 timeoutSeconds: 1 readinessProbe: httpGet: path: /health port: http scheme: HTTP # vLLM's health check is simple, so we can more aggressively probe it. Readiness # check endpoints should always be suitable for aggressive probing, but may be # slightly more expensive than readiness probes. periodSeconds: 1 successThreshold: 1 # vLLM has a very simple health implementation, which means that any failure is # likely significant, failureThreshold: 1 timeoutSeconds: 1 # We set a startup probe so that we don't begin directing traffic or checking # liveness to this instance until the model is loaded. startupProbe: # Failure threshold is when we believe startup will not happen at all, and is set # to the maximum possible time we believe loading a model will take. In our # default configuration we are downloading a model from HuggingFace, which may # take a long time, then the model must load into the accelerator. We choose # 10 minutes as a reasonable maximum startup time before giving up and attempting # to restart the pod. # # IMPORTANT: If the core model takes more than 10 minutes to load, pods will crash # loop forever. Be sure to set this appropriately. failureThreshold: 3600 # Set delay to start low so that if the base model changes to something smaller # or an optimization is deployed, we don't wait unnecessarily. initialDelaySeconds: 2 # As a startup probe, this stops running and so we can more aggressively probe # even a moderately complex startup - this is a very important workload. periodSeconds: 1 httpGet: # vLLM does not start the OpenAI server (and hence make /health available) # until models are loaded. This may not be true for all model servers. path: /health port: http scheme: HTTP resources: limits: nvidia.com/gpu: 1 requests: nvidia.com/gpu: 1 volumeMounts: - mountPath: /data name: data - mountPath: /dev/shm name: shm - name: adapters mountPath: "/adapters" initContainers: - name: lora-adapter-syncer tty: true stdin: true image: us-central1-docker.pkg.dev/k8s-staging-images/gateway-api-inference-extension/lora-syncer:main restartPolicy: Always imagePullPolicy: Always env: - name: DYNAMIC_LORA_ROLLOUT_CONFIG value: "/config/configmap.yaml" volumeMounts: # DO NOT USE subPath, dynamic configmap updates don't work on subPaths - name: config-volume mountPath: /config restartPolicy: Always # vLLM allows VLLM_PORT to be specified as an environment variable, but a user might # create a 'vllm' service in their namespace. That auto-injects VLLM_PORT in docker # compatible form as `tcp://<IP>:<PORT>` instead of the numeric value vLLM accepts # causing CrashLoopBackoff. Set service environment injection off by default. enableServiceLinks: false # Generally, the termination grace period needs to last longer than the slowest request # we expect to serve plus any extra time spent waiting for load balancers to take the # model server out of rotation. # # An easy starting point is the p99 or max request latency measured for your workload, # although LLM request latencies vary significantly if clients send longer inputs or # trigger longer outputs. Since steady state p99 will be higher than the latency # to drain a server, you may wish to slightly this value either experimentally or # via the calculation below. # # For most models you can derive an upper bound for the maximum drain latency as # follows: # # 1. Identify the maximum context length the model was trained on, or the maximum # allowed length of output tokens configured on vLLM (llama2-7b was trained to # 4k context length, while llama3-8b was trained to 128k). # 2. Output tokens are the more compute intensive to calculate and the accelerator # will have a maximum concurrency (batch size) - the time per output token at # maximum batch with no prompt tokens being processed is the slowest an output # token can be generated (for this model it would be about 100ms TPOT at a max # batch size around 50) # 3. Calculate the worst case request duration if a request starts immediately # before the server stops accepting new connections - generally when it receives # SIGTERM (for this model that is about 4096 / 10 ~ 40s) # 4. If there are any requests generating prompt tokens that will delay when those # output tokens start, and prompt token generation is roughly 6x faster than # compute-bound output token generation, so add 20% to the time from above (40s + # 16s ~ 55s) # # Thus we think it will take us at worst about 55s to complete the longest possible # request the model is likely to receive at maximum concurrency (highest latency) # once requests stop being sent. # # NOTE: This number will be lower than steady state p99 latency since we stop receiving # new requests which require continuous prompt token computation. # NOTE: The max timeout for backend connections from gateway to model servers should # be configured based on steady state p99 latency, not drain p99 latency # # 5. Add the time the pod takes in its preStop hook to allow the load balancers have # stopped sending us new requests (55s + 30s ~ 85s) # # Because the termination grace period controls when the Kubelet forcibly terminates a # stuck or hung process (a possibility due to a GPU crash), there is operational safety # in keeping the value roughly proportional to the time to finish serving. There is also # value in adding a bit of extra time to deal with unexpectedly long workloads. # # 6. Add a 50% safety buffer to this time since the operational impact should be low # (85s * 1.5 ~ 130s) # # One additional source of drain latency is that some workloads may run close to # saturation and have queued requests on each server. Since traffic in excess of the # max sustainable QPS will result in timeouts as the queues grow, we assume that failure # to drain in time due to excess queues at the time of shutdown is an expected failure # mode of server overload. If your workload occasionally experiences high queue depths # due to periodic traffic, consider increasing the safety margin above to account for # time to drain queued requests. terminationGracePeriodSeconds: 130 nodeSelector: cloud.google.com/gke-accelerator: "nvidia-h100-80gb" volumes: - name: data emptyDir: {} - name: shm emptyDir: medium: Memory - name: adapters emptyDir: {} - name: config-volume configMap: name: vllm-llama3-8b-adapters --- apiVersion: v1 kind: ConfigMap metadata: name: vllm-llama3-8b-adapters data: configmap.yaml: | vLLMLoRAConfig: name: vllm-llama3.1-8b-instruct port: 8000 defaultBaseModel: meta-llama/Llama-3.1-8B-Instruct ensureExist: models: - id: food-review source: Kawon/llama3.1-food-finetune_v14_r8 - id: cad-fabricator source: redcathode/fabricator --- kind: HealthCheckPolicy apiVersion: networking.gke.io/v1 metadata: name: health-check-policy namespace: default spec: targetRef: group: "inference.networking.x-k8s.io" kind: InferencePool name: vllm-llama3-8b-instruct default: config: type: HTTP httpHealthCheck: requestPath: /health port: 8000
将清单应用到您的集群:
kubectl apply -f vllm-llama3-8b-instruct.yaml
创建 InferencePool
资源
InferencePool
Kubernetes 自定义资源用于定义一组具有共同基础 LLM 和计算配置的 Pod。
InferencePool
自定义资源包含以下关键字段:
selector
:指定哪些 Pod 属于此池。此选择器中的标签必须与应用于模型服务器 Pod 的标签完全匹配。targetPort
:定义 Pod 中模型服务器使用的端口。
InferencePool
资源可让 GKE 推理网关将流量路由到您的模型服务器 Pod。
如需使用 Helm 创建 InferencePool
,请执行以下步骤:
helm install vllm-llama3-8b-instruct \
--set inferencePool.modelServers.matchLabels.app=vllm-llama3-8b-instruct \
--set provider.name=gke \
--version v0.3.0 \
oci://registry.k8s.io/gateway-api-inference-extension/charts/inferencepool
将以下字段更改为与您的部署相符:
inferencePool.modelServers.matchLabels.app
:用于选择模型服务器 Pod 的标签的键。
此命令会创建一个 InferencePool
对象,该对象在逻辑上表示模型服务器部署,并引用 Selector
选择的 Pod 中的模型端点服务。
创建具有投放重要性的 InferenceModel
资源
InferenceModel
Kubernetes 自定义资源定义了特定模型(包括 LoRA 调优型模型)及其服务重要性。
InferenceModel
自定义资源包含以下关键字段:
modelName
:指定基准模型或 LoRA 适配器的名称。Criticality
:指定模型的服务重要性。poolRef
:引用模型的服务端InferencePool
。
借助 InferenceModel
,GKE 推理网关可以根据模型名称和重要性将流量路由到您的模型服务器 pod。
如需创建 InferenceModel
,请执行以下步骤:
将以下示例清单保存为
inferencemodel.yaml
:apiVersion: inference.networking.x-k8s.io/v1alpha2 kind: InferenceModel metadata: name: inferencemodel-sample spec: modelName: MODEL_NAME criticality: CRITICALITY poolRef: name: INFERENCE_POOL_NAME
替换以下内容:
MODEL_NAME
:基准模型或 LoRA 适配器的名称。例如food-review
。CRITICALITY
:所选的广告投放重要性。从Critical
、Standard
或Sheddable
中选择。例如Standard
。INFERENCE_POOL_NAME
:您在上一步中创建的InferencePool
的名称。例如vllm-llama3-8b-instruct
。
将示例清单应用到您的集群:
kubectl apply -f inferencemodel.yaml
以下示例创建了一个 InferenceModel
对象,用于在 vllm-llama3-8b-instruct
InferencePool
上配置 food-review
LoRA 模型,并设置 Standard
服务重要性。InferenceModel
对象还会配置要以 Critical
优先级级别提供的基础模型。
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
name: food-review
spec:
modelName: food-review
criticality: Standard
poolRef:
name: vllm-llama3-8b-instruct
targetModels:
- name: food-review
weight: 100
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
name: llama3-base-model
spec:
modelName: meta-llama/Llama-3.1-8B-Instruct
criticality: Critical
poolRef:
name: vllm-llama3-8b-instruct
创建网关
Gateway 资源充当外部流量进入 Kubernetes 集群的入口点。它定义了接受传入连接的监听器。
GKE 推理网关支持 gke-l7-rilb
和 gke-l7-regional-external-managed
网关类。如需了解详情,请参阅 GKE 文档中的网关类部分。
如需创建网关,请执行以下步骤:
将以下示例清单保存为
gateway.yaml
:apiVersion: gateway.networking.k8s.io/v1 kind: Gateway metadata: name: GATEWAY_NAME spec: gatewayClassName: gke-l7-regional-external-managed listeners: - protocol: HTTP # Or HTTPS for production port: 80 # Or 443 for HTTPS name: http
将
GATEWAY_NAME
替换为网关资源的唯一名称。例如inference-gateway
。将清单应用到您的集群:
kubectl apply -f gateway.yaml
创建 HTTPRoute
资源
在本部分中,您将创建一个 HTTPRoute
资源,以定义网关如何将传入的 HTTP 请求路由到您的 InferencePool
。
HTTPRoute 资源定义了 GKE 网关如何将传入 HTTP 请求路由到后端服务(即您的 InferencePool
)。它用于指定匹配规则(例如标头或路径)以及应将流量转发到的后端。
如需创建 HTTPRoute,请执行以下步骤:
将以下示例清单保存为
httproute.yaml
:apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: HTTPROUTE_NAME spec: parentRefs: - name: GATEWAY_NAME rules: - matches: - path: type: PathPrefix value: PATH_PREFIX backendRefs: - name: INFERENCE_POOL_NAME group: inference.networking.x-k8s.io kind: InferencePool
替换以下内容:
HTTPROUTE_NAME
:您的HTTPRoute
资源的唯一名称。例如my-route
。GATEWAY_NAME
:您创建的Gateway
资源的名称。例如inference-gateway
。PATH_PREFIX
:用于匹配传入请求的路径前缀。例如,/
用于匹配所有内容。INFERENCE_POOL_NAME
:您要将流量转送到的InferencePool
资源的名称。例如vllm-llama3-8b-instruct
。
将清单应用到您的集群:
kubectl apply -f httproute.yaml
发送推理请求
配置 GKE 推理网关后,您可以向已部署的模型发送推理请求。
如需发送推理请求,请执行以下步骤:
- 检索网关端点。
- 构建格式正确的 JSON 请求。
- 使用
curl
将请求发送到/v1/completions
端点。
这样,您就可以根据输入的提示和指定的参数生成文本。
如需获取网关端点,请运行以下命令:
IP=$(kubectl get gateway/GATEWAY_NAME -o jsonpath='{.status.addresses[0].address}') PORT=PORT_NUMBER # Use 443 for HTTPS, or 80 for HTTP
替换以下内容:
GATEWAY_NAME
:网关资源的名称。PORT_NUMBER
:您在网关中配置的端口号。
如需使用
curl
向/v1/completions
端点发送请求,请运行以下命令:curl -i -X POST https://${IP}:${PORT}/v1/completions \ -H 'Content-Type: application/json' \ -H 'Authorization: Bearer $(gcloud auth print-access-token)' \ -d '{ "model": "MODEL_NAME", "prompt": "PROMPT_TEXT", "max_tokens": MAX_TOKENS, "temperature": "TEMPERATURE" }'
替换以下内容:
MODEL_NAME
:要使用的模型或 LoRA 适配器的名称。PROMPT_TEXT
:模型的输入提示。MAX_TOKENS
:在回答中生成的词元数量上限。TEMPERATURE
:控制输出的随机性。使用��0
可获得确定性输出,使用更大的数字可获得更具创意的输出。
请注意以下行为:
- 请求正文:请求正文可以包含
stop
和top_p
等其他参数。如需查看完整的选项列表,请参阅 OpenAI API 规范。 - 错误处理:在客户端代码中实现适当的错误处理,以处理响应中的潜在错误。例如,检查
curl
响应中的 HTTP 状态代码。非 200 状态代码通常表示存在错误。 - 身份验证和授权:对于生产部署,��使用身份验证和授权机制保护您的 API 端点。在请求中添加适当的标头(例如
Authorization
)。
为推理网关配置可观测性
GKE 推理网关可让您了解推理工作负载的运行状况、性能和行为。这有助于您发现和解决问题、优化资源利用率,并确保应用的可靠性。您可以在 Cloud Monitoring 中通过 Metrics Explorer 查看这些可观测性指标。
如需为 GKE 推理网关配置可观测性,请参阅配置可观测性。
删除已部署的资源
为避免因您在本指南中创建的资源导致您的 Google Cloud 账号产生费用,请运行以下命令:
gcloud container clusters delete CLUSTER_NAME \
--region=REGION
替换以下值:
REGION
:支持要使用的加速器类型的区域,例如适用于 H100 GPU 的us-central1
。CLUSTER_NAME
:您的集群的名称。
后续步骤
- 了解 GKE 推理网关。
- 了解如何部署 GKE 推理网关。