Build Scalable LLM Apps With Kubernetes: A Step-by-Step Guide

Understanding how to scale AI apps efficiently is the difference between a model stuck in research and one delivering actionable results in production.

Apr 14th, 2025 6:00am by Oladimeji Sowole

Featued image for: Build Scalable LLM Apps With Kubernetes: A Step-by-Step Guide

Image from Dragon Claws on Shutterstock.

Large language models (LLMs) like GPT-4 have transformed the possibilities of AI, unlocking new advancements in natural language processing, conversational AI and content creation. Their impact stretches across industries, from powering chatbots and virtual assistants to automating document analysis and enhancing customer engagement.

But while LLMs promise immense potential, deploying them effectively in real-world scenarios presents unique challenges. These models demand significant computational resources, seamless scalability and efficient traffic management to meet the demands of production environments.

That’s where Kubernetes comes in. Recognized as the leading container orchestration platform, Kubernetes can provide a dynamic and reliable framework for managing and scaling LLM-based applications in a cloud native ecosystem. Kubernetes’ ability to handle containerized workloads makes it an essential tool for organizations looking to operationalize AI solutions without compromising on performance or flexibility.

This step-by-step guide will take you through the process of deploying and scaling an LLM-powered application using Kubernetes. Understanding how to scale AI applications efficiently is the difference between a model stuck in research environments and one delivering actionable results in production. We’ll consider how to containerize LLM applications, deploy them to Kubernetes, configure autoscaling to meet fluctuating demands and manage user traffic for optimal performance.

This is about turning cutting-edge AI into a practical, scalable engine driving innovation for your organization.

Prerequisites

Before beginning this tutorial, ensure you have the following in place:

A basic knowledge of Kubernetes: Familiarity with kubectl, deployments, services and pods is a must.
Install Docker and configure it on your system.
Install and run a Kubernetes cluster on your local machine (such as minikube) or in the cloud (AWS Elastic Kubernetes Service, Google Kubernetes Engine or Microsoft Azure Kubernetes Service).
Install OpenAI and Flask in your Python environment to create the LLM application.

Install necessary Python dependencies:

pip install openai flask

Step 1: Creating an LLM-Powered Application

We’ll start by building a simple Python-based API for interacting with an LLM (for instance, OpenAI’s GPT-4).

Code for the Application

Create a file named app.py:

from flask import Flask, request, jsonify
import openai
import os

# Initialize Flask app
app = Flask(__name__)

# Configure OpenAI API key
openai.api_key = os.getenv("OPENAI_API_KEY")

@app.route("/generate", methods=["POST"])
def generate():
    try:
        data = request.get_json()
        prompt = data.get("prompt", "")
        
        # Generate response using GPT-4
        response = openai.Completion.create(
            model="text-davinci-003",
            prompt=prompt,
            max_tokens=100
        )
        return jsonify({"response": response.choices[0].text.strip()})
    except Exception as e:
        return jsonify({"error": str(e)}), 500

if __name__ == "__main__":
    app.run(host="0.0.0.0", port=5000)

from flask import Flask, request, jsonify

import openai

import os

# Initialize Flask app

app = Flask(__name__)

# Configure OpenAI API key

openai.api_key = os.getenv("OPENAI_API_KEY")

@app.route("/generate", methods=["POST"])

def generate():

try:

data = request.get_json()

prompt = data.get("prompt", "")

# Generate response using GPT-4

response = openai.Completion.create(

model="text-davinci-003",

prompt=prompt,

max_tokens=100

)

return jsonify({"response": response.choices[0].text.strip()})

except Exception as e:

return jsonify({"error": str(e)}), 500

if __name__ == "__main__":

app.run(host="0.0.0.0", port=5000)

Step 2: Containerizing the Application

To deploy the application to Kubernetes, we need to package it in a Docker container.

Dockerfile

Create a Dockerfile in the same directory as app.py:

# Use an official Python runtime as the base image
FROM python:3.9-slim

# Set the working directory
WORKDIR /app

# Copy application files
COPY app.py /app

# Copy requirements and install dependencies
RUN pip install flask openai

# Expose the application port
EXPOSE 5000

# Run the application
CMD ["python", "app.py"]

# Use an official Python runtime as the base image

FROM python:3.9-slim

# Set the working directory

WORKDIR /app

# Copy application files

COPY app.py /app

# Copy requirements and install dependencies

RUN pip install flask openai

# Expose the application port

EXPOSE 5000

# Run the application

CMD ["python", "app.py"]

Step 3: Building and Pushing the Docker Image

Build the Docker image and push it to a container registry (such as Docker Hub).

# Build the image
docker build -t your-dockerhub-username/llm-app:v1 .

# Push the image
docker push your-dockerhub-username/llm-app:v1

# Build the image

docker build -t your-dockerhub-username/llm-app:v1 .

# Push the image

docker push your-dockerhub-username/llm-app:v1

Step 4: Deploying the Application to Kubernetes

We’ll create a Kubernetes deployment and service to manage and expose the LLM application.

Deployment YAML

Create a file named deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: llm-app
  template:
    metadata:
      labels:
        app: llm-app
    spec:
      containers:
      - name: llm-app
        image: your-dockerhub-username/llm-app:v1
        ports:
        - containerPort: 5000
        env:
        - name: OPENAI_API_KEY
          valueFrom:
            secretKeyRef:
              name: openai-secret
              key: api-key
---
apiVersion: v1
kind: Service
metadata:
  name: llm-app-service
spec:
  selector:
    app: llm-app
  ports:
    - protocol: TCP
      port: 80
      targetPort: 5000
  type: LoadBalancer

apiVersion: apps/v1

kind: Deployment

metadata:

spec:

replicas: 3

selector:

matchLabels:

app: llm-app

template:

metadata:

labels:

app: llm-app

spec:

containers:

- name: llm-app

image: your-dockerhub-username/llm-app:v1

ports:

- containerPort: 5000

env:

- name: OPENAI_API_KEY

valueFrom:

secretKeyRef:

key: api-key

---

apiVersion: v1

kind: Service

metadata:

spec:

selector:

app: llm-app

ports:

- protocol: TCP

port: 80

targetPort: 5000

type: LoadBalancer

Secret for API Key

Create a Kubernetes secret to securely store the OpenAI API key:

kubectl create secret generic openai-secret --from-literal=api-key="your_openai_api_key"

1	kubectl create secret generic openai-secret --from-literal=api-key="your_openai_api_key"

Step 5: Applying the Deployment and Service

Deploy the application to the Kubernetes cluster:

kubectl apply -f deployment.yaml

Verify the deployment:
kubectl get deployments
kubectl get pods
kubectl get services

kubectl apply -f deployment.yaml

Verify the deployment:

kubectl get deployments

kubectl get pods

kubectl get services

Once the service is running, note the external IP address (if using a cloud provider) or the NodePort (if using minikube).

Step 6: Configuring Autoscaling

Kubernetes Horizontal Pod Autoscaler (HPA) allows you to scale pods based on CPU or memory utilization.

Apply HPA

kubectl autoscale deployment llm-app --cpu-percent=50 --min=3 --max=10

1	kubectl autoscale deployment llm-app --cpu-percent=50 --min=3 --max=10

Check the status of the HPA:

kubectl get hpa

1	kubectl get hpa

The autoscaler will adjust the number of pods in the llm-app deployment based on the load.

Step 7: Monitoring and Logging

Monitoring and logging are critical for maintaining and troubleshooting LLM applications.

Enable Monitoring

Use tools like Prometheus and Grafana to monitor Kubernetes clusters. For basic monitoring, Kubernetes Metrics Server can provide resource usage data.

Install Metrics Server:

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

1	kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

View Logs

Inspect logs from the running pods:

kubectl logs <pod-name>

1	kubectl logs <pod-name>

For aggregated logs, consider tools like Fluentd, Elasticsearch and Kibana.

Step 8: Testing the Application

Test the LLM API using a tool like curl or Postman:

curl -X POST http://<external-ip>/generate \
-H "Content-Type: application/json" \
-d '{"prompt": "Explain Kubernetes in simple terms."}'

curl -X POST http://<external-ip>/generate \

-H "Content-Type: application/json" \

-d '{"prompt": "Explain Kubernetes in simple terms."}'

Expected output:

{
  "response": "Kubernetes is an open-source platform that manages containers..."
}

{

"response": "Kubernetes is an open-source platform that manages containers..."

}

Step 9: Scaling Beyond Kubernetes

To handle more advanced workloads or deploy across multiple regions:

Use service mesh: Tools like Istio can manage traffic between microservices.
Implement multicluster deployments: Tools like KubeFed or cloud provider solutions (like Google Anthos) enable multicluster management.
Integrate CI/CD: Automate deployments using pipelines with Jenkins, GitHub Actions or GitLab CI.

Conclusion

Building and deploying a scalable LLM application using Kubernetes might seem complex, but as we’ve seen, the process is both achievable and rewarding. Starting from creating an LLM-powered API to deploying and scaling it within a Kubernetes cluster, you now have a blueprint for making your applications robust, scalable and ready for production environments.

With Kubernetes’ features including autoscaling, monitoring and service discovery, your setup is built to handle real-world demands effectively. From here, you can push boundaries even further by exploring advanced enhancements such as canary deployments, A/B testing or integrating serverless components using Kubernetes native tools like Knative. The possibilities are endless, and this foundation is just the start.

Want to learn more about LLMs? Discover how to leverage LangChain and optimize large language models effectively in Andela’s guide, “Using Langchain to Benchmark LLM Application Performance.”

Oladimeji Sowole is a member of the Andela Talent Network, a private marketplace for global tech talent. A Data Scientist and Data Analyst with more than 6 years of professional experience building data visualizations with different tools and predictive models...