Untitled

How to Deploy Local LLMs to Kubernetes

Provision GPU nodes with at least 24 GB VRAM and install the NVIDIA GPU Operator via Helm.
Verify GPU availability by checking nvidia.com/gpu in node allocatable resources.
Select a serving framework — start with a plain vLLM Deployment for single-model setups.
Configure GPU resource requests/limits and taint nodes for dedicated inference workloads.
Deploy vLLM using the provided Helm chart with your Hugging Face token secret.
Expose custom Prometheus metrics (queue depth) via the Prometheus Adapter.
Enable HPA autoscaling targeting vllm_queue_depth with tuned stabilization windows.
Validate the full pipeline by sending a completion request and confirming a JSON response.

Running local LLMs on Kubernetes gives DevOps teams a self-hosted inference path that is health-checked, autoscaled, and rolling-updatable, without depending on costly cloud API endpoints. This guide walks through the full pipeline: from preparing GPU nodes and installing the NVIDIA GPU Operator, through selecting a serving framework, to deploying a complete Helm chart for vLLM with custom-metric autoscaling.

Prerequisites: Preparing Your Cluster for GPU Workloads
Choosing a Serving Framework: KServe vs. Ray Serve vs. Simple Deployment
Resource Management: GPU Requests, Limits, and Bin-Packing
Autoscaling Inference: HPA on Custom Metrics
Full Walkthrough: Helm Chart for a vLLM Service
Implementation Checklist
Where to Go Next

Deploying these workloads on Kubernetes brings together data privacy guarantees, infrastructure-bound costs that scale with hardware utilization rather than per-token API pricing, and lower latency by keeping model serving inside the cluster boundary (eliminating the network round-trip to external APIs, typically 50 to 200 ms).

Kubernetes fits inference workload management well because its core orchestration primitives, including scheduling, health checks, rolling updates, and horizontal scaling, map directly onto the operational requirements of serving large language models. GPU-aware scheduling ensures pods land on nodes with available accelerators, while liveness and readiness probes guard against serving stale or crashed model instances. Rolling updates then enable zero-downtime model version swaps.

This guide walks through the full pipeline: from preparing GPU nodes and installing the NVIDIA GPU Operator, through selecting a serving framework, to deploying a complete Helm chart for vLLM with custom-metric autoscaling. The target audience is DevOps and platform engineers with intermediate Kubernetes experience who want a reproducible, opinionated deployment rather than scattered documentation fragments. For broader context on running models outside cloud APIs, the Running LLMs Locally hub covers the wider set of approaches.

Prerequisites: Preparing Your Cluster for GPU Workloads

Hardware and Cluster Requirements

A minimum viable GPU node for serving a 7B-parameter model (such as Mistral 7B or Llama 2 7B) requires an NVIDIA GPU with at least 24 GB of VRAM. 24 GB provides headroom for the KV-cache beyond the ~14 GB weight footprint; a 16 GB GPU is feasible for small batch sizes but will constrain throughput. The NVIDIA A10G and L4 are common choices in cloud environments, while the A100 (40 GB or 80 GB) provides headroom for larger models or higher throughput via increased KV-cache capacity. Each inference node should have sufficient system RAM (at least 32 GB) and fast local or network-attached storage if model weights will be cached locally rather than streamed from object storage.

All features in this guide require Kubernetes 1.27 or later. The autoscaling/v2 HPA API requires Kubernetes 1.23+; GPU Operator 23.x requires Kubernetes 1.24+. Managed Kubernetes services (EKS, GKE, AKS) simplify GPU node provisioning through dedicated node pools with pre-configured machine types. Bare-metal clusters require manual NVIDIA driver management unless the GPU Operator handles it, which is the approach outlined below.

The full walkthrough requires the following:

Prometheus deployed and configured to scrape pod annotations in the inference namespace (e.g., via kube-prometheus-stack).
Prometheus Adapter installed and configured (covered in the autoscaling section below).
A Hugging Face account with Mistral-7B-Instruct-v0.2 model terms accepted and an API token generated. Gated models require authentication; without a token, pod startup will fail.

Installing the NVIDIA GPU Operator

The NVIDIA GPU Operator automates the full stack needed to run GPU containers on Kubernetes: host NVIDIA drivers, the NVIDIA Container Toolkit, the Kubernetes device plugin that advertises nvidia.com/gpu resources, and optional GPU monitoring via DCGM Exporter. Rather than baking drivers into node images and managing version drift, the Operator deploys everything as DaemonSets.

# Add the NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# Install the GPU Operator into its own namespace
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set devicePlugin.enabled=true \
  --set dcgmExporter.enabled=true

Setting driver.enabled=true tells the Operator to install and manage NVIDIA drivers on the host. On managed cloud node pools where drivers are pre-installed, set this to false to avoid conflicts. The dcgmExporter.enabled=true flag deploys NVIDIA DCGM Exporter, which exposes GPU utilization, temperature, and memory metrics to Prometheus.

Verifying GPU Availability

After the Operator pods reach a Running state (which typically takes 3 to 8 minutes on first install as drivers compile; see the NVIDIA GPU Operator documentation for version-specific timing), verify that GPU resources are visible to the scheduler. You can poll readiness with:

kubectl get pods -n gpu-operator -w

Wait until all pods show Running or Completed, then verify GPU resources:

# Check that the node advertises GPU resources
kubectl describe node <gpu-node-name> | grep -A 5 "Allocatable"
# Expected output should include:
#   nvidia.com/gpu:  1

# Run a quick test pod to confirm nvidia-smi works inside a container
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-test
  namespace: gpu-operator
spec:
  restartPolicy: Never
  containers:
  - name: nvidia-smi
    image: nvidia/cuda:12.3.1-base-ubuntu22.04
    command: ["nvidia-smi"]
    resources:
      limits:
        nvidia.com/gpu: 1
EOF

# Check the output
kubectl logs -n gpu-operator gpu-test

# Clean up the test pod after verification
kubectl delete pod -n gpu-operator gpu-test

The nvidia-smi output should display the GPU model, driver version, and available memory. If nvidia.com/gpu does not appear in allocatable resources, the device plugin DaemonSet likely has not started correctly. Check GPU Operator pod logs in the gpu-operator namespace.

Choosing a Serving Framework: KServe vs. Ray Serve vs. Simple Deployment

Option 1: Plain Kubernetes Deployment

A standard Kubernetes Deployment wrapping a vLLM or similar container is the simplest approach. It fits single-model, low-complexity setups where teams want full control over the pod spec and no additional abstractions. You get no extra CRDs, no framework-specific operational overhead, and straightforward debugging. On the other hand, you lose built-in dynamic batching management at the orchestration layer (though vLLM handles continuous batching internally), multi-model routing, and traffic splitting without adding an ingress layer manually.

Option 2: Ray Serve on Kubernetes (KubeRay)

Ray Serve, deployed via the KubeRay operator, suits teams running multi-model pipelines or who are deeply invested in the Python ecosystem. A typical use case: a pipeline chaining an embedding model with a reranker and a generator, all managed as a single deployment graph. Ray Serve provides autoscaling at the actor level, dynamic batching, and model composition within that graph. The cost is operational complexity: a Ray head node, worker nodes, and the KubeRay CRDs add moving parts. Resource management across Ray actors and Kubernetes pods creates a two-layer scheduling problem that can be difficult to debug.

Option 3: KServe (ModelMesh or Serverless)

Teams serving dozens of models at enterprise scale use KServe for its standardized V2 inference protocol. ModelMesh multiplexes many models onto shared GPU pods, making it efficient when serving dozens of smaller models. KServe's serverless (scale-to-zero) mode requires Knative Serving and an ingress controller (Istio or Kourier). KServe's RawDeployment mode requires neither Knative nor a service mesh, making it significantly lighter to operate.

Decision Matrix

Criteria	Plain Deployment	Ray Serve (KubeRay)	KServe
Setup complexity	Low	Medium-High	High
Multi-model support	None (manual)	Native (deployment graph)	Native (ModelMesh)
Autoscaling granularity	HPA on custom metrics	Per-actor autoscaling	KPA / HPA with Knative
Community maturity	Mature (core K8s primitives)	Growing	Established
GPU utilization efficiency	One model per GPU	Flexible actor placement	Model multiplexing

Start with a plain Deployment running vLLM. vLLM's internal continuous batching and PagedAttention memory management handle the serving-layer optimizations, while Kubernetes handles orchestration. Teams can graduate to KServe or Ray Serve as multi-model, canary, or pipeline requirements emerge. For a deeper comparison of serving engines including Ollama, see the Ollama vs vLLM article, which contextualizes why vLLM's throughput characteristics make it a strong choice for production deployments.

Resource Management: GPU Requests, Limits, and Bin-Packing

Setting GPU Requests and Limits

GPU resources in Kubernetes behave differently from CPU and memory. The nvidia.com/gpu resource is integer-only and non-overcommittable: a request of 1 means one entire GPU is reserved. The standard device plugin does not support fractional requests. Time-slicing (via GPU Operator config) enables overcommit but without memory isolation. For nvidia.com/gpu, requests and limits must be identical; this resource is non-overcommittable and integer-only. CPU and memory may differ between request and limit.

resources:
  requests:
    nvidia.com/gpu: 1
    cpu: "4"
    memory: "32Gi"
  limits:
    nvidia.com/gpu: 1
    cpu: "8"
    memory: "48Gi"

For a 7B-parameter model running in float16, model weights alone consume roughly 14 GB of VRAM. The remaining VRAM on a 24 GB GPU serves the KV-cache. Setting CPU requests to 4 cores and memory to 32 GB accounts for tokenization overhead, model loading, and the serving framework's host-side memory.

Because GPU allocation is all-or-nothing at the device level, a pod using 14 GB on a 24 GB GPU leaves 10 GB stranded. Kubernetes cannot schedule another pod onto that GPU.

Dealing with Bin-Packing and Fragmentation

Because GPU allocation is all-or-nothing at the device level, a pod using 14 GB on a 24 GB GPU leaves 10 GB stranded. Kubernetes cannot schedule another pod onto that GPU. Two strategies address this. NVIDIA MIG on supported hardware (A100, H100, A30) partitions a physical GPU into isolated instances with dedicated memory and compute slices. Note that the A10G and L4 recommended in this guide do not support MIG; use MPS on those GPUs instead. NVIDIA Multi-Process Service (MPS) allows multiple processes to share a GPU, though without the memory isolation guarantees of MIG.

At the Kubernetes level, dedicating GPU nodes to inference workloads via taints prevents non-GPU pods from occupying these expensive nodes. Apply the taint to each GPU node:

# Idempotent: --overwrite prevents failure if taint already exists.
# Run this for each GPU node. List nodes with: kubectl get nodes
kubectl taint nodes <gpu-node-name> workload=inference:NoSchedule --overwrite

Then in the pod spec, add a matching toleration and node affinity:

spec:
  tolerations:
  - key: "workload"
    operator: "Equal"
    value: "inference"
    effect: "NoSchedule"
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nvidia.com/gpu.present
            operator: In
            values:
            - "true"

Autoscaling Inference: HPA on Custom Metrics

Why Standard CPU/Memory HPA Fails for LLMs

LLM inference is GPU-bound and queue-bound. The CPU on an inference node may idle at 10% while the GPU is saturated and dozens of requests wait in the serving queue. A standard HPA targeting CPU utilization will never trigger scale-up under these conditions, so queued requests wait too long.

LLM inference is GPU-bound and queue-bound. The CPU on an inference node may idle at 10% while the GPU is saturated and dozens of requests wait in the serving queue.

Exposing Custom Metrics (Queue Depth)

vLLM exposes a Prometheus-compatible /metrics endpoint with several metrics critical for autoscaling decisions. Before configuring the Prometheus Adapter, verify the exact metric names exposed by vLLM in your version:

kubectl exec -n inference <vllm-pod-name> -- curl -s http://localhost:8000/metrics | grep -i "waiting\|cache"

Confirm the metric names match those used in the adapter configuration below. The metric names may vary between vLLM versions; names using colon notation (e.g., vllm:num_requests_waiting) follow the Prometheus recording rule convention and may indicate a recording rule must be defined in Prometheus, while raw metrics exposed directly by vLLM typically use underscores (e.g., vllm_num_requests_waiting). Use the exact name returned by the /metrics endpoint.

These metrics need to be surfaced to the Kubernetes HPA controller via the Prometheus Adapter. KEDA ScaledObject configuration for vLLM is outside the scope of this guide; see the KEDA documentation for a Prometheus scaler example.

First, install the Prometheus Adapter if it is not already present in your cluster:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install prometheus-adapter prometheus-community/prometheus-adapter \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.url=http://<prometheus-service>.<prometheus-namespace>.svc \
  --set prometheus.port=9090 \
  -f prometheus-adapter-config.yaml

Replace <prometheus-service> and <prometheus-namespace> with the actual Prometheus service name and namespace in your cluster (e.g., http://prometheus-kube-prometheus-prometheus.monitoring.svc).

Verify the adapter is running:

kubectl get apiservice v1beta1.custom.metrics.k8s.io
# AVAILABLE column should show True

The Prometheus Adapter configuration translates Prometheus queries into Kubernetes custom metrics API responses. Create prometheus-adapter-config.yaml with the following content. Important: Run the verification command above first to confirm the exact metric name. The configuration below uses vllm_num_requests_waiting (underscores), which is the raw metric name typically exposed by vLLM. If your version uses a different name, adjust accordingly:

# prometheus-adapter-config.yaml
# IMPORTANT: Verify the exact metric name before deploying:
# kubectl exec -n inference <vllm-pod> -- curl -s http://localhost:8000/metrics | grep -E "waiting|queue"
# Replace vllm_num_requests_waiting below with the exact name returned.
rules:
- seriesQuery: 'vllm_num_requests_waiting{namespace!="",pod!=""}'
  resources:
    overrides:
      namespace: {resource: "namespace"}
      pod: {resource: "pod"}
  name:
    matches: "^vllm_num_requests_waiting$"
    as: "vllm_queue_depth"
  metricsQuery: 'sum(vllm_num_requests_waiting{<<.LabelMatchers>>}) by (<<.GroupBy>>)'

This configuration queries vllm_num_requests_waiting, maps it to Kubernetes namespace and pod labels, and exposes it as a custom metric named vllm_queue_depth that the HPA can target.

Configuring the HPA

Cost note: minReplicas: 1 keeps at least one GPU pod running at all times, which means continuous GPU node cost even during idle periods. On cloud providers, consider using cluster autoscaler node scale-down in combination with this setting, or set minReplicas: 0 if your setup supports scale-to-zero (requires KEDA or Knative).

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
  namespace: inference
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-server
  minReplicas: 1
  maxReplicas: 4
  metrics:
  - type: Pods
    pods:
      metric:
        name: vllm_queue_depth
      target:
        type: AverageValue
        averageValue: "5"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Pods
        value: 1
        periodSeconds: 120

The averageValue of 5 means the HPA targets no more than 5 waiting requests per pod on average. When queue depth exceeds this, new replicas are requested. The scaleDown.stabilizationWindowSeconds of 300 seconds is critical: vLLM model loading can take 30 to 120 seconds depending on model size and storage speed. For models >13B parameters or on slow NFS/S3-backed PVCs, loading can exceed 300 seconds; tune initialDelaySeconds and stabilization windows accordingly. A 300-second window prevents thrashing but delays capacity reduction after traffic drops. If traffic arrives in bursts with 10-minute peaks, set the stabilization window to match peak duration so the HPA does not scale down mid-burst. Scaling down too aggressively means pods are destroyed and recreated repeatedly. Keeping at least one warm replica (minReplicas: 1) avoids cold-start latency on the first request.

Full Walkthrough: Helm Chart for a vLLM Service

Chart Structure Overview

vllm-chart/
├── Chart.yaml
├── values.yaml
└── templates/
    ├── deployment.yaml
    ├── service.yaml
    └── hpa.yaml

# Chart.yaml
apiVersion: v2
name: vllm-server
description: Helm chart for deploying vLLM on Kubernetes with GPU support
type: application
version: 0.1.0
# appVersion must match image.tag in values.yaml; update both when upgrading vLLM.
appVersion: "v0.4.2"

`values.yaml`: Configurable Parameters

Important: Verify the latest vLLM image tag at https://github.com/vllm-project/vllm/releases before deploying. The tag below was current at time of writing but may not exist on the registry if the project has moved on.

# values.yaml
model:
  name: "mistralai/Mistral-7B-Instruct-v0.2"
  downloadFromHub: true
  # pvcName: "model-storage"  # Uncomment to use a PVC instead

replicaCount: 1

image:
  repository: ghcr.io/vllm-project/vllm-openai
  tag: "v0.4.2"
  pullPolicy: IfNotPresent

# Name of a Kubernetes Secret containing the Hugging Face token.
# Create it with: kubectl create secret generic hf-token --from-literal=token=<YOUR_TOKEN> -n inference
huggingFaceSecret: "hf-token"
# Key within the Secret that holds the token value. Change if your Secret uses a different key name.
huggingFaceSecretKey: "token"

resources:
  requests:
    nvidia.com/gpu: 1
    cpu: "4"
    memory: "32Gi"
  limits:
    nvidia.com/gpu: 1
    cpu: "8"
    memory: "48Gi"

service:
  type: ClusterIP
  port: 8000

hpa:
  enabled: true
  minReplicas: 1
  maxReplicas: 4
  targetQueueDepth: 5        # Integer only. Must be a whole number for AverageValue quantity.
  scaleDownStabilizationSeconds: 300
  scaleUpStabilizationSeconds: 60

prometheus:
  scrape: true

Deployment Template

# templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: {{ .Release.Name }}-vllm
  namespace: {{ .Release.Namespace }}
  labels:
    app: {{ .Release.Name }}-vllm
spec:
  replicas: {{ .Values.replicaCount }}
  selector:
    matchLabels:
      app: {{ .Release.Name }}-vllm
  template:
    metadata:
      labels:
        app: {{ .Release.Name }}-vllm
      annotations:
        {{- if .Values.prometheus.scrape }}
        prometheus.io/scrape: "true"
        prometheus.io/port: "8000"
        prometheus.io/path: "/metrics"
        {{- end }}
    spec:
      terminationGracePeriodSeconds: 120
      tolerations:
      - key: "workload"
        operator: "Equal"
        value: "inference"
        effect: "NoSchedule"
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              - key: nvidia.com/gpu.present
                operator: In
                values:
                - "true"
      containers:
      - name: vllm
        image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
        imagePullPolicy: {{ .Values.image.pullPolicy }}
        args:
          - "--model"
          - "{{ .Values.model.name }}"
          - "--host"
          - "0.0.0.0"
          - "--port"
          - "8000"
        ports:
        - containerPort: 8000
          name: http
        env:
        {{- if .Values.huggingFaceSecret }}
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: {{ .Values.huggingFaceSecret }}
              key: {{ .Values.huggingFaceSecretKey }}
        {{- end }}
        resources:
          {{- toYaml .Values.resources | nindent 10 }}
        lifecycle:
          preStop:
            exec:
              # Allow endpoint propagation before SIGTERM is sent.
              command: ["sleep", "15"]
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          # Must exceed readiness delay to prevent restart loop during loading.
          initialDelaySeconds: 180
          periodSeconds: 15
          failureThreshold: 6
          timeoutSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          # Matches worst-case model load time stated in text; increase for models >13B or slow PVCs.
          initialDelaySeconds: 120
          periodSeconds: 10
          failureThreshold: 3
          timeoutSeconds: 10

The initialDelaySeconds on the liveness probe is set to 180 seconds, deliberately higher than the readiness probe (120 seconds), to accommodate model loading time. The failureThreshold: 6 on the liveness probe provides 90 seconds of tolerance (6 × 15s) after the initial delay before Kubernetes kills the pod, preventing restart loops under heavy GPU inference load. If the model is larger or storage is slow, both values may need to increase further to prevent Kubernetes from killing the pod during startup.

Service Template

# templates/service.yaml
apiVersion: v1
kind: Service
metadata:
  name: {{ .Release.Name }}-vllm
  namespace: {{ .Release.Namespace }}
spec:
  type: {{ .Values.service.type }}
  selector:
    app: {{ .Release.Name }}-vllm
  ports:
  - port: {{ .Values.service.port }}
    targetPort: 8000
    protocol: TCP
    name: http

HPA Template

# templates/hpa.yaml
{{- if .Values.hpa.enabled }}
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: {{ .Release.Name }}-vllm-hpa
  namespace: {{ .Release.Namespace }}
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: {{ .Release.Name }}-vllm
  minReplicas: {{ .Values.hpa.minReplicas }}
  maxReplicas: {{ .Values.hpa.maxReplicas }}
  metrics:
  - type: Pods
    pods:
      metric:
        name: vllm_queue_depth
      target:
        type: AverageValue
        averageValue: "{{ .Values.hpa.targetQueueDepth | int }}"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: {{ .Values.hpa.scaleUpStabilizationSeconds }}
      policies:
      - type: Pods
        value: 1
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: {{ .Values.hpa.scaleDownStabilizationSeconds }}
      policies:
      - type: Pods
        value: 1
        periodSeconds: 120
{{- end }}

Deploying and Verifying

Before deploying, create the Hugging Face token secret in the target namespace. This is required for downloading gated models such as Mistral-7B-Instruct-v0.2:

# Create the inference namespace and HF token secret
kubectl create namespace inference
kubectl create secret generic hf-token \
  --from-literal=token=<YOUR_HUGGING_FACE_TOKEN> \
  -n inference

Security note: Do not pass the token as a plain environment variable or commit it to version control. The Secret-based approach above keeps the token out of your Helm values and pod specs.

Now install the chart:

# Install the chart (override image tag with --set image.tag=<version> if needed)
helm install vllm-inference ./vllm-chart \
  --namespace inference

# Watch pods come up
kubectl get pods -n inference -w

# Check logs for model loading progress (look for auth errors or download progress)
kubectl logs -n inference -l app=vllm-inference-vllm --tail=50

# Verify no authentication errors
kubectl logs -n inference -l app=vllm-inference-vllm | grep -E "error|401|gated|token"
# Expected: no matches

# Once the readiness probe passes, test the endpoint
kubectl port-forward -n inference svc/vllm-inference-vllm 8000:8000 &
PF_PID=$!
sleep 3

# This uses the legacy completions API. For chat-style inference, use /v1/chat/completions with a messages array.
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-Instruct-v0.2",
    "prompt": "Explain Kubernetes in one sentence:",
    "max_tokens": 64
  }'

# Clean up the port-forward when done
kill "$PF_PID"

A successful response returns a JSON object with the generated completion, confirming the full pipeline works end to end: GPU Operator, device plugin, vLLM container, and Kubernetes networking.

A successful response returns a JSON object with the generated completion, confirming the full pipeline works end to end: GPU Operator, device plugin, vLLM container, and Kubernetes networking.

Implementation Checklist

GPU nodes provisioned and labeled.
NVIDIA GPU Operator installed and verified.
Inference nodes tainted for dedicated workloads.
Hugging Face token Secret created in the inference namespace.
Model artifacts accessible (PVC, S3, or Hugging Face Hub with valid token).
vLLM Helm chart values reviewed for resource sizing.
Prometheus deployed and scraping confirmed for vLLM metrics in the inference namespace.
Prometheus Adapter installed and custom metrics API available.
HPA deployed and tested under synthetic load.
Liveness/readiness probes validated (ensure initialDelaySeconds exceeds model load time).
Scale-down stabilization window tuned for model load time.

Where to Go Next

To mature this platform, add model versioning and canary rollouts with KServe, implement A/B traffic splitting for model evaluation, and integrate GPU-aware cost monitoring tools to track inference spend per model and per team.

For a broader view of local LLM tooling options, the Running LLMs Locally guide covers alternative approaches. Teams evaluating serving engines should also review the Ollama vs vLLM comparison to understand where each tool fits in the deployment spectrum.

Deploy Local LLMs on Kubernetes: Complete vLLM + Helm Guid

How to Deploy Local LLMs to Kubernetes

Table of Contents

Prerequisites: Preparing Your Cluster for GPU Workloads

Hardware and Cluster Requirements

Installing the NVIDIA GPU Operator

Verifying GPU Availability

Choosing a Serving Framework: KServe vs. Ray Serve vs. Simple Deployment

Option 1: Plain Kubernetes Deployment

Option 2: Ray Serve on Kubernetes (KubeRay)

Option 3: KServe (ModelMesh or Serverless)

Decision Matrix

Resource Management: GPU Requests, Limits, and Bin-Packing

Setting GPU Requests and Limits

Dealing with Bin-Packing and Fragmentation

Autoscaling Inference: HPA on Custom Metrics

Why Standard CPU/Memory HPA Fails for LLMs

Exposing Custom Metrics (Queue Depth)

Configuring the HPA

Full Walkthrough: Helm Chart for a vLLM Service

Chart Structure Overview

`values.yaml`: Configurable Parameters

Deployment Template

Service Template

HPA Template

Deploying and Verifying

Implementation Checklist

Where to Go Next

Deploy Local LLMs on Kubernetes: Complete vLLM + Helm Guid

How to Deploy Local LLMs to Kubernetes

Table of Contents

Prerequisites: Preparing Your Cluster for GPU Workloads

Hardware and Cluster Requirements

Installing the NVIDIA GPU Operator

Verifying GPU Availability

Choosing a Serving Framework: KServe vs. Ray Serve vs. Simple Deployment

Option 1: Plain Kubernetes Deployment

Option 2: Ray Serve on Kubernetes (KubeRay)

Option 3: KServe (ModelMesh or Serverless)

Decision Matrix

Resource Management: GPU Requests, Limits, and Bin-Packing

Setting GPU Requests and Limits

Dealing with Bin-Packing and Fragmentation

Autoscaling Inference: HPA on Custom Metrics

Why Standard CPU/Memory HPA Fails for LLMs

Exposing Custom Metrics (Queue Depth)

Configuring the HPA

Full Walkthrough: Helm Chart for a vLLM Service

Chart Structure Overview

values.yaml: Configurable Parameters

Deployment Template

Service Template

HPA Template

Deploying and Verifying

Implementation Checklist

Where to Go Next

`values.yaml`: Configurable Parameters