How to Deploy Local LLMs to Kubernetes
- Provision GPU nodes with at least 24 GB VRAM and install the NVIDIA GPU Operator via Helm.
- Verify GPU availability by checking
nvidia.com/gpuin node allocatable resources. - Select a serving framework — start with a plain vLLM Deployment for single-model setups.
- Configure GPU resource requests/limits and taint nodes for dedicated inference workloads.
- Deploy vLLM using the provided Helm chart with your Hugging Face token secret.
- Expose custom Prometheus metrics (queue depth) via the Prometheus Adapter.
- Enable HPA autoscaling targeting
vllm_queue_depthwith tuned stabilization windows. - Validate the full pipeline by sending a completion request and confirming a JSON response.
Running local LLMs on Kubernetes gives DevOps teams a self-hosted inference path that is health-checked, autoscaled, and rolling-updatable, without depending on costly cloud API endpoints. This guide walks through the full pipeline: from preparing GPU nodes and installing the NVIDIA GPU Operator, through selecting a serving framework, to deploying a complete Helm chart for vLLM with custom-metric autoscaling.
Table of Contents
- Prerequisites: Preparing Your Cluster for GPU Workloads
- Choosing a Serving Framework: KServe vs. Ray Serve vs. Simple Deployment
- Resource Management: GPU Requests, Limits, and Bin-Packing
- Autoscaling Inference: HPA on Custom Metrics
- Full Walkthrough: Helm Chart for a vLLM Service
- Implementation Checklist
- Where to Go Next
Deploying these workloads on Kubernetes brings together data privacy guarantees, infrastructure-bound costs that scale with hardware utilization rather than per-token API pricing, and lower latency by keeping model serving inside the cluster boundary (eliminating the network round-trip to external APIs, typically 50 to 200 ms).
Kubernetes fits inference workload management well because its core orchestration primitives, including scheduling, health checks, rolling updates, and horizontal scaling, map directly onto the operational requirements of serving large language models. GPU-aware scheduling ensures pods land on nodes with available accelerators, while liveness and readiness probes guard against serving stale or crashed model instances. Rolling updates then enable zero-downtime model version swaps.
This guide walks through the full pipeline: from preparing GPU nodes and installing the NVIDIA GPU Operator, through selecting a serving framework, to deploying a complete Helm chart for vLLM with custom-metric autoscaling. The target audience is DevOps and platform engineers with intermediate Kubernetes experience who want a reproducible, opinionated deployment rather than scattered documentation fragments. For broader context on running models outside cloud APIs, the Running LLMs Locally hub covers the wider set of approaches.
Prerequisites: Preparing Your Cluster for GPU Workloads
Hardware and Cluster Requirements
A minimum viable GPU node for serving a 7B-parameter model (such as Mistral 7B or Llama 2 7B) requires an NVIDIA GPU with at least 24 GB of VRAM. 24 GB provides headroom for the KV-cache beyond the ~14 GB weight footprint; a 16 GB GPU is feasible for small batch sizes but will constrain throughput. The NVIDIA A10G and L4 are common choices in cloud environments, while the A100 (40 GB or 80 GB) provides headroom for larger models or higher throughput via increased KV-cache capacity. Each inference node should have sufficient system RAM (at least 32 GB) and fast local or network-attached storage if model weights will be cached locally rather than streamed from object storage.
All features in this guide require Kubernetes 1.27 or later. The autoscaling/v2 HPA API requires Kubernetes 1.23+; GPU Operator 23.x requires Kubernetes 1.24+. Managed Kubernetes services (EKS, GKE, AKS) simplify GPU node provisioning through dedicated node pools with pre-configured machine types. Bare-metal clusters require manual NVIDIA driver management unless the GPU Operator handles it, which is the approach outlined below.
The full walkthrough requires the following:
- Prometheus deployed and configured to scrape pod annotations in the
inferencenamespace (e.g., via kube-prometheus-stack). - Prometheus Adapter installed and configured (covered in the autoscaling section below).
- A Hugging Face account with Mistral-7B-Instruct-v0.2 model terms accepted and an API token generated. Gated models require authentication; without a token, pod startup will fail.
Installing the NVIDIA GPU Operator
The NVIDIA GPU Operator automates the full stack needed to run GPU containers on Kubernetes: host NVIDIA drivers, the NVIDIA Container Toolkit, the Kubernetes device plugin that advertises nvidia.com/gpu resources, and optional GPU monitoring via DCGM Exporter. Rather than baking drivers into node images and managing version drift, the Operator deploys everything as DaemonSets.
# Add the NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
# Install the GPU Operator into its own namespace
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator \
--create-namespace \
--set driver.enabled=true \
--set toolkit.enabled=true \
--set devicePlugin.enabled=true \
--set dcgmExporter.enabled=true
Setting driver.enabled=true tells the Operator to install and manage NVIDIA drivers on the host. On managed cloud node pools where drivers are pre-installed, set this to false to avoid conflicts. The dcgmExporter.enabled=true flag deploys NVIDIA DCGM Exporter, which exposes GPU utilization, temperature, and memory metrics to Prometheus.
Verifying GPU Availability
After the Operator pods reach a Running state (which typically takes 3 to 8 minutes on first install as drivers compile; see the NVIDIA GPU Operator documentation for version-specific timing), verify that GPU resources are visible to the scheduler. You can poll readiness with:
kubectl get pods -n gpu-operator -w
Wait until all pods show Running or Completed, then verify GPU resources:
# Check that the node advertises GPU resources
kubectl describe node <gpu-node-name> | grep -A 5 "Allocatable"
# Expected output should include:
# nvidia.com/gpu: 1
# Run a quick test pod to confirm nvidia-smi works inside a container
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
name: gpu-test
namespace: gpu-operator
spec:
restartPolicy: Never
containers:
- name: nvidia-smi
image: nvidia/cuda:12.3.1-base-ubuntu22.04
command: ["nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1
EOF
# Check the output
kubectl logs -n gpu-operator gpu-test
# Clean up the test pod after verification
kubectl delete pod -n gpu-operator gpu-test
The nvidia-smi output should display the GPU model, driver version, and available memory. If nvidia.com/gpu does not appear in allocatable resources, the device plugin DaemonSet likely has not started correctly. Check GPU Operator pod logs in the gpu-operator namespace.
Choosing a Serving Framework: KServe vs. Ray Serve vs. Simple Deployment
Option 1: Plain Kubernetes Deployment
A standard Kubernetes Deployment wrapping a vLLM or similar container is the simplest approach. It fits single-model, low-complexity setups where teams want full control over the pod spec and no additional abstractions. You get no extra CRDs, no framework-specific operational overhead, and straightforward debugging. On the other hand, you lose built-in dynamic batching management at the orchestration layer (though vLLM handles continuous batching internally), multi-model routing, and traffic splitting without adding an ingress layer manually.
Option 2: Ray Serve on Kubernetes (KubeRay)
Ray Serve, deployed via the KubeRay operator, suits teams running multi-model pipelines or who are deeply invested in the Python ecosystem. A typical use case: a pipeline chaining an embedding model with a reranker and a generator, all managed as a single deployment graph. Ray Serve provides autoscaling at the actor level, dynamic batching, and model composition within that graph. The cost is operational complexity: a Ray head node, worker nodes, and the KubeRay CRDs add moving parts. Resource management across Ray actors and Kubernetes pods creates a two-layer scheduling problem that can be difficult to debug.
Option 3: KServe (ModelMesh or Serverless)
Teams serving dozens of models at enterprise scale use KServe for its standardized V2 inference protocol. ModelMesh multiplexes many models onto shared GPU pods, making it efficient when serving dozens of smaller models. KServe's serverless (scale-to-zero) mode requires Knative Serving and an ingress controller (Istio or Kourier). KServe's RawDeployment mode requires neither Knative nor a service mesh, making it significantly lighter to operate.
Decision Matrix
| Criteria | Plain Deployment | Ray Serve (KubeRay) | KServe |
|---|---|---|---|
| Setup complexity | Low | Medium-High | High |
| Multi-model support | None (manual) | Native (deployment graph) | Native (ModelMesh) |
| Autoscaling granularity | HPA on custom metrics | Per-actor autoscaling | KPA / HPA with Knative |
| Community maturity | Mature (core K8s primitives) | Growing | Established |
| GPU utilization efficiency | One model per GPU | Flexible actor placement | Model multiplexing |
Start with a plain Deployment running vLLM. vLLM's internal continuous batching and PagedAttention memory management handle the serving-layer optimizations, while Kubernetes handles orchestration. Teams can graduate to KServe or Ray Serve as multi-model, canary, or pipeline requirements emerge. For a deeper comparison of serving engines including Ollama, see the Ollama vs vLLM article, which contextualizes why vLLM's throughput characteristics make it a strong choice for production deployments.
Resource Management: GPU Requests, Limits, and Bin-Packing
Setting GPU Requests and Limits
GPU resources in Kubernetes behave differently from CPU and memory. The nvidia.com/gpu resource is integer-only and non-overcommittable: a request of 1 means one entire GPU is reserved. The standard device plugin does not support fractional requests. Time-slicing (via GPU Operator config) enables overcommit but without memory isolation. For nvidia.com/gpu, requests and limits must be identical; this resource is non-overcommittable and integer-only. CPU and memory may differ between request and limit.
resources:
requests:
nvidia.com/gpu: 1
cpu: "4"
memory: "32Gi"
limits:
nvidia.com/gpu: 1
cpu: "8"
memory: "48Gi"
For a 7B-parameter model running in float16, model weights alone consume roughly 14 GB of VRAM. The remaining VRAM on a 24 GB GPU serves the KV-cache. Setting CPU requests to 4 cores and memory to 32 GB accounts for tokenization overhead, model loading, and the serving framework's host-side memory.
Because GPU allocation is all-or-nothing at the device level, a pod using 14 GB on a 24 GB GPU leaves 10 GB stranded. Kubernetes cannot schedule another pod onto that GPU.
Dealing with Bin-Packing and Fragmentation
Because GPU allocation is all-or-nothing at the device level, a pod using 14 GB on a 24 GB GPU leaves 10 GB stranded. Kubernetes cannot schedule another pod onto that GPU. Two strategies address this. NVIDIA MIG on supported hardware (A100, H100, A30) partitions a physical GPU into isolated instances with dedicated memory and compute slices. Note that the A10G and L4 recommended in this guide do not support MIG; use MPS on those GPUs instead. NVIDIA Multi-Process Service (MPS) allows multiple processes to share a GPU, though without the memory isolation guarantees of MIG.
At the Kubernetes level, dedicating GPU nodes to inference workloads via taints prevents non-GPU pods from occupying these expensive nodes. Apply the taint to each GPU node:
# Idempotent: --overwrite prevents failure if taint already exists.
# Run this for each GPU node. List nodes with: kubectl get nodes
kubectl taint nodes <gpu-node-name> workload=inference:NoSchedule --overwrite
Then in the pod spec, add a matching toleration and node affinity:
spec:
tolerations:
- key: "workload"
operator: "Equal"
value: "inference"
effect: "NoSchedule"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.present
operator: In
values:
- "true"
Autoscaling Inference: HPA on Custom Metrics
Why Standard CPU/Memory HPA Fails for LLMs
LLM inference is GPU-bound and queue-bound. The CPU on an inference node may idle at 10% while the GPU is saturated and dozens of requests wait in the serving queue. A standard HPA targeting CPU utilization will never trigger scale-up under these conditions, so queued requests wait too long.
LLM inference is GPU-bound and queue-bound. The CPU on an inference node may idle at 10% while the GPU is saturated and dozens of requests wait in the serving queue.
Exposing Custom Metrics (Queue Depth)
vLLM exposes a Prometheus-compatible /metrics endpoint with several metrics critical for autoscaling decisions. Before configuring the Prometheus Adapter, verify the exact metric names exposed by vLLM in your version:
kubectl exec -n inference <vllm-pod-name> -- curl -s http://localhost:8000/metrics | grep -i "waiting\|cache"
Confirm the metric names match those used in the adapter configuration below. The metric names may vary between vLLM versions; names using colon notation (e.g., vllm:num_requests_waiting) follow the Prometheus recording rule convention and may indicate a recording rule must be defined in Prometheus, while raw metrics exposed directly by vLLM typically use underscores (e.g., vllm_num_requests_waiting). Use the exact name returned by the /metrics endpoint.
These metrics need to be surfaced to the Kubernetes HPA controller via the Prometheus Adapter. KEDA ScaledObject configuration for vLLM is outside the scope of this guide; see the KEDA documentation for a Prometheus scaler example.
First, install the Prometheus Adapter if it is not already present in your cluster:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus-adapter prometheus-community/prometheus-adapter \
--namespace monitoring \
--create-namespace \
--set prometheus.url=http://<prometheus-service>.<prometheus-namespace>.svc \
--set prometheus.port=9090 \
-f prometheus-adapter-config.yaml
Replace <prometheus-service> and <prometheus-namespace> with the actual Prometheus service name and namespace in your cluster (e.g., http://prometheus-kube-prometheus-prometheus.monitoring.svc).
Verify the adapter is running:
kubectl get apiservice v1beta1.custom.metrics.k8s.io
# AVAILABLE column should show True
The Prometheus Adapter configuration translates Prometheus queries into Kubernetes custom metrics API responses. Create prometheus-adapter-config.yaml with the following content. Important: Run the verification command above first to confirm the exact metric name. The configuration below uses vllm_num_requests_waiting (underscores), which is the raw metric name typically exposed by vLLM. If your version uses a different name, adjust accordingly:
# prometheus-adapter-config.yaml
# IMPORTANT: Verify the exact metric name before deploying:
# kubectl exec -n inference <vllm-pod> -- curl -s http://localhost:8000/metrics | grep -E "waiting|queue"
# Replace vllm_num_requests_waiting below with the exact name returned.
rules:
- seriesQuery: 'vllm_num_requests_waiting{namespace!="",pod!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "^vllm_num_requests_waiting$"
as: "vllm_queue_depth"
metricsQuery: 'sum(vllm_num_requests_waiting{<<.LabelMatchers>>}) by (<<.GroupBy>>)'
This configuration queries vllm_num_requests_waiting, maps it to Kubernetes namespace and pod labels, and exposes it as a custom metric named vllm_queue_depth that the HPA can target.
Configuring the HPA
Cost note: minReplicas: 1 keeps at least one GPU pod running at all times, which means continuous GPU node cost even during idle periods. On cloud providers, consider using cluster autoscaler node scale-down in combination with this setting, or set minReplicas: 0 if your setup supports scale-to-zero (requires KEDA or Knative).
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-hpa
namespace: inference
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-server
minReplicas: 1
maxReplicas: 4
metrics:
- type: Pods
pods:
metric:
name: vllm_queue_depth
target:
type: AverageValue
averageValue: "5"
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 1
periodSeconds: 120
The averageValue of 5 means the HPA targets no more than 5 waiting requests per pod on average. When queue depth exceeds this, new replicas are requested. The scaleDown.stabilizationWindowSeconds of 300 seconds is critical: vLLM model loading can take 30 to 120 seconds depending on model size and storage speed. For models >13B parameters or on slow NFS/S3-backed PVCs, loading can exceed 300 seconds; tune initialDelaySeconds and stabilization windows accordingly. A 300-second window prevents thrashing but delays capacity reduction after traffic drops. If traffic arrives in bursts with 10-minute peaks, set the stabilization window to match peak duration so the HPA does not scale down mid-burst. Scaling down too aggressively means pods are destroyed and recreated repeatedly. Keeping at least one warm replica (minReplicas: 1) avoids cold-start latency on the first request.
Full Walkthrough: Helm Chart for a vLLM Service
Chart Structure Overview
vllm-chart/
├── Chart.yaml
├── values.yaml
└── templates/
├── deployment.yaml
├── service.yaml
└── hpa.yaml
# Chart.yaml
apiVersion: v2
name: vllm-server
description: Helm chart for deploying vLLM on Kubernetes with GPU support
type: application
version: 0.1.0
# appVersion must match image.tag in values.yaml; update both when upgrading vLLM.
appVersion: "v0.4.2"
values.yaml: Configurable Parameters
Important: Verify the latest vLLM image tag at https://github.com/vllm-project/vllm/releases before deploying. The tag below was current at time of writing but may not exist on the registry if the project has moved on.
# values.yaml
model:
name: "mistralai/Mistral-7B-Instruct-v0.2"
downloadFromHub: true
# pvcName: "model-storage" # Uncomment to use a PVC instead
replicaCount: 1
image:
repository: ghcr.io/vllm-project/vllm-openai
tag: "v0.4.2"
pullPolicy: IfNotPresent
# Name of a Kubernetes Secret containing the Hugging Face token.
# Create it with: kubectl create secret generic hf-token --from-literal=token=<YOUR_TOKEN> -n inference
huggingFaceSecret: "hf-token"
# Key within the Secret that holds the token value. Change if your Secret uses a different key name.
huggingFaceSecretKey: "token"
resources:
requests:
nvidia.com/gpu: 1
cpu: "4"
memory: "32Gi"
limits:
nvidia.com/gpu: 1
cpu: "8"
memory: "48Gi"
service:
type: ClusterIP
port: 8000
hpa:
enabled: true
minReplicas: 1
maxReplicas: 4
targetQueueDepth: 5 # Integer only. Must be a whole number for AverageValue quantity.
scaleDownStabilizationSeconds: 300
scaleUpStabilizationSeconds: 60
prometheus:
scrape: true
Deployment Template
# templates/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ .Release.Name }}-vllm
namespace: {{ .Release.Namespace }}
labels:
app: {{ .Release.Name }}-vllm
spec:
replicas: {{ .Values.replicaCount }}
selector:
matchLabels:
app: {{ .Release.Name }}-vllm
template:
metadata:
labels:
app: {{ .Release.Name }}-vllm
annotations:
{{- if .Values.prometheus.scrape }}
prometheus.io/scrape: "true"
prometheus.io/port: "8000"
prometheus.io/path: "/metrics"
{{- end }}
spec:
terminationGracePeriodSeconds: 120
tolerations:
- key: "workload"
operator: "Equal"
value: "inference"
effect: "NoSchedule"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.present
operator: In
values:
- "true"
containers:
- name: vllm
image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
imagePullPolicy: {{ .Values.image.pullPolicy }}
args:
- "--model"
- "{{ .Values.model.name }}"
- "--host"
- "0.0.0.0"
- "--port"
- "8000"
ports:
- containerPort: 8000
name: http
env:
{{- if .Values.huggingFaceSecret }}
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: {{ .Values.huggingFaceSecret }}
key: {{ .Values.huggingFaceSecretKey }}
{{- end }}
resources:
{{- toYaml .Values.resources | nindent 10 }}
lifecycle:
preStop:
exec:
# Allow endpoint propagation before SIGTERM is sent.
command: ["sleep", "15"]
livenessProbe:
httpGet:
path: /health
port: 8000
# Must exceed readiness delay to prevent restart loop during loading.
initialDelaySeconds: 180
periodSeconds: 15
failureThreshold: 6
timeoutSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8000
# Matches worst-case model load time stated in text; increase for models >13B or slow PVCs.
initialDelaySeconds: 120
periodSeconds: 10
failureThreshold: 3
timeoutSeconds: 10
The initialDelaySeconds on the liveness probe is set to 180 seconds, deliberately higher than the readiness probe (120 seconds), to accommodate model loading time. The failureThreshold: 6 on the liveness probe provides 90 seconds of tolerance (6 × 15s) after the initial delay before Kubernetes kills the pod, preventing restart loops under heavy GPU inference load. If the model is larger or storage is slow, both values may need to increase further to prevent Kubernetes from killing the pod during startup.
Service Template
# templates/service.yaml
apiVersion: v1
kind: Service
metadata:
name: {{ .Release.Name }}-vllm
namespace: {{ .Release.Namespace }}
spec:
type: {{ .Values.service.type }}
selector:
app: {{ .Release.Name }}-vllm
ports:
- port: {{ .Values.service.port }}
targetPort: 8000
protocol: TCP
name: http
HPA Template
# templates/hpa.yaml
{{- if .Values.hpa.enabled }}
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: {{ .Release.Name }}-vllm-hpa
namespace: {{ .Release.Namespace }}
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: {{ .Release.Name }}-vllm
minReplicas: {{ .Values.hpa.minReplicas }}
maxReplicas: {{ .Values.hpa.maxReplicas }}
metrics:
- type: Pods
pods:
metric:
name: vllm_queue_depth
target:
type: AverageValue
averageValue: "{{ .Values.hpa.targetQueueDepth | int }}"
behavior:
scaleUp:
stabilizationWindowSeconds: {{ .Values.hpa.scaleUpStabilizationSeconds }}
policies:
- type: Pods
value: 1
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: {{ .Values.hpa.scaleDownStabilizationSeconds }}
policies:
- type: Pods
value: 1
periodSeconds: 120
{{- end }}
Deploying and Verifying
Before deploying, create the Hugging Face token secret in the target namespace. This is required for downloading gated models such as Mistral-7B-Instruct-v0.2:
# Create the inference namespace and HF token secret
kubectl create namespace inference
kubectl create secret generic hf-token \
--from-literal=token=<YOUR_HUGGING_FACE_TOKEN> \
-n inference
Security note: Do not pass the token as a plain environment variable or commit it to version control. The Secret-based approach above keeps the token out of your Helm values and pod specs.
Now install the chart:
# Install the chart (override image tag with --set image.tag=<version> if needed)
helm install vllm-inference ./vllm-chart \
--namespace inference
# Watch pods come up
kubectl get pods -n inference -w
# Check logs for model loading progress (look for auth errors or download progress)
kubectl logs -n inference -l app=vllm-inference-vllm --tail=50
# Verify no authentication errors
kubectl logs -n inference -l app=vllm-inference-vllm | grep -E "error|401|gated|token"
# Expected: no matches
# Once the readiness probe passes, test the endpoint
kubectl port-forward -n inference svc/vllm-inference-vllm 8000:8000 &
PF_PID=$!
sleep 3
# This uses the legacy completions API. For chat-style inference, use /v1/chat/completions with a messages array.
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.2",
"prompt": "Explain Kubernetes in one sentence:",
"max_tokens": 64
}'
# Clean up the port-forward when done
kill "$PF_PID"
A successful response returns a JSON object with the generated completion, confirming the full pipeline works end to end: GPU Operator, device plugin, vLLM container, and Kubernetes networking.
A successful response returns a JSON object with the generated completion, confirming the full pipeline works end to end: GPU Operator, device plugin, vLLM container, and Kubernetes networking.
Implementation Checklist
- GPU nodes provisioned and labeled.
- NVIDIA GPU Operator installed and verified.
- Inference nodes tainted for dedicated workloads.
- Hugging Face token Secret created in the
inferencenamespace. - Model artifacts accessible (PVC, S3, or Hugging Face Hub with valid token).
- vLLM Helm chart values reviewed for resource sizing.
- Prometheus deployed and scraping confirmed for vLLM metrics in the
inferencenamespace. - Prometheus Adapter installed and custom metrics API available.
- HPA deployed and tested under synthetic load.
- Liveness/readiness probes validated (ensure
initialDelaySecondsexceeds model load time). - Scale-down stabilization window tuned for model load time.
Where to Go Next
To mature this platform, add model versioning and canary rollouts with KServe, implement A/B traffic splitting for model evaluation, and integrate GPU-aware cost monitoring tools to track inference spend per model and per team.
For a broader view of local LLM tooling options, the Running LLMs Locally guide covers alternative approaches. Teams evaluating serving engines should also review the Ollama vs vLLM comparison to understand where each tool fits in the deployment spectrum.

