Running local LLMs has shifted from a hobbyist experiment to a core part of the developer toolkit. The convergence of efficient quantization, serving tools like Ollama (v0.3+) and LM Studio (v0.3+) that have shipped stable releases for over two years, and open-weight models that approach GPT-4-class performance at 8B to 14B parameters means that a single consumer GPU or an Apple Silicon laptop can now produce 20+ tokens per second with context windows above 2K tokens. This guide walks through the hardware requirements, tooling decisions, installation, Python integration, and performance tuning needed to go from zero to a working local LLM stack.
Table of Contents
- Why Run LLMs Locally in 2026?
- Hardware Requirements: What You Actually Need
- Local LLM Tooling Options in 2026
- Getting Started with Ollama
- Getting Started with LM Studio
- Integrating Local LLMs into Python Projects
- Choosing the Right Model for Your Use Case
- Performance Optimization and Troubleshooting
- Complete Setup Script: From Zero to Local LLM in 5 Minutes
- Where to Go from Here
Why Run LLMs Locally in 2026?
Privacy and Data Sovereignty
Any workflow involving sensitive codebases, proprietary business logic, or customer data introduces risk the moment that data leaves the local machine. Regulated industries operating under HIPAA or GDPR face strict data residency and processing requirements that cloud-hosted LLM APIs complicate significantly. Running inference locally eliminates this entire class of concern. The machine sends no tokens to third-party servers, and no external provider's data retention policy applies. Verify telemetry settings in each tool's documentation; for example, confirm Ollama's telemetry policy at ollama.com before assuming no data is collected. For security-conscious teams, local inference is not a preference but a requirement.
For security-conscious teams, local inference is not a preference but a requirement.
Cost, Speed, and Offline Capability
Per-token API pricing adds up quickly for high-volume use cases like batch code review, document summarization pipelines, or continuous agent loops. Local inference replaces ongoing operational cost with a one-time hardware investment. Latency also improves: once a local model is loaded and generating, inter-token latency can reach tens of milliseconds per token, avoiding the network round-trip penalty inherent to cloud API calls. Time-to-first-token and cold-load times are higher, but for interactive developer tools and IDE integrations, sustained generation speed matters more. Local models work identically whether the machine is connected to the internet or not. Flights, air-gapped environments, outages: nothing changes.
Customization and Control
When you run models locally, you control system prompts, fine-tuning, and behavior without any provider-imposed restrictions. You can swap models per task, apply custom fine-tunes for domain-specific workloads, and engineer prompts without worrying about content filtering policies that cloud providers enforce. You avoid vendor lock-in: to switch model families, pull a different weight file rather than rewriting API integration code.
Hardware Requirements: What You Actually Need
GPU vs. CPU vs. Apple Silicon
VRAM is the primary bottleneck for local LLM inference. NVIDIA GPUs from the RTX 4060 (base model, 8GB VRAM) and above represent the entry point for GPU-accelerated inference. The mapping is roughly linear: a 7B parameter model at Q4 quantization requires approximately 4 to 5GB of VRAM, a 14B model needs 8 to 10GB for full GPU offload (partial offload to CPU is possible on 8GB cards at reduced speed of approximately 15 tok/s), and a 30B model demands 16 to 20GB. The RTX 4090 with 24GB VRAM handles up to 30B models comfortably, while 70B models require multi-GPU setups or aggressive quantization to fit.
Apple Silicon machines from the M2 Pro with 32GB of unified memory and above benefit from unified memory architecture, where system RAM and GPU memory share the same pool. The base M2 Pro at 16GB is insufficient for 14B models at Q4. An M2 Pro with 32GB of unified memory can run 14B models at roughly 20 tok/s using Metal acceleration, and M3 Max or M4 Max configurations with 64GB or more handle 30B models effectively. Apple's Metal Performance Shaders integration with llama.cpp has closed much of the gap with CUDA-based inference over the past two years.
CPU-only inference remains viable for models at 7B parameters and below, but at significantly reduced throughput. Expect 2 to 5 tokens per second on a modern high-core-count desktop CPU (such as an AMD Ryzen 9 or Intel Core i9) compared to 30 to 80 tokens per second on a mid-range GPU (RTX 4060 to 4070 range) for the same model.
RAM, Storage, and Practical Minimums
System RAM of 16GB is the bare minimum for running a 7B model alongside normal development tools. 32GB is strongly preferred, and 64GB opens up 30B+ models on Apple Silicon. SSD storage matters because model files range from roughly 4GB for a heavily quantized 7B model to 30GB or more for larger models at higher quantization levels. Mechanical drives will introduce painful load times.
Hardware and Model Size Benchmark Reference
| Configuration | 7B (Q4) | 14B (Q4) | 30B (Q4) | 70B (Q4) |
|---|---|---|---|---|
| Budget laptop (16GB RAM, CPU only) | ~3 tok/s, viable | ~1 tok/s, marginal | Not viable | Not viable |
| Mid-range desktop (RTX 4060, 8GB VRAM, 32GB RAM) | ~40 tok/s, excellent | ~15 tok/s, good (partial CPU offload) | Partial offload, ~5 tok/s | Not viable |
| High-end workstation (RTX 4090, 24GB VRAM, 64GB RAM) | ~80 tok/s, excellent | ~50 tok/s, excellent | ~25 tok/s, good | Partial offload, ~8 tok/s |
| Mac Studio (M2 Ultra, 192GB unified) | ~35 tok/s, excellent | ~25 tok/s, excellent | ~15 tok/s, good | ~8 tok/s, viable |
Throughput numbers are approximate and vary with context length, quantization method, and specific model architecture. They represent output token generation speed (excluding prompt processing time) at moderate context lengths (2K to 4K tokens). The 14B figure on the RTX 4060 reflects partial CPU offload since the model exceeds the card's 8GB VRAM; a 12GB+ card is needed for full GPU offload.
Local LLM Tooling Options in 2026
Ollama: CLI and Server Tool for Local LLMs
Ollama has become the default CLI and server tool for running local LLMs. It wraps llama.cpp with a single-command interface for model management, provides an OpenAI-compatible REST API out of the box, and handles model pulling, quantization selection, and GPU offloading automatically. Its model library covers all major open-weight model families. Developers comfortable with terminal workflows and those building automated pipelines gravitate toward Ollama because it behaves like infrastructure: start a server, hit an endpoint, get completions.
LM Studio: The GUI-First Experience
LM Studio provides a visual model browser, one-click downloads from Hugging Face, a built-in chat interface for testing, and a local server mode that exposes an OpenAI-compatible API on port 1234 by default. It targets developers who prefer a graphical workflow for exploring models, comparing outputs, and adjusting inference parameters through a UI. LM Studio focuses on GGUF format models, which is the dominant format for quantized local inference.
Other Notable Tools
llama.cpp is the foundational C++ inference engine that both Ollama and LM Studio build upon. It provides the lowest-level control and is the right choice when custom compilation flags or hardware-specific optimizations are needed.
vLLM targets production serving scenarios requiring high-throughput batched inference with PagedAttention, primarily on CUDA GPUs with growing community support for other backends. LocalAI functions as a broader OpenAI API drop-in replacement supporting multiple backends.
Jan, GPT4All, and Mozilla's llamafile each occupy different niches. Jan offers an Electron-based desktop experience. GPT4All focuses on simplicity for non-technical users. llamafile packages models as single executable files (bundling the llama.cpp runtime with model weights, which results in multi-gigabyte executables for larger models) for maximum portability.
Tool Comparison Reference
| Dimension | Ollama | LM Studio | llama.cpp | vLLM | LocalAI |
|---|---|---|---|---|---|
| Ease of setup | Very easy | Very easy | Moderate | Moderate | Moderate |
| API compatibility | OpenAI-compatible | OpenAI-compatible | HTTP server (basic) | OpenAI-compatible | OpenAI-compatible |
| GPU support | CUDA, Metal, ROCm | CUDA, Metal | CUDA, Metal, ROCm, Vulkan | CUDA primarily | CUDA, Metal |
| Model format | GGUF (auto-managed) | GGUF | GGUF | Safetensors, AWQ, GPTQ | GGUF, others |
| OS support | macOS, Linux, Windows | macOS, Linux, Windows | All major | Linux primarily | Linux, macOS |
| Best use case | CLI workflows, APIs, automation | Exploration, GUI testing | Custom builds, embedded | Production serving | Drop-in API replacement |
Getting Started with Ollama
Prerequisites
- macOS 13+ or Linux (Ubuntu 20.04+ tested). Windows users should download from ollama.com/download and run the Python steps manually.
- Python 3.8 or higher required. Verify with
python3 --version. ChromaDB requires 3.8+; the OpenAI SDK requires 3.7.1+. - Bash 4.0+ recommended. macOS ships with Bash 3.x; install a newer version via
brew install bashif needed.
Installation on macOS, Linux, and Windows
Ollama installs with a single command on macOS and Linux. On Windows, download the installer from the official site.
# macOS and Linux — install via the official install script
# Security note: This pipes a remote script directly to sh. To inspect first:
# curl -fsSL https://ollama.com/install.sh -o install.sh && less install.sh && sh install.sh
curl -fsSL https://ollama.com/install.sh | sh
# Windows — download the installer from https://ollama.com/download
# After installation, Ollama is available in the system PATH.
# Verify installation on any OS
ollama --version
The version command confirms that the Ollama binary is correctly installed and accessible. On macOS, Ollama also installs as a menu bar application that manages the background server process.
Pulling and Running Your First Model
Ollama uses a Docker-like pull and run workflow. Models are downloaded from the Ollama model library and cached locally. The recommended starter models in 2026 span several families: Qwen 3 8B from Alibaba provides excellent multilingual and reasoning capability, and Gemma 3 4B from Google handles summarization, Q&A, and light code tasks well enough to match several 7B-class models on standard benchmarks like MMLU. Meta's Llama 4 Scout series offers strong general performance at the 8B tier.
Quantization tags appended to model names control the precision and size trade-off. Q4_K_M uses 4-bit quantization with a medium-sized lookup table, offering the best balance of size reduction and quality retention for most use cases. Q5_K_M provides slightly higher quality at roughly 20% more memory usage. Q8_0 is near-full-precision and suitable when VRAM is abundant and quality is paramount.
Note: Model tags in Ollama's registry can change over time. Before pulling, verify the exact tag at ollama.com/library for your chosen model. If the tag below returns a "model not found" error, check the library for the current canonical tag.
# Pull a specific quantized model variant
# Verify this tag exists at ollama.com/library/qwen3 before pulling
ollama pull qwen3:8b-q5_K_M
# Run the model interactively
ollama run qwen3:8b-q5_K_M
# At the interactive prompt, type your query:
>>> Explain the difference between a mutex and a semaphore in three sentences.
# Sample response:
# A mutex is a locking mechanism that allows only one thread to access a
# critical section at a time, and it must be released by the same thread
# that acquired it. A semaphore is a signaling mechanism that maintains a
# count, allowing multiple threads to access a limited pool of resources
# concurrently. The key distinction is ownership — mutexes enforce it,
# semaphores do not.
Using the Ollama API
Ollama automatically runs a local HTTP server on port 11434. This server exposes an OpenAI-compatible chat completions endpoint, making it a drop-in replacement for the OpenAI API in existing code.
# Ensure Ollama is running (it starts automatically on macOS;
# on Linux, run `ollama serve` in a separate terminal if needed)
# Send a chat completion request
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3:8b-q5_K_M",
"messages": [
{"role": "system", "content": "You are a concise technical assistant."},
{"role": "user", "content": "What is the CAP theorem?"}
],
"stream": false
}'
# Response structure matches OpenAI format:
# {
# "id": "chatcmpl-...",
# "object": "chat.completion",
# "choices": [{
# "index": 0,
# "message": {
# "role": "assistant",
# "content": "The CAP theorem states that a distributed system..."
# },
# "finish_reason": "stop"
# }],
# "usage": {"prompt_tokens": 28, "completion_tokens": 64, "total_tokens": 92}
# }
For streaming responses, set "stream": true in the request body. The server will return server-sent events in the same format as the OpenAI streaming API.
Getting Started with LM Studio
Installation and Model Discovery
Download LM Studio as a desktop app from lmstudio.ai for macOS, Windows, and Linux. After installation, the built-in model hub lets you browse and download GGUF-format models directly from Hugging Face. The search interface filters by model family, parameter count, and quantization level, making it straightforward to find and download a specific variant without manually navigating Hugging Face repositories.
Running Inference and Starting a Local Server
The chat interface provides immediate testing of any downloaded model with adjustable parameters including temperature, top-p, and context length. For programmatic access, LM Studio's local server mode is enabled through the developer tab, which starts an OpenAI-compatible API endpoint on port 1234 by default.
# With LM Studio's local server running on port 1234
# Note: The model identifier in LM Studio reflects the filename on disk,
# which may differ from Ollama's tag format. Check the exact model string
# in LM Studio's model list before making API calls.
curl http://localhost:1234/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "your-model-identifier-from-lm-studio",
"messages": [
{"role": "user", "content": "Write a Python function to flatten a nested list."}
],
"temperature": 0.7
}'
The response format is identical to Ollama's and OpenAI's, enabling the same client code to target any of the three backends by changing only the base URL.
When to Use LM Studio vs. Ollama
LM Studio excels at exploration and experimentation. Its visual interface makes it easy to compare models side by side, adjust parameters with sliders, and quickly test different prompts without writing scripts. Ollama is the better choice for automation, scripting, CI/CD integration, and headless server deployments where a GUI is unnecessary overhead. Many developers use both: LM Studio for evaluation and selection, Ollama for production integration.
Integrating Local LLMs into Python Projects
Using the OpenAI Python SDK with Local Endpoints
The official OpenAI Python library (version 1.0.0 or higher) works with local LLM servers by simply overriding the base_url parameter. This means existing code written for the OpenAI API can migrate to local inference with a one-line change. No fork, no wrapper library, no code rewrite.
Install with: pip install 'openai>=1.0.0'
This means existing code written for the OpenAI API can migrate to local inference with a one-line change. No fork, no wrapper library, no code rewrite.
from openai import OpenAI
# Point at Ollama's local server (or LM Studio on port 1234)
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="not-needed" # Local servers don't require auth
)
# Standard chat completion — identical to OpenAI API usage
response = client.chat.completions.create(
model="qwen3:8b-q5_K_M",
messages=[
{"role": "system", "content": "You are a senior Python developer."},
{"role": "user", "content": "Explain Python's GIL and its impact on threading."}
],
temperature=0.7,
timeout=120
)
print(response.choices[0].message.content)
# Streaming variant
print("
--- Streaming ---")
stream = client.chat.completions.create(
model="qwen3:8b-q5_K_M",
messages=[
{"role": "user", "content": "List five common Python performance pitfalls."}
],
stream=True,
timeout=120
)
for chunk in stream:
if not chunk.choices:
continue
delta_content = chunk.choices[0].delta.content
if delta_content:
print(delta_content, end="", flush=True)
print()
Building a Simple RAG Pipeline with a Local LLM
Retrieval-augmented generation combines document retrieval with LLM generation to ground responses in specific source material. The following pipeline loads a text file, chunks it, generates embeddings using a local embedding model, stores them in ChromaDB, and generates an answer using the local LLM.
Install dependencies with: pip install 'openai>=1.0.0' 'chromadb>=0.4.0'
import chromadb
from openai import OpenAI
# Initialize clients
llm_client = OpenAI(base_url="http://localhost:11434/v1", api_key="not-needed")
chroma_client = chromadb.EphemeralClient() # chromadb>=0.4.0; use PersistentClient(path=...) for persistence
collection = chroma_client.get_or_create_collection(name="docs")
# --- Chunking utility ---
def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
"""Chunk text by character count with overlap; does not split mid-word."""
chunks = []
start = 0
length = len(text)
while start < length:
end = min(start + chunk_size, length)
# Walk back to nearest whitespace to avoid mid-word splits
if end < length:
boundary = text.rfind(" ", start, end)
if boundary > start:
end = boundary
chunks.append(text[start:end].strip())
start = end - overlap if end - overlap > start else end
return [c for c in chunks if c]
# Step 1: Load and chunk a document
# Ensure documentation.txt exists; create a sample for testing:
# echo "Your documentation content here" > documentation.txt
# Replace documentation.txt with your actual file path.
with open("documentation.txt", "r", encoding="utf-8") as f:
text = f.read()
chunks = chunk_text(text, chunk_size=500, overlap=50)
# Step 2: Generate embeddings locally and store in ChromaDB
# Ollama exposes an embeddings endpoint compatible with standard clients
for idx, chunk in enumerate(chunks):
try:
embedding_resp = llm_client.embeddings.create(
model="nomic-embed-text", # Pull first: ollama pull nomic-embed-text
input=chunk,
timeout=30
)
collection.add(
ids=[f"chunk_{idx}"],
embeddings=[embedding_resp.data[0].embedding],
documents=[chunk]
)
except Exception as exc:
print(f"Warning: Failed to embed chunk {idx}: {exc}")
# Step 3: Query — embed the question and retrieve relevant chunks
query = "What are the authentication requirements?"
query_embedding = llm_client.embeddings.create(
model="nomic-embed-text", input=query, timeout=30
).data[0].embedding
results = collection.query(query_embeddings=[query_embedding], n_results=3)
context = "
".join(results["documents"][0])
# Step 4: Generate an answer grounded in retrieved context
response = llm_client.chat.completions.create(
model="qwen3:8b-q5_K_M",
messages=[
{"role": "system", "content": f"Answer based on this context:
{context}"},
{"role": "user", "content": query}
],
timeout=120
)
print(response.choices[0].message.content)
This pipeline requires pulling the embedding model first with ollama pull nomic-embed-text. The nomic-embed-text and mxbai-embed-large models are both available through Ollama's library and work well for RAG applications.
Structured Output and Function Calling Locally
Ollama supports structured output through JSON mode and schema enforcement, enabling reliable extraction of structured data from unstructured text. This is critical for pipelines that need to parse LLM output programmatically rather than display it to humans.
Version requirement: The json_schema response format requires Ollama 0.3.0+ and a model that supports structured output. Verify your version with ollama --version. Models below approximately 7B parameters, particularly at Q3 or lower quantization, frequently produce malformed JSON or ignore required fields entirely.
from openai import OpenAI
import json
from json import JSONDecodeError
client = OpenAI(base_url="http://localhost:11434/v1", api_key="not-needed")
# Define the expected output schema
schema = {
"type": "object",
"properties": {
"name": {"type": "string"},
"email": {"type": "string"},
"company": {"type": "string"},
"role": {"type": "string"},
"topics": {"type": "array", "items": {"type": "string"}}
},
"required": ["name", "email", "company", "role", "topics"]
}
# Unstructured input text
unstructured_text = """
Hi, I'm Priya Sharma from Dataflow Systems. I'm their lead ML engineer
and I'd love to discuss model optimization, edge deployment, and
inference acceleration. Reach me at priya.sharma@dataflow.io.
"""
response = client.chat.completions.create(
model="qwen3:8b-q5_K_M",
messages=[
{"role": "system", "content": "Extract contact information from the text."},
{"role": "user", "content": unstructured_text}
],
response_format={
"type": "json_schema",
"json_schema": {
"name": "contact_extraction",
"schema": schema
}
},
timeout=120
)
raw_content = response.choices[0].message.content
try:
result = json.loads(raw_content)
except JSONDecodeError as exc:
raise ValueError(
f"Model did not return valid JSON. Raw output: {raw_content!r}"
) from exc
print(json.dumps(result, indent=2))
# Expected output:
# {
# "name": "Priya Sharma",
# "email": "priya.sharma@dataflow.io",
# "company": "Dataflow Systems",
# "role": "lead ML engineer",
# "topics": ["model optimization", "edge deployment", "inference acceleration"]
# }
Schema enforcement depends on model capability. Larger and more capable models follow schemas more reliably. When using smaller models (4B and below), expect occasional schema violations that downstream code should handle gracefully.
Choosing the Right Model for Your Use Case
Code Generation and Developer Tools
Qwen 2.5 Coder leads code-centric tasks at sizes runnable on consumer hardware. The 7B variant handles code completion, explanation, and refactoring well in its Q5 quantized form. DeepSeek Coder V2 Lite at 16B parameters provides strong code generation quality, requiring 10 to 12GB of VRAM at Q4 quantization. The full DeepSeek Coder V3 is a 671B MoE model not suitable for consumer hardware. More parameters produce better code, especially for complex multi-file reasoning, but demand proportionally more VRAM.
General Chat, Summarization, and Writing
The Llama 4 Scout family at 8B parameters offers a strong general-purpose starting point, balancing capability across summarization, analysis, and conversational tasks. Mistral models continue to offer strong performance at smaller sizes. Gemma 3 from Google matches several 7B-class models on MMLU at just 4B parameters, making it a strong choice for resource-constrained environments where a 7B+ model does not fit comfortably.
Specialized Tasks: Embeddings, Vision, and Agents
For RAG embedding pipelines, nomic-embed-text and mxbai-embed-large are both available through Ollama and produce embeddings suitable for semantic search and retrieval. Multimodal models capable of processing images locally now support interleaved image-text prompts and handle 1080p input, with Gemma 3 vision variants and successors to the LLaVA family supporting image understanding alongside text. Agent-oriented workloads benefit from models with strong instruction following and tool-use capabilities, where Qwen 3 and Llama 4 Scout both perform well.
Performance Optimization and Troubleshooting
Maximizing Inference Speed
GPU layer offloading is the single most impactful performance lever. Both Ollama and llama.cpp support specifying how many model layers run on the GPU versus the CPU. Ollama handles this automatically in most cases, but manual configuration through the num_gpu parameter in a Modelfile allows fine-tuning. Setting all layers to GPU (num_gpu 999) forces full GPU offload when VRAM allows. A minimal Modelfile example:
FROM qwen3:8b-q5_K_M
PARAMETER num_gpu 999
Context window size directly impacts memory consumption and speed. A 7B Q4 model at 4K context uses roughly 5GB of VRAM; the same model at 32K context requires roughly 8 to 9GB. For tasks that do not require long-context reasoning, explicitly setting a shorter context window frees VRAM and improves throughput.
Batch size tuning matters for throughput-oriented workloads. When processing many independent requests, increasing the batch size allows the GPU to process multiple sequences in parallel. This is more relevant in vLLM deployments than in Ollama, which is optimized for single-user interactive use.
Common Issues and Fixes
"Out of memory" errors are the most frequent problem. The solutions, in order of increasing impact: reduce quantization level (switch from Q5 to Q4 or Q3), lower the context window length, reduce the number of GPU-offloaded layers to split processing between GPU and CPU, or select a smaller model.
Slow time-to-first-token usually means the model is reloading from disk. Ollama keeps models loaded in memory based on a keep-alive timer (default is 5 minutes as of recent versions; verify with ollama --version). Setting OLLAMA_KEEP_ALIVE=-1 keeps models loaded indefinitely, eliminating reload delays between requests at the cost of persistent VRAM usage. Verify with ollama ps after a request to confirm the model remains listed.
Warning: Keeping models permanently loaded consumes VRAM continuously. On shared or memory-constrained systems, this may prevent other GPU workloads from running.
Garbled or nonsensical output usually indicates a prompt template mismatch. Each model family expects a specific chat template format (ChatML, Llama format, Alpaca format). Ollama handles this automatically for models in its library, but manually loaded GGUF files through llama.cpp or LM Studio need the correct template set explicitly. This problem appears most often with raw Hugging Face GGUF downloads that lack embedded template metadata.
Troubleshooting Checklist
- Verify GPU driver versions: run
nvidia-smi(NVIDIA) or check system info (Apple Silicon) - Confirm CUDA/cuDNN installation matches Ollama requirements
- Check available VRAM before loading:
nvidia-smishows current utilization - Verify model compatibility with your hardware tier
- Test API connectivity:
curl http://localhost:11434/api/tags(native Ollama endpoint;/v1/modelsis available only with the OpenAI-compatibility layer and may not exist in all versions) - Validate prompt format: use Ollama's built-in templates for library models
- Monitor for silent CPU fallback: if
nvidia-smishows no GPU utilization during inference, offloading has failed - On macOS, use Activity Monitor to confirm Metal GPU usage under the GPU History tab
Monitoring Resource Usage
On NVIDIA systems, nvidia-smi is the essential monitoring tool, showing VRAM usage, GPU utilization percentage, and running processes. Running watch -n 1 nvidia-smi provides a live updating view during inference. On macOS, Activity Monitor's GPU History tab shows Metal utilization, and sudo powermetrics --samplers gpu_power provides command-line GPU monitoring. The most common silent failure is GPU offload not engaging, where the model runs entirely on CPU without any error message. Spotting this requires checking that GPU utilization actually increases during generation.
The most common silent failure is GPU offload not engaging, where the model runs entirely on CPU without any error message. Spotting this requires checking that GPU utilization actually increases during generation.
Complete Setup Script: From Zero to Local LLM in 5 Minutes
The following script automates the entire setup process for getting a local LLM running with Python integration tooling installed. It is designed to be copied, pasted, and executed directly.
Important: Run this script with bash, not sh: bash setup.sh. The script uses bash-specific syntax that will fail under a POSIX sh shell.
#!/bin/bash
set -e
echo "=== Local LLM Setup Script ==="
echo "This script installs Ollama, pulls a model, sets up a Python environment,"
echo "and verifies everything works."
echo "Requires: Python 3.8+, bash 4.0+, curl, internet connection"
echo ""
OLLAMA_PID=""
cleanup() {
if [[ -n "$OLLAMA_PID" ]] && kill -0 "$OLLAMA_PID" 2>/dev/null; then
echo "Stopping Ollama server (PID $OLLAMA_PID)..."
kill "$OLLAMA_PID"
fi
}
trap cleanup EXIT INT TERM
# Step 1: Detect OS
OS="unknown"
if [[ "$OSTYPE" == "darwin"* ]]; then
OS="macos"
elif [[ "$OSTYPE" == "linux-gnu"* ]]; then
OS="linux"
else
echo "ERROR: This script supports macOS and Linux. For Windows, install Ollama"
echo "from https://ollama.com/download and run the Python steps manually."
exit 1
fi
echo "[1/7] Detected OS: $OS"
# Step 2: Install Ollama (if not already present)
if command -v ollama &> /dev/null; then
echo "[2/7] Ollama is already installed: $(ollama --version)"
else
echo "[2/7] Installing Ollama..."
# To inspect the install script before running: curl -fsSL https://ollama.com/install.sh -o install.sh && less install.sh && sh install.sh
curl -fsSL https://ollama.com/install.sh | sh
echo " Installed: $(ollama --version)"
fi
# Step 3: Start Ollama server (on Linux; macOS auto-starts)
if [[ "$OS" == "linux" ]]; then
echo "[3/7] Starting Ollama server in background..."
nohup ollama serve > /tmp/ollama-setup.log 2>&1 &
OLLAMA_PID=$!
echo " Ollama PID: $OLLAMA_PID (logs: /tmp/ollama-setup.log)"
# Wait for server to be ready (up to 15 seconds)
SERVER_READY=0
for i in {1..15}; do
if curl -sf http://localhost:11434/ > /dev/null 2>&1; then
SERVER_READY=1
echo " Ollama server ready after ${i}s"
break
fi
echo " Waiting for Ollama server... ($i/15)"
sleep 1
done
# Verify server is actually responding
if [[ "$SERVER_READY" -eq 0 ]]; then
echo "ERROR: Ollama server did not start within 15 seconds."
echo "Check logs: /tmp/ollama-setup.log"
exit 1
fi
else
echo "[3/7] Ollama server auto-managed on macOS"
fi
# Step 4: Pull a recommended starter model
# Verify this tag at ollama.com/library/gemma3 if pull fails
MODEL="gemma3:4b-q5_K_M"
echo "[4/7] Pulling model: $MODEL (this may take a few minutes)..."
ollama pull "$MODEL"
# Step 5: Create a Python virtual environment
echo "[5/7] Creating Python virtual environment (./llm-env)..."
python3 -m venv llm-env
source llm-env/bin/activate
# Step 6: Install Python dependencies
echo "[6/7] Installing Python packages (openai>=1.0.0, chromadb>=0.4.0)..."
pip install 'openai>=1.0.0' 'chromadb>=0.4.0' || { echo "ERROR: pip install failed"; exit 1; }
# Step 7: Run a test inference call
echo "[7/7] Running test inference..."
export OLLAMA_MODEL="$MODEL"
python3 - <<'PYEOF'
import os
from openai import OpenAI
model = os.environ.get("OLLAMA_MODEL")
client = OpenAI(base_url="http://localhost:11434/v1", api_key="not-needed")
response = client.chat.completions.create(
model=model,
messages=[{"role": "user", "content": "Say hello in one sentence."}],
timeout=60,
)
print("Model response:", response.choices[0].message.content)
PYEOF
# Clear the PID so the trap does not kill the server on normal exit
# (remove this line if you want the server to stop when the script finishes)
OLLAMA_PID=""
echo ""
echo "=== Setup Complete ==="
echo "Your local LLM stack is ready. Next steps:"
echo " 1. Activate your environment: source llm-env/bin/activate"
echo " 2. Try interactive chat: ollama run $MODEL"
echo " 3. Use the API endpoint: http://localhost:11434/v1/chat/completions"
echo " 4. Explore more models: ollama.com/library"
echo " 5. Add embeddings for RAG: ollama pull nomic-embed-text"
Where to Go from Here
Starting with Ollama and a single model like Gemma 3 4B or Qwen 3 8B provides immediate value for code generation assistance, document analysis, and chat-based exploration. Graduating to RAG pipelines with ChromaDB and structured output extraction opens up production-grade application patterns.
For those ready to go further, the natural next steps include fine-tuning local models on domain-specific datasets using tools like Unsloth or Axolotl, deploying Ollama or vLLM as team-shared services behind a reverse proxy, and integrating local LLMs into IDE workflows through extensions like Continue and Sourcegraph Cody, both of which support local model backends.

