This metrics tool terrifies bad developers

Start free trial
SitePoint Premium
Stay Relevant and Grow Your Career in Tech
  • Premium Results
  • Publish articles on SitePoint
  • Daily curated jobs
  • Learning Paths
  • Discounts to dev tools
Start Free Trial

7 Day Free Trial. Cancel Anytime.

How to Set Up Ollama for Local LLM Development

  1. Verify your hardware meets minimum requirements: 16 GB RAM, modern CPU or GPU with sufficient VRAM for your target model size.
  2. Install Ollama via Homebrew (macOS), the official install script (Linux), or winget (Windows).
  3. Start the Ollama daemon with ollama serve and confirm it responds to ollama list.
  4. Pull your first model using ollama pull llama3.1:8b-q4_K_M and test with ollama run.
  5. Configure environment variables like OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS to match your hardware.
  6. Create custom Modelfiles with tailored system prompts, temperature, and context window settings.
  7. Integrate Ollama into your workflow via the REST API, IDE extensions (Continue, Cody), or Python with langchain-ollama.
  8. Deploy Open WebUI as a browser-based chat frontend for team access to local models.

Local-first AI development has accelerated sharply through 2025 and into 2026. Data privacy mandates, unpredictable API pricing, and the growing need for offline-capable workflows all drive adoption. This guide walks you through a fully working Ollama installation, optimized configuration for your hardware, custom model profiles, and integration points into Python projects, IDEs, and chat frontends.

Table of Contents

Why Ollama Dominates Local LLM Tooling in 2026

Local-first AI development has accelerated sharply through 2025 and into 2026. Data privacy mandates, unpredictable API pricing, and the growing need for offline-capable workflows all drive adoption. Ollama has established itself as the de facto standard CLI tool for running local LLMs, providing a streamlined interface for downloading, configuring, and serving open-weight language models across every major operating system. Its architecture wraps llama.cpp inference behind a simple command-line and REST API layer, abstracting away the complexity of model quantization, GPU memory allocation, and model file management.

This guide was written assuming Ollama v0.6.x or later. Run ollama --version to confirm your installation matches.

By the end of this guide, you will have a fully working Ollama installation, optimized configuration for your hardware, custom model profiles, and integration points into Python projects, IDEs, and chat frontends. The walkthrough assumes intermediate comfort with terminal commands, a basic understanding of what LLMs are, and at minimum 16GB of system RAM. With that spec, expect at least 10 tok/s interactive generation from a 7B-q4_K_M model on a modern CPU.

What You Need Before Installing Ollama

Hardware Requirements

Local LLM inference is memory-bound. The critical bottleneck is RAM (for CPU inference) or VRAM (for GPU-accelerated inference). A rough sizing rule: budget approximately 0.6 GB per billion parameters at q4_K_M quantization, then add headroom for context. A 7B model at q4_K_M needs 4 to 6 GB. A 13B model at the same quantization needs 8 to 10 GB. For 70B-class models at q4_K_M, expect 38 to 48 GB depending on context length. Larger context windows add memory overhead on top of these figures.

CPU-only inference works but runs far slower. On a modern 8-core processor (e.g., Apple M2 or AMD Ryzen 7 5800X), expect 5 to 15 tokens per second for a 7B q4_K_M model. GPU-accelerated inference on an NVIDIA RTX 4090 (24GB VRAM) or Apple M-series with unified memory pushes that to 40 to 80+ tokens per second for the same model at default context length.

Supported Operating Systems

Ollama supports macOS (Apple Silicon and Intel), Linux (Ubuntu/Debian, Fedora, Arch and derivatives), and Windows (x86_64 and ARM64). Recent releases added native Windows ARM64 support, meaning Surface Pro and Snapdragon-based devices now run Ollama without emulation overhead.

ComponentMinimumRecommended
RAM8GB (7B models, slow)16GB+ (13B models comfortably)
GPU VRAMNone (CPU-only)8GB+ NVIDIA/AMD; Apple Silicon unified
Disk Space10GB free50GB+ (multiple models)
macOS12+ (Intel), 13+ (Apple Silicon)Latest macOS, Apple Silicon
LinuxKernel 5.x+, glibc 2.31+Ubuntu 22.04+/Fedora 38+
WindowsWindows 10 21H2+ (x86_64)Windows 11 (ARM64 native supported)
NVIDIA GPUDriver 525+, CUDA 12.xDriver 550+, CUDA 12.4+
AMD GPUROCm 6.x (Linux only)ROCm 6.2+

Installing Ollama on macOS, Linux, and Windows

macOS Installation

The two supported methods are Homebrew and direct download from ollama.com. Homebrew is the preferred approach for developers who already use it for package management.

brew install ollama
ollama --version

On Apple Silicon Macs (M1 through M4), macOS enables Metal GPU acceleration by default. You need no additional configuration, drivers, or flags. The unified memory architecture means system RAM and GPU memory are shared, so a 32GB M-series Mac can load models that would require a dedicated GPU on other platforms.

Linux Installation

The official install script detects your system architecture and installs the appropriate binary.

Before running, you can inspect the script at https://ollama.com/install.sh. For added security, download first and verify:

#!/usr/bin/env bash
set -euo pipefail

INSTALL_SCRIPT="$(mktemp)"
EXPECTED_SHA256="REPLACE_WITH_HASH_FROM_OLLAMA_GITHUB_RELEASES"

echo "==> Downloading Ollama install script..."
curl -fsSL https://ollama.com/install.sh -o "$INSTALL_SCRIPT"

echo "==> Verifying script integrity..."
ACTUAL_SHA256="$(sha256sum "$INSTALL_SCRIPT" | awk '{print $1}')"
if [ "$ACTUAL_SHA256" != "$EXPECTED_SHA256" ]; then
  echo "ERROR: SHA256 mismatch. Expected: $EXPECTED_SHA256  Got: $ACTUAL_SHA256"
  rm -f "$INSTALL_SCRIPT"
  exit 1
fi

echo "==> Installing Ollama..."
sh "$INSTALL_SCRIPT"
rm -f "$INSTALL_SCRIPT"

Replace REPLACE_WITH_HASH_FROM_OLLAMA_GITHUB_RELEASES with the SHA256 hash published on Ollama's GitHub releases page.

Then enable and start the service:

sudo systemctl enable ollama
sudo systemctl start ollama
ollama --version

For NVIDIA GPU acceleration, the NVIDIA driver must be installed and functional before Ollama can detect the GPU. Verify with:

nvidia-smi

This command should display the driver version (525+ required, 550+ recommended) and list available GPUs. If nvidia-smi is not found or shows errors, install or update the driver before GPU-accelerated inference will work.

Windows Installation

Install via the MSI installer from ollama.com or use winget from PowerShell:

winget install Ollama.Ollama
# Verify it's in PATH
ollama --version

If ollama is not recognized after installation, a new terminal session may be needed for PATH changes to take effect. For advanced users, running Ollama inside WSL2 is an alternative that provides the Linux experience. Note that ROCm support for AMD GPUs under WSL2 is experimental, hardware-dependent, and not officially supported by Ollama. Verify your specific GPU is on AMD's WSL2 ROCm support matrix before attempting this path.

New in 2026: Windows ARM64 devices receive a native build. Previous versions required x86 emulation, which imposed a measurable performance penalty. The native ARM64 build eliminates this overhead entirely.

Verifying Your Installation

Regardless of platform, confirm the Ollama daemon is running and responsive:

ollama serve

In a separate terminal (or if the service is already running via systemd/launchd):

ollama list

A fresh installation returns an empty model list. If the command responds without error, the daemon is operational and ready to pull models.

Pulling and Running Your First Local LLM

Understanding the Ollama Model Registry

Ollama maintains a model library at ollama.com/library. Ollama names models with a name:tag convention. The tag encodes the variant, parameter count, and quantization level. For example, llama3.1:8b-q4_K_M refers to the Llama 3.1 model at 8 billion parameters with Q4_K_M quantization. Common quantization suffixes include q4_0, q4_K_M, q5_K_M, q8_0, and fp16, representing increasing precision and memory requirements.

Downloading and Running a Model

ollama pull llama3.1:8b-q4_K_M
ollama run llama3.1:8b-q4_K_M

After the pull completes, ollama run starts an interactive chat session:

>>> Explain dependency injection in three sentences.
Dependency injection is a design pattern where an object receives its dependencies
from external sources rather than creating them internally. This decouples
component creation from component use, making code more testable and modular.
Frameworks like Spring and .NET's built-in DI container automate this wiring at
application startup.

Under the hood, ollama pull downloads model layers as content-addressable blobs, similar to how container registries work. Repeated pulls of models that share base layers skip already-cached data, saving bandwidth and disk space.

Repeated pulls of models that share base layers skip already-cached data, saving bandwidth and disk space.

Model Comparison: Choosing the Right Model for Your Hardware

Model versions shown below are current as of early 2026. Check ollama.com/library for the latest available tags.

Speed tiers are approximate tok/s ranges measured on an RTX 4090 (24 GB VRAM) at q4_K_M with default context length: Fast = 50 to 80+ tok/s, Moderate = 20 to 50 tok/s, Slow = 5 to 20 tok/s. Your results will vary by hardware, quantization, and context size.

ModelParametersQuantization OptionsMin RAMRec. VRAMBest Use CaseRelative Speed
Llama 3.18B, 70Bq4_K_M, q8_0, fp166GB (8B-q4)8GB+General purpose, codingFast (8B)
Llama 3.370Bq4_K_M, q8_040GB48GB+Complex tasks, analysisSlow
Mistral7Bq4_K_M, q5_K_M, q8_06GB8GBInstruction followingFast
Phi-414Bq4_K_M, q8_010GB12GB+Reasoning, compactModerate
Gemma 39B, 27Bq4_K_M, q8_06GB (9B)8GB (9B), 20GB (27B)Multilingual, long contextModerate
Qwen 38B, 32B, 72Bq4_K_M, q8_06GB (8B)8GB (8B), 48GB (72B)Coding, multilingualFast (8B)
DeepSeek-R1 (distill)7B, 14Bq4_K_M, q8_06GB (7B)8GB (7B), 12GB (14B)Reasoning chainsModerate
Gemma 39Bq4_K_M, q8_06GB8GBCode generation (with coding system prompt)Fast

Configuring Ollama for Optimal Performance

Environment Variables and Server Configuration

Environment variables control Ollama's behavior. The key ones for performance tuning and multi-model workflows:

# Bind address — default is 127.0.0.1:11434
# WARNING: Binding to 0.0.0.0 exposes the Ollama API on ALL network interfaces
# with NO authentication. Only use on trusted networks AND with a firewall rule.
export OLLAMA_HOST=0.0.0.0:11434

# Restrict access to your local subnet (adjust CIDR to match your network)
sudo ufw allow from 192.168.1.0/24 to any port 11434
sudo ufw deny 11434

Warning: Binding to 0.0.0.0 exposes the Ollama API on all network interfaces with no authentication. Any device on the same network can query your API and load arbitrary models. The firewall rules above are mandatory when using this setting. Adjust the CIDR range (192.168.1.0/24) to match your actual local subnet. If you do not need remote access, keep the default 127.0.0.1:11434.

# Custom model storage location (useful for external drives or NAS)
export OLLAMA_MODELS=/mnt/fast-ssd/ollama-models

# Number of parallel request slots per model
export OLLAMA_NUM_PARALLEL=4

# Maximum number of models loaded in memory simultaneously
export OLLAMA_MAX_LOADED_MODELS=2

Where to persist these settings depends on the OS. On macOS, they go in a launchd plist (e.g., ~/Library/LaunchAgents/com.ollama.env.plist) or shell profile. On Linux with systemd, use sudo systemctl edit ollama to create an override file and add lines like Environment="OLLAMA_NUM_PARALLEL=4" under a [Service] section. On Windows, set them through System Environment Variables (Machine scope) or in a PowerShell profile. Note that if Ollama runs as a Windows service under the SYSTEM account, User-scoped environment variables will not be visible to it — use Machine scope instead.

GPU Memory Management

Ollama does not expose a direct VRAM fraction setting. Use OLLAMA_MAX_LOADED_MODELS and model quantization selection to manage VRAM headroom. For example, setting OLLAMA_MAX_LOADED_MODELS=1 ensures only a single model occupies VRAM at a time, while choosing q4_K_M over q8_0 cuts VRAM requirements by roughly 40 to 50 percent.

When running multiple models, monitor VRAM with nvidia-smi (NVIDIA) or system_profiler SPDisplaysDataType (macOS). If OOM errors occur, switch to a more aggressively quantized model variant or drop OLLAMA_MAX_LOADED_MODELS to 1.

Creating Custom Modelfiles

Modelfiles let developers create named model configurations with custom system prompts, parameters, and behavioral constraints:

# Modelfile-coding-assistant
FROM llama3.1:8b-q4_K_M

SYSTEM """You are a senior software engineer. Provide concise, production-ready code. Always include error handling. Prefer TypeScript unless otherwise specified."""

PARAMETER temperature 0.3
PARAMETER num_ctx 8192
PARAMETER stop "<|eot_id|>"

The stop token must match the base model's tokenizer. <|eot_id|> is correct for Llama-3 family models; for Phi family models, use <|end|> instead. Do not combine stop tokens from different model families in the same Modelfile, as the wrong token may fire unexpectedly.

Avoid using triple-backtick stop tokens in Modelfiles embedded in Markdown documentation, as they break code fence rendering. Use tokens like <|end|> or <|eot_id|> as appropriate for your model family.

Build and use it:

ollama create coding-assistant -f Modelfile-coding-assistant
ollama run coding-assistant

This pattern supports multiple configurations from the same base model: a coding assistant with low temperature, a writing assistant with higher temperature and a different system prompt, or a RAG-optimized profile with a large context window and strict stop tokens. The base model blobs are shared on disk, so creating variants incurs negligible additional storage.

The base model blobs are shared on disk, so creating variants incurs negligible additional storage.

Integrating Ollama into Your Development Workflow

Using the REST API

Ollama exposes a local REST API on port 11434. For single-turn completions:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b-q4_K_M",
  "prompt": "Write a Python function to flatten a nested list.",
  "stream": false
}'

Note: On Windows cmd.exe, single-quoted JSON strings will fail. Use double-quotes and escape internal quotes, or use PowerShell's Invoke-WebRequest.

For multi-turn conversation with message history:

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1:8b-q4_K_M",
  "messages": [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user", "content": "What is a monad?"},
    {"role": "assistant", "content": "A monad is a design pattern for chaining operations..."},
    {"role": "user", "content": "Give me a practical example in JavaScript."}
  ],
  "stream": false
}'

Both endpoints return JSON with the generated text in the response (generate) or message.content (chat) field, along with timing metadata.

IDE and Editor Integration

To get Ollama completions inside your editor, start with VS Code. Extensions like Continue (which supports autocomplete and chat with local models) and Cody (configurable to use a local backend instead of Sourcegraph's cloud) connect directly to a local Ollama instance. JetBrains IDEs support Ollama through community plugins such as "Ollama Integration." In each case, set the endpoint to http://localhost:11434 and specify the model name.

Connecting to LangChain and Python Projects

To call Ollama from Python, use the langchain-ollama package:

import os
import sys
from langchain_ollama import ChatOllama

base_url = os.environ.get("OLLAMA_BASE_URL", "http://localhost:11434")

llm = ChatOllama(
    model="llama3.1:8b-q4_K_M",
    base_url=base_url,
    temperature=0.4,
)

try:
    for chunk in llm.stream("Explain the CAP theorem in distributed systems."):
        print(chunk.content, end="", flush=True)
    print()
except ConnectionRefusedError:
    print(f"
ERROR: Cannot connect to Ollama at {base_url}. Is the daemon running?",
          file=sys.stderr)
    sys.exit(1)
except Exception as e:
    print(f"
ERROR: Streaming failed: {e}", file=sys.stderr)
    sys.exit(1)

Install the package with pip install langchain-ollama==0.2.0. Pin to a specific version in production to ensure reproducible behavior. Set the OLLAMA_BASE_URL environment variable to override the default endpoint for containerized or remote deployments. This streams tokens as they are generated, providing responsive output for interactive applications.

Using Open WebUI as a Chat Frontend

Open WebUI gives you a ChatGPT-style browser interface for local models. Replace the version tag below with the latest stable release from the Open WebUI releases page and verify the image digest for reproducible, secure deployments:

# Check https://github.com/open-webui/open-webui/releases for the latest stable tag
# and replace v0.3.35 below accordingly.
docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart unless-stopped \
  ghcr.io/open-webui/open-webui:v0.3.35

After pulling, verify the image digest:

docker inspect --format='{{index .RepoDigests 0}}' ghcr.io/open-webui/open-webui:v0.3.35

Navigate to http://localhost:3000 and it will auto-detect the Ollama instance running on the host. On macOS with Docker Desktop, host.docker.internal resolves differently than on Linux — if auto-detection fails, manually configure the Ollama URL in Open WebUI's settings. This gives teams a shared chat interface for local models without exposing anything to the public internet.

Troubleshooting Common Ollama Issues

Model Download Failures and Disk Space

Ollama supports resumable downloads. If a pull is interrupted, running the same ollama pull command resumes from where it left off. Delete the contents of the model storage path (default ~/.ollama/models/ on Linux/macOS; the internal directory structure is subject to change) to clear partial blobs. To relocate model storage to a larger drive, set OLLAMA_MODELS to the desired path before starting the daemon.

GPU Not Detected

For NVIDIA GPUs, ensure the driver version is 525 or newer (550+ recommended) and that nvidia-smi runs successfully. Mismatched CUDA toolkit versions cause most detection failures. For AMD GPUs, ROCm 6.x support is Linux-only as of 2026; Windows AMD GPU acceleration is not supported natively. On Apple Silicon, macOS system frameworks provide Metal GPU acceleration. No additional tools are required for Ollama to use the GPU.

Slow Inference Performance

The most impactful lever is quantization level. Moving from q8_0 to q4_K_M cuts memory usage by roughly 40 to 50 percent and increases throughput, though perplexity typically rises 1 to 3 percent on standard benchmarks (see the llama.cpp quantization quality comparison for model-specific numbers). Large context lengths (num_ctx) also consume proportionally more memory and slow inference. Tuning OLLAMA_NUM_PARALLEL above 1 enables concurrent requests but splits available memory across slots. If a model does not fit entirely in VRAM, Ollama offloads remaining layers to the CPU, mixing GPU and CPU inference speeds. For example, a 13B model split roughly 60/40 between GPU and CPU may drop from 40 tok/s to around 15 tok/s. In those cases, consider using a smaller model or more aggressive quantization rather than accepting the CPU offload penalty.

If a model does not fit entirely in VRAM, Ollama offloads remaining layers to the CPU, mixing GPU and CPU inference speeds.

Quick-Start Setup Scripts

These scripts automate the full setup from installation through verification.

Linux/macOS (Bash):

#!/usr/bin/env bash
set -euo pipefail

# Optional: Uncomment and set these variables to customize your setup
# export OLLAMA_MODELS=/mnt/fast-ssd/ollama-models
# export OLLAMA_HOST=127.0.0.1:11434

INSTALL_SCRIPT="$(mktemp)"
EXPECTED_SHA256="REPLACE_WITH_HASH_FROM_OLLAMA_GITHUB_RELEASES"

echo "==> Downloading Ollama install script..."
curl -fsSL https://ollama.com/install.sh -o "$INSTALL_SCRIPT"

echo "==> Verifying script integrity..."
ACTUAL_SHA256="$(sha256sum "$INSTALL_SCRIPT" | awk '{print $1}')"
if [ "$ACTUAL_SHA256" != "$EXPECTED_SHA256" ]; then
  echo "ERROR: SHA256 mismatch. Expected: $EXPECTED_SHA256  Got: $ACTUAL_SHA256"
  rm -f "$INSTALL_SCRIPT"
  exit 1
fi

echo "==> Installing Ollama..."
sh "$INSTALL_SCRIPT"
rm -f "$INSTALL_SCRIPT"

echo "==> Configuring environment variables..."
if command -v systemctl &>/dev/null && systemctl is-active --quiet ollama 2>/dev/null; then
  echo "NOTE: Ollama is managed by systemd. Exporting vars in this shell will NOT affect"
  echo "      the running service. To persist settings, run:"
  echo "        sudo systemctl edit ollama"
  echo "      and add under [Service]:"
  echo "        Environment=OLLAMA_NUM_PARALLEL=2"
  echo "        Environment=OLLAMA_MAX_LOADED_MODELS=1"
else
  export OLLAMA_NUM_PARALLEL=2
  export OLLAMA_MAX_LOADED_MODELS=1
fi

# Note: These exports apply only for this session and the daemon started
# by this script. To persist settings across reboots, add them to your
# shell profile or a systemd override file (see Configuration section).

echo "==> Starting Ollama daemon..."
if command -v systemctl &>/dev/null && systemctl is-active --quiet ollama 2>/dev/null; then
  echo "Ollama is already running via systemd."
else
  OLLAMA_LOG="${HOME}/.ollama/serve.log"
  mkdir -p "$(dirname "$OLLAMA_LOG")"

  echo "    (log: $OLLAMA_LOG)"
  ollama serve >>"$OLLAMA_LOG" 2>&1 &
  OLLAMA_PID=$!

  # Ensure daemon is killed on script exit if we started it
  trap 'kill "$OLLAMA_PID" 2>/dev/null || true' EXIT
fi

echo "==> Waiting for Ollama daemon to become ready..."
timeout 30 bash -c \
  'until curl -sf --max-time 2 http://localhost:11434/api/tags >/dev/null; do sleep 1; done' \
  || { echo "ERROR: Ollama daemon did not start within 30 seconds. Check ~/.ollama/serve.log"; exit 1; }

echo "==> Pulling default model (llama3.1:8b-q4_K_M)..."
ollama pull llama3.1:8b-q4_K_M

echo "==> Running test prompt..."
echo "Say hello in one sentence." | ollama run llama3.1:8b-q4_K_M

echo "==> Setup complete. Ollama is ready."

Windows (PowerShell):

#Requires -RunAsAdministrator

Write-Host "==> Installing Ollama..."
winget install Ollama.Ollama --accept-source-agreements --accept-package-agreements
if ($LASTEXITCODE -ne 0) {
    Write-Error "ERROR: winget install failed with exit code $LASTEXITCODE."
    exit 1
}

Write-Host "==> Setting environment variables..."
# Machine scope is required so the Ollama Windows service (running as SYSTEM) inherits these values.
# This script must be run as Administrator.
if (-not ([Security.Principal.WindowsPrincipal][Security.Principal.WindowsIdentity]::GetCurrent()
         ).IsInRole([Security.Principal.WindowsBuiltInRole]::Administrator)) {
    Write-Error "This script must be run as Administrator to set Machine-scoped environment variables."
    exit 1
}
[System.Environment]::SetEnvironmentVariable("OLLAMA_NUM_PARALLEL", "2", "Machine")
[System.Environment]::SetEnvironmentVariable("OLLAMA_MAX_LOADED_MODELS", "1", "Machine")

Write-Host "==> Refreshing PATH..."
$env:Path = [System.Environment]::GetEnvironmentVariable("Path", "Machine") + ";" + [System.Environment]::GetEnvironmentVariable("Path", "User")

Write-Host "==> Starting Ollama daemon..."
$ollamaProc = Start-Process -FilePath "ollama" -ArgumentList "serve" -PassThru -WindowStyle Hidden

Write-Host "==> Waiting for Ollama daemon to become ready..."
$timeout = 30
$elapsed = 0
do {
    Start-Sleep -Seconds 1
    $elapsed++
    try {
        $response = Invoke-WebRequest -Uri "http://localhost:11434/api/tags" -UseBasicParsing -TimeoutSec 2 -ErrorAction Stop
        break
    } catch {
        if ($elapsed -ge $timeout) {
            Write-Error "ERROR: Ollama daemon did not start within ${timeout}s."
            Stop-Process -Id $ollamaProc.Id -Force -ErrorAction SilentlyContinue
            exit 1
        }
    }
} while ($true)

Write-Host "==> Pulling default model..."
ollama pull llama3.1:8b-q4_K_M

Write-Host "==> Running test prompt..."
ollama run llama3.1:8b-q4_K_M "Say hello in one sentence."

Write-Host "==> Setup complete. Ollama is ready."

Both scripts can be customized by changing the model name, adjusting environment variable values, or adding additional ollama pull commands for multiple models.

What to Build Next with Local LLMs

With Ollama running and integrated into your workflow, here are three concrete next steps:

  • Fine-tune on domain-specific data. Ollama handles inference only. Use a tool like Unsloth or axolotl to fine-tune a LoRA adapter, then import the result with ollama create.
  • Build a retrieval-augmented generation (RAG) pipeline. Combine langchain-ollama with a local vector store like ChromaDB. The LangChain docs include a step-by-step RAG tutorial that works with any Ollama-served model.
  • Deploy Open WebUI as a team AI assistant. The Docker command from the integration section above gives your team a shared, self-hosted chat interface with no external API dependencies.

Each builds directly on what this guide covered.

SitePoint TeamSitePoint Team

Sharing our passion for building incredible internet things.

© 2000 – 2026 SitePoint Pty. Ltd.
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.