How to Run Claude Code on Ollama
- Install Ollama via Homebrew or the official install script and start the server.
- Pull a coding model such as
qwen2.5-coder:14bwithollama pull. - Verify Ollama's OpenAI-compatible endpoint responds at
localhost:11434/v1. - Install Claude Code globally via
npm install -g @anthropic-ai/claude-code. - Unset any existing
ANTHROPIC_API_KEYto prevent accidental API billing. - Export environment variables pointing Claude Code to the local Ollama endpoint.
- Launch Claude Code in your project directory and confirm the local model name appears.
- Confirm local routing by checking for active connections to port 11434 during a session.
Running Claude Code against Anthropic's API gets expensive fast. Run Claude Code against a local model through Ollama and you pay zero marginal cost per query—this tutorial walks through the complete setup, from installing Ollama and pulling an appropriate coding model, to configuring Claude Code's environment variables, to running real coding tasks against a React and Node.js project.
Table of Contents
- Why Your Claude Code API Bill Is a Problem
- What Is Claude Code and Why Go Local?
- Understanding the Architecture: Claude Code + Ollama + OpenAI-Compatible APIs
- Prerequisites and System Requirements
- Step 1: Install and Configure Ollama
- Step 2: Install and Configure Claude Code for Local Use
- Step 3: Take It for a Spin with a React + Node.js Project
- Performance Tuning and Optimization
- Complete Implementation Checklist and Model Comparison Table
- Troubleshooting Common Issues
- When to Use Local vs. API: A Practical Framework
- What Comes Next
Why Your Claude Code API Bill Is a Problem
Running Claude Code against Anthropic's API gets expensive fast. Developers on the Anthropic subreddit and various forums have reported spending between $100 and $200 in a single day of heavy agentic coding sessions. One widely cited, self-reported community account described burning through $175 in just four hours while refactoring a medium-sized codebase (results will vary significantly by task type and codebase size). Even conservative usage patterns, involving periodic prompts for code reviews, test generation, and debugging, can easily generate monthly bills exceeding $500 according to similar anecdotal reports. The token-intensive nature of agentic workflows, where Claude Code reads entire files, reasons across multiple steps, and writes back changes, compounds the cost far beyond what a single chat-style API call would.
Run Claude Code against a local model through Ollama and you pay zero marginal cost per query. The model runs on hardware already sitting on the desk.
This tutorial walks through the complete setup, from installing Ollama and pulling an appropriate coding model, to configuring Claude Code's environment variables, to running real coding tasks against a React and Node.js project. The target reader is a developer with intermediate familiarity with CLI tools, Node.js, and local development environments.
Claude Code version compatibility: Claude Code is under rapid development and its configuration interface, including supported environment variables, may change between releases. This guide documents one approach to local model routing via OpenAI-compatible endpoints. After installation, run claude --version and consult Anthropic's current documentation or claude --help to confirm the exact environment variable names supported by your installed version. If variable names have changed, adapt the instructions accordingly.
What Is Claude Code and Why Go Local?
Claude Code in 60 Seconds
Claude Code is Anthropic's agentic command-line coding tool. Unlike GitHub Copilot, which operates primarily as an inline autocomplete engine, or Cursor, which embeds AI within a custom IDE fork, Claude Code functions as a standalone CLI agent. It reads project files, reasons about codebases, writes and edits code across multiple files, runs shell commands, and iterates on its own output. Its default operating model requires an Anthropic API key, routing all requests to Claude Sonnet 4 or Claude Opus, with costs determined by token consumption. A typical multi-step agentic task can consume tens of thousands of tokens per interaction.
The Case for Local Models
Running Claude Code against a local model solves three problems. Privacy and data sovereignty come first: source code never leaves the developer's machine, which matters for proprietary codebases and organizations with strict data handling policies. You also eliminate per-query costs after the one-time hardware investment. And the setup works without an internet connection, so you keep working when connectivity drops.
The trade-offs deserve honest acknowledgment. Local models, even the best open-weight coding models in the 7B to 16B parameter range, do not match Claude Sonnet 4 or Opus in complex multi-file reasoning, nuanced architectural decisions, or large-context understanding. For straightforward tasks like boilerplate generation, refactoring, and test scaffolding, local models produce usable output on first attempt for single-file edits. For tasks requiring deep contextual reasoning across thousands of lines, the quality gap remains significant.
Understanding the Architecture: Claude Code + Ollama + OpenAI-Compatible APIs
How the Pieces Fit Together
Claude Code supports third-party model providers through OpenAI-compatible API endpoints. This is the mechanism that makes local usage possible. Ollama, a local model server, exposes exactly such an endpoint at localhost:11434/v1. When you configure the right environment variables, Claude Code sends its requests to this local endpoint instead of Anthropic's servers.
The request flow is straightforward:
Claude Code CLI → http://localhost:11434/v1/chat/completions → Ollama Server → Local LLM (e.g., qwen2.5-coder:14b)
[prompt] [OpenAI-compatible API] [inference] [response]
Claude Code constructs its prompts and tool-use payloads in the OpenAI chat completions format. Ollama receives these, runs inference on the specified local model, and returns the completion. From Claude Code's perspective, it talks to an OpenAI-compatible provider. From the model's perspective, it handles standard chat completion requests.
Prerequisites and System Requirements
Hardware Considerations
Local LLM inference is memory-bound. The RAM figures below refer to available (free) RAM, not total installed RAM. For 7B parameter models at Q4 quantization, you need at least 16GB of available RAM. Running 13B or 14B parameter models comfortably requires 32GB or more, and models with 30B+ parameters typically demand 64GB of available RAM or a GPU with substantial VRAM. Higher quantization levels (e.g., Q8) roughly double the RAM requirement compared to Q4 variants.
For GPU acceleration, Ollama supports NVIDIA GPUs via CUDA, Apple Silicon via Metal (automatic on macOS), and AMD GPUs via ROCm on Linux. Disk space requirements vary by model: expect 4GB to 10GB per quantized model file.
Software Requirements
The setup requires Node.js 18 or later (with npm), Ollama installed and running as a local server, and the Claude Code CLI installed globally via npm.
Step 1: Install and Configure Ollama
Installing Ollama
On macOS and Linux, Ollama installs with a single command. Windows users can download the installer from the Ollama website.
# macOS (via Homebrew)
brew install ollama
# Linux (official install script)
# For sensitive environments, download and inspect the script before executing:
# curl -fsSL https://ollama.com/install.sh -o install.sh && cat install.sh && sh install.sh
curl -fsSL https://ollama.com/install.sh | sh
# Verify installation
ollama --version
# Start the Ollama server
# On macOS (Homebrew): check if already running with `brew services list`.
# If not running: brew services start ollama
# Running `ollama serve` manually when the Homebrew service is active causes a port conflict.
# On Linux: ollama serve starts the server process.
ollama serve
On macOS, Ollama typically launches as a background service automatically after Homebrew installation. On Linux, ollama serve starts the server process. Verify it is running by checking that port 11434 is listening.
Pulling the Right Model
Not all models handle code generation equally. The following models are well-suited for coding tasks through Claude Code:
- For the best balance of quality and resource usage, pull
qwen2.5-coder:14b. It handles multi-file edits in Python, TypeScript, and Go with fewer syntax errors than other models in this parameter range. deepseek-coder-v2:16bgenerates syntactically valid Python and JavaScript in single-file tasks (performance varies by task; evaluate against your own workload).- Meta's
codellama:13bis a purpose-built coding model based on Llama 2 (released 2023; based on the older Llama 2 architecture, so the newer alternatives above generally produce better results). - When RAM is tight,
llama3.1:8bprovides a lighter-weight general-purpose option.
Model choice directly affects output quality. Purpose-built coding models like Qwen 2.5 Coder produce noticeably better structured code, handle edge cases more reliably, and follow coding conventions more consistently than general-purpose models of equivalent size.
# Pull the recommended model
ollama pull qwen2.5-coder:14b
# Verify available models
ollama list
The ollama list command should show the model name, size, and modification date, confirming the weights are downloaded and ready.
Verifying the Local API
Before configuring Claude Code, confirm that Ollama's OpenAI-compatible endpoint is responding:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer not-a-real-key-local-ollama-only" \
-d '{
"model": "qwen2.5-coder:14b",
"stream": false,
"messages": [{"role": "user", "content": "Write a hello world function in JavaScript"}]
}'
A successful response returns a single JSON object containing the model's completion. If this command fails with "connection refused," Ollama is not running. If it returns a model-not-found error, the model name does not match what was pulled.
Step 2: Install and Configure Claude Code for Local Use
Installing Claude Code CLI
Install Claude Code globally through npm:
npm install -g @anthropic-ai/claude-code
# Verify the installation
claude --version
This installs the claude command globally. The CLI requires Node.js 18 or later. Note the version number displayed — the environment variables described below are version-dependent. Run claude --help to confirm the supported configuration options for your version.
Configuring Claude Code to Use Ollama
First: if you have ANTHROPIC_API_KEY set in your environment, unset it. Leaving it set may cause Claude Code to route requests to Anthropic's API instead of Ollama, silently incurring costs.
unset ANTHROPIC_API_KEY
You configure Claude Code's third-party provider support with environment variables. The exact variable names depend on your Claude Code version. Run claude --help to confirm the correct names. The variables below represent one documented configuration approach — verify them against the current Anthropic documentation for your installed version:
# Use a placeholder that cannot be mistaken for a real credential.
# Prefer placing these in a .env file loaded via `direnv` or `source .env`
# rather than inline shell exports, so the values do not appear in shell history.
export OPENAI_API_KEY="not-a-real-key-local-ollama-only"
export ANTHROPIC_BASE_URL="http://localhost:11434/v1"
export CLAUDE_CODE_USE_OPENAI=1
export CLAUDE_MODEL="qwen2.5-coder:14b"
Version-dependent variables: The variable names
CLAUDE_CODE_USE_OPENAI,CLAUDE_MODEL, and the choice betweenANTHROPIC_BASE_URLandOPENAI_BASE_URLmay differ across Claude Code releases. Confirm them withclaude --helpor the Anthropic documentation for your version. If the variables are incorrect, Claude Code may silently fall back to the Anthropic API, incurring costs.
You set OPENAI_API_KEY to a placeholder string because Ollama does not require authentication, but Claude Code refuses to start without a non-empty key value. ANTHROPIC_BASE_URL points to the local Ollama server's OpenAI-compatible API path. CLAUDE_CODE_USE_OPENAI signals Claude Code to use the OpenAI-compatible provider path rather than the Anthropic API. CLAUDE_MODEL specifies which Ollama model to use and must match the model name exactly as shown by ollama list, including the tag (e.g., :14b).
For persistence, add these exports to ~/.bashrc, ~/.zshrc, or a project-level .env file. If using a project-level .env file, ensure it is listed in .gitignore to prevent accidental commits.
Windows users (PowerShell):
$env:OPENAI_API_KEY="not-a-real-key-local-ollama-only"
$env:ANTHROPIC_BASE_URL="http://localhost:11434/v1"
$env:CLAUDE_CODE_USE_OPENAI="1"
$env:CLAUDE_MODEL="qwen2.5-coder:14b"
Launching Claude Code in Local Mode
With the environment variables set, start Claude Code in any project directory:
cd /path/to/your/project
claude
On startup, Claude Code should display the configured model name (e.g., qwen2.5-coder:14b) rather than a Claude Sonnet or Opus identifier. This is an initial indicator that configuration was applied, but displaying the model name alone does not guarantee local routing — the configured variable value could be shown even if routing fails. To definitively confirm that requests reach Ollama, monitor connections during a session:
# In a separate terminal, verify traffic reaches Ollama:
lsof -i :11434 | grep ESTABLISHED
# Windows equivalent:
# netstat -ano | findstr :11434
You should see an active TCP connection to 127.0.0.1:11434. If no connection is shown, requests may be going to Anthropic's servers.
Step 3: Take It for a Spin with a React + Node.js Project
Scaffolding a Test Project
Create a minimal project that gives Claude Code real files to work with:
npm create vite@latest test-project -- --template react
cd test-project
npm install
npm install express
Add a minimal Express server at the project root. Because the Vite scaffold creates an ES module project ("type": "module" in package.json), the CommonJS require() syntax will not work by default. Either rename the file server.cjs, or add "type": "commonjs" to a separate root-level package.json, or rewrite using ES module import syntax. The example below uses the .cjs approach:
// server.cjs
const express = require('express');
const app = express();
const PORT = process.env.PORT ?? 3001;
app.use(express.json());
app.get('/', (req, res) => {
res.json({ message: 'Server is running' });
});
const server = app.listen(PORT, () => {
console.log(`Server listening on port ${PORT}`);
});
server.on('error', (err) => {
if (err.code === 'EADDRINUSE') {
console.error(`Port ${PORT} is already in use. Set PORT env var to use a different port.`);
} else {
console.error('Server failed to start:', err);
}
process.exit(1);
});
This provides both a React frontend and a Node.js backend for Claude Code to operate on.
Running Real Coding Tasks
With Claude Code running in the project directory, issue a practical prompt:
Add a /api/health endpoint to server.cjs that returns { status: "healthy", uptime: process.uptime() }
and create a React component called HealthStatus that fetches and displays this data.
With qwen2.5-coder:14b, expect output structured like this (your results will vary based on prompt phrasing and model state):
// Addition to server.cjs
app.get('/api/health', (req, res) => {
res.json({
status: 'healthy',
uptime: process.uptime(),
timestamp: new Date().toISOString()
});
});
// src/components/HealthStatus.jsx
import { useState, useEffect } from 'react';
const API_BASE = import.meta.env.VITE_API_BASE ?? 'http://localhost:3001';
function HealthStatus() {
const [health, setHealth] = useState(null);
const [loading, setLoading] = useState(true);
const [error, setError] = useState(null);
useEffect(() => {
const controller = new AbortController();
fetch(`${API_BASE}/api/health`, { signal: controller.signal })
.then((res) => {
if (!res.ok) throw new Error(`HTTP ${res.status}`);
return res.json();
})
.then((data) => {
setHealth(data);
setLoading(false);
})
.catch((err) => {
if (err.name === 'AbortError') return;
console.error('Failed to fetch health status:', err);
setError(err.message);
setLoading(false);
});
return () => controller.abort();
}, []);
if (loading) return <p>Loading health status...</p>;
if (error) return <p>Error: {error}</p>;
if (!health) return <p>Unable to reach server.</p>;
return (
<div>
<h2>Server Health</h2>
<p>Status: {health.status}</p>
<p>Uptime: {Math.round(health.uptime)}s</p>
</div>
);
}
export default HealthStatus;
Note on fetch URLs: The React frontend runs on the Vite dev server (typically port 5173), while the Express backend runs on port 3001. The component above uses the VITE_API_BASE environment variable to configure the API origin, falling back to http://localhost:3001 for local development. For production or containerised deployments, set VITE_API_BASE to the appropriate backend URL. Alternatively, configure a Vite proxy by adding server: { proxy: { '/api': 'http://localhost:3001' } } to vite.config.js and use relative fetch paths.
Claude Code's agentic capabilities mean it reads the existing server.cjs, identifies where to insert the new endpoint, writes the changes, creates the new component file, and can even update imports in App.jsx if prompted.
Evaluating Output Quality
Local models in the 7B to 14B range handle boilerplate code, CRUD endpoint generation, simple component creation, test scaffolding, and straightforward refactoring well. For single-endpoint handlers and isolated component files, they produce usable output on first attempt without manual correction.
Where local models fall short is in complex multi-file reasoning: tracing a bug across several interconnected modules, making architectural decisions that require understanding a full codebase's patterns, or producing correct output when the context window fills up. Claude Sonnet 4 handles these scenarios with noticeably higher accuracy. For example, Sonnet correctly traces cross-module type errors that qwen2.5-coder:14b misses after multiple attempts, and it maintains coherence across longer context windows.
Performance Tuning and Optimization
Ollama Configuration for Better Performance
Ollama exposes several environment variables and configuration options that affect inference speed:
# Allow parallel request handling (increases memory consumption proportionally;
# each parallel slot loads approximately 1x base model RAM — check available
# memory with `free -h` or `vm_stat` before raising above 1)
export OLLAMA_NUM_PARALLEL=2
# Context window configuration:
# OLLAMA_NUM_CTX as a shell export is NOT reliably supported across all
# Ollama versions as a global override. Use one of the two verified methods:
# Method 1: per-request (pass in API body options field) — most reliable:
# "options": { "num_ctx": 8192 }
# Method 2: persistent per-model Modelfile override:
# ollama show qwen2.5-coder:14b --modelfile > Modelfile.custom
# # Add or update: PARAMETER num_ctx 8192
# ollama create qwen2.5-coder-ctx8k -f Modelfile.custom
# # Then use model name: qwen2.5-coder-ctx8k
# Verify active context length:
ollama show qwen2.5-coder:14b --modelfile | grep num_ctx
# For NVIDIA GPUs, control GPU layer offloading in a Modelfile:
# PARAMETER num_gpu 99
# (This instructs Ollama to offload up to 99 layers to the GPU;
# the actual number offloaded is capped by available VRAM.)
Setting OLLAMA_NUM_PARALLEL above 1 enables concurrent request handling, which matters less for single-user Claude Code sessions but helps if other tools share the same Ollama instance. Increasing the context length allows the model to reason over more code at once, but increases memory consumption significantly; very long contexts can consume substantially more memory than the base model load.
Choosing the Right Model for the Task
A practical strategy is to keep multiple models pulled and switch between them. Use a smaller model like llama3.1:8b for quick completions and simple edits where speed matters. Switch to qwen2.5-coder:14b or deepseek-coder-v2:16b for tasks requiring higher code quality. Switching models requires only changing the CLAUDE_MODEL environment variable (or the equivalent for your Claude Code version) and restarting Claude Code.
Complete Implementation Checklist and Model Comparison Table
Setup Checklist
- Install Ollama (
brew install ollamaorcurlinstall script) and verify withollama --version - Start Ollama server (
ollama serveorbrew services start ollamaon macOS) and confirm port 11434 is listening - Pull a coding model (
ollama pull qwen2.5-coder:14b) and verify withollama list - Test the API endpoint with
curl http://localhost:11434/v1/chat/completions(include"stream": falsein the request body) - Install Claude Code (
npm install -g @anthropic-ai/claude-code) and verify withclaude --version - Unset
ANTHROPIC_API_KEYif present (unset ANTHROPIC_API_KEY) - Check
claude --helpto confirm the correct environment variable names for your version - Set environment variables (
OPENAI_API_KEY,ANTHROPIC_BASE_URL,CLAUDE_CODE_USE_OPENAI,CLAUDE_MODEL), adapting variable names if your version differs - Launch Claude Code in a project directory and confirm the model name in startup output
- Run
lsof -i :11434(ornetstat -ano | findstr :11434on Windows) during a session to verify local routing - Run a test prompt and verify the response comes from the local model
Local Coding Model Comparison Table
| Model | Size | Min. Free RAM (Q4) | Coding Quality* | Speed | Best For |
|---|---|---|---|---|---|
llama3.1:8b | ~4.7GB | 16GB | Moderate | Fast | Quick completions, simple edits |
codellama:13b | ~7.4GB | 32GB** | Good | Moderate | General code generation |
qwen2.5-coder:14b | ~8.9GB | 32GB | Very Good | Moderate | Best overall for coding tasks |
deepseek-coder-v2:16b | ~9.1GB | 32GB | Very Good | Moderate | Complex code generation |
codellama:34b | ~19GB | 64GB | Excellent | Slow | Maximum local quality |
llama3.1:70b | ~40GB | 64GB+ | Excellent | Very Slow | Near-API quality (if hardware allows) |
*Coding Quality ratings reflect informal single-file pass rates on HumanEval-style tasks. "Moderate" = frequent manual fixes needed; "Good" = occasional fixes; "Very Good" = first-attempt success on most single-file tasks; "Excellent" = consistent first-attempt success including multi-function files.
**16GB is the technical minimum for codellama:13b; 32GB is recommended for stable inference without swapping. Sizes and RAM figures assume Q4 quantization; Q8 quantization approximately doubles RAM requirements. Verify actual on-disk size with ollama list after pulling.
Best overall pick: qwen2.5-coder:14b offers the strongest balance of code generation quality, reasonable resource requirements, and practical inference speed for iterative development workflows.
Troubleshooting Common Issues
Connection Refused or Model Not Found
If Claude Code reports connection errors, verify that ollama serve is running and that http://localhost:11434 responds to requests. On macOS, check whether the Homebrew service is already running with brew services list — running ollama serve manually when the service is active causes a port conflict. A "model not found" error means the value in CLAUDE_MODEL does not exactly match the model name shown by ollama list, including the tag (e.g., :14b).
Slow Responses or Out-of-Memory Errors
If inference is unacceptably slow or the system runs out of memory, reduce the context window (via the Modelfile PARAMETER num_ctx or the per-request options field), switch to a smaller quantized model, or verify that GPU offloading is active. On NVIDIA systems, nvidia-smi confirms whether Ollama is utilizing the GPU. On Apple Silicon, Metal acceleration is automatic.
Claude Code Ignoring Local Config
Environment variables override each other in ways that cause routing mistakes. If you have an ANTHROPIC_API_KEY set in the shell environment or in a global configuration file, Claude Code may prioritize the Anthropic provider over the OpenAI-compatible path. Unset any Anthropic-specific variables (unset ANTHROPIC_API_KEY) before launching Claude Code in local mode. Additionally, verify that the environment variable names you are using match those supported by your installed Claude Code version — run claude --help to confirm.
Warning: If environment variables are misconfigured, Claude Code may silently route requests to Anthropic's API, incurring unexpected costs. Always verify local routing by checking for active connections to localhost:11434 during your session.
When to Use Local vs. API: A Practical Framework
Use local models for iterative development, boilerplate generation, test writing, refactoring, and work on private or proprietary codebases where data must not leave the machine. Use the Anthropic API for complex architectural reasoning, large-context multi-file changes that exceed local model capabilities, and code that ships to production without additional human review.
The most practical approach is a hybrid one: default to local for the bulk of daily coding tasks and switch to the API selectively for heavy lifts. This pattern captures the majority of cost savings while preserving access to frontier model quality when it matters.
What Comes Next
This setup eliminates API costs for the majority of routine coding agent interactions. Developers who previously spent $100 or more per day on Anthropic API credits can reserve that spend for tasks that genuinely require frontier model capabilities. Developers who route the majority of routine tasks locally can significantly reduce API costs; actual savings depend on individual workflow composition and the ratio of local-suitable tasks to those requiring frontier models.
From here, the natural next steps are experimenting with additional models as the open-weight ecosystem evolves and creating task-specific Modelfile configurations tuned for particular programming languages or frameworks. Beyond that, you can integrate local Claude Code sessions into CI workflows for automated code review on private repositories.

