The demand for a local AI coding assistant has surged as developers seek GitHub Copilot-level code suggestions without routing proprietary source code through third-party servers or paying recurring subscription fees. Two viable paths now exist: Cursor, a VS Code fork with built-in AI capabilities, and a fully open stack combining VS Code, Ollama, and the Continue extension.
With DeepSeek-Coder-V2 and recent Continue extension updates narrowing the gap with cloud-based alternatives, a private, on-device setup is a realistic daily driver for intermediate developers. This tutorial walks through both approaches, compares them head-to-head, and provides working configurations to get started.
Table of Contents
- Why Go Local? The Privacy and Cost Case for On-Device AI
- The Two Paths: Cursor vs VS Code + Ollama + Continue
- Setting Up Ollama with DeepSeek-Coder-V2
- Integrating Continue into VS Code
- The Cursor Alternative: Built-In AI with a Privacy Mode
- Quality Comparison: Local 7B/16B Models vs GPT-4o
- Implementation Checklist and Recommendations
- Final Thoughts
Tested Versions: This tutorial was tested with Ollama, the Continue extension for VS Code, and the
deepseek-coder-v2:16bmodel tag as served by Ollama. Because all three components are under active development, check their respective changelogs if setup behavior differs from what is described here. Runollama --versionand check the Continue extension version in the VS Code Extensions panel to confirm your installed versions.
Why Go Local? The Privacy and Cost Case for On-Device AI
The Code Privacy Problem with Cloud AI
Every prompt sent to a cloud-hosted coding assistant carries fragments of the codebase with it: proprietary business logic, internal API structures, environment variables, and sometimes credentials that developers forget to strip from context. For organizations operating under GDPR, SOC 2, or HIPAA compliance frameworks, the mere transmission of source code to external servers can trigger audit findings or outright policy violations. Many enterprise security teams have blanket prohibitions against cloud AI tools touching production codebases, regardless of vendor data-retention promises. The risk calculus is straightforward. If the code never leaves the machine, an entire category of data exposure simply disappears.
If the code never leaves the machine, an entire category of data exposure simply disappears.
What You Give Up and What You Don't
Honesty matters here. Local models in the 7B to 16B parameter range are smaller, slower, and less capable than GPT-4o or Claude 3.5 Sonnet. They will not produce the same quality of output on complex architectural reasoning, multi-file refactoring, or unfamiliar API usage. That said, for inline completions, boilerplate generation, well-known framework patterns, and context-aware single-file suggestions, local models produce usable first-try completions for standard React/Express patterns and similar well-represented domains. They handle the routine work -- simple completions, boilerplate, known patterns -- that fills most autocomplete interactions in a typical coding session. And after the initial hardware investment, the marginal cost is zero: no monthly subscription, no per-token billing, no usage caps.
For a broader look at the AI coding assistant landscape, see our overview of AI coding assistants.
The Two Paths: Cursor vs VS Code + Ollama + Continue
Before diving into setup, it helps to understand the two architectures at a high level.
Path A: Cursor is a fork of VS Code with AI features built directly into the editor. It ships with tight integration for cloud models like GPT-4o and Claude on paid plans, but also offers a privacy mode and experimental local model support via Ollama. It requires zero configuration out of the box: preconfigured keybindings, a working chat panel, and automatic cloud model setup on paid tiers. VS Code extensions remain compatible.
Path B: VS Code + Ollama + Continue is a fully open, modular stack. Ollama serves as a local model inference server, Continue acts as the AI interface inside VS Code, and the developer retains complete control over model selection, context providers, and data flow. Nothing leaves the machine unless explicitly configured to do so.
The sections that follow walk through Path B step by step, then evaluate how Path A compares.
Setting Up Ollama with DeepSeek-Coder-V2
Installing Ollama
Ollama provides installers for macOS, Linux, and Windows. For security-hardened or CI/CD environments, download and verify the install script before executing, or use a versioned binary directly from the Ollama GitHub releases page:
# Install Ollama (macOS/Linux) — integrity-verified path
# Option A: Review script manually before execution
curl -fsSL https://ollama.com/install.sh -o /tmp/ollama_install.sh
# Verify checksum against published value at https://github.com/ollama/ollama/releases
shasum -a 256 /tmp/ollama_install.sh
# If checksum matches, execute:
sh /tmp/ollama_install.sh
# Option B (preferred for CI/hardened environments): download versioned binary directly
# https://github.com/ollama/ollama/releases — pin to a specific release tag
# Verify installation
ollama --version
On Windows 10/11, download and run the installer from ollama.com, then verify from PowerShell or Command Prompt with the same ollama --version command.
Pulling and Running DeepSeek-Coder-V2
DeepSeek-Coder-V2 is the recommended model for this setup. It scores well on HumanEval and MBPP (see the model card for specific numbers), supports native fill-in-the-middle (critical for inline autocomplete), and follows multi-step prompts without dropping constraints in most single-file tasks. It comes in two practical sizes: the 16B parameter variant, which requires approximately 16GB of VRAM for GPU-accelerated inference (RTX 3080-class or Apple M2 Pro with 16GB unified memory), and a lite variant suitable for machines with less VRAM. CPU-only inference with the 16B model requires 16GB+ RAM but produces tokens too slowly (typically 1-3 tokens/sec) for practical autocomplete use. Check available VRAM with nvidia-smi (NVIDIA) or system_profiler SPDisplaysDataType (macOS).
For machines with sufficient VRAM, the 16B model is the better choice. It produces noticeably higher-quality completions, especially for framework-specific patterns in React, Express, and similar ecosystems. If you only have 8GB of VRAM or are running CPU-only, use the lite variant instead -- verify the specific tag's quantization level and memory requirements with ollama show deepseek-coder-v2:lite --modelfile.
Confirm FIM is active by checking Continue's autocomplete logs (Continue panel → Output). The model tag deepseek-coder-v2:16b on Ollama uses the instruct template by default; verify that Continue's tabAutocompleteOptions specifies "template": "deepseek" or the appropriate FIM template for your Continue version.
# Pull the model
ollama pull deepseek-coder-v2:16b
# Quick smoke test (with timeout to prevent indefinite hang on slow hardware)
# Linux/macOS:
timeout 120 ollama run deepseek-coder-v2:16b "Write a React useState hook that manages a shopping cart array"
The smoke test should return a coherent React hook implementation within 5-15 seconds on GPU-accelerated hardware; CPU-only inference will be substantially slower. If response time exceeds 60 seconds, switch to the lite variant.
Confirming the Local API Is Running
Ollama automatically exposes a local REST API at localhost:11434 when a model is loaded. Ensure Ollama is running (ollama serve or via system service) before launching VS Code with Continue -- Ollama may not auto-start after a reboot depending on your installation method. This endpoint serves Continue and other local tools. A quick test confirms everything is wired up:
Security note: The Ollama API on localhost:11434 has no authentication by default. Any process on the host can send requests to it. Be aware of this in multi-user or shared environments.
macOS/Linux:
# With timeout and error output
curl --max-time 120 --silent --show-error \
http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-coder-v2:16b",
"prompt": "function fibonacci(n) {",
"stream": false
}' | python3 -m json.tool
Windows (PowerShell -- recommended over cmd.exe to avoid quoting issues):
# PowerShell (Windows) — avoids cmd.exe quoting issues entirely
$body = @{
model = "deepseek-coder-v2:16b"
prompt = "function fibonacci(n) {"
stream = $false
} | ConvertTo-Json
Invoke-RestMethod -Uri "http://localhost:11434/api/generate" `
-Method POST `
-ContentType "application/json" `
-Body $body `
-TimeoutSec 120
A successful response returns a JSON object containing the model's completion of the Fibonacci function. If the request fails, verify that Ollama is running in the background (ollama serve) and that no firewall rules are blocking localhost traffic on port 11434.
For deeper analysis of how local models perform across different hardware configurations, see our guide to benchmarking local models.
Integrating Continue into VS Code
Installing the Continue Extension
Continue is available directly from the VS Code marketplace. Search for "Continue" by publisher Continue (ID: Continue.continue), or install via the terminal:
# Verify that the 'code' CLI is available before running
which code || echo "VS Code 'code' command not found — add it to PATH first (macOS: Cmd+Shift+P → 'Shell Command: Install')"
code --install-extension Continue.continue
Installing Continue adds a chat panel to the sidebar, inline completion capabilities, and a set of keyboard shortcuts for triggering AI actions. The default keybindings place chat access on Cmd+L (macOS) or Ctrl+L (Linux/Windows), while inline editing uses Cmd+I or Ctrl+I.
Configuring Continue for Ollama
Continue stores its configuration in ~/.continue/config.json (Continue v0.7.x) or ~/.continue/config.yaml (Continue v0.8+). On Windows, the path is %USERPROFILE%\.continue\. Check your installed version in the VS Code Extensions panel and consult docs.continue.dev for the matching schema. The config file tells Continue which models to load, how to connect, and which context providers to use. The following configuration (JSON format, applicable to Continue v0.7.x) sets up DeepSeek-Coder-V2 as both the chat model and the tab autocomplete model, with context providers enabled:
{
"models": [
{
"title": "DeepSeek Coder V2 (Local)",
"provider": "ollama",
"model": "deepseek-coder-v2:16b",
"apiBase": "http://localhost:11434",
"requestOptions": {
"timeout": 30000
}
}
],
"tabAutocompleteModel": {
"title": "DeepSeek Autocomplete",
"provider": "ollama",
"model": "deepseek-coder-v2:16b",
"apiBase": "http://localhost:11434",
"requestOptions": {
"timeout": 8000
},
"maxTokens": 1024
},
"contextProviders": [
{ "name": "code" },
{ "name": "docs", "params": { "urls": [] } },
{ "name": "diff" }
]
}
After saving the configuration, harden the file permissions to prevent other users or processes from reading any API keys that may be added later:
chmod 600 ~/.continue/config.json
Note on the terminal context provider: A terminal provider is available in Continue that shares recent terminal output with the AI model. This is excluded from the default configuration above because terminal output can inadvertently contain environment variables, secrets, or command history. Only add { "name": "terminal" } to contextProviders if you have confirmed that no sensitive data appears in your terminal sessions.
Connecting Context Providers
Continue uses context providers to inject relevant information into prompts beyond the current file. The code provider allows referencing specific files or symbols with @code in chat. The docs provider enables indexing project-specific documentation that the model can query, which is particularly useful for internal APIs or custom libraries. The diff provider feeds current git changes into the context.
In practice, typing @code followed by a filename in the chat panel lets the model reason about code in other files. Point the @docs provider at local Markdown files or external URLs to give the model awareness of project conventions. Note that pointing @docs at external URLs requires outbound network requests. For a fully offline setup, limit doc sources to local file paths.
Testing Your Setup with a Real Task
The best way to validate the setup is to open a real component and exercise both inline completion and chat refactoring. Create or open a React component like the following:
// components/ShoppingCart.jsx
import { useState } from 'react';
export default function ShoppingCart() {
const [items, setItems] = useState([]);
const addItem = (product) => {
if (!product.id) {
console.warn('addItem: product missing stable id', product);
}
setItems(prev => {
const existing = prev.findIndex(i => i.id === product.id);
if (existing !== -1) {
return prev.map((item, idx) =>
idx === existing ? { ...item, quantity: item.quantity + 1 } : item
);
}
return [...prev, { ...product, quantity: 1 }];
});
};
// Ask Continue: "Add a removeItem function and a total price calculation"
return (
<div>
{items.map((item) => (
<div key={item.id}>{item.name} - ${item.price.toFixed(2)}</div>
))}
</div>
);
}
Place the cursor after the addItem function and trigger inline completion. DeepSeek-Coder-V2 should suggest a removeItem function or similar pattern. Then open the Continue chat panel and type: "Add a removeItem function and a total price calculation to this component, using @code ShoppingCart.jsx for context." The model should produce a coherent refactored version with both features added.
The Cursor Alternative: Built-In AI with a Privacy Mode
What Cursor Offers Out of the Box
Cursor is a fork of VS Code that bundles AI features natively. It supports Cmd+K for inline editing, a chat panel, and codebase-wide context via the @codebase directive, which indexes the entire project for retrieval-augmented generation. Note that @codebase indexing may send code embeddings to Cursor's servers to build the search index; verify Cursor's current indexing data handling before enabling in compliance environments. Because it is a VS Code fork, virtually all existing extensions remain compatible, though Cursor may occasionally lag behind the latest VS Code release, causing brief extension incompatibility windows. Paid plans unlock cloud model access (GPT-4o, Claude), with a free tier offering limited usage.
Cursor's Privacy Mode and Local Model Support
Cursor's privacy mode prevents code from being stored on Cursor's servers, but prompts and code snippets still transit Cursor's infrastructure for cloud model inference. For GDPR, HIPAA, or SOC 2 environments where transmission itself is restricted, privacy mode does not satisfy a "data never leaves the machine" requirement. Only the local Ollama path provides that guarantee. For teams that need absolute certainty that code never leaves the machine, this is an important distinction.
Cursor also supports connecting to a local Ollama instance as an experimental feature, giving access to the same local models used in the Continue setup. However, the configuration options are simpler and offer fewer context provider controls than Continue's configuration approach.
Pros and Cons: Cursor vs VS Code + Continue
| Criteria | Cursor | VS Code + Ollama + Continue |
|---|---|---|
| Setup Effort | Low (built-in) | Medium (manual config) |
| Cost | Free tier limited; Pro $20/mo (verify at cursor.com/pricing) | Fully free |
| Privacy Control | Privacy mode (trust required; prompts still transit servers) | Complete (never leaves machine) |
| Local Model Quality | Same Ollama models | Same Ollama models |
| Context Awareness | Excellent (@codebase) | Good (configurable providers) |
| Extension Ecosystem | VS Code compatible (may lag VS Code releases slightly) | Full VS Code ecosystem |
| Customization | Limited | Highly configurable |
| Cloud Fallback | Built-in (paid) | Manual (add API key) |
For more on how cloud-based AI tools like Claude Code compare, see our Claude Code analysis.
Quality Comparison: Local 7B/16B Models vs GPT-4o
Test Methodology
Three tasks provide a useful cross-section of real coding work: inline autocomplete for a Node.js Express route, chat-based refactoring of a React component, and multi-file context resolution involving imports across files. The models under comparison are DeepSeek-Coder-V2 16B (local via Ollama), CodeLlama 7B (local via Ollama), and GPT-4o (cloud, accessed through Cursor).
Side-by-Side Results
The following prompt was given to all three models for the Express middleware task:
// Create an Express middleware that validates a JWT token
// from the Authorization header and attaches the decoded
// user object to req.user
// Note: This example uses CommonJS require() syntax typical for Node.js/Express.
// Adjust to import syntax if your project uses ES modules.
// Requires: npm install jsonwebtoken@^9.0.2
// JWT_SECRET must be set as an environment variable — never hardcode
const jwt = require('jsonwebtoken');
function authMiddleware(req, res, next) {
const authHeader = req.headers['authorization'];
if (!authHeader || !authHeader.startsWith('Bearer ')) {
return res.status(401).json({ error: 'Missing or malformed Authorization header' });
}
const token = authHeader.split(' ')[1];
const secret = process.env.JWT_SECRET;
if (!secret) {
// Fail closed — do not proceed without a configured secret
return res.status(500).json({ error: 'Server misconfiguration: JWT_SECRET not set' });
}
try {
const decoded = jwt.verify(token, secret, { algorithms: ['HS256'] });
req.user = decoded;
next();
} catch (err) {
if (err.name === 'TokenExpiredError') {
return res.status(401).json({ error: 'Token expired' });
}
return res.status(403).json({ error: 'Invalid token' });
}
}
module.exports = authMiddleware;
DeepSeek-Coder-V2 16B produced a complete middleware extracting the Bearer token from the Authorization header, wrapping jwt.verify in a try-catch, attaching the decoded payload to req.user, and calling next(). It included a 401 response for missing tokens and a 403 for invalid ones. The code was clean and production-appropriate.
CodeLlama 7B generated a functional but minimal version. It extracted the token and called jwt.verify but omitted error handling for the missing Authorization header case and did not differentiate between missing and invalid tokens in its error responses.
GPT-4o produced the most thorough output, including the same Bearer token extraction and error differentiation as DeepSeek, plus optional configuration for the JWT secret via environment variables and inline JSDoc comments explaining each step.
Where Local Models Shine (and Where They Don't)
Local models in the 7B to 16B range perform well on boilerplate code, well-known framework patterns like Express middleware and React hooks, and single-file completions where the required context is visible in the current buffer. They handle the repetitive, pattern-heavy work that makes up the bulk of daily autocomplete interactions.
Ask a 16B model to trace how a TypeScript interface defined in
types.tsflows through a service layer and into a React component three imports away, though, and the output degrades fast.
Ask a 16B model to trace how a TypeScript interface defined in types.ts flows through a service layer and into a React component three imports away, though, and the output degrades fast. The same applies to novel or recently released API usage that may not appear in training data, and to complex architectural suggestions that require holistic project understanding. In informal testing across the tasks above, local models handled common autocomplete interactions -- boilerplate, patterns, and single-file completions -- competently, while falling short on multi-file reasoning and novel API usage. No formal benchmark is claimed; teams should validate against their own codebases. For the remaining gaps, keeping a cloud model available as a fallback is the pragmatic approach.
Implementation Checklist and Recommendations
## Local AI Coding Assistant Setup Checklist
### Prerequisites
- [ ] 16GB+ VRAM for GPU inference with the 16B model (or 8GB+ for lite variant)
- [ ] 16GB+ RAM minimum if running CPU-only (expect slow inference with 16B model)
- [ ] macOS, Linux, or Windows 10/11
- [ ] VS Code installed (latest stable)
- [ ] GPU: NVIDIA (CUDA support) or Apple Silicon recommended for practical autocomplete speed
### Ollama Setup
- [ ] Install Ollama (review install script or use versioned binary for security)
- [ ] Verify Ollama is running (`ollama serve` or system service)
- [ ] Pull deepseek-coder-v2:16b (or :lite for lower VRAM)
- [ ] Verify API at localhost:11434 (use platform-appropriate curl/PowerShell command)
### Continue Extension
- [ ] Install Continue from VS Code marketplace (publisher: Continue, ID: Continue.continue)
- [ ] Check installed Continue version and consult docs.continue.dev for config format
- [ ] Configure config.json or config.yaml with Ollama provider (path varies by OS)
- [ ] Add tabAutocompleteModel with explicit apiBase for inline completions
- [ ] Configure context providers (code, docs, diff)
- [ ] Harden config file permissions: `chmod 600 ~/.continue/config.json`
- [ ] Test inline completion in a real project file
- [ ] Test chat refactoring with @code references
### Optional Enhancements
- [ ] Add a cloud model API key as fallback for complex tasks (note: keys are stored in plaintext in the config file — ensure file permissions are restricted)
- [ ] Configure project-specific @docs context (use local paths only for fully offline setup)
- [ ] Set up keyboard shortcuts for Continue commands
- [ ] Try Cursor's privacy mode for comparison
- [ ] Only add `terminal` context provider if terminal sessions are confirmed free of secrets
Which Path Should You Choose?
The answer depends on context. Solo developers and open-source contributors benefit most from VS Code + Ollama + Continue: it is completely free, offers maximum control over data flow, and is endlessly configurable. Teams with budget that value polish and built-in cloud model access should consider Cursor Pro with privacy mode enabled. If every network call must be auditable, VS Code + Ollama + Continue is the only path that guarantees it -- no data leaves the machine unless the developer explicitly adds an external API key.
Final Thoughts
Local AI coding assistance has crossed the threshold of practical daily utility. The combination of Ollama's simple model serving, DeepSeek-Coder-V2's capable code generation, and Continue's flexible VS Code integration delivers a private, zero-cost setup that handles routine coding tasks -- boilerplate, known patterns, single-file completions -- as demonstrated in the test results above. The gap with cloud models like GPT-4o is real, but each new release of open-weight models closes part of it. The most effective strategy is to start local for privacy and cost control, then layer in cloud access selectively for tasks that genuinely demand it.

