Running a local LLM on a 10GB VRAM budget has become a genuinely practical option in 2026. By the end of this article, you will have a fully operational local LLM setup producing 30 to 45 tokens per second, integrated with development tooling, and tuned for the specific memory constraints of your hardware.
How to Set Up a Local LLM on 10GB VRAM
- Measure your actual free VRAM with
nvidia-smiunder normal desktop conditions to establish your real memory budget. - Select a quantized 7B–8B model (Llama 3.1 8B or DeepSeek Coder 6.7B) at Q5_K_M quantization (~5.3 GB).
- Install Ollama via the one-line installer and verify GPU/CUDA detection in the startup logs.
- Pull your chosen model with
ollama pull llama3.1:8b-instruct-q5_K_M. - Create a custom Modelfile setting
num_gpu 32,num_ctx 4096, andnum_threadto your physical core count. - Benchmark token throughput with the included Python script, targeting 30–45 tokens/sec.
- Integrate with your IDE using the Continue extension pointed at
localhost:11434. - Optimize by closing VRAM-consuming apps, adjusting context window size, and monitoring GPU thermals.
Table of Contents
- Why 10GB VRAM Is the New Sweet Spot for Local LLMs
- Hardware Reality Check: What 10GB VRAM Actually Means
- Model Selection: Choosing the Right LLM for Your VRAM
- Installing Ollama and Pulling Your First Model
- Performance Optimization: Hitting 30 to 45 Tokens/Sec
- Practical Use Cases: Putting Your Local LLM to Work
- Troubleshooting Common 10GB VRAM Issues
- Implementation Checklist and Next Steps
Why 10GB VRAM Is the New Sweet Spot for Local LLMs
Running a local LLM on a 10GB VRAM budget has become a genuinely practical option in 2026. Cloud API costs accumulate fast for developers iterating on LLM-powered features. Round-trip latency to hosted endpoints adds friction to code completion workflows. Every prompt sent to a third-party service is a prompt that leaves the developer's control, and for teams handling proprietary codebases or sensitive data, that last point alone justifies local inference.
The key insight most guides overlook: a 12GB or 16GB GPU does not give developers 12 or 16 gigabytes of working memory for models. The operating system reserves VRAM for display buffers and compositor needs, and the CUDA runtime itself claims a slice. On a typical desktop running a display server, the realistic usable VRAM on an RTX 4060 Ti 16GB lands around 14GB, while the RTX 4070's 12GB yields approximately 10.8GB for model weights under typical desktop conditions. That approximately 10 to 11GB effective budget is the common denominator for both cards under real working conditions, and it is the constraint this guide targets.
The key insight most guides overlook: a 12GB or 16GB GPU does not give developers 12 or 16 gigabytes of working memory for models.
By the end of this article, you will have a fully operational local LLM setup producing 30 to 45 tokens per second (community-reported ranges for Q5_K_M 8B models; verify with the included benchmark script), integrated with development tooling, and tuned for the specific memory constraints of your hardware. Every command is copy-paste ready.
Hardware Reality Check: What 10GB VRAM Actually Means
RTX 4060 Ti (16GB) vs RTX 4070: Specs That Matter for LLM Inference
Two specs dominate LLM inference performance on consumer GPUs: VRAM capacity and memory bandwidth. The RTX 4060 Ti 16GB provides 16GB of GDDR6 on a 128-bit bus with 288 GB/s of memory bandwidth. The RTX 4070 ships with 12GB of GDDR6X on a 192-bit bus, delivering 504 GB/s of bandwidth. That bandwidth gap matters enormously. Token generation in autoregressive LLMs is memory-bandwidth-bound: each token requires reading the full model weights from VRAM. The RTX 4070's higher bandwidth translates directly into faster token generation, even though it has less total VRAM.
CUDA core count (4352 on the 4060 Ti, 5888 on the 4070 -- verify against the NVIDIA spec sheet at nvidia.com for your specific SKU) matters primarily during the prefill phase, which is compute-bound rather than memory-bound. For interactive use where generation speed is the bottleneck users feel, bandwidth is the limiting factor.
Calculating Your Effective VRAM Budget
The formula is straightforward: Usable VRAM = Total VRAM - OS/Display Overhead - KV Cache Reservation = Model Budget.
On a standard Linux or Windows desktop with a display attached, OS overhead consumes 400 to 800 MB (varies by display configuration; measure with nvidia-smi on your specific system). The KV cache for a 4096-token context window on Llama 3.1 8B (with 8 GQA KV heads at float16) requires approximately 500 to 600 MB; other models with different architectures will differ.
| GPU | Total VRAM | OS/Display Overhead | KV Cache (4K ctx) | Model Weight Budget |
|---|---|---|---|---|
| RTX 4060 Ti 16GB | 16 GB | ~600 MB | ~550 MB | ~14.8 GB |
| RTX 4070 12GB | 12 GB | ~600 MB | ~550 MB | ~10.8 GB |
The RTX 4060 Ti has headroom for larger quantizations or wider context windows. The RTX 4070 demands more careful model selection but rewards it with faster generation. To decide between them, consider whether your workload needs longer context (favor the 4060 Ti) or faster output (favor the 4070).
To check actual available VRAM on a given system:
# Check VRAM allocation with nvidia-smi
nvidia-smi --query-gpu=memory.total,memory.used,memory.free --format=csv
# For a more precise measurement from within PyTorch (requires: pip install torch):
python3 -c "
import torch
free, total = torch.cuda.mem_get_info(0)
print(f'Total VRAM: {total / 1024**3:.2f} GB')
print(f'Free VRAM: {free / 1024**3:.2f} GB')
print(f'Used VRAM: {(total - free) / 1024**3:.2f} GB')
"
Run these commands with the desktop environment active and typical background applications open to get a realistic baseline, not an optimistic one.
Model Selection: Choosing the Right LLM for Your VRAM
Why Quantized 7B to 8B Models Are the Target
A full-precision (FP16) 7B-parameter model requires approximately 14GB of VRAM for weights alone. That already exceeds the RTX 4070's effective budget. Quantization compresses model weights to lower precision formats, trading a small amount of output quality for dramatically reduced memory requirements. An 8B model at Q5_K_M quantization fits comfortably in about 5.3GB, leaving ample room for KV cache and runtime overhead.
Models in the 13B parameter range at Q4 quantization require roughly 7.5 to 8GB for weights alone. While this technically fits in 10GB, it leaves minimal headroom for context windows beyond 2048 tokens, and any VRAM pressure from other applications causes out-of-memory failures. The 7B to 8B tier at Q4 through Q6 quantization is the reliable operating range.
Recommended Models for 10GB VRAM (2026)
Llama 3.1 8B Instruct remains the strongest general-purpose option at this parameter count. Community benchmarks consistently rank its reasoning and instruction following near the top of the 8B class, and the model ships in every common GGUF quantization format through Ollama's library.
If your primary workload is code generation and local Copilot-style assistance, DeepSeek Coder 6.7B Instruct (the 7B-class DeepSeek Coder model; note that "DeepSeek Coder V3" refers to a separate 671B MoE model) is the better fit. It outperforms general-purpose alternatives on HumanEval and MBPP (verify current scores at the model's repository), while fitting within the same VRAM envelope.
Mistral 7B v0.3 is worth considering when generation speed matters more than peak benchmark scores (verify the current latest release at ollama.com/library/mistral). Its architecture generates tokens faster per VRAM-byte than the other options listed here, making it a practical choice for latency-sensitive workflows.
For multilingual workloads, Qwen 2.5 7B handles CJK and European languages better than the English-centric alternatives above. Phi-3 Mini targets constrained-context reasoning tasks where its training data composition gives it an edge.
Understanding GGUF Quantization Levels
| Quantization | Bits/Weight | VRAM (8B Model) | Quality Impact | Speed Impact |
|---|---|---|---|---|
| Q4_K_M | ~4.8 avg* | ~4.9 GB | Noticeable on nuanced reasoning | Fastest |
| Q5_K_M | ~5.5 avg* | ~5.3 GB | Minimal for most tasks | Fast |
| Q6_K | ~6.5 avg* | ~6.1 GB | Near-imperceptible | Moderate |
| Q8_0 | 8.0 | ~8.0 GB | Negligible | Slower (more data to read) |
*K-quant formats (e.g., Q4_K_M) use mixed precision, quantizing different weight tensors at different bit depths; the listed value is the weighted average bits per weight across all tensors.
Q5_K_M represents the optimal trade-off for the 10GB VRAM tier. It preserves nearly all model quality while leaving 4 to 5GB free for KV cache and overhead on both target GPUs. When you need wider context windows and can tolerate some quality loss on reasoning-heavy prompts, drop to Q4_K_M to reclaim roughly 400MB. Q8_0 fits on the RTX 4060 Ti 16GB but pushes the RTX 4070 to its limits.
Installing Ollama and Pulling Your First Model
Ollama Installation (Linux, macOS, Windows)
# Linux (one-line install — review the script at the URL before running if preferred)
curl -fsSL https://ollama.com/install.sh | sh
# macOS (via Homebrew)
brew install ollama
# Windows — download the installer from https://ollama.com/download/windows
# Or via winget:
winget install Ollama.Ollama
# Verify installation and version
ollama --version
# Verify GPU detection:
# On Linux, Ollama installs a systemd service; check status with:
systemctl status ollama
# If running manually: first verify no instance is running with:
pgrep -x ollama
# If no process is found, start with:
ollama serve
# Check logs for "NVIDIA" or "CUDA" detection messages
# Also verify driver with:
nvidia-smi
Ollama automatically detects CUDA-capable GPUs at startup. Ollama bundles its own CUDA runtime, so a separate CUDA toolkit installation is not required -- only the NVIDIA driver (version 535 or later) is needed. If the logs show CPU-only mode, verify that NVIDIA drivers are properly installed.
Pulling a Quantized Model
Ollama's model library uses a name:tag convention where the tag specifies the variant, size, and quantization. Verify the exact tag at ollama.com/library/llama3.1 before running, as tags are case-sensitive and naming conventions may change.
# Pull the recommended model
ollama pull llama3.1:8b-instruct-q5_K_M
# Verify the model is available locally
ollama list
# Expected output (approximate):
# NAME ID SIZE MODIFIED
# llama3.1:8b-instruct-q5_K_M a1b2c3d4... 5.3 GB Just now
The download size matches the on-disk size closely since GGUF files are already in their quantized format. No additional conversion step is needed.
Running Your First Inference
# Start an interactive chat session
ollama run llama3.1:8b-instruct-q5_K_M
# At the prompt, type a test query:
# >>> Explain the difference between a mutex and a semaphore in three sentences.
# To check model metadata and performance info within the session:
# >>> /show info
# For a quick benchmark from the command line:
echo "Write a Python function to compute Fibonacci numbers iteratively." | \
ollama run llama3.1:8b-instruct-q5_K_M --verbose 2>&1 | tail -5
# The --verbose flag outputs eval rate (tokens/sec) at the end
On an RTX 4070 with default settings, expect 35 to 40 tokens per second on first run. The RTX 4060 Ti will land closer to 28 to 34 tokens per second due to its lower memory bandwidth.
Performance Optimization: Hitting 30 to 45 Tokens/Sec
Configuring GPU Layers and Context Window
The num_gpu parameter controls how many model layers are loaded onto the GPU versus remaining on CPU. For an 8B model that fits entirely in VRAM, set this to the total layer count (32 for Llama 3.1 8B; verify with ollama show <model> --modelfile) to avoid any CPU offloading.
The context window (num_ctx) has a direct, linear relationship to VRAM consumption through the KV cache. For Llama 3.1 8B (with 8 GQA KV heads, 128 head dim, at float16), each additional 1024 tokens of context consumes approximately 128 to 160 MB of VRAM; other models will differ. Expanding from the default 4096 to 8192 tokens costs roughly 500 to 650 MB.
# Create a custom Modelfile with tuned parameters
cat > Modelfile-optimized <<'EOF'
FROM llama3.1:8b-instruct-q5_K_M
# Llama 3.1 8B has 32 transformer blocks.
# Adjust num_gpu to match the actual layer count for other models:
# ollama show <model> --modelfile
PARAMETER num_gpu 32
PARAMETER num_ctx 4096
# Uncomment and set to physical CPU core count (not hyperthreaded logical cores).
# Linux: lscpu | grep 'Core(s) per socket'
# macOS: sysctl -n hw.physicalcpu
# PARAMETER num_thread 8
SYSTEM "You are a precise technical assistant."
EOF
# Build the custom model
ollama create llama31-optimized -f Modelfile-optimized
# Run it
ollama run llama31-optimized
Uncomment and adjust the num_thread parameter to match the number of physical CPU cores (not hyperthreaded logical cores) for optimal prompt processing performance.
VRAM-Saving Techniques That Actually Work
Current llama.cpp builds (the inference backend Ollama uses) enable Flash Attention by default, reducing peak activation memory during the attention computation. It does not reduce KV cache size, which num_ctx and model architecture determine. No manual configuration is required in Ollama.
For speed-critical tasks like code completion where low latency matters more than long context, reducing num_ctx to 2048 frees several hundred megabytes and measurably improves token throughput.
Close VRAM-consuming applications before inference sessions. Chromium-based browsers with hardware acceleration enabled and multiple tabs open can consume 500MB or more of VRAM. Game launchers (Steam, Epic) often hold VRAM allocations even when idle.
Benchmarking Your Setup
Consistent benchmarking requires a fixed prompt, a fixed generation length, and multiple runs to account for variance. Note that the first inference run after loading a model is slower due to model warmup; the benchmark script below performs an automatic warmup run before recording results. Expected ranges with Q5_K_M quantization on an 8B model at 4096 context: RTX 4060 Ti achieves 30 to 38 tokens/sec, and the RTX 4070 reaches 35 to 45 tokens/sec.
Token generation in autoregressive LLMs is memory-bandwidth-bound: each token requires reading the full model weights from VRAM.
#!/usr/bin/env python3
"""benchmark_ollama.py — Standardized Ollama inference benchmark."""
import os
import time
import json
import warnings
import requests # requires: pip install requests
OLLAMA_URL = os.environ.get("OLLAMA_URL", "http://localhost:11434/api/generate")
MODEL = os.environ.get("BENCHMARK_MODEL", "llama31-optimized")
PROMPT = "Explain how a B-tree index works in a relational database. Be thorough."
NUM_RUNS = int(os.environ.get("BENCHMARK_RUNS", "3"))
CONNECT_TIMEOUT = 10 # seconds to establish connection
READ_TIMEOUT = 120 # seconds between streaming chunks
def benchmark_once(warmup: bool = False) -> dict:
start = time.perf_counter()
first_token_time = None
eval_count = None
tps = 0.0
try:
resp = requests.post(
OLLAMA_URL,
json={
"model": MODEL,
"prompt": PROMPT,
"stream": True,
"options": {"num_predict": 256},
},
stream=True,
timeout=(CONNECT_TIMEOUT, READ_TIMEOUT),
)
resp.raise_for_status()
except requests.exceptions.RequestException as exc:
raise RuntimeError(f"Failed to connect to Ollama at {OLLAMA_URL}: {exc}") from exc
for line in resp.iter_lines():
if not line:
continue
try:
data = json.loads(line)
except json.JSONDecodeError as exc:
warnings.warn(f"Skipping malformed response line: {exc}")
continue
if data.get("done", False):
eval_count = data.get("eval_count")
eval_duration_ns = data.get("eval_duration") # nanoseconds, from Ollama
if eval_count and eval_duration_ns:
# Use Ollama's own timing for TPS — most accurate source
tps = eval_count / (eval_duration_ns / 1e9)
elif eval_count:
# Fallback: wall-clock minus TTFT covers generation window only
gen_elapsed = time.perf_counter() - (first_token_time or start)
tps = eval_count / gen_elapsed if gen_elapsed > 0 else 0.0
else:
warnings.warn(
"eval_count missing from Ollama response; "
"TPS cannot be computed accurately. "
"Check Ollama version compatibility."
)
tps = 0.0
else:
if first_token_time is None:
first_token_time = time.perf_counter()
elapsed = time.perf_counter() - start
ttft = (first_token_time - start) if first_token_time else 0.0
return {
"tokens": eval_count or 0,
"ttft_ms": ttft * 1000,
"tps": tps,
"total_s": elapsed,
"warmup": warmup,
}
def main() -> None:
# Warmup run — model already loaded, but execution pipeline may be cold
print("[warmup] Running warmup pass (results excluded)...", flush=True)
try:
benchmark_once(warmup=True)
except RuntimeError as exc:
print(f"[warmup] FAILED: {exc}")
return
print(f"
{'Run':<5} {'Tokens':<8} {'TTFT (ms)':<12} {'Tokens/s':<10} {'Total (s)':<10}")
print("-" * 50)
for i in range(NUM_RUNS):
try:
r = benchmark_once()
except RuntimeError as exc:
print(f"{i+1:<5} ERROR: {exc}")
continue
print(
f"{i+1:<5} {r['tokens']:<8} {r['ttft_ms']:<12.1f} "
f"{r['tps']:<10.1f} {r['total_s']:<10.2f}"
)
if __name__ == "__main__":
main()
Run with python3 benchmark_ollama.py while Ollama is serving. Compare results against the expected ranges above to identify configuration issues.
Practical Use Cases: Putting Your Local LLM to Work
Code Completion with DeepSeek Coder
# Pull DeepSeek Coder 6.7B Instruct
ollama pull deepseek-coder:6.7b-instruct-q5_K_M
To integrate with VS Code, install the Continue extension, then configure it to point at the local Ollama endpoint:
{
"models": [
{
"title": "DeepSeek Coder Local",
"provider": "ollama",
"model": "deepseek-coder:6.7b-instruct-q5_K_M",
"apiBase": "http://localhost:11434"
}
],
"tabAutocompleteModel": {
"title": "DeepSeek Coder Autocomplete",
"provider": "ollama",
"model": "deepseek-coder:6.7b-instruct-q5_K_M"
}
}
Place this in the Continue extension's config.json:
- Linux/macOS:
~/.continue/config.json - Windows:
%USERPROFILE%\.continue\config.json
(You can also access this file via the Continue sidebar panel in VS Code -> gear icon -> "Open config.json". Note that the config schema may change across Continue extension versions.)
Autocomplete latency with the RTX 4070 runs under 300ms for short completions (≤20 tokens) on a warmed model.
RAG on Local Documents
A lightweight retrieval-augmented generation pipeline on 10GB VRAM pairs Ollama with a small embedding model (such as nomic-embed-text at approximately 300MB VRAM; verify with nvidia-smi after loading) and a local vector store like ChromaDB (pip install chromadb). The critical constraint is co-residency: the embedding model and the generation model share the same VRAM pool. At Q5_K_M, an 8B generation model plus the embedding model totals approximately 5.6GB, leaving room on both target GPUs. Expanding to Q8_0 or using wider context windows makes co-residency tight on the RTX 4070.
API Integration for Application Development
Ollama exposes an OpenAI-compatible API endpoint, so developers can swap local inference into any application using the standard openai Python library with a single base URL change.
import os
from openai import OpenAI
# Ollama does not enforce authentication but the openai library requires the field.
# For real OpenAI endpoints, use: api_key=os.environ["OPENAI_API_KEY"]
client = OpenAI(
base_url=os.environ.get("OLLAMA_BASE_URL", "http://localhost:11434/v1"),
api_key="ollama", # sentinel value; not transmitted meaningfully by Ollama
timeout=120.0, # seconds; prevents indefinite hang on slow generation
)
try:
response = client.chat.completions.create(
model="llama31-optimized",
messages=[
{
"role": "user",
"content": (
"Write a SQL query to find duplicate email addresses "
"in a users table."
),
}
],
temperature=0.2,
max_tokens=512, # explicit cap; avoids unbounded generation
)
print(response.choices[0].message.content)
except Exception as exc:
# Surface actionable context: endpoint, model, and root cause
raise RuntimeError(
f"Ollama request failed (endpoint={client.base_url}, "
f"model=llama31-optimized): {exc}"
) from exc
The api_key field is required by the library but ignored by Ollama. Ollama binds to 127.0.0.1:11434 by default. If you change OLLAMA_HOST to 0.0.0.0, the unauthenticated endpoint becomes accessible on your local network -- add firewall rules to restrict access in that case.
Note: Ollama's OpenAI-compatible endpoint covers chat completions and basic generation. Advanced features (function calling, vision) require verification against your Ollama version.
Troubleshooting Common 10GB VRAM Issues
"Out of Memory" Errors
The two most common triggers are a model that exceeds available VRAM and a context window configuration that pushes KV cache beyond remaining headroom. Run nvidia-smi during model load to diagnose. If VRAM usage hits the ceiling before the model finishes loading, step down one quantization level (Q5_K_M to Q4_K_M saves roughly 400MB). If the model loads but OOM occurs during generation, reduce num_ctx from 4096 to 2048.
Slower Than Expected Performance
First, verify that all model layers sit on the GPU. The --verbose flag in ollama run reports how many layers Ollama offloaded to the GPU. If the count is less than the model's total layers, VRAM pressure is forcing partial CPU offload, which devastates generation speed.
Second, check the PCIe slot configuration. A GPU in a PCIe 3.0 x8 slot (common on some motherboards' secondary slots) provides approximately 8 GB/s bandwidth, roughly one-quarter the ~32 GB/s of a PCIe 4.0 x16 slot. While this does not affect steady-state generation speed (which is VRAM-bandwidth-bound), it increases model loading time and can slow the prefill phase for very long inputs.
Third, monitor GPU temperature. Sustained inference workloads push thermal loads higher than typical gaming patterns. If the GPU clock throttles due to temperatures above 83 degrees Celsius, generation speed drops proportionally.
Model Quality Feels Degraded
Q4 quantization can produce degraded output on tasks requiring nuanced reasoning, mathematical computation, or structured data generation. Symptoms include incorrect intermediate steps in multi-step math, malformed JSON output, and instructions followed partially rather than completely. The quality gap between Q4_K_M and Q5_K_M is generally larger than between Q5_K_M and Q8_0, though this varies by model and task. If output quality is unsatisfactory, run the same five prompts at both Q4_K_M and Q5_K_M, compare outputs side-by-side, and decide based on the specific tasks that matter for your use case rather than relying on general benchmark scores.
The quality gap between Q4_K_M and Q5_K_M is generally larger than between Q5_K_M and Q8_0, though this varies by model and task.
Implementation Checklist and Next Steps
## 10GB VRAM Local LLM — Implementation Checklist
### Hardware Verification
1. Run: nvidia-smi --query-gpu=memory.total,memory.used,memory.free --format=csv
2. Confirm free VRAM ≥ 10GB with desktop environment running
3. Verify PCIe slot (requires root on most distros):
sudo lspci -vv | grep -A30 -i "vga\|3d controller" | grep -i "lnksta"
(look for "LnkSta: Speed 16GT/s, Width x16")
### Ollama Installation
4. Linux: curl -fsSL https://ollama.com/install.sh | sh
macOS: brew install ollama
Windows: winget install Ollama.Ollama
5. Run: ollama --version (confirm latest 2026 release)
6. Check if Ollama is already running: pgrep -x ollama
On Linux (systemd): systemctl status ollama
If not running: ollama serve
Check logs for CUDA/GPU detection.
### Model Setup
7. Verify tags at ollama.com/library before pulling.
8. Run: ollama pull llama3.1:8b-instruct-q5_K_M
9. Run: ollama list (verify model appears with ~5.3GB size)
10. For code tasks: ollama pull deepseek-coder:6.7b-instruct-q5_K_M
### Custom Modelfile
11. Create Modelfile-optimized with num_ctx 4096, num_gpu 32, num_thread <physical cores>
Find physical cores — Linux: lscpu | grep 'Core(s) per socket'
macOS: sysctl -n hw.physicalcpu
12. Run: ollama create llama31-optimized -f Modelfile-optimized
### Benchmark Validation
13. Install Python dependency: pip install requests
14. Run benchmark_ollama.py (from article)
15. Confirm tokens/sec: RTX 4060 Ti → 30–38, RTX 4070 → 35–45
16. If below range: check GPU layer count, close background apps, verify thermals
### IDE Integration
17. Install Continue extension in VS Code
18. Configure config.json with local Ollama endpoint
(Linux/macOS: ~/.continue/config.json | Windows: %USERPROFILE%\.continue\config.json)
19. Test autocomplete latency (target: <300ms for short completions)
### API Integration
20. Test OpenAI-compatible endpoint: http://localhost:11434/v1
21. Verify drop-in compatibility with existing application code
(Note: covers chat completions; advanced features may differ by Ollama version)
Re-run your benchmarks after each Ollama update -- the llama.cpp backend receives regular inference speed optimizations, and you may see measurable gains without changing any configuration.

