This metrics tool terrifies bad developers

Start free trial
SitePoint Premium
Stay Relevant and Grow Your Career in Tech
  • Premium Results
  • Publish articles on SitePoint
  • Daily curated jobs
  • Learning Paths
  • Discounts to dev tools
Start Free Trial

7 Day Free Trial. Cancel Anytime.

How to Run DeepSeek R1 Locally

  1. Verify your hardware meets VRAM requirements for your chosen model variant (1.5B–671B).
  2. Install NVIDIA drivers with CUDA 12.x, or confirm sufficient Apple Silicon unified memory.
  3. Download Ollama and pull the target DeepSeek R1 model using ollama pull deepseek-r1:7b.
  4. Test CLI inference with a reasoning prompt and confirm <think> tags appear in the output.
  5. Containerize the deployment using Docker Compose with GPU passthrough for reproducibility.
  6. Optimize inference by selecting a quantization level (Q4_K_M, Q5_K_M, or Q8_0) and tuning context window size.
  7. Build a Node.js API proxy with SSE streaming and a React chat UI that parses reasoning blocks.
  8. Benchmark tokens per second and iterate on model size and quantization to match your performance needs.

Running DeepSeek R1 locally addresses three persistent concerns with cloud-hosted large language models: data privacy, recurring API costs, and dependence on network availability. This guide covers hardware selection, environment setup, Docker-based deployment, inference optimization, and a complete Node.js plus React integration layer for building privacy-first AI applications.

Table of Contents

Tested with: Ollama 0.6.x, Node.js 20.x LTS, Docker 24.x with Compose v2, CUDA 12.4.x, Ubuntu 22.04 / macOS 14+. Pin these versions for reproducibility; commands may require adjustment for newer releases.

Why Run DeepSeek R1 Locally?

Running DeepSeek R1 locally addresses three persistent concerns with cloud-hosted large language models: data privacy, recurring API costs, and dependence on network availability. If you handle sensitive data or operate in a regulated environment, local inference keeps every prompt and response on-premises, reducing what third parties can see since nothing leaves your machine.

DeepSeek R1 is an open-weights reasoning model. It produces chain-of-thought output, exposing intermediate reasoning steps before delivering a final answer. The model ships in a full 671B mixture-of-experts (MoE) configuration and a range of distilled variants (1.5B, 7B, 8B, 14B, 32B, and 70B), making it accessible across a wide spectrum of hardware. MoE models activate only a subset of their parameters on each forward pass, which affects both VRAM requirements and throughput characteristics. Per DeepSeek's published benchmarks, R1's reasoning capabilities rival proprietary models on math, code generation, and logic-heavy tasks.

If you handle sensitive data or operate in a regulated environment, local inference keeps every prompt and response on-premises, reducing what third parties can see since nothing leaves your machine.

This guide covers hardware selection, environment setup, Docker-based deployment, inference optimization, and a complete Node.js plus React integration layer for building privacy-first AI applications. It targets intermediate developers comfortable with the command line, Docker, and JavaScript.

Hardware Requirements for DeepSeek R1

Minimum vs. Recommended Specs

The model variant you choose dictates everything about the hardware conversation. Smaller distilled variants fit within a single 8-24 GB consumer GPU with room for OS overhead, while the full 671B model demands multi-GPU server configurations.

Model Variant Quantization Approx. VRAM Required RAM (CPU Fallback) Storage Est. Tokens/sec (GPU)†
1.5B Q4_K_M ~1.5 GB 8 GB ~1.5 GB 40-60
7B Q4_K_M ~5 GB 16 GB ~4.5 GB 25-40
8B Q4_K_M ~6 GB 16 GB ~5 GB 20-35
14B Q4_K_M ~9 GB 32 GB ~8.5 GB 15-25
32B Q4_K_M ~20 GB 64 GB ~19 GB 8-15
70B Q4_K_M ~40 GB 128 GB ~40 GB 4-8
671B (full MoE) Q4_K_M ~350+ GB‡ 512+ GB ~350 GB 1-3

†Illustrative estimates; single GPU, Q4_K_M quantization, 2048-token context, batch size 1. Actual results vary significantly by GPU model, driver version, CUDA version, system RAM bandwidth, and thermal state. Use the benchmarking script in this guide to measure your own hardware.

‡MoE models activate only a fraction of parameters per token; practical VRAM depends on expert routing and framework support. 350 GB represents full model weight storage, not necessarily peak active VRAM.

FP16 roughly doubles the VRAM requirement compared to Q4. Q5_K_M sits between Q4_K_M and Q8_0, offering a modest quality improvement at roughly 15-20% more memory. Q8_0 quantization stores weights as 8-bit integers, yielding approximately 50% of FP16 memory requirements, with quality close to FP16 on most benchmarks.

Choosing the Right Model Size for Your Hardware

For an NVIDIA RTX 4090 (24 GB VRAM), the 14B Q4_K_M variant is the practical ceiling for single-GPU inference with headroom for context. Dual 3090s or 4090s support 32B quantized models. The 70B variant realistically requires an A100 (80 GB) or multiple consumer GPUs with tensor parallelism.

Apple Silicon M-series machines use unified memory, which means an M2 Ultra with 192 GB unified memory can run the 70B Q4 variant per community benchmarks. Expect roughly 30-60% of the tokens/sec of an equivalently sized NVIDIA GPU; benchmark to confirm. M3 and M4 Max chips with 64-128 GB unified memory handle 32B variants well.

You can run the 1.5B and 7B variants on CPU alone, but expect roughly 2-5 tokens per second on a modern 8-16 core desktop CPU (e.g., Intel Core i9-13900K); results vary with core count and memory bandwidth. This makes CPU inference suitable for testing and development rather than interactive use.

Environment Setup and Prerequisites

Installing Core Dependencies

The stack requires Python 3.11 or later (used by model serving backends and dependency tooling), Node.js 20 or later with npm or pnpm for the application layer, Git, curl, and standard build essentials (gcc, make). For NVIDIA GPUs, install the latest NVIDIA driver (550+ series recommended) and the CUDA toolkit 12.x. AMD GPU users need ROCm 6.x; AMD ROCm support in Ollama is available via the ollama/ollama:rocm Docker image. Verify with docker run --rm --device /dev/kfd --device /dev/dri ollama/ollama:rocm and consult Ollama's ROCm documentation for host driver requirements.

Installing Ollama for Local Model Management

Ollama provides a single binary that manages model downloads, quantization variants, and a local REST API for inference. It abstracts away the complexity of llama.cpp configuration while exposing a clean HTTP interface.

# Install Ollama (macOS and Linux)
# Security note: inspect the script at ollama.com/install.sh before executing.
# Alternatively, use the native package or binary from ollama.com/download.
curl -fsSL https://ollama.com/install.sh | sh

# Windows: Download the native installer from ollama.com/download.
# WSL2 is an alternative if you prefer a Linux environment.

# Pull the DeepSeek R1 7B distilled variant (adjust tag for other sizes)
ollama pull deepseek-r1:7b

# For the 14B variant:
# ollama pull deepseek-r1:14b

# For the 32B variant:
# ollama pull deepseek-r1:32b

# Verify the model is downloaded
ollama list

# Run a quick CLI test
ollama run deepseek-r1:7b "What is the sum of the first 50 prime numbers? Think step by step."

The CLI test should produce visible <think> tags containing the model's reasoning chain, followed by a final answer. If the response appears, the model is functional and ready for application integration.

Running DeepSeek R1 with Docker

Docker Setup for Containerized Deployment

Docker provides reproducible, isolated environments that simplify deployment across development and production machines. Containerizing Ollama ensures consistent behavior regardless of the host OS and makes teardown trivial.

GPU Passthrough Configuration

NVIDIA GPU passthrough requires the NVIDIA Container Toolkit. Install it following NVIDIA's official documentation for the host OS, then verify:

# Replace the tag with a version matching your installed CUDA toolkit.
# Verify available tags at hub.docker.com/r/nvidia/cuda/tags
docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi
# docker-compose.yml
# Docker Compose v2+ — version field is deprecated and should be omitted

services:
  ollama:
    image: ollama/ollama:0.6.5
    container_name: deepseek-ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: unless-stopped

volumes:
  ollama_data:

Note: The deploy.resources syntax requires Docker Compose v2 (docker compose, not the legacy docker-compose v1 binary).

After running docker compose up -d, pull the model inside the container:

docker exec -it deepseek-ollama ollama pull deepseek-r1:7b
docker exec -it deepseek-ollama ollama run deepseek-r1:7b "Explain why 0.1 + 0.2 != 0.3 in IEEE 754."

Validate GPU access with docker exec -it deepseek-ollama nvidia-smi. If the GPU is not visible, confirm the NVIDIA Container Toolkit is installed and the Docker daemon has been restarted.

Inference Optimization Techniques

Quantization and Performance Tuning

For most local deployments, Q4_K_M strikes the best tradeoff: it halves FP16 VRAM requirements while community perplexity benchmarks typically show less than a 2-point increase over FP16 on standard evals. Run your own perplexity comparison to confirm for your workload. For tasks demanding higher precision (complex mathematical proofs, nuanced code generation), Q5_K_M or Q8_0 can improve output quality at the cost of memory and speed.

For most local deployments, Q4_K_M strikes the best tradeoff: it halves FP16 VRAM requirements while community perplexity benchmarks typically show less than a 2-point increase over FP16 on standard evals.

A larger context window uses more VRAM. Ollama sets the context window per model in its Modelfile (verify with ollama show deepseek-r1:7b). Override it by setting OLLAMA_NUM_CTX=4096 before starting ollama serve. Increasing the context to 4096 or 8192 tokens requires proportionally more VRAM.

Tune the thread count with OLLAMA_NUM_THREADS, set in the shell environment before starting ollama serve, not at request time. This affects CPU execution threads only. For CPU-bound portions of inference, match the physical core count rather than the hyperthreaded count; in informal tests, physical-core-only configurations outperformed hyperthreaded counts by roughly 10-20% on CPU-bound layers.

Monitoring Performance

Ollama's /api/generate endpoint returns metadata including total_duration, load_duration, prompt_eval_count, prompt_eval_duration, eval_count, and eval_duration. These enable precise tokens-per-second calculation.

// benchmark.mjs — Node.js performance monitoring script
// Requires Node.js 18+. Confirm with: node --version
const OLLAMA_URL = process.env.OLLAMA_URL || "http://localhost:11434";
const MODEL = process.env.MODEL || "deepseek-r1:7b";

async function benchmark(prompt) {
  const start = Date.now();
  const controller = new AbortController();
  const timeout = setTimeout(() => controller.abort(), 120_000);

  try {
    const res = await fetch(`${OLLAMA_URL}/api/generate`, {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({
        model: MODEL,
        prompt,
        stream: false,
      }),
      signal: controller.signal,
    });

    if (!res.ok) {
      const text = await res.text();
      console.error(`Ollama returned HTTP ${res.status}: ${text}`);
      return;
    }

    const data = await res.json();

    const evalDurSec = data.eval_duration / 1e9;
    const promptDurSec = data.prompt_eval_duration / 1e9;
    const evalTokensPerSec = evalDurSec > 0 ? data.eval_count / evalDurSec : 0;
    const promptTokensPerSec = promptDurSec > 0 ? data.prompt_eval_count / promptDurSec : 0;

    console.log(`Model: ${MODEL}`);
    console.log(`Prompt eval: ${data.prompt_eval_count} tokens at ${promptTokensPerSec.toFixed(1)} t/s`);
    console.log(`Generation: ${data.eval_count} tokens at ${evalTokensPerSec.toFixed(1)} t/s`);
    console.log(`Model load time: ${(data.load_duration / 1e9).toFixed(2)}s`);
    console.log(`Total duration: ${(data.total_duration / 1e9).toFixed(2)}s`);
    console.log(`Wall clock: ${((Date.now() - start) / 1000).toFixed(2)}s`);
    console.log(`
Response:
${(data.response ?? "(no response)").slice(0, 300)}...`);
  } catch (err) {
    if (err.name === "AbortError") {
      console.error("Benchmark timed out after 120s");
    } else {
      console.error("Benchmark error:", err);
    }
  } finally {
    clearTimeout(timeout);
  }
}

benchmark(
  "Solve this step by step: A farmer has 17 sheep. All but 9 die. How many are left? Explain your reasoning."
);

Run with node benchmark.mjs (requires Node.js 18+ for native fetch). If tokens per second falls below expectations for the hardware, consider reducing context window size, switching to a smaller quantization, or verifying GPU offloading is active.

For workloads requiring concurrent requests or higher throughput, vLLM and llama.cpp with its server mode offer more granular control over batching, KV cache management, and tensor parallelism. These backends demand more configuration but per community benchmarks and llama.cpp documentation, they outperform Ollama under heavy concurrent load.

Building a Node.js + React Integration

Creating the Node.js API Server

The Express server acts as a proxy between the React frontend and the local Ollama instance, streaming responses via Server-Sent Events.

First, scaffold the project and install dependencies:

mkdir deepseek-server && cd deepseek-server
npm init -y
npm pkg set type=module
npm install express cors

⚠️ Production warning: The server shown below is a development proxy. Before any network-accessible deployment, add authentication (e.g., a static API key header check), rate limiting (e.g., express-rate-limit), and review the prompt length cap. An unauthenticated inference endpoint allows anyone on the network to saturate your GPU resources.

// server.js
// Requires Node.js 18+. Confirm with: node --version
import express from "express";
import cors from "cors";

const app = express();

const CORS_ORIGIN = process.env.CORS_ORIGIN || "http://localhost:5173";
app.use(cors({ origin: CORS_ORIGIN }));
app.use(express.json());

const OLLAMA_URL = process.env.OLLAMA_URL || "http://localhost:11434";
const MODEL = process.env.MODEL || "deepseek-r1:7b";
const TIMEOUT_MS = parseInt(process.env.TIMEOUT_MS || "120000", 10) || 120000;
const MAX_PROMPT_BYTES = parseInt(process.env.MAX_PROMPT_BYTES || "32768", 10);

app.post("/api/chat", async (req, res) => {
  const { prompt } = req.body;
  if (!prompt) return res.status(400).json({ error: "prompt is required" });
  if (Buffer.byteLength(prompt, "utf8") > MAX_PROMPT_BYTES) {
    return res.status(413).json({ error: "prompt exceeds maximum allowed length" });
  }

  res.setHeader("Content-Type", "text/event-stream");
  res.setHeader("Cache-Control", "no-cache");
  res.setHeader("Connection", "keep-alive");

  const controller = new AbortController();
  const timeout = setTimeout(() => controller.abort(), TIMEOUT_MS);

  let reader;

  try {
    const ollamaRes = await fetch(`${OLLAMA_URL}/api/generate`, {
      method: "POST",
      headers: { "Content-Type": "application/json" },
      body: JSON.stringify({ model: MODEL, prompt, stream: true }),
      signal: controller.signal,
    });

    if (!ollamaRes.ok) {
      clearTimeout(timeout);
      res.write(`data: ${JSON.stringify({ error: "Ollama error", status: ollamaRes.status })}

`);
      return res.end();
    }

    reader = ollamaRes.body.getReader();
    const decoder = new TextDecoder();

    while (true) {
      const { done, value } = await reader.read();
      if (done) break;
      const chunk = decoder.decode(value, { stream: true });
      const lines = chunk.split("
").filter(Boolean);
      for (const line of lines) {
        try {
          const parsed = JSON.parse(line);
          res.write(`data: ${JSON.stringify({ token: parsed.response, done: parsed.done })}

`);
        } catch {
          // skip malformed chunks
        }
      }
    }
  } catch (err) {
    if (reader) {
      await reader.cancel().catch(() => {});
    }
    if (err.name === "AbortError") {
      res.write(`data: ${JSON.stringify({ error: "Request timed out" })}

`);
    } else {
      console.error("Stream error:", err);
      res.write(`data: ${JSON.stringify({ error: err.message })}

`);
    }
  } finally {
    clearTimeout(timeout);
    res.write("data: [DONE]

");
    res.end();
  }
});

const PORT = process.env.PORT || 3001;
app.listen(PORT, () => console.log(`API server running on port ${PORT}`));

Building the React Chat Interface

DeepSeek R1 wraps its reasoning steps inside <think>...</think> tags. The frontend parses these to display reasoning separately from the final answer.

First, scaffold the React project:

npm create vite@latest deepseek-ui -- --template react
cd deepseek-ui
npm install

Create a .env file in the deepseek-ui directory:

VITE_API_URL=http://localhost:3001

Then replace the contents of src/App.jsx with an import of the Chat component, and create src/Chat.jsx:

// src/Chat.jsx
import { useState, useRef, useEffect } from "react";

const API_URL = import.meta.env.VITE_API_URL || "http://localhost:3001";

function parseThinkBlocks(text) {
  const thinkRegex = /<think>([\s\S]*?)<\/think>/g;
  const reasoning = [];
  let match;
  while ((match = thinkRegex.exec(text)) !== null) {
    reasoning.push(match[1].trim());
  }
  thinkRegex.lastIndex = 0;
  const answer = text.replace(thinkRegex, "").trim();
  return { reasoning, answer };
}

export default function Chat() {
  const [prompt, setPrompt] = useState("");
  const [rawResponse, setRawResponse] = useState("");
  const [loading, setLoading] = useState(false);
  const [showReasoning, setShowReasoning] = useState(false);
  const abortRef = useRef(null);

  // Clean up any in-flight stream on unmount
  useEffect(() => {
    return () => abortRef.current?.abort();
  }, []);

  async function handleSubmit(e) {
    e.preventDefault();
    if (!prompt.trim() || loading) return;

    setRawResponse("");
    setLoading(true);
    setShowReasoning(false);

    const controller = new AbortController();
    abortRef.current = controller;

    try {
      const res = await fetch(`${API_URL}/api/chat`, {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({ prompt }),
        signal: controller.signal,
      });

      const reader = res.body.getReader();
      const decoder = new TextDecoder();
      let accumulated = "";
      let done_streaming = false;

      while (true) {
        const { done, value } = await reader.read();
        if (done) break;
        const text = decoder.decode(value, { stream: true });
        const lines = text.split("
").filter((l) => l.startsWith("data: "));
        for (const line of lines) {
          const payload = line.slice(6);
          if (payload === "[DONE]") { done_streaming = true; break; }
          try {
            const parsed = JSON.parse(payload);
            if (parsed.error) {
              accumulated += `
[Error: ${parsed.error}]`;
            } else if (parsed.token) {
              accumulated += parsed.token;
            }
          } catch {
            // skip
          }
        }
        setRawResponse(accumulated);
        if (done_streaming) break;
      }
    } catch (err) {
      if (err.name !== "AbortError") {
        setRawResponse((prev) => prev + `
[Error: ${err.message}]`);
      }
    } finally {
      setLoading(false);
    }
  }

  const { reasoning, answer } = parseThinkBlocks(rawResponse);

  return (
    <div style={{ maxWidth: 720, margin: "2rem auto", fontFamily: "system-ui" }}>
      <h1>DeepSeek R1 Local Chat</h1>
      <form onSubmit={handleSubmit}>
        <textarea
          value={prompt}
          onChange={(e) => setPrompt(e.target.value)}
          rows={4}
          style={{ width: "100%", fontSize: "1rem", padding: "0.5rem" }}
          placeholder="Enter a reasoning-heavy prompt..."
        />
        <div style={{ marginTop: "0.5rem", display: "flex", gap: "0.5rem" }}>
          <button type="submit" disabled={loading}>
            {loading ? "Generating..." : "Send"}
          </button>
          <button
            type="button"
            onClick={() => abortRef.current?.abort()}
            disabled={!loading}
          >
            Cancel
          </button>
        </div>
      </form>

      {reasoning.length > 0 && (
        <div style={{ marginTop: "1rem" }}>
          <button onClick={() => setShowReasoning(!showReasoning)}>
            {showReasoning ? "Hide" : "Show"} Reasoning ({reasoning.length} block
            {reasoning.length > 1 ? "s" : ""})
          </button>
          {showReasoning && (
            <pre
              style={{
                background: "#f0f4f8",
                padding: "1rem",
                borderRadius: 6,
                whiteSpace: "pre-wrap",
                marginTop: "0.5rem",
                fontSize: "0.9rem",
              }}
            >
              {reasoning.join("

---

")}
            </pre>
          )}
        </div>
      )}

      {answer && (
        <div style={{ marginTop: "1rem", lineHeight: 1.6 }}>
          <h2>Answer</h2>
          <div style={{ whiteSpace: "pre-wrap" }}>{answer}</div>
        </div>
      )}
    </div>
  );
}

Running the Full Stack Locally

Start the services in order: Ollama (or the Docker container), then the Node.js server, then the React development server.

# Terminal 1: Start Ollama (only if NOT already running as a system service)
# If Ollama was installed via the install script and is already running as a system service,
# skip this command to avoid a port conflict on 11434.
# Check with: systemctl status ollama (Linux)
ollama serve

# Terminal 2: Start the Node.js API server
cd deepseek-server
node server.js

# Terminal 3: Start the React dev server
cd deepseek-ui
npm run dev

Test with a reasoning-heavy prompt such as "If a bat and a ball cost $1.10 together, and the bat costs $1.00 more than the ball, how much does the ball cost? Show all reasoning."

Common issues: "connection refused" typically means Ollama is not running or is bound to a different port. Out-of-memory errors indicate the model variant exceeds available VRAM; drop to a smaller variant or lower quantization. Slow first-token latency is expected on the initial request as the model loads into GPU memory; subsequent requests reuse the cached model.

Complete Implementation Checklist

  1. ☐ Verify hardware meets minimum requirements for chosen model variant (see VRAM table above)
  2. ☐ Install NVIDIA drivers and CUDA toolkit 12.x (or confirm Apple Silicon unified memory is sufficient)
  3. ☐ Install Ollama and pull the target DeepSeek R1 model variant
  4. ☐ Verify CLI inference works with a test prompt and confirm <think> tags appear
  5. ☐ Set up Docker with GPU passthrough using the provided docker-compose.yml (optional but recommended)
  6. ☐ Tune quantization level (Q4_K_M default, Q5_K_M or Q8_0 for higher fidelity) and context window size
  7. ☐ Benchmark tokens per second using the Node.js benchmarking script and confirm acceptable performance
  8. ☐ Build the Node.js Express API proxy with SSE streaming support
  9. ☐ Build the React chat UI with <think> block parsing and collapsible reasoning display
  10. ☐ Test end-to-end with complex reasoning prompts across multiple domains (math, logic, code)
  11. ☐ Configure for production: add authentication, rate limiting, and review prompt length caps on the API server; add a process manager (pm2 or systemd), structured error handling, and request logging

What Comes Next

The entire pipeline now runs on local hardware with prompts and responses staying on your machine. Review Ollama's telemetry settings to confirm no usage data is transmitted if full network isolation is required.

The entire pipeline now runs on local hardware with prompts and responses staying on your machine.

Natural next steps: fine-tune distilled variants on domain-specific datasets using LoRA adapters (practical for the 7B and 14B distilled variants on the hardware described here; larger variants require significantly more VRAM and training infrastructure). Integrate RAG pipelines with local vector databases like ChromaDB. Deploy the containerized setup on private servers for team-wide access. The Ollama model library (ollama.com/library) and DeepSeek's official documentation provide updated model tags and configuration references.

To find the right balance between quality and throughput for your workload, run the benchmarking script from this guide against two quantization levels and compare output quality on a fixed eval set.

SitePoint TeamSitePoint Team

Sharing our passion for building incredible internet things.

© 2000 – 2026 SitePoint Pty. Ltd.
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.