Untitled

How to Build a Private Local RAG System

Install Ollama and pull a local LLM (e.g., Mistral 7B) plus an embedding model (nomic-embed-text).
Run ChromaDB as a Docker container bound to localhost for vector storage.
Initialize a Node.js backend with LangChain, Express, and the ChromaDB client.
Build a document ingestion pipeline that parses PDFs, Markdown, and text files into chunks.
Generate embeddings locally for each chunk and store them in ChromaDB with source metadata.
Create a RAG query chain that embeds the user's question, retrieves top-k chunks, and streams an LLM response.
Connect a React frontend with drag-and-drop upload and a streaming chat interface with source attribution.
Verify network isolation by confirming no outbound connections leave the host during operation.

Why Go Local with RAG?
How Local RAG Works: Core Concepts
Setting Up the Local AI Infrastructure
Building the Document Ingestion Pipeline
Vector Storage with ChromaDB
The RAG Query Engine: Tying It Together
Building the React Frontend
Implementation Checklist and Performance Tuning
Security and Privacy Considerations
Next Steps

Why Go Local with RAG?

Retrieval-Augmented Generation (RAG) grounds large language model responses in specific documents by retrieving relevant content at query time and injecting it into the prompt context. Instead of relying on a model's training data alone, RAG retrieves chunks of text from a local corpus, producing answers that cite actual source material. Building a local RAG pipeline with JavaScript keeps sensitive data entirely on private infrastructure, eliminating cloud API costs, latency, and the privacy risks that come with sending proprietary documents to third-party services.

The privacy problem is simple: every document sent to a cloud API like OpenAI or Anthropic traverses external networks and lands on infrastructure outside an organization's control. For legal documents, medical records, internal company knowledge bases, and financial data, that exposure may violate regulatory requirements or internal security policies. A properly configured local system keeps all data on the machine.

Building a local RAG pipeline with JavaScript keeps sensitive data entirely on private infrastructure, eliminating cloud API costs, latency, and the privacy risks that come with sending proprietary documents to third-party services.

This tutorial builds a complete, private document Q&A system using JavaScript, Node.js, and React. The stack replaces the Python-dominated RAG ecosystem with a full-stack JS approach, using open-source models and local vector storage. By the end, readers will have a working pipeline that ingests documents, stores embeddings, retrieves relevant context, and generates grounded answers through a chat interface.

Prerequisites:

Node.js 18 or later (node --version to confirm)
Docker Desktop or Docker Engine running
Basic familiarity with React
Minimum 8GB RAM (supports quantized 3B models only); 16GB recommended for Mistral 7B
Comfort with CLI tools
macOS or Linux recommended (Windows works but path differences are not covered here)
NVIDIA GPU users: CUDA toolkit 11.3 or later and compatible drivers installed (see Performance Tips)

How Local RAG Works: Core Concepts

The RAG Pipeline Explained

The pipeline follows a linear flow: document ingestion, chunking, embedding, vector storage, query embedding, retrieval, and generation. Loaders first parse documents into raw text. The chunker splits that text into fixed-size segments. The embedding model then converts each chunk into a numerical vector that captures its semantic meaning, and the pipeline stores those vectors in a vector database. When a user submits a question, the same model embeds the question, and the vector store returns the most semantically similar chunks. The prompt template receives those chunks alongside the user's question, and a local LLM generates an answer grounded in the retrieved context.

The flow looks like this:

PDF/MD/TXT → Loader → Chunker → Embedding Model → ChromaDB
                                                        ↓
User Question → Embedding Model → Similarity Search → Top-K Chunks
                                                        ↓
                                              Prompt Template + LLM → Answer

Why Local Models Change the Equation

Local embedding models, such as those available through Ollama's nomic-embed-text, replace cloud embedding APIs entirely. Local LLMs served via Ollama replace calls to OpenAI or Anthropic endpoints. The trade-offs are real: local models are typically slower than cloud inference (especially without GPU acceleration), smaller models may produce lower quality outputs than GPT-4 class models, and hardware requirements become the developer's responsibility. But the privacy guarantee is strong. With proper configuration, nothing leaves the machine. Set OLLAMA_NOPRUNE=1 and verify with netstat as described in the Security section to confirm no outbound connections. No API keys and no data exposure to third parties.

Note: By default, Ollama may perform update checks to ollama.com. To enforce complete network isolation, set the environment variable OLLAMA_NOPRUNE=1 and block outbound connections at the firewall level.

Setting Up the Local AI Infrastructure

Installing Ollama for Local LLM Inference

Ollama wraps model downloading, quantization, and serving behind a single local HTTP API.

# macOS (via Homebrew)
brew install ollama

# Linux (official install script)
curl -fsSL https://ollama.com/install.sh | sh

# Windows: download installer from https://ollama.com/download

# Start the Ollama server
ollama serve

# Pull a suitable model for generation (in a separate terminal)
ollama pull mistral

# Pull an embedding model
ollama pull nomic-embed-text

# Verify installation with a test prompt
ollama run mistral "What is retrieval-augmented generation?"

Mistral 7B is a reasonable default for 16 GB machines; it produces multi-paragraph answers and runs without swap pressure. Llama 3.1 8B is an alternative with a larger training corpus and similar RAM requirements. For embedding, nomic-embed-text targets retrieval tasks and runs efficiently on consumer hardware.

Project Initialization and Dependencies

mkdir local-rag && cd local-rag
mkdir backend frontend

# Backend setup
cd backend
npm init -y

# Install core dependencies
npm install langchain @langchain/community @langchain/ollama @langchain/textsplitters chromadb pdf-parse express multer cors

# Commit package-lock.json; use npm ci for reproducible installs.

cd ../frontend
npm create vite@latest . -- --template react
npm install

The backend package.json dependencies should include:

{
  "name": "local-rag-backend",
  "version": "1.0.0",
  "type": "module",
  "dependencies": {
    "@langchain/community": "^0.3.0",
    "@langchain/ollama": "^0.1.0",
    "@langchain/textsplitters": "^0.1.0",
    "chromadb": "^1.9.2",
    "cors": "^2.8.5",
    "express": "^4.21.0",
    "langchain": "^0.3.0",
    "multer": "^1.4.5-lts.1",
    "pdf-parse": "^1.1.1"
  }
}

Important: LangChain's 0.x releases may introduce breaking import path changes between minor versions. Always commit package-lock.json to version control and use npm ci instead of npm install when setting up in new environments.

Building the Document Ingestion Pipeline

Loading and Parsing Documents

The ingestion layer needs to handle PDF, Markdown, and plain text files. LangChain provides document loaders for each format, and Express with Multer handles file uploads.

Note: Multer saves uploaded files with generated temp names that have no file extension. The loadDocument function must use the original filename to determine the file type, not the temp file path.

// backend/src/ingest.js
import { PDFLoader } from "@langchain/community/document_loaders/fs/pdf";
import { TextLoader } from "langchain/document_loaders/fs/text";
import fs from "fs";
import path from "path";

export async function loadDocument(filePath, originalName) {
  const ext = path.extname(originalName).toLowerCase();
  let loader;

  switch (ext) {
    case ".pdf":
      loader = new PDFLoader(filePath);
      break;
    case ".md":
    case ".txt":
      loader = new TextLoader(filePath);
      break;
    default:
      throw new Error(`Unsupported file type: ${ext}`);
  }

  const docs = await loader.load();
  return docs;
}

Below is the complete server.js file. Both the ingestion and query routes are included in a single file — do not split them into separate files or duplicate app.listen.

// backend/src/server.js
import express from "express";
import multer from "multer";
import cors from "cors";
import path from "path";
import fs from "fs";
import { loadDocument } from "./ingest.js";
import { chunkDocuments } from "./chunker.js";
import { embedAndStore } from "./embedder.js";
import { queryRAG } from "./ragchain.js";

const app = express();

// Restrict CORS to the known frontend origin only
app.use(cors({ origin: process.env.CORS_ORIGIN ?? "http://localhost:5173" }));
app.use(express.json());

// Ensure uploads directory exists
fs.mkdirSync("uploads/", { recursive: true });

const upload = multer({
  dest: "uploads/",
  limits: { fileSize: 50 * 1024 * 1024 }, // 50 MB hard cap
  fileFilter: (req, file, cb) => {
    const allowed = [".pdf", ".md", ".txt"];
    const ext = path.extname(file.originalname).toLowerCase();
    if (allowed.includes(ext)) {
      cb(null, true);
    } else {
      cb(new Error(`File type ${ext} not allowed. Accepted: ${allowed.join(", ")}`));
    }
  },
});

app.post("/api/ingest", upload.single("document"), async (req, res) => {
  // Guard: Multer may have rejected the file
  if (!req.file) {
    return res.status(400).json({ error: "No valid file received." });
  }

  const filePath = req.file.path;
  const originalName = req.file.originalname;

  try {
    const docs = await loadDocument(filePath, originalName);
    const chunks = await chunkDocuments(docs, originalName);
    await embedAndStore(chunks);

    res.json({
      success: true,
      filename: originalName,
      chunksCreated: chunks.length,
    });
  } catch (err) {
    // Do not expose internal error details to the client
    console.error({ event: "ingest_error", filename: originalName, error: err.message, ts: Date.now() });
    res.status(500).json({ error: "Ingestion failed. Check server logs." });
  } finally {
    // Always attempt cleanup regardless of success or failure
    fs.unlink(filePath, (unlinkErr) => {
      if (unlinkErr) console.warn("Temp file cleanup failed:", unlinkErr.message);
    });
  }
});

app.post("/api/query", async (req, res) => {
  const { question } = req.body;
  if (!question || typeof question !== "string" || !question.trim()) {
    return res.status(400).json({ error: "Question required" });
  }
  // Truncate to prevent prompt manipulation via very long questions
  const safeQuestion = question.trim().slice(0, 1000);
  await queryRAG(safeQuestion, res);
});

const PORT = process.env.PORT ?? 3001;
app.listen(PORT, () => console.log(`RAG backend running on port ${PORT}`));

Chunking Strategy for Optimal Retrieval

Chunk size and overlap directly affect retrieval quality. Chunks that are too large dilute the relevant signal with noise; chunks that are too small lose necessary context. Recursive character text splitting breaks text along natural boundaries (paragraphs, sentences, then characters) before falling back to hard splits.

Note: chunkSize is measured in characters, not tokens. Approximately 500 characters ≈ 125 tokens at average English prose density. A starting configuration of 500-character chunks with 50-character overlap is a common starting point for prose-heavy documents, though code documentation or structured data may benefit from larger chunks.

// backend/src/chunker.js
// Primary import for langchain ^0.3.x:
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
// Fallback if the above package is absent (older installs):
// import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";

export async function chunkDocuments(docs, sourceFilename) {
  const splitter = new RecursiveCharacterTextSplitter({
    chunkSize: 500,
    chunkOverlap: 50,
    separators: ["

", "
", ". ", " ", ""],
  });

  const chunks = await splitter.splitDocuments(docs);

  // Attach metadata for source tracking
  return chunks.map((chunk, index) => ({
    ...chunk,
    metadata: {
      ...chunk.metadata,
      source: sourceFilename,
      chunkIndex: index,
    },
  }));
}

Generating Embeddings Locally

The nomic-embed-text model through Ollama generates embeddings entirely on the local machine. The model converts each chunk into a vector that captures its semantic meaning, enabling similarity-based retrieval later.

To avoid configuration drift, both the ingestion and retrieval paths share a single embeddings instance:

// backend/src/embeddings.js
import { OllamaEmbeddings } from "@langchain/ollama";

const embeddings = new OllamaEmbeddings({
  model: process.env.EMBED_MODEL ?? "nomic-embed-text",
  baseUrl: process.env.OLLAMA_BASE_URL ?? "http://localhost:11434",
});

export default embeddings;

// backend/src/embedder.js
import crypto from "crypto";
import embeddings from "./embeddings.js";
import { getChromaCollection } from "./vectorstore.js";

const BATCH_SIZE = 32;

export async function embedAndStore(chunks) {
  if (!chunks || chunks.length === 0) {
    console.warn("embedAndStore called with empty chunks array — skipping.");
    return 0;
  }

  const collection = await getChromaCollection();

  const texts = chunks.map((c) => c.pageContent);
  const metadatas = chunks.map((c) => c.metadata);

  // Use content-based IDs so re-ingesting the same document upserts rather than duplicates
  const ids = chunks.map((c) => {
    const hash = crypto
      .createHash("sha256")
      // Include content so re-upload of different file with same name produces distinct IDs
      .update(`${c.metadata.source}_${c.metadata.chunkIndex}_${c.pageContent}`)
      .digest("hex")
      .slice(0, 16);
    return `chunk_${hash}`;
  });

  // Batch embedding to avoid single oversized request to Ollama
  const vectors = [];
  for (let i = 0; i < texts.length; i += BATCH_SIZE) {
    const batch = texts.slice(i, i + BATCH_SIZE);
    const batchVectors = await embeddings.embedDocuments(batch);
    vectors.push(...batchVectors);
  }

  console.log(
    `Generated ${vectors.length} embeddings, dimension: ${vectors[0].length}`
  );
  // Expected dimension for nomic-embed-text: 768

  await collection.add({
    ids,
    embeddings: vectors,
    documents: texts,
    metadatas,
  });

  return vectors.length;
}

Vector Storage with ChromaDB

Running ChromaDB Locally

ChromaDB requires no external dependencies beyond Docker and exposes a simple HTTP API, which keeps the setup minimal. It runs as a Docker container accessed via an HTTP client from Node.js.

Note: The JavaScript chromadb npm package is an HTTP client only. True in-process (embedded) mode requires the Python ChromaDB library.

// backend/src/vectorstore.js
import { ChromaClient } from "chromadb";

const client = new ChromaClient({ path: "http://localhost:8000" });

export async function getChromaCollection() {
  const collection = await client.getOrCreateCollection({
    name: "local_rag_docs",
    metadata: { "hnsw:space": "cosine" },
  });
  return collection;
}

Start ChromaDB before starting the backend. The backend server expects ChromaDB to be available on port 8000 at startup.

# Pin the ChromaDB image version to match the chromadb npm package.
# Binding to 127.0.0.1 restricts access to localhost only (recommended).
# The volume flag persists data across container restarts.
docker run -d -p 127.0.0.1:8000:8000 -v chroma-data:/chroma/chroma chromadb/chroma:0.5.20

Important: Ensure the ChromaDB Docker image version matches your chromadb npm package version. Version mismatches between client and server cause 400/422 errors or silent data corruption. Verify with curl http://localhost:8000/api/v1 and compare the version string to the npm package compatibility matrix.

Querying the Vector Store

Similarity search finds the chunks whose embeddings are closest to the query embedding in vector space. Retrieving the top-k most relevant chunks (typically 3 to 5) provides enough context for the LLM without overwhelming the prompt window.

// backend/src/retriever.js
import embeddings from "./embeddings.js";
import { getChromaCollection } from "./vectorstore.js";

export async function retrieveRelevantChunks(query, topK = 3) {
  const collection = await getChromaCollection();
  const queryEmbedding = await embeddings.embedQuery(query);

  const results = await collection.query({
    queryEmbeddings: [queryEmbedding],
    nResults: topK,
  });

  // Guard against empty collection or malformed response
  if (!results.documents?.[0]?.length) return [];

  return results.documents[0].map((doc, i) => ({
    content: doc,
    metadata: results.metadatas[0][i],
    distance: results.distances[0][i],
  }));
}

The RAG Query Engine: Tying It Together

Building the Retrieval-Augmented Generation Chain

This is the central piece. The retrieval step pulls relevant chunks, a prompt template injects them as context, and the local Ollama LLM generates a grounded response. Streaming the response back through Express gives the frontend real-time output.

// backend/src/ragchain.js
import { ChatOllama } from "@langchain/ollama";
import { retrieveRelevantChunks } from "./retriever.js";

const llm = new ChatOllama({
  model: process.env.LLM_MODEL ?? "mistral",
  baseUrl: process.env.OLLAMA_BASE_URL ?? "http://localhost:11434",
  streaming: true,
});

function buildPrompt(context, question) {
  return `You are a helpful assistant. Answer ONLY from the context below.
If the context is insufficient, respond: "I don't have enough information in the provided documents to answer that question."

<context>
${context}
</context>

Question: ${question}

Answer:`;
}

export async function queryRAG(question, res) {
  // Set SSE headers before any await that could throw
  res.setHeader("Content-Type", "text/event-stream");
  res.setHeader("Cache-Control", "no-cache");
  res.setHeader("Connection", "keep-alive");

  const writeEvent = (payload) => {
    if (!res.writableEnded) {
      res.write(`data: ${JSON.stringify(payload)}

`);
    }
  };

  try {
    const chunks = await retrieveRelevantChunks(question, 3);
    const context = chunks.map((c) => c.content).join("

---

");

    const sources = chunks.map((c) => ({
      source: c.metadata.source,
      chunkIndex: c.metadata.chunkIndex,
      excerpt: c.content.substring(0, 120) + "...",
    }));

    writeEvent({ type: "sources", sources });

    const prompt = buildPrompt(context, question);
    const stream = await llm.stream(prompt);

    for await (const chunk of stream) {
      // Stop writing if client already disconnected
      if (res.writableEnded) break;
      const text = chunk.content;
      if (text) writeEvent({ type: "token", text });
    }

    writeEvent({ type: "done" });
  } catch (err) {
    console.error({ event: "query_error", error: err.message, ts: Date.now() });
    // Send a structured error event if the stream is still open
    writeEvent({ type: "error", message: "Query failed. Check server logs." });
  } finally {
    if (!res.writableEnded) res.end();
  }
}

Prompt Engineering for Local Models

Local models like Mistral 7B need more explicit prompt structure than GPT-4 class models. The prompt template above includes two critical elements: an explicit instruction to answer only from provided context, and a fallback instruction for when the context is insufficient. Without the context-only instruction, Mistral 7B frequently answers from training data instead of the retrieved chunks, which defeats the purpose of RAG entirely. Keeping the instruction direct and the context clearly delimited with <context> tags helps local models stay grounded. Avoid complex multi-step instructions; one clear directive produces more reliable results with 7B-parameter models.

Without the context-only instruction, Mistral 7B frequently answers from training data instead of the retrieved chunks, which defeats the purpose of RAG entirely.

Building the React Frontend

Document Upload Interface

// frontend/src/components/DocumentUpload.jsx
import { useState, useCallback } from "react";

const API_BASE = import.meta.env.VITE_API_BASE ?? "http://localhost:3001";

export default function DocumentUpload({ onIngested }) {
  const [uploading, setUploading] = useState(false);
  const [status, setStatus] = useState(null);
  const [dragOver, setDragOver] = useState(false);

  const handleFile = useCallback(async (file) => {
    setUploading(true);
    setStatus(`Ingesting ${file.name}...`);

    const formData = new FormData();
    formData.append("document", file);

    try {
      const res = await fetch(`${API_BASE}/api/ingest`, {
        method: "POST",
        body: formData,
      });
      const data = await res.json();

      if (data.success) {
        setStatus(`${data.filename}: ${data.chunksCreated} chunks indexed`);
        onIngested?.(data);
      } else {
        setStatus(`Error: ${data.error}`);
      }
    } catch (err) {
      setStatus(`Upload failed: ${err.message}`);
    } finally {
      setUploading(false);
    }
  }, [onIngested]);

  const onDrop = (e) => {
    e.preventDefault();
    setDragOver(false);
    const file = e.dataTransfer.files[0];
    if (file) handleFile(file);
  };

  return (
    <div
      onDragOver={(e) => { e.preventDefault(); setDragOver(true); }}
      onDragLeave={() => setDragOver(false)}
      onDrop={onDrop}
      style={{
        border: `2px dashed ${dragOver ? "#4f46e5" : "#ccc"}`,
        padding: "2rem",
        textAlign: "center",
        borderRadius: "8px",
        background: dragOver ? "#eef2ff" : "#fafafa",
      }}
    >
      <p>Drag and drop a PDF, MD, or TXT file here</p>
      <input
        type="file"
        accept=".pdf,.md,.txt"
        onChange={(e) => e.target.files[0] && handleFile(e.target.files[0])}
        disabled={uploading}
      />
      {status && <p style={{ marginTop: "1rem" }}>{status}</p>}
    </div>
  );
}

Chat Interface for Document Q&A

// frontend/src/components/ChatInterface.jsx
import { useState } from "react";

const API_BASE = import.meta.env.VITE_API_BASE ?? "http://localhost:3001";

export default function ChatInterface() {
  const [messages, setMessages] = useState([]);
  const [input, setInput] = useState("");
  const [loading, setLoading] = useState(false);

  const sendQuery = async () => {
    if (!input.trim() || loading) return;
    const question = input.trim();
    setInput("");
    setMessages((prev) => [...prev, { role: "user", text: question }]);
    setLoading(true);

    try {
      const res = await fetch(`${API_BASE}/api/query`, {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({ question }),
      });

      // Guard: ReadableStream may be unavailable
      if (!res.body) throw new Error("Streaming not supported in this environment.");

      const reader = res.body.getReader();
      const decoder = new TextDecoder();
      let assistantText = "";
      let sources = [];
      let buffer = "";

      setMessages((prev) => [...prev, { role: "assistant", text: "", sources: [] }]);

      while (true) {
        const { done, value } = await reader.read();
        if (done) break;

        buffer += decoder.decode(value, { stream: true });
        const lines = buffer.split("
");
        buffer = lines.pop(); // Hold back the last (potentially incomplete) line

        for (const line of lines) {
          if (!line.startsWith("data: ")) continue;
          try {
            const payload = JSON.parse(line.slice(6));

            if (payload.type === "sources") {
              sources = payload.sources;
            } else if (payload.type === "token") {
              assistantText += payload.text;
            } else if (payload.type === "error") {
              assistantText = `Error: ${payload.message}`;
            }
            // "done" type is intentionally a no-op; stream ends via reader done flag
          } catch (parseErr) {
            console.warn("Skipping malformed SSE line:", line);
          }
        }

        setMessages((prev) => {
          const updated = [...prev];
          updated[updated.length - 1] = {
            role: "assistant",
            text: assistantText,
            sources,
          };
          return updated;
        });
      }
    } catch (err) {
      setMessages((prev) => [
        ...prev,
        { role: "assistant", text: `Error: ${err.message}`, sources: [] },
      ]);
    } finally {
      setLoading(false);
    }
  };

  return (
    <div style={{ maxWidth: "700px", margin: "0 auto" }}>
      <div style={{ minHeight: "400px", padding: "1rem" }}>
        {messages.map((msg, i) => (
          <div key={i} style={{ marginBottom: "1rem" }}>
            <strong>{msg.role === "user" ? "You" : "Assistant"}:</strong>
            <p style={{ whiteSpace: "pre-wrap" }}>{msg.text}</p>
            {msg.sources?.length > 0 && (
              <div style={{ fontSize: "0.85rem", color: "#666", marginTop: "0.5rem" }}>
                <strong>Sources:</strong>
                {msg.sources.map((s, j) => (
                  <div key={j}>
                    📄 {s.source} (chunk {s.chunkIndex}): {s.excerpt}
                  </div>
                ))}
              </div>
            )}
          </div>
        ))}
      </div>
      <div style={{ display: "flex", gap: "0.5rem" }}>
        <input
          value={input}
          onChange={(e) => setInput(e.target.value)}
          onKeyDown={(e) => e.key === "Enter" && !loading && sendQuery()}
          placeholder="Ask a question about your documents..."
          style={{ flex: 1, padding: "0.5rem" }}
          disabled={loading}
        />
        <button onClick={sendQuery} disabled={loading}>
          {loading ? "..." : "Ask"}
        </button>
      </div>
    </div>
  );
}

Implementation Checklist and Performance Tuning

Use this checklist to verify that every component of the local RAG pipeline is operational:

Ollama installed and models pulled (mistral and nomic-embed-text)
ChromaDB running locally (Docker on port 8000, version matching npm client)
Node.js backend with document ingestion endpoint at /api/ingest
PDF/MD/TXT parsing working across all three file types
Chunking configured (500 characters, 50 overlap)
Local embeddings generating correctly (verify dimension is 768 in console output)
Vectors stored in ChromaDB with metadata (source filename, chunk index)
RAG query chain returning grounded answers at /api/query
React frontend uploading documents via drag-and-drop
Chat UI streaming responses with source attribution
Prompt template handles "no context" gracefully with explicit fallback
Tested with 3+ real documents of varying length

Performance Tips

Chunk size should be adjusted based on document type. Code documentation and structured content may benefit from 800 to 1000 character chunks to preserve function-level context (this is a common heuristic; experiment with your specific content), while dense prose is a reasonable starting point at 500 characters. For model selection, smaller models like Phi-3 trade answer quality for speed; benchmark on your own hardware to measure the difference, since throughput varies widely by CPU, RAM, and quantization level. When ingesting large document sets, batch embedding (processing multiple chunks per call) reduces overhead from repeated model loading.

Ollama uses GPU acceleration automatically on Apple Silicon. On NVIDIA hardware, CUDA toolkit 11.3 or later and compatible drivers must be installed first. Verify with nvidia-smi during inference to confirm GPU memory usage increases. Systems with less than 16GB of RAM may struggle with 7B-parameter models; 8GB machines should use smaller quantized models or the 3B-parameter variants.

Security and Privacy Considerations

Confirm that no traffic leaves the host. Run netstat -an | grep ESTABLISHED or a network monitor during operation to verify no outbound connections beyond localhost exist. By default, Ollama may check for updates or send telemetry. Set OLLAMA_NOPRUNE=1 as an environment variable and, for strict isolation, block outbound connections at the firewall level.

File system permissions on the uploads/ directory should be restricted so that only the Node.js process can read and write uploaded documents. ChromaDB in its default configuration has no authentication; in multi-user scenarios, place it behind a reverse proxy with access controls. The Docker command in this tutorial binds ChromaDB to 127.0.0.1 only, which prevents access from other machines on the network. For compliance-sensitive environments (legal, healthcare, finance), implement audit logging on the ingestion and query endpoints to track which documents were indexed and what questions were asked, creating a verifiable chain of access.

For compliance-sensitive environments (legal, healthcare, finance), implement audit logging on the ingestion and query endpoints to track which documents were indexed and what questions were asked, creating a verifiable chain of access.

Next Steps

This tutorial delivers a fully private, local RAG system built entirely in JavaScript. The pipeline handles document ingestion, chunking, embedding, vector storage, retrieval, and grounded generation without any data leaving the host machine (when configured for network isolation as described above). The most valuable next extension is adding a docker-compose.yml at the project root to bundle ChromaDB alongside the application services for single-command startup, since that eliminates the most common source of "it doesn't work on my machine" issues. Beyond that, consider multi-modal document support (images and tables), conversation memory for multi-turn interactions, and user authentication for shared deployments. The project structure outlined here maps directly to a GitHub repository: backend/src/ for the Node.js modules and frontend/src/components/ for the React UI.

Local RAG Without the Cloud: Private Document AI Setup