This metrics tool terrifies bad developers

Start free trial
SitePoint Premium
Stay Relevant and Grow Your Career in Tech
  • Premium Results
  • Publish articles on SitePoint
  • Daily curated jobs
  • Learning Paths
  • Discounts to dev tools
Start Free Trial

7 Day Free Trial. Cancel Anytime.

Running a model inside the network perimeter shifts the attack surface rather than eliminating it. This article provides a security and compliance checklist with working code, specifically designed for local and on-premise LLM deployments—covering threat modeling, network hardening, access control, audit logging, and input/output guardrails.

Table of Contents

Local Doesn't Mean Secure

The rapid adoption of enterprise local AI deployments, powered by models like Llama 3 (e.g., 8B/70B) and Mistral (e.g., 7B v0.3), reflects genuine concerns about data sovereignty, inference costs, and regulatory exposure. Organizations across finance, healthcare, and legal services are pulling LLM inference behind their own perimeters to maintain control over sensitive data. But a dangerous assumption persists: that local LLM security is somehow a solved problem simply because the model runs on owned infrastructure. It is not.

Running a model inside the network perimeter shifts the attack surface rather than eliminating it. Prompt injection, model weight exfiltration, training data poisoning, and unaudited access all remain active threats. The difference is that there is no vendor-side safety net. AI model governance, content filtering, access control, and audit logging all become the enterprise's responsibility from day one.

Running a model inside the network perimeter shifts the attack surface rather than eliminating it. Prompt injection, model weight exfiltration, training data poisoning, and unaudited access all remain active threats.

This article targets CTOs, DevSecOps engineers, and platform teams responsible for enterprise AI compliance. It provides a security and compliance checklist with working code, specifically designed for local and on-premise LLM deployments. Each section pairs concrete implementation patterns in Node.js with the reasoning behind them, covering network hardening, access control, audit logging, and input/output guardrails. The final deliverable is a printable checklist suitable for quarterly security reviews.

For broader context on the strategic case for running models locally, see the pillar article on enterprise local AI strategy.

Prerequisites

All code examples in this article have been tested with the following versions. Behavior may differ across major versions (e.g., express-rate-limit v6→v7 changed option names; http-proxy-middleware v2→v3 changed the constructor API).

# Requires Node.js 20 LTS
npm install express@^4.19 express-rate-limit@^7 http-proxy-middleware@^3

You will also need:

  • A running local model server (e.g., Ollama, llama.cpp server, or vLLM) on localhost:8080 exposing an OpenAI-compatible /v1/completions endpoint.
  • OpenSSL for generating internal CA, server, and client certificates (see the mTLS section below).
  • All Express snippets require app.use(express.json()) registered before any middleware that reads req.body.
  • A PROMPT_HMAC_SECRET environment variable set to a cryptographically random secret (used for prompt hashing in audit logs).

Threat Model for Local LLMs

Prompt Injection Attacks

Prompt injection remains a primary vulnerability class in LLM deployments (ranked #1 on the OWASP Top 10 for LLM Applications), and local installations receive no exemption. Direct prompt injection occurs when a user crafts input that overrides the system prompt, causing the model to behave outside its intended scope. Indirect prompt injection is subtler: adversarial instructions embedded in documents, emails, or database records that the model processes as context, causing it to follow attacker-controlled directives.

Cloud-hosted LLM APIs from vendors like OpenAI and Anthropic apply their own input filtering layers, but a locally deployed Llama 3 or Mistral model ships with none of these protections. Consider an internal HR chatbot backed by a local model. An employee submits a query containing a hidden instruction that causes the model to output salary data from its retrieval context, data the employee should never see. Without application-layer guardrails, the model complies.

Model Theft and Intellectual Property Exposure

Attackers who reach the file system can copy multi-gigabyte weight files, especially fine-tuned models encoding months of compute and proprietary training data. Exfiltration risk is concrete: if the inference VLAN permits outbound connections on arbitrary ports, an attacker with local access transfers those weights off-network with no friction. Supply-chain attacks present another vector. If teams source fine-tuning datasets or base model weights from public repositories without verifying integrity, a compromised artifact introduces backdoors into inference behavior. Model safeguarding policies that cover storage encryption, access logging on weight files, and integrity checksums are non-negotiable for any serious deployment.

Training Data Poisoning

When organizations fine-tune models on internal datasets, adversarial actors, whether external attackers or malicious insiders, inject poisoned examples into the training corpus. In compliance-sensitive domains like legal, finance, and healthcare, as few as tens of poisoned examples in a fine-tuning set of thousands shift model behavior enough to generate outputs that appear authoritative but contain incorrect regulatory guidance, fabricated case citations, or hallucinated medical dosages. This ties directly to enterprise AI compliance requirements: a poisoned model producing incorrect compliance advice creates legal liability and regulatory risk that no disclaimer fully mitigates.

Network Security: Isolating Your Inference Layer

Air-Gapping Requirements and Network Segmentation

The first line of defense for local LLM deployments is network isolation. Full air-gapping, where the inference cluster has zero connectivity to the internet, is warranted in environments handling classified data, certain healthcare records under HIPAA, or financial data subject to PCI-DSS or ITAR requirements. For most enterprises, restricted VLAN segmentation provides a practical alternative: the inference cluster sits on a dedicated network segment with no default route to the internet and tightly scoped firewall rules.

The architecture places the inference cluster (GPU nodes running the model server) on an isolated VLAN. Application servers that need inference access connect through a defined set of firewall rules permitting only the inference API port. The firewall denies all egress from the inference VLAN by default. This functions as network-layer LLM isolation, ensuring that even a compromised inference node cannot reach external endpoints or lateral corporate network segments without explicit allowlisting.

Implementing mTLS for Inference Traffic

Standard TLS encrypts traffic in transit but does not authenticate the client. For internal east-west traffic between application servers and the inference API, this is insufficient. A compromised service on the same network could make unauthorized inference calls. Mutual TLS (mTLS) solves this by requiring both the server and the client to present valid certificates issued by a trusted internal CA.

Note: The code below references certificate files under /certs/. You must generate these using your internal PKI or OpenSSL. For example, to create a self-signed CA and server certificate for testing:

# Generate internal CA
openssl req -x509 -newkey rsa:4096 -keyout /certs/internal-ca-key.pem -out /certs/internal-ca-cert.pem -days 365 -nodes -subj "/CN=Internal CA"

# Generate server key and CSR
openssl req -newkey rsa:4096 -keyout /certs/server-key.pem -out /certs/server.csr -nodes -subj "/CN=inference-server"

# Sign server cert with CA
openssl x509 -req -in /certs/server.csr -CA /certs/internal-ca-cert.pem -CAkey /certs/internal-ca-key.pem -CAcreateserial -out /certs/server-cert.pem -days 365

Repeat a similar process for client certificates. On Windows, replace /certs/ with an appropriate local path, or use path.join(__dirname, 'certs', ...) in the code.

The following Node.js Express server configures mTLS for an internal LLM inference API endpoint. Note that apiKeyMiddleware and auditLogMiddleware are registered on the proxy route so that every request is authenticated and logged even though the upstream response is piped by the proxy:

const https = require('https');
const fs = require('fs');
const express = require('express');
const { createProxyMiddleware } = require('http-proxy-middleware');
const apiKeyMiddleware = require('./apiKeyMiddleware');
const auditLogMiddleware = require('./auditLogMiddleware');

const app = express();

app.use(express.json()); // Required for all middleware that reads req.body

// Fail fast with an actionable message if certs are missing
let serverOptions;
try {
  serverOptions = {
    key: fs.readFileSync('/certs/server-key.pem'),
    cert: fs.readFileSync('/certs/server-cert.pem'),
    ca: fs.readFileSync('/certs/internal-ca-cert.pem'),
    requestCert: true,
    rejectUnauthorized: true
  };
} catch (e) {
  console.error(`Certificate load failed: ${e.message}`);
  process.exit(1);
}

// Auth and audit MUST precede the proxy so req.context is populated
app.use(
  '/v1/completions',
  apiKeyMiddleware,
  auditLogMiddleware,
  createProxyMiddleware({
    target: 'http://localhost:8080',
    changeOrigin: true
  })
);

const PORT = process.env.PORT || 8443;
const server = https.createServer(serverOptions, app);
server.listen(PORT, () => {
  console.log(`mTLS inference proxy listening on port ${PORT}`);
});

The server loads certificates and the internal CA certificate, then enforces requestCert: true and rejectUnauthorized: true so that any client without a valid certificate signed by the internal CA gets rejected at the TLS handshake. apiKeyMiddleware ensures a valid API key is also required, providing team attribution even for mTLS-authenticated clients. The proxy route then forwards validated requests to the local model server.

For containerized environments, mTLS can be handled transparently by a service mesh like Istio or Linkerd. See the Deploying to Kubernetes guide for details on that approach.

Access Control: API Keys and Usage Quotas

Designing an Internal API Key Scheme

Even within a private network, unauthenticated access to the inference API is a security gap. Every internal service calling the LLM should present an API key. Scope keys by team, application, and environment (development, staging, production) to enable fine-grained access control and audit attribution. A key rotation strategy with a maximum 90-day lifetime and immediate revocation capability through a central key management service prevents long-lived credentials from becoming persistent attack vectors.

// WARNING: In-memory key store shown for illustration only.
// Never commit credentials to source control. Load keys at runtime from
// a secrets manager (e.g., HashiCorp Vault, AWS Secrets Manager) or
// environment variables.
const API_KEYS = {
  'key-engineering-prod-001': { team: 'engineering', scope: 'inference', env: 'production' },
  'key-legal-prod-001': { team: 'legal', scope: 'inference', env: 'production' }
};

function apiKeyMiddleware(req, res, next) {
  const authHeader = req.headers['authorization'];

  if (!authHeader || !authHeader.startsWith('Bearer ')) {
    return res.status(401).json({ error: 'Missing or malformed API key' });
  }

  const key = authHeader.slice(7);
  const meta = API_KEYS[key];

  if (!meta) {
    return res.status(401).json({ error: 'Invalid API key' });
  }

  // Store a truncated key fingerprint, not the raw key, to avoid
  // accidental exposure in logs or error responses.
  req.context = {
    keyId: key.slice(0, 8) + '...',
    team: meta.team,
    scope: meta.scope,
    env: meta.env
  };

  next();
}

module.exports = apiKeyMiddleware;

The middleware extracts the API key from the Authorization header, validates it against a store, and attaches team and scope metadata to req.context for use by downstream audit logging. In production, replace the in-memory object with calls to a secrets manager or database that supports key rotation and revocation.

Enforcing Usage Quotas and Rate Limiting

GPU resources are expensive and finite; a single A100 tops out at roughly 30-60 concurrent requests depending on model size and sequence length. Without rate limiting, one team or runaway process monopolizes inference capacity, creating a noisy-neighbor problem that degrades service for everyone. Both token-based and request-based rate limiting should be considered, though request-based limits are simpler to implement at the API gateway layer.

const express = require('express');
const rateLimit = require('express-rate-limit');
const apiKeyMiddleware = require('./apiKeyMiddleware');
// This assumes apiKeyMiddleware.js exports the middleware from the Access Control section.

const app = express();
app.use(express.json()); // Required before middleware that reads req.body

// Required for correct req.ip resolution behind load balancers / k8s ingress.
// Without this, all requests behind a reverse proxy share a single IP bucket,
// making IP-based rate limit fallback trivially bypassable.
app.set('trust proxy', 1);

const inferenceRateLimiter = rateLimit({
  windowMs: 60 * 1000,
  max: 30,
  keyGenerator: (req) => req.context?.keyId || req.ip,
  standardHeaders: true,
  legacyHeaders: false,
  message: { error: 'Rate limit exceeded. Try again later.' }
});

// apiKeyMiddleware must be registered before the rate limiter so that
// req.context.keyId is available. If middleware order is reversed,
// the rate limiter falls back to keying by IP address.
app.use('/v1/completions', apiKeyMiddleware, inferenceRateLimiter);

Using express-rate-limit, this enforces 30 requests per 60-second fixed window, keyed to the API key ID attached by the authentication middleware. For high-availability deployments, replace the in-memory store with a Redis-backed store (e.g., rate-limit-redis) to enforce rate limits consistently across multiple API gateway instances. Note that express-rate-limit with a Redis store uses a fixed window by default; for a true sliding window algorithm, consider a library like rate-limiter-flexible.

Role-Based Access for Model Management

Access control must distinguish between inference consumers and model administrators. Teams that query the model should have no ability to deploy, swap, or fine-tune models. Model management operations, including uploading new weights, modifying system prompts, and triggering fine-tuning jobs, require a separate set of credentials with elevated privileges. Enforce these restrictions through an RBAC system integrated with your organization's identity provider.

Audit Logging: Accountability Without Privacy Violations

What to Log (and What Not to Log)

Effective audit logging for LLM inference requires capturing enough detail for security investigation and compliance reporting without creating new privacy liabilities. Mandatory fields include: timestamp, API key or team ID, model name and version, token count (input and output), response latency, and HTTP response status. These fields support usage attribution, anomaly detection, and capacity planning.

Sensitive fields require careful handling. Full prompt text may contain PII, trade secrets, or privileged legal communications. Logging raw prompts creates GDPR exposure (if the prompts contain personal data and the organization lacks a lawful basis for processing that data within logs), HIPAA violations (storing protected health information in log systems without adequate safeguards), or SOC 2 audit findings. Log an HMAC of the prompt using a server-side secret key instead. This enables correlation without retaining content and prevents offline brute-forcing of short or predictable prompts. Redact any PII from completions before writing log entries. This is the operational core of enterprise AI compliance.

Logging raw prompts creates GDPR exposure, HIPAA violations, or SOC 2 audit findings. Log an HMAC of the prompt using a server-side secret key instead. This enables correlation without retaining content and prevents offline brute-forcing of short or predictable prompts.

Building Audit Logging Middleware

The following shared PII pattern set is used by both the audit logging middleware and the guardrails module to ensure consistent redaction:

// piiPatterns.js — shared PII pattern definitions
const PII_PATTERNS = [
  { regex: /\b\d{3}-\d{2}-\d{4}\b/i,                              label: 'SSN' },
  { regex: /\b[\w.-]+@[\w.-]+\.\w{2,}\b/i,                        label: 'EMAIL' },
  { regex: /\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b/,         label: 'CREDIT_CARD' },
  { regex: /\b(\+?1[\s.-]?)?\(?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}\b/, label: 'PHONE' }
];

module.exports = { PII_PATTERNS };
// auditLogMiddleware.js
const crypto = require('crypto');
const { PII_PATTERNS } = require('./piiPatterns');

function redactPII(text) {
  if (!text) return '';
  let result = text;

  for (const { regex, label } of PII_PATTERNS) {
    const globalRegex = new RegExp(regex.source, 'g' + regex.flags.replace('g', ''));
    result = result.replace(globalRegex, `[${label}_REDACTED]`);
  }

  return result;
}

function auditLogMiddleware(req, res, next) {
  const startTime = Date.now();
  const originalJson = res.json.bind(res);
  let logged = false;

  const hmacSecret = process.env.PROMPT_HMAC_SECRET;
  if (!hmacSecret) {
    // Fail closed: do not serve requests without the HMAC secret
    return res.status(500).json({ error: 'Audit logging misconfigured' });
  }

  res.json = (body) => {
    if (!logged) {
      logged = true;

      const logEntry = {
        timestamp: new Date().toISOString(),
        keyId: req.context?.keyId || 'unknown',
        team: req.context?.team || 'unknown',
        model: req.body?.model || 'default',
        inputTokens: req.body?.prompt?.split(/\s+/).length || 0,
        // WARNING: whitespace split is not an LLM token count. For accurate
        // billing and compliance logging, use a tokenizer matching your model
        // (e.g., @dqbd/tiktoken) or call the model server's /tokenize endpoint.
        outputTokens: body?.choices?.[0]?.text?.split(/\s+/).length || 0,
        promptHash: crypto
          .createHmac('sha256', hmacSecret)
          .update(req.body?.prompt || '')
          .digest('hex'),
        // Retain the full 64-char hex digest for collision resistance in forensic audit trails.
        responseStatus: res.statusCode,
        latencyMs: Date.now() - startTime,
        redactedOutput: redactPII(body?.choices?.[0]?.text?.slice(0, 200) || '')
      };

      console.log(JSON.stringify(logEntry));
    }

    return originalJson(body);
  };

  next();
}

module.exports = { auditLogMiddleware, redactPII };

The middleware intercepts responses, constructs structured JSON log entries with an HMAC of the prompt (using the PROMPT_HMAC_SECRET environment variable) instead of raw text, applies PII redaction to a snippet of the output, and writes to stdout for consumption by a log aggregator. The logged guard prevents duplicate log entries if res.json is called more than once per request.

Important: This middleware captures responses only when the inference gateway itself calls res.json(). In proxy architectures using http-proxy-middleware (as in the mTLS section above), the upstream response is piped directly and res.json() is never called. For proxied requests, use the on.proxyRes hook instead to intercept and log the upstream response:

const { createProxyMiddleware } = require('http-proxy-middleware');
const crypto = require('crypto');
const { redactPII } = require('./auditLogMiddleware');

app.use('/v1/completions', createProxyMiddleware({
  target: 'http://localhost:8080',
  changeOrigin: true,
  on: {
    proxyRes: (proxyRes, req, res) => {
      const hmacSecret = process.env.PROMPT_HMAC_SECRET;
      const startTime = req._auditStart || Date.now();
      const chunks = [];

      proxyRes.on('data', (chunk) => {
        // Accumulate Buffer objects; never coerce chunks to string mid-stream
        chunks.push(Buffer.isBuffer(chunk) ? chunk : Buffer.from(chunk));
      });

      proxyRes.on('end', () => {
        let parsed = {};

        try {
          parsed = JSON.parse(Buffer.concat(chunks).toString('utf8'));
        } catch (_) {
          // Non-JSON upstream response; log what we can
        }

        const logEntry = {
          timestamp: new Date().toISOString(),
          keyId: req.context?.keyId || 'unknown',
          team: req.context?.team || 'unknown',
          model: req.body?.model || 'default',
          promptHash: hmacSecret
            ? crypto
                .createHmac('sha256', hmacSecret)
                .update(req.body?.prompt || '')
                .digest('hex')
            : 'HMAC_SECRET_MISSING',
          responseStatus: proxyRes.statusCode,
          latencyMs: Date.now() - startTime,
          redactedOutput: redactPII(
            parsed?.choices?.[0]?.text?.slice(0, 200) || ''
          )
        };

        console.log(JSON.stringify(logEntry));
      });
    }
  }
}));

Production note: In production, replace console.log with a structured logging library (e.g., pino, winston) configured with a durable transport. Relying on stdout alone is not sufficient for tamper-evident audit logging, as entries can be silently dropped under high load or misconfigured container log rotation.

Log Storage, Retention, and Tamper-Proofing

Audit logs should be forwarded to a SIEM system (Splunk, Elastic Security) or an append-only object store. Retention periods must align with the applicable compliance framework:

  • HIPAA requires six-year retention for Privacy Rule policies and procedures (45 CFR §164.530(j)); audit log retention periods for covered entities are not federally prescribed and must be determined by the organization's risk analysis and applicable state law.
  • SOC 2 does not prescribe a specific log retention period; requirements derive from the organization's System Description commitments and auditor interpretation of the CC7 monitoring criteria. Consult your auditor before setting retention policy.
  • GDPR requires retention to be limited to what is necessary for the stated purpose.

Tamper-proofing can be achieved through hash chaining (each log entry includes a hash of the previous entry) or write-once storage such as AWS S3 Object Lock or Azure Immutable Blob Storage.

Input/Output Sanitization: Guardrails for Local LLMs

Why Guardrails Matter for Local Deployments

When using a cloud LLM API, the provider typically applies content filtering, toxicity detection, and basic prompt injection defenses. Local deployments have none of this. The organization owns content filtering end-to-end.

The risks break down by likelihood and impact:

  1. Prompt injection - highest likelihood; any user-facing deployment faces this on day one.
  2. PII leakage in model outputs - high likelihood in RAG or fine-tuned deployments where training data contains personal records.
  3. Hallucinated compliance or regulatory advice - moderate likelihood, but the compliance and legal consequences are severe.
  4. Toxic or harmful content generation - likelihood depends on use case and audience; reputational damage is the primary concern.

An application-layer guardrail, sitting between the user-facing API and the model inference server, catches these risks before they reach end users.

Implementing Guardrails with Node.js

const { PII_PATTERNS } = require('./piiPatterns');

const INJECTION_PATTERNS = [
  /ignore\s+(all\s+)?previous\s+instructions/i,
  /you\s+are\s+now\s+in\s+developer\s+mode/i,
  /disregard\s+(your|all)\s+(rules|instructions)/i,
  /system\s*:\s*/i
];

// Note: Do not use the `g` flag with .test() on reused RegExp instances.
// The `g` flag causes RegExp objects to maintain `lastIndex` state between
// calls, which leads to intermittent false negatives when the same pattern
// object is reused across function invocations.

const INFERENCE_TIMEOUT_MS = parseInt(process.env.INFERENCE_TIMEOUT_MS || '30000', 10);

function withTimeout(promise, ms) {
  return Promise.race([
    promise,
    new Promise((_, reject) =>
      setTimeout(() => reject(new Error(`Model server timeout after ${ms}ms`)), ms)
    )
  ]);
}

function preProcess(prompt) {
  for (const pattern of INJECTION_PATTERNS) {
    if (pattern.test(prompt)) {
      return { blocked: true, reason: 'Prompt injection pattern detected' };
    }
  }

  for (const { regex, label } of PII_PATTERNS) {
    if (regex.test(prompt)) {
      return { blocked: true, reason: `Input contains ${label}` };
    }
  }

  return { blocked: false, sanitizedPrompt: prompt.trim() };
}

function postProcess(completion, maxLength = 2000) {
  let output = completion.slice(0, maxLength);

  for (const { regex, label } of PII_PATTERNS) {
    // Construct a new regex with the `g` flag for replacement,
    // preserving any existing flags (e.g., `i` for case-insensitive matching)
    const globalRegex = new RegExp(
      regex.source,
      'g' + regex.flags.replace('g', '')
    );
    output = output.replace(globalRegex, `[${label}_REDACTED]`);
  }

  return output;
}

/**
 * @param {string} prompt - The user's input prompt.
 * @param {function(string): Promise<string>} callModel - Async function that
 *   calls the local model server and returns the completion string. You must
 *   implement this to match your model server's API (e.g., an HTTP POST to
 *   your Ollama or vLLM endpoint).
 */
async function guardrailedInference(prompt, callModel) {
  const preCheck = preProcess(prompt);

  if (preCheck.blocked) {
    return { error: preCheck.reason, status: 400 };
  }

  try {
    const rawCompletion = await withTimeout(
      callModel(preCheck.sanitizedPrompt),
      INFERENCE_TIMEOUT_MS
    );
    const safeOutput = postProcess(rawCompletion);
    return { completion: safeOutput, status: 200 };
  } catch (e) {
    const isTimeout = e.message.includes('timeout');
    return {
      error: isTimeout ? 'Model server unavailable' : 'Inference error',
      status: isTimeout ? 503 : 500
    };
  }
}

module.exports = { preProcess, postProcess, guardrailedInference };

The module wraps the inference call with pre-processing and post-processing stages. Pre-processing validates input against deny-list regex patterns for known prompt injection phrases and PII patterns (SSNs, emails, credit card numbers, phone numbers). Post-processing checks the model's completion for PII leakage and applies output truncation. guardrailedInference includes a configurable timeout (defaulting to 30 seconds via INFERENCE_TIMEOUT_MS) so that an unresponsive model server returns a structured error instead of hanging indefinitely. The modular structure allows teams to add custom validators. For organizations running Python-based validation frameworks such as Guardrails AI, the Node.js service can call them via HTTP as a sidecar container.

Important: Regex deny-lists detect only known, literal injection patterns and are trivially bypassed via paraphrasing, encoding, character substitution, or non-English phrasing. Treat this as a first-pass filter only. Pair with a semantic classifier or dedicated prompt injection detection model for production deployments.

Note on PII detection: No single regex covers all international phone formats or all PII categories. The patterns above are illustrative. For production use, consider a dedicated PII detection library or service.

Continuous Guardrail Testing

Guardrails degrade over time as new injection techniques emerge. Conduct red-team exercises with adversarial prompt suites on a regular schedule, not just at deployment. Automating guardrail regression tests within CI/CD pipelines ensures that updates to the model, system prompt, or guardrail rules do not introduce regressions. Each adversarial test case should be version-controlled alongside the guardrail code itself.

The Complete Security & Compliance Checklist

CategoryAction ItemPriorityStatusSection Reference
Threat ModelingDocument direct and indirect prompt injection vectorsCritical[ ]Threat Model
Threat ModelingImplement integrity checksums for model weightsCritical[ ]Threat Model
Threat ModelingAudit fine-tuning dataset provenance and integrityHigh[ ]Threat Model
Threat ModelingEstablish model safeguarding policy for weight storageCritical[ ]Threat Model
Network SecurityIsolate inference cluster on dedicated VLANCritical[ ]Network Security
Network SecurityDeny all egress from inference VLAN by defaultCritical[ ]Network Security
Network SecurityImplement mTLS for all inference API trafficCritical[ ]Network Security
Network SecurityIssue and rotate internal CA certificatesHigh[ ]Network Security
Access ControlDeploy API key authentication for all inference consumersCritical[ ]Access Control
Access ControlScope keys by team, application, and environmentHigh[ ]Access Control
Access ControlEnforce 90-day key rotation policyHigh[ ]Access Control
Access ControlImplement per-key rate limitingHigh[ ]Access Control
Access ControlSeparate inference consumer roles from model admin rolesCritical[ ]Access Control
Audit LoggingLog timestamp, key ID, model, token count, latency, statusCritical[ ]Audit Logging
Audit LoggingHMAC prompts instead of logging raw textCritical[ ]Audit Logging
Audit LoggingApply PII redaction to logged output snippetsCritical[ ]Audit Logging
Audit LoggingForward logs to SIEM or append-only storeHigh[ ]Audit Logging
Audit LoggingSet retention periods per compliance frameworkHigh[ ]Audit Logging
Audit LoggingImplement tamper-proofing (hash chaining or WORM storage)Medium[ ]Audit Logging
SanitizationDeploy pre-processing guardrails for prompt injectionCritical[ ]Sanitization
SanitizationDeploy pre-processing guardrails for PII in inputCritical[ ]Sanitization
SanitizationDeploy post-processing guardrails for PII in outputCritical[ ]Sanitization
SanitizationEnforce output truncation limitsMedium[ ]Sanitization
SanitizationSchedule regular red-team exercises against guardrailsHigh[ ]Sanitization
SanitizationAutomate guardrail regression tests in CI/CDHigh[ ]Sanitization

Store this checklist in version control alongside infrastructure-as-code. Teams should review it during quarterly security assessments, updating status and adding new items as new attack vectors emerge. Printing it or exporting it as a shared document ensures that security reviews for local LLM deployments have a concrete, reproducible baseline rather than ad hoc conversations.

Your Security Responsibilities When Deploying Locally

Deploying models locally transfers the full weight of security responsibility to the enterprise. There is no vendor to absorb the risk of prompt injection. No managed service handles access control. No external safety layer filters harmful outputs. The five pillars covered here, threat modeling, network isolation, access control, audit logging, and input/output sanitization, form the minimum viable security posture for any production local LLM deployment.

Deploying models locally transfers the full weight of security responsibility to the enterprise. There is no vendor to absorb the risk of prompt injection. No managed service handles access control. No external safety layer filters harmful outputs.

The code examples provided are designed to be integrated directly into a Node.js-based inference gateway. Adopt the checklist, wire up the middleware, and schedule a red-team exercise within the first 30 days of deployment. For the strategic rationale behind local AI and guidance on containerized deployment, refer to the enterprise local AI pillar article and the Kubernetes deployment guide.

SitePoint TeamSitePoint Team

Sharing our passion for building incredible internet things.

© 2000 – 2026 SitePoint Pty. Ltd.
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.