Running multiple local models simultaneously has become a practical necessity for developers building agentic AI workflows, RAG pipelines, and task-specific routing systems. This tutorial walks through building a Node.js orchestration layer paired with a React monitoring dashboard that manages multiple local model lifecycles with dynamic memory allocation, model swapping, and real-time observability.
Table of Contents
- Prerequisites
- Why Multi-Model Setups Are the New Default
- Understanding Memory Constraints When Running Multiple Local Models
- Core Memory Management Strategies
- Building the Orchestration Layer with Node.js
- Building a Real-Time Monitoring Dashboard with React
- Performance Tuning and Troubleshooting
- Complete Implementation Checklist
- Scaling Locally Without Scaling Hardware
Prerequisites
This tutorial assumes the following environment. Pin these versions (or compatible ones) to ensure reproducibility:
- Node.js ≥ 20 LTS
- npm ≥ 9
- Express ^4.18
- cors ^2.8
- React ^18
- Inference runtime: Ollama ≥ 0.1.32 or a recent llama.cpp build (post May 2024)
- OS: Linux (NVIDIA GPU with drivers exposing
nvidia-smiin PATH) or macOS (Apple Silicon with unified memory). Windows is not covered. - CUDA: Compatible with your NVIDIA driver version (if applicable)
Project Structure
Create the following files in a single project directory:
multi-model-orchestrator/
├── package.json
├── orchestrator.js
├── model-manager.js
├── model-registry.js
├── semaphore.js
└── client/ # React app (e.g., via create-react-app)
└── src/
└── Dashboard.jsx
Minimal package.json:
{
"name": "multi-model-orchestrator",
"version": "1.0.0",
"main": "orchestrator.js",
"scripts": {
"start": "node orchestrator.js"
},
"dependencies": {
"express": "^4.18",
"cors": "^2.8"
}
}
Run npm install before proceeding.
Why Multi-Model Setups Are the New Default
Running multiple local models simultaneously has become a practical necessity for developers building agentic AI workflows, retrieval-augmented generation (RAG) pipelines, and task-specific routing systems. A typical production-adjacent setup might involve a coding model, a summarization model, and an embedding model all expected to respond on demand. This is a far cry from the single-model tutorial pattern that dominates most local LLM guides.
The core problem is straightforward: GPU VRAM and system RAM are finite resources, and naively loading multiple models causes out-of-memory (OOM) crashes or severely degrades inference speed. A 7B parameter model at FP16 precision already consumes roughly 14 GB of VRAM. Add a second model of similar size, and consumer GPUs with 24 GB of VRAM are already at capacity before accounting for KV cache or runtime overhead.
GPU VRAM and system RAM are finite resources, and naively loading multiple models causes out-of-memory (OOM) crashes or severely degrades inference speed.
This tutorial walks through building a Node.js orchestration layer paired with a React monitoring dashboard that manages multiple local model lifecycles with dynamic memory allocation, model swapping, and real-time observability.
Understanding Memory Constraints When Running Multiple Local Models
How LLMs Consume Memory
A local LLM consumes memory in three ways: model weights, the key-value (KV) cache, and runtime overhead.
Parameter count and precision determine model weight size. A 7B parameter model at FP16 (2 bytes per parameter) requires approximately 14 GB. The pattern scales linearly. Q8_0 quantization uses approximately 1.06 bytes per parameter due to per-block scale factors, putting a 7B model at roughly 7.4 GB. At Q4_K_M (approximately 0.54 bytes per parameter including quantization metadata), a 7B model drops to about 3.8 GB.
| Model Size | FP16 | Q8_0 | Q4_K_M |
|---|---|---|---|
| 7B | ~14 GB | ~7.4 GB | ~3.8 GB |
| 13B | ~26 GB | ~13.8 GB | ~7 GB |
| 70B | ~140 GB | ~74 GB | ~37-38 GB |
A 70B model at FP16 demands around 140 GB, making full-precision loading impossible on all but the most specialized multi-GPU setups. Even at Q4_K_M, it still requires approximately 37-38 GB.
The KV cache grows dynamically during inference based on context length and batch size. For long-context workloads, KV cache can consume several additional gigabytes beyond the static weight footprint.
The inference engine itself (llama.cpp, Ollama, or similar) uses another few hundred MB as a baseline cost.
It is also critical to distinguish between GPU VRAM, system RAM, and unified/shared memory (as found on Apple Silicon). System RAM bandwidth on DDR4 typically ranges from 30-50 GB/s, which is 10-17x lower than GPU HBM (for example, ~900 GB/s on an A100). Models loaded into system RAM instead of VRAM will generate tokens proportionally slower, but this can be a viable strategy for secondary or infrequently accessed models.
Identifying Your Memory Budget
Before loading any models, the orchestration layer must query available resources. The following Node.js script reads available VRAM via nvidia-smi on NVIDIA systems and falls back to system_profiler on macOS to detect unified memory:
Important: This script assumes a single NVIDIA GPU. On multi-GPU systems, nvidia-smi returns one line per GPU; the code below parses only the first GPU's data. Extend the parsing logic if you need multi-GPU aggregation.
Apple Silicon note: system_profiler reports total installed physical RAM, not a GPU-specific memory partition. On Apple Silicon, GPU memory is dynamically allocated by macOS and there is no OS-level API that exposes "free GPU VRAM" directly. The value returned here is an approximation only. For more precise GPU memory pressure data, use the Metal API or Activity Monitor.
const { execSync } = require('child_process');
const os = require('os');
function getMemoryBudget() {
const budget = {
systemRAM: {
totalMB: Math.round(os.totalmem() / 1024 / 1024),
freeMB: Math.round(os.freemem() / 1024 / 1024),
},
gpuVRAM: null,
};
try {
const smiOutput = execSync(
'nvidia-smi --query-gpu=memory.total,memory.free,memory.used --format=csv,noheader,nounits',
{ encoding: 'utf-8', timeout: 5000 }
);
// Single-GPU only; extend for multi-GPU aggregation.
const firstLine = smiOutput.trim().split('
')[0];
const [total, free, used] = firstLine.split(',').map((s) => Number(s.trim()));
budget.gpuVRAM = { totalMB: total, freeMB: free, usedMB: used };
} catch {
// nvidia-smi not available; check for macOS unified memory
if (process.platform === 'darwin') {
try {
const spOutput = execSync(
'system_profiler SPHardwareDataType | grep "Memory"',
{ encoding: 'utf-8', timeout: 5000 }
);
const match = spOutput.match(/(\d+)\s*GB/);
if (match) {
const totalGB = parseInt(match[1], 10);
budget.gpuVRAM = {
totalMB: totalGB * 1024,
freeMB: null,
// Apple Silicon does not expose per-process GPU allocation.
// Budget against system free RAM as an approximation only;
// actual GPU pressure requires Metal API or Activity Monitor.
usedMB: null,
note: 'Apple unified memory — VRAM is shared with system RAM',
};
}
} catch {
// No GPU info available
}
}
}
return budget;
}
console.log(JSON.stringify(getMemoryBudget(), null, 2));
This provides the foundational data that every subsequent memory management decision depends on.
Core Memory Management Strategies
Strategy 1: Model Quantization for Reduced Footprint
Quantization reduces model precision from FP16 (16-bit floating point) to lower bit-widths like Q8_0 (8-bit) or Q4_K_M (4-bit). The memory savings are roughly proportional: FP16 uses 2 bytes per parameter, Q8_0 uses approximately 1.06 bytes (including per-block scale factors), and Q4_K_M approximately 0.54 bytes (including quantization metadata).
The tradeoff is output quality. For embedding models, Q8 quantization typically preserves quality well enough for retrieval tasks. Coding models and summarization models benefit from staying at Q8 or higher when possible. Q4 works for draft generation or low-stakes conversational tasks; on standard benchmarks for 7B-class models, the perplexity increase from FP16 to Q4_K_M is typically under 1%, but you should measure on your own eval set to confirm acceptable quality for your use case.
A structured model registry encodes these decisions. Save this as model-registry.js:
const MODEL_REGISTRY = {
codeLlama7B: {
name: 'CodeLlama-7B',
parametersBillion: 7,
quantization: 'Q8',
estimatedVRAM_MB: 7400, // ~7.4 GB (includes quantization overhead)
maxContextLength: 4096,
role: 'coding',
priority: 1, // lower = higher priority for staying loaded
},
mistral7BInstruct: {
name: 'Mistral-7B-Instruct',
parametersBillion: 7,
quantization: 'Q4',
estimatedVRAM_MB: 3800, // ~3.8 GB (includes quantization overhead)
maxContextLength: 8192,
role: 'summarization',
priority: 2,
},
bgeSmallEn: {
name: 'BGE-Small-EN',
parametersBillion: 0.033,
quantization: 'FP16',
estimatedVRAM_MB: 128, // includes runtime overhead; raw weights ~66 MB
maxContextLength: 512,
role: 'embedding',
priority: 0, // always loaded
},
};
module.exports = { MODEL_REGISTRY };
Each entry carries its estimated VRAM footprint, which the orchestrator uses to decide whether loading is feasible without eviction.
Strategy 2: Dynamic Model Loading and Unloading
Only actively needed models belong in VRAM. The "hot swap" pattern evicts idle models when memory is required, using LRU (Least Recently Used) eviction: the model that has gone the longest without receiving an inference request gets unloaded first.
Warning: The _spawnModelProcess method below is a placeholder stub that simulates model loading without actually starting an inference runtime. Before using this orchestrator with real workloads, you must replace _spawnModelProcess with an actual integration — for example, spawning a llama.cpp server process or calling the Ollama API. Without this replacement, no real inference will occur.
Save this as model-manager.js:
class ModelManager {
constructor(memoryBudgetMB) {
this.memoryBudgetMB = memoryBudgetMB;
this.loadedModels = new Map(); // modelId -> { config, lastUsed, process }
}
getUsedMemoryMB() {
let total = 0;
for (const [, model] of this.loadedModels) {
total += model.config.estimatedVRAM_MB;
}
return total;
}
async loadModel(modelId, config, liveBudgetMB) {
if (this.loadedModels.has(modelId)) {
this.loadedModels.get(modelId).lastUsed = Date.now();
return { status: 'already_loaded' };
}
// Use provided live budget if given, else fall back to construction-time budget
const effectiveBudget = liveBudgetMB !== undefined
? liveBudgetMB
: this.memoryBudgetMB;
// Evict models until there is room, with an iteration cap to prevent infinite loops
const maxEvictions = this.loadedModels.size + 1;
let evictions = 0;
while (
this.getUsedMemoryMB() + config.estimatedVRAM_MB >
effectiveBudget
) {
if (evictions++ >= maxEvictions) {
throw new Error(
`Cannot free enough memory for ${modelId} after ${evictions - 1} evictions. ` +
`Need ${config.estimatedVRAM_MB} MB, budget ${effectiveBudget} MB, ` +
`used ${this.getUsedMemoryMB()} MB.`
);
}
const evicted = await this._evictLRU();
if (!evicted) {
throw new Error(
`Cannot free enough memory for ${modelId}. ` +
`Need ${config.estimatedVRAM_MB} MB, budget ${effectiveBudget} MB, ` +
`used ${this.getUsedMemoryMB()} MB.`
);
}
}
// PLACEHOLDER: replace with actual model process spawn
// (e.g., Ollama pull + run, llama.cpp server)
const proc = await this._spawnModelProcess(modelId, config);
this.loadedModels.set(modelId, {
config,
lastUsed: Date.now(),
process: proc,
});
return { status: 'loaded', memoryUsedMB: this.getUsedMemoryMB() };
}
async unloadModel(modelId) {
const entry = this.loadedModels.get(modelId);
if (!entry) return { status: 'not_loaded' };
// Remove from map first so no new requests are routed here
this.loadedModels.delete(modelId);
if (entry.process && typeof entry.process.kill === 'function') {
entry.process.kill('SIGTERM');
// If process does not exit within 5 seconds, force kill
const forceKillTimer = setTimeout(() => {
try { entry.process.kill('SIGKILL'); } catch { /* already exited */ }
}, 5000);
// Prevent timer from keeping the event loop alive after server shutdown
if (forceKillTimer.unref) forceKillTimer.unref();
}
return { status: 'unloaded', memoryUsedMB: this.getUsedMemoryMB() };
}
async _evictLRU() {
let oldest = null;
let oldestKey = null;
for (const [key, model] of this.loadedModels) {
// Never evict priority-0 (pinned) models
if (model.config.priority === 0) continue;
if (!oldest || model.lastUsed < oldest.lastUsed) {
oldest = model;
oldestKey = key;
}
}
if (!oldestKey) return false;
await this.unloadModel(oldestKey);
return true;
}
async _spawnModelProcess(modelId, config) {
// PLACEHOLDER: Integration point — spawn llama.cpp server, call Ollama API, etc.
// Replace this with your actual runtime integration before use.
console.log(`[STUB] Spawning process for ${config.name} (${config.quantization})`);
return { kill: () => console.log(`[STUB] Killed process for ${config.name}`) };
}
}
module.exports = { ModelManager };
The priority field protects critical models (like a small embedding model) from eviction. All other models compete on recency.
Strategy 3: Model Layering Across GPU and CPU
What happens when a model's weight footprint exceeds available VRAM? Layer offloading splits the model between GPU and CPU memory. llama.cpp exposes this via the --n-gpu-layers parameter. For example, a 70B Q4 model (~37-38 GB) on a 24 GB GPU loads 40 of its 80 layers onto the GPU and keeps the rest in system RAM (LLaMA-2-70B architecture specifically; layer counts vary by model family).
From Node.js, you configure this when spawning the model process by passing the layer count as a command-line argument. The tradeoff is latency: layers processed on CPU memory are bottlenecked by system RAM bandwidth (typically 30-50 GB/s on DDR4; DDR5 and LPDDR5 platforms reach 80-200 GB/s) compared to GPU memory bandwidth (e.g., ~900 GB/s on an RTX 4090 with GDDR6X, or ~2 TB/s on an A100 with HBM2e). With 50% of layers offloaded to DDR4, expect roughly 2x latency increase; DDR5 narrows the gap.
Memory management, not model capability, is the real bottleneck in multi-model local deployments.
The orchestrator can calculate an appropriate n-gpu-layers value dynamically by dividing available VRAM by the per-layer memory cost, which varies by model architecture but can be estimated as total model VRAM divided by total layer count.
Strategy 4: Request Queuing and Batching
Simultaneous inference requests to multiple loaded models can spike memory usage beyond the budget, even with careful loading and eviction. KV cache growth during active inference is the main culprit. A memory-aware request queue prevents this.
Tuning note: The maxConcurrent value below is set to 2 as a conservative default. Set this to the number of concurrent inference slots your GPU can support without OOM, based on your model sizes and available headroom. The minHeadroomMB of 512 MB is a starting point; for long-context inference where KV cache can grow by several gigabytes, increase this to 2048 MB or more.
Save this as semaphore.js:
class MemoryAwareSemaphore {
constructor(getAvailableMemoryMB, minHeadroomMB = 512, maxConcurrent = 2) {
this.getAvailableMemoryMB = getAvailableMemoryMB;
this.minHeadroomMB = minHeadroomMB;
this.maxConcurrent = maxConcurrent;
this.waiters = []; // pending resolve callbacks, one per blocked acquire()
this.running = 0;
}
acquire() {
return new Promise((resolve) => {
const tryRun = () => {
const available = this.getAvailableMemoryMB();
if (
this.running < this.maxConcurrent &&
available > this.minHeadroomMB
) {
this.running++;
resolve();
} else {
// Park this waiter; release() will retry it
this.waiters.push(tryRun);
}
};
tryRun();
});
}
release() {
this.running = Math.max(0, this.running - 1);
if (this.waiters.length > 0) {
const next = this.waiters.shift();
// Retry on next tick to allow memory to reflect released state
setImmediate(next);
}
}
}
module.exports = { MemoryAwareSemaphore };
When available memory drops below the configured headroom, incoming requests wait until a running inference completes and frees KV cache memory. The maxConcurrent cap prevents unbounded parallel execution regardless of apparent memory availability.
Building the Orchestration Layer with Node.js
Architecture Overview
The system consists of three layers: a Node.js Express server (the orchestrator), local model runtimes (llama.cpp server instances, Ollama API, or equivalent), and a React dashboard consuming the orchestrator's REST API.
The REST API exposes four primary endpoints: GET /models (list status of all registry models), POST /load (load a model by ID), POST /unload (unload a model by ID), and POST /infer (run inference on a loaded model). A GET /health endpoint provides real-time memory telemetry.
Implementing the Model Orchestrator
Save this as orchestrator.js:
Warning: When nvidia-smi is unavailable (e.g., on a system without an NVIDIA GPU), getGPUFreeMB() falls back to reporting system free RAM. The orchestrator will then manage models against system RAM figures, not actual GPU VRAM. This may not reflect real GPU memory availability on AMD GPU or integrated-graphics systems. A warning is logged when this fallback is used.
const express = require('express');
const cors = require('cors');
const { ModelManager } = require('./model-manager');
const { MemoryAwareSemaphore } = require('./semaphore');
const { MODEL_REGISTRY } = require('./model-registry');
const { execSync } = require('child_process');
const os = require('os');
const app = express();
app.use(cors({ origin: process.env.CORS_ORIGIN || 'http://localhost:3000' }));
app.use(express.json());
const MAX_PROMPT_LENGTH = 100000; // ~100 KB; adjust per deployment
const MIN_BUDGET_MB = 0;
// Cached GPU reading to avoid blocking execSync on every request
let _gpuCacheValue = null;
let _gpuCacheTime = 0;
const GPU_CACHE_TTL_MS = 1000;
// Query current GPU free memory with a short TTL cache
// to avoid blocking the event loop
function getGPUFreeMB() {
const now = Date.now();
if (_gpuCacheValue !== null && now - _gpuCacheTime < GPU_CACHE_TTL_MS) {
return _gpuCacheValue;
}
try {
const output = execSync(
'nvidia-smi --query-gpu=memory.free --format=csv,noheader,nounits',
{ encoding: 'utf-8', timeout: 5000 }
);
// Single-GPU only; takes the first GPU's value
const firstLine = output.trim().split('
')[0];
const parsed = parseInt(firstLine.trim(), 10);
if (!Number.isFinite(parsed)) throw new Error('Unparseable nvidia-smi output');
_gpuCacheValue = parsed;
_gpuCacheTime = now;
return _gpuCacheValue;
} catch {
console.warn(
'[WARNING] nvidia-smi not available. Falling back to system free RAM ' +
'for memory budget. This may not reflect actual GPU VRAM.'
);
_gpuCacheValue = Math.round(os.freemem() / 1024 / 1024);
_gpuCacheTime = now;
return _gpuCacheValue;
}
}
// Compute budget dynamically per-request to reflect current GPU state
function getLiveMemoryBudgetMB() {
const free = getGPUFreeMB();
const budget = free - 1024; // reserve 1 GB for system/driver overhead
return Math.max(MIN_BUDGET_MB, budget);
}
// Use initial budget for ModelManager ceiling;
// live checks happen at request time
const INITIAL_BUDGET_MB = getLiveMemoryBudgetMB();
const manager = new ModelManager(INITIAL_BUDGET_MB);
const semaphore = new MemoryAwareSemaphore(
() => getLiveMemoryBudgetMB() - manager.getUsedMemoryMB(),
512
);
app.get('/models', (req, res) => {
const models = Object.entries(MODEL_REGISTRY).map(([id, config]) => ({
id,
...config,
loaded: manager.loadedModels.has(id),
lastUsed: manager.loadedModels.get(id)?.lastUsed || null,
}));
res.json(models);
});
app.post('/load', async (req, res) => {
const { modelId } = req.body;
if (!modelId || !Object.hasOwn(MODEL_REGISTRY, modelId)) {
return res.status(404).json({ error: 'Model not in registry' });
}
const config = MODEL_REGISTRY[modelId];
const currentBudget = getLiveMemoryBudgetMB();
const available = currentBudget - manager.getUsedMemoryMB();
if (config.estimatedVRAM_MB > currentBudget) {
return res.status(400).json({
error: `Model requires ${config.estimatedVRAM_MB} MB but total budget is ${currentBudget} MB`,
});
}
if (config.estimatedVRAM_MB > available) {
// Not enough free memory; loadModel will attempt eviction
}
try {
const result = await manager.loadModel(modelId, config, currentBudget);
res.json(result);
} catch (err) {
res.status(507).json({ error: err.message });
}
});
app.post('/unload', async (req, res) => {
const { modelId } = req.body;
const result = await manager.unloadModel(modelId);
res.json(result);
});
app.post('/infer', async (req, res) => {
const { modelId, prompt } = req.body;
if (!modelId || typeof prompt !== 'string' || prompt.length === 0) {
return res.status(400).json({
error: 'modelId and a non-empty prompt string are required',
});
}
if (prompt.length > MAX_PROMPT_LENGTH) {
return res.status(400).json({
error: `Prompt exceeds maximum length of ${MAX_PROMPT_LENGTH} characters`,
});
}
if (!Object.hasOwn(MODEL_REGISTRY, modelId)) {
return res.status(400).json({ error: 'Model not in registry' });
}
if (!manager.loadedModels.has(modelId)) {
return res.status(400).json({ error: 'Model not loaded' });
}
await semaphore.acquire();
try {
manager.loadedModels.get(modelId).lastUsed = Date.now();
// PLACEHOLDER: forward prompt to actual model runtime
const mockResponse = `[Response from ${MODEL_REGISTRY[modelId].name}]: ...`;
res.json({ modelId, response: mockResponse });
} finally {
semaphore.release();
}
});
app.get('/health', (req, res) => {
const currentBudget = getLiveMemoryBudgetMB();
const loadedDetails = [];
for (const [id, entry] of manager.loadedModels) {
loadedDetails.push({
id,
name: entry.config.name,
estimatedVRAM_MB: entry.config.estimatedVRAM_MB,
lastUsed: entry.lastUsed,
});
}
res.json({
timestamp: Date.now(),
memoryBudgetMB: currentBudget,
memoryUsedMB: manager.getUsedMemoryMB(),
memoryFreeMB: currentBudget - manager.getUsedMemoryMB(),
systemRAM_FreeMB: Math.round(os.freemem() / 1024 / 1024),
loadedModels: loadedDetails,
activeInferences: semaphore.running,
queuedRequests: semaphore.waiters.length,
});
});
const server = app.listen(3001, () =>
console.log(`Orchestrator running. Initial memory budget: ${INITIAL_BUDGET_MB} MB`)
);
server.on('error', (err) => {
console.error('Server failed to start:', err);
process.exit(1);
});
module.exports = app;
The /load endpoint performs a pre-flight memory check before attempting to load, and returns HTTP 507 (Insufficient Storage) if eviction cannot free enough space. The memory budget refreshes on each request to account for external GPU processes that may have started or stopped since the orchestrator launched.
Building a Real-Time Monitoring Dashboard with React
Dashboard Layout and Components
The dashboard consists of three components: MemoryGauge (visual bars for VRAM and RAM usage), ModelList (loaded models with status and unload buttons), and InferenceLog (recent requests with latencies; left as an exercise for the reader). These give operators immediate visibility into the memory state of the multi-model system.
Connecting to the Node.js Backend
Save this as client/src/Dashboard.jsx. Configure the backend URL via the REACT_APP_API_URL environment variable (defaults to http://localhost:3001 for local development):
import React, { useState, useEffect } from 'react';
const API_BASE = process.env.REACT_APP_API_URL || 'http://localhost:3001';
function Dashboard() {
const [health, setHealth] = useState(null);
const [error, setError] = useState(null);
useEffect(() => {
const interval = setInterval(() => {
fetch(`${API_BASE}/health`)
.then((res) => res.json())
.then((data) => {
setHealth(data);
setError(null);
})
.catch((err) => setError(err.message));
}, 2000);
return () => clearInterval(interval);
}, []);
function handleUnload(modelId) {
fetch(`${API_BASE}/unload`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ modelId }),
})
.then((res) => {
if (!res.ok) throw new Error(`Unload failed: ${res.status}`);
return res.json();
})
.then(() => {
// Trigger immediate health refresh rather than waiting for next poll
fetch(`${API_BASE}/health`)
.then((r) => r.json())
.then(setHealth)
.catch((err) => setError(err.message));
})
.catch((err) => setError(`Unload error: ${err.message}`));
}
if (error) return <div className="error">Connection error: {error}</div>;
if (!health) return <div>Loading...</div>;
const rawPercent =
health.memoryBudgetMB > 0
? (health.memoryUsedMB / health.memoryBudgetMB) * 100
: 0;
const usagePercent = Math.min(100, Math.max(0, Math.round(rawPercent)));
return (
<div style={{ fontFamily: 'monospace', padding: 20 }}>
<h2>Memory Usage</h2>
<div style={{
background: '#333',
borderRadius: 4,
height: 30,
width: '100%'
}}>
<div
style={{
background: usagePercent > 85 ? '#e74c3c' : '#2ecc71',
height: '100%',
width: `${usagePercent}%`,
borderRadius: 4,
transition: 'width 0.3s',
}}
/>
</div>
<p>
{health.memoryUsedMB} MB / {health.memoryBudgetMB} MB ({usagePercent}%)
— Active: {health.activeInferences}, Queued: {health.queuedRequests}
</p>
<h2>Loaded Models</h2>
{health.loadedModels.map((model) => (
<div
key={model.id}
style={{
border: '1px solid #555',
padding: 10,
marginBottom: 8,
}}
>
<strong>{model.name}</strong> — {model.estimatedVRAM_MB} MB
<button
style={{ marginLeft: 12 }}
onClick={() => handleUnload(model.id)}
>
Unload
</button>
</div>
))}
</div>
);
}
export default Dashboard;
The component polls /health every 2 seconds. The memory bar turns red when usage exceeds 85%, providing a visual warning before OOM conditions arise.
Triggering Model Load/Unload from the UI
The unload button shown above makes a direct POST to /unload. Load functionality follows the same pattern, calling /load with the target modelId. For a production setup, optimistic UI updates (immediately reflecting the expected state change and reverting on error) improve responsiveness, particularly when the /load endpoint triggers eviction that may take several seconds to complete.
Performance Tuning and Troubleshooting
Common Memory Pitfalls
Developers most frequently overlook KV cache growth as an OOM cause during long-context inference. A model loaded within VRAM budget can still exceed it mid-inference if the context window fills. Memory fragmentation after repeated load/unload cycles can also reduce the effective contiguous VRAM available, particularly on older CUDA driver versions. The Node.js orchestrator process itself is another overlooked cost: it uses ~100 MB at baseline and rises toward 300 MB with large route tables or verbose logging enabled.
A model loaded within VRAM budget can still exceed it mid-inference if the context window fills.
Optimization Tips
Pre-warming critical models during predictable low-traffic periods (e.g., on server startup or during scheduled maintenance windows) avoids cold-start latency during peak demand. Reducing maxContextLength for models that do not need long context (embedding models, for instance, rarely need more than 512 tokens) directly reduces KV cache memory. Pinning the most frequently used model (by setting its priority to 0 in the registry) ensures it is never evicted, while secondary models swap based on demand.
Complete Implementation Checklist
- Audit hardware: query GPU VRAM and system RAM using
nvidia-smiorsystem_profiler - Build a model registry with quantization level, estimated VRAM (inclusive of quantization overhead), priority, and role metadata
- Implement
ModelManagerwith LRU eviction and a configurable memory ceiling - Replace the
_spawnModelProcessstub with your chosen inference runtime (Ollama API, llama.cpp server, etc.) - Configure layer offloading via
--n-gpu-layersfor models exceeding available VRAM - Add a memory-aware request queue with a concurrency cap and minimum headroom threshold
- Expose orchestration via REST API with pre-load memory checks returning 507 on failure
- Build a React monitoring dashboard polling
/healthevery 2 seconds - Set up recurring health checks and alerting thresholds (85% VRAM as warning)
- Test under load: simulate concurrent multi-model inference requests to validate eviction and queuing
- Profile and tune context windows, quantization levels, and eviction policies based on observed usage patterns
Scaling Locally Without Scaling Hardware
Memory management, not model capability, is the real bottleneck in multi-model local deployments. The orchestration pattern described here combines quantization-aware registries, LRU eviction, layer offloading, and memory-gated request queuing to run several specialized models on a single machine without additional hardware. The most impactful next step is replacing the polling-based dashboard with WebSocket connections for true real-time updates.

