This metrics tool terrifies bad developers

Start free trial
SitePoint Premium
Stay Relevant and Grow Your Career in Tech
  • Premium Results
  • Publish articles on SitePoint
  • Daily curated jobs
  • Learning Paths
  • Discounts to dev tools
Start Free Trial

7 Day Free Trial. Cancel Anytime.

The economics of team local AI are difficult to ignore. A capable inference GPU costs between $2,000 and $10,000, and most of the time it sits idle on a single developer's desk. This tutorial builds a complete shared local AI inference server that puts one GPU to work for your entire team.

How to Share One GPU Across Multiple Developers

  1. Provision a GPU workstation with 24 GB+ VRAM and verify NVIDIA drivers and CUDA are installed.
  2. Install Ollama on the workstation, pull your target model, and bind to the LAN with OLLAMA_HOST=0.0.0.0.
  3. Deploy Redis with AOF persistence enabled to back the job queue.
  4. Build a Node.js API gateway with Express, API-key authentication, and proxy endpoints mirroring Ollama's API.
  5. Configure BullMQ for fair queuing with priority based on per-developer usage counts and concurrency of 1.
  6. Add WebSocket broadcasting so queue depth and GPU status update in real time.
  7. Create a React monitoring dashboard that consumes the WebSocket feed and displays queue and GPU state.
  8. Distribute API keys and the gateway URL to your team, pointing their tools at the gateway instead of a local Ollama instance.

Table of Contents

Why Your Team Doesn't Need a GPU Per Developer

The economics of team local AI are difficult to ignore. A capable inference GPU, an RTX 4090 or A6000, costs between $2,000 and $10,000 at current street prices. Multiply that across five or six developers and the hardware budget alone becomes a serious line item. The alternative, routing every request through a cloud API, introduces its own problems: per-token costs that compound month over month, latency that disrupts flow states, rate limits during peak usage, and the ongoing risk of sending proprietary code to third-party servers.

But GPU inference has a characteristic that works in a team's favor: most interactive developer interactions with AI are bursty. A developer sends a prompt, waits a few seconds for the response, then spends minutes reading, editing, and thinking before sending the next one. A single GPU sitting on one person's desk is idle the vast majority of the time. Sharing one GPU across multiple developers is not a compromise. It is a rational allocation of an expensive, underutilized resource. (Batch or CI-driven workloads sustain higher utilization and should be evaluated separately.)

Sharing one GPU across multiple developers is not a compromise. It is a rational allocation of an expensive, underutilized resource.

This tutorial builds a complete shared local AI inference server: a GPU workstation running Ollama as the model backend, a Node.js API gateway handling authentication and fair queuing, and a React dashboard providing real-time visibility into queue depth and GPU status.

Prerequisites

  • GPU workstation: One machine with an NVIDIA GPU (24GB+ VRAM recommended). NVIDIA driver ≥ 525.x and CUDA ≥ 11.8 required. Verify with nvidia-smi (check the "CUDA Version" in the header) and nvcc --version. If CUDA is absent, install via the NVIDIA CUDA Toolkit.
  • You will need Node.js 20 or later for built-in fetch support.
  • Redis 6.2 or later with AOF persistence enabled (see "Project Scaffolding" below).
  • Basic familiarity with REST APIs and React.

Architecture Overview: How Team Local AI Sharing Works

Core Components

The system consists of three layers. The GPU workstation runs Ollama at the bottom, loading models into VRAM and exposing a local HTTP API for inference. A Node.js API gateway sits in the middle, authenticating incoming requests, placing them into a fair queue, and proxying them to Ollama when the GPU is available. A React dashboard on top connects over WebSocket to display live queue state, per-developer statistics, and GPU health.

How Requests Flow

A developer's tool sends a request to the gateway. The gateway validates the API key, identifies the developer, and enqueues the job in BullMQ. The queue worker picks up jobs one at a time (or at whatever concurrency the GPU can sustain), forwards the request to Ollama's API, and returns the response back through the gateway to the developer. Simultaneously, the gateway broadcasts queue state changes over WebSocket to the monitoring dashboard.

The flow is: Developer Tool → API Gateway (auth + enqueue) → BullMQ Queue → Worker → Ollama (GPU inference) → Response → Developer Tool. The dashboard taps into the queue state at the gateway level via WebSocket.

Why This Stack (Node.js + React)

Node.js fits this role well compared to thread-per-request servers. The event loop handles concurrent I/O-bound proxy connections without blocking, which matters when multiple developers are waiting on queued responses simultaneously. (CPU-intensive synchronous operations should be offloaded to worker threads.) React provides a lightweight path to a real-time dashboard. And practically speaking, most frontend and full-stack teams already have both in their toolchain, which lowers the barrier to maintaining and extending this system.

Project File Structure

Before proceeding, create the following directory layout. Each code section below corresponds to a specific file:

team-ai-gateway/
├── server.js
├── auth.js
├── queue.js
├── websocket.js
├── .env
├── .gitignore
└── package.json

Setting Up the GPU Host with Ollama

Installing Ollama and Pulling a Model

Ollama provides the simplest path from bare metal to a running inference server. The following commands install Ollama, pull a model appropriate for code assistance, and configure the server to accept connections from other machines on the local network.

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull a code-focused model (adjust based on VRAM capacity)
ollama pull codellama:13b

# Configure Ollama to listen on all network interfaces
# Add to ~/.bashrc or /etc/environment for persistence
export OLLAMA_HOST=0.0.0.0

# Restart Ollama with the new binding
systemctl restart ollama

By default, Ollama binds to 127.0.0.1, which prevents any access from other machines. Setting OLLAMA_HOST=0.0.0.0 makes it accessible on the LAN. This is safe inside a trusted network but should be paired with firewall rules if the workstation is exposed to a broader network. For example, to restrict access to a local subnet:

sudo ufw allow from 192.168.0.0/24 to any port 11434

Adjust the subnet to match your LAN.

Verifying GPU Utilization

Before building anything on top, verify that the model is loaded onto the GPU and that the API responds correctly.

# Test the Ollama API from another machine on the network
curl http://GPU_MACHINE_IP:11434/api/generate -d '{
  "model": "codellama:13b",
  "prompt": "Write a JavaScript function to debounce input.",
  "stream": false
}'

# After the first inference request completes, confirm GPU is being used
nvidia-smi

The nvidia-smi output should show the ollama process occupying VRAM. Note that the model lazy-loads: the ollama process will not appear in nvidia-smi until the first inference request is issued. The curl command should return a JSON response with the generated text. If the response comes back but nvidia-smi shows no GPU usage, the model is likely running on CPU because the CUDA drivers are not properly installed. Verify with nvidia-smi that "CUDA Version" shows 11.8 or higher.

Building the Node.js API Gateway

Project Scaffolding and Dependencies

{
  "name": "team-ai-gateway",
  "version": "1.0.0",
  "type": "module",
  "engines": {
    "node": ">=20.0.0"
  },
  "scripts": {
    "start": "node server.js"
  },
  "dependencies": {
    "express": "^4.18.2",
    "bullmq": "^5.1.0",
    "ioredis": "^5.3.2",
    "jsonwebtoken": "^9.0.2",
    "dotenv": "^16.3.1",
    "ws": "^8.16.0",
    "axios": "^1.6.0"
  }
}

BullMQ requires Redis, so a Redis instance must be running on the gateway machine or accessible on the network. Install Redis via apt install redis-server or run it in a container. Enable AOF persistence (appendonly yes in redis.conf) so that queued jobs survive Redis restarts. If Redis requires authentication, add password to the IORedis connection options. Set maxmemory-policy noeviction to prevent Redis from silently dropping data under memory pressure.

Startup order: Redis must be running before you start the gateway (node server.js). If Redis is unreachable, IORedis will emit a connection error (logged by the gateway).

Before running npm install, create a .gitignore file:

.env
node_modules/

Then run npm install.

API Key Authentication Middleware

Authentication uses a simple API key scheme. A .env file maps keys to developer identities.

Never commit .env to version control. Ensure .env is listed in .gitignore (done above). Set file permissions: chmod 600 .env. Rotate the example keys below before deploying — do not use them in production.

API_KEYS={"sk-alice-abc123":"alice","sk-bob-def456":"bob","sk-carol-ghi789":"carol"}
OLLAMA_URL=http://GPU_MACHINE_IP:11434
REDIS_HOST=127.0.0.1
REDIS_PORT=6379

The middleware extracts the key from the Authorization header and attaches the developer identity to the request. Save this as auth.js:

import dotenv from 'dotenv';
dotenv.config();

let apiKeys;
try {
  if (!process.env.API_KEYS) {
    throw new Error('API_KEYS environment variable is not set');
  }
  apiKeys = JSON.parse(process.env.API_KEYS);
  if (typeof apiKeys !== 'object' || Array.isArray(apiKeys)) {
    throw new Error('API_KEYS must be a JSON object mapping keys to developer IDs');
  }
} catch (err) {
  console.error('[auth] Fatal: failed to parse API_KEYS —', err.message);
  process.exit(1);
}

export function authMiddleware(req, res, next) {
  const authHeader = req.headers['authorization'] ?? '';
  const key = authHeader.replace(/^Bearer\s+/i, '');

  if (!key || !apiKeys[key]) {
    return res.status(401).json({ error: 'Invalid API key' });
  }

  req.developerId = apiKeys[key];
  next();
}

This is intentionally minimal. For teams that already have an identity provider, swapping in JWT validation or OAuth token introspection requires replacing roughly 20 lines in auth.js.

Fair Request Queue with BullMQ

A single GPU processes one inference request at a time in most single-GPU setups. Without a queue, concurrent requests either block each other, cause out-of-memory errors, or time out. BullMQ provides persistent job queuing backed by Redis (persistence depends on Redis AOF/RDB configuration), with built-in support for concurrency control and job prioritization.

The fair scheduling approach assigns priority based on each developer's recent usage count. A developer who has consumed more GPU time in the current window receives lower priority on their next request, preventing any single person from monopolizing the resource.

A developer who has consumed more GPU time in the current window receives lower priority on their next request, preventing any single person from monopolizing the resource.

Save this as queue.js:

import { Queue, Worker, QueueEvents } from 'bullmq';
import IORedis from 'ioredis';
import axios from 'axios';

function makeConnection() {
  const conn = new IORedis({
    host: process.env.REDIS_HOST || '127.0.0.1',
    port: parseInt(process.env.REDIS_PORT || '6379', 10),
    maxRetriesPerRequest: null, // required by BullMQ
  });
  conn.on('error', (err) => {
    console.error('[redis] connection error:', err.message);
  });
  return conn;
}

// BullMQ requires separate IORedis instances for Queue, Worker, and QueueEvents
const queueConnection = makeConnection();
const workerConnection = makeConnection();
const eventsConnection = makeConnection();

const inferenceQueue = new Queue('inference', { connection: queueConnection });
const queueEvents = new QueueEvents('inference', { connection: eventsConnection });

// Track per-developer request counts for fair scheduling.
// Note: counts are in-memory and reset on restart. For durable fairness
// across restarts, persist counts in Redis using
// redis.incr(`devcount:${developerId}`) with a TTL matching the reset interval.
let devRequestCounts = {};

export async function enqueueRequest(developerId, payload) {
  devRequestCounts[developerId] = (devRequestCounts[developerId] || 0) + 1;

  // BullMQ: lower number = higher priority. Invert so heavy users get lower priority.
  // Scale within 1–100; modulo prevents permanent saturation after many requests.
  const priority = Math.max(1, 100 - (devRequestCounts[developerId] % 100));

  const job = await inferenceQueue.add('generate', {
    developerId,
    payload
  }, { priority });

  return job;
}

export const OLLAMA_TIMEOUT_MS = 110_000;

// Worker processes one job at a time (concurrency: 1)
const worker = new Worker('inference', async (job) => {
  const { payload } = job.data;

  const response = await axios.post(
    `${process.env.OLLAMA_URL}/api/generate`,
    { ...payload, stream: false },
    { timeout: OLLAMA_TIMEOUT_MS }
  );

  return response.data;
}, { connection: workerConnection, concurrency: 1 });

worker.on('error', (err) => {
  console.error('[worker] error:', err.message);
});

// Reset counts every 10 minutes to prevent permanent deprioritization.
// Reassign to free memory for departed developers.
setInterval(() => {
  devRequestCounts = {};
}, 600_000);

export { inferenceQueue, queueEvents };

Setting concurrency: 1 ensures only one request hits the GPU at a time. For machines with large VRAM running smaller models, this can be increased, but testing under load is essential before doing so.

Proxy Endpoint: /api/generate and /api/chat

The gateway exposes endpoints that mirror Ollama's API format for non-streaming generate and chat calls. When stream: true is requested, the gateway uses SSE transport to return the completed response as a single frame. Full token-level streaming is not implemented; the gateway buffers responses server-side before delivering them. When stream: false (or omitted), the response is returned as standard JSON. Time-to-first-token equals full inference time in both cases.

Save this as server.js:

import express from 'express';
import { authMiddleware } from './auth.js';
import { enqueueRequest, queueEvents, OLLAMA_TIMEOUT_MS } from './queue.js';
import { broadcastQueueState } from './websocket.js';

const app = express();
// Limit body size to prevent memory exhaustion from oversized payloads
app.use(express.json({ limit: '1mb' }));

// Gateway wait timeout slightly shorter than Ollama+worker timeout
// to ensure client receives a clean error before the job orphans.
const GATEWAY_WAIT_MS = OLLAMA_TIMEOUT_MS - 5_000; // 105 000 ms

app.post('/api/generate', authMiddleware, async (req, res) => {
  // NOTE: This gateway does not support true token-level streaming.
  // All requests are buffered server-side regardless of the stream field.
  // Callers requesting stream:true receive a single SSE frame.
  const useSSE = req.body.stream === true;

  if (useSSE) {
    res.setHeader('Content-Type', 'text/event-stream');
    res.setHeader('Cache-Control', 'no-cache');
    res.setHeader('Connection', 'keep-alive');
  }

  let job;
  try {
    job = await enqueueRequest(req.developerId, { ...req.body, stream: false });
  } catch (err) {
    console.error('[gateway] enqueue error:', err.message);
    if (useSSE) {
      res.write(`data: ${JSON.stringify({ error: 'Failed to enqueue request' })}

`);
      return res.end();
    }
    return res.status(500).json({ error: 'Failed to enqueue request' });
  }

  // Await so broadcast reflects the just-enqueued job count
  await broadcastQueueState();

  try {
    const result = await job.waitUntilFinished(queueEvents, GATEWAY_WAIT_MS);
    if (useSSE) {
      res.write(`data: ${JSON.stringify(result)}

`);
      res.write('data: [DONE]

');
      return res.end();
    }
    return res.json(result);
  } catch (err) {
    console.error('[gateway] waitUntilFinished error:', err.message);
    if (useSSE) {
      res.write(`data: ${JSON.stringify({ error: err.message })}

`);
      return res.end();
    }
    return res.status(500).json({ error: err.message });
  }
});

app.post('/api/chat', authMiddleware, async (req, res) => {
  let job;
  try {
    job = await enqueueRequest(req.developerId, req.body);
  } catch (err) {
    console.error('[gateway] enqueue error:', err.message);
    return res.status(500).json({ error: 'Failed to enqueue request' });
  }

  await broadcastQueueState();

  try {
    const result = await job.waitUntilFinished(queueEvents, GATEWAY_WAIT_MS);
    return res.json(result);
  } catch (err) {
    console.error('[gateway] waitUntilFinished error:', err.message);
    return res.status(500).json({ error: err.message });
  }
});

app.listen(3000, () => console.log('[gateway] running on port 3000'));

Building the React Monitoring Dashboard

Dashboard Features

The dashboard provides three core views: real-time queue depth with per-developer position, a breakdown of request counts per developer, and current GPU status (busy or idle). These give the team immediate answers to the question that matters: "How long until my request gets processed?"

A developer leaderboard listing each team member's request count and average response time is a natural extension left as an exercise.

WebSocket Integration for Live Updates

The Node.js gateway broadcasts queue state changes whenever a job is added, started, or completed. Save this as websocket.js:

import { WebSocketServer } from 'ws';
import { inferenceQueue } from './queue.js';

const wss = new WebSocketServer({ port: 3001 });
const clients = new Set();

wss.on('connection', (ws) => {
  clients.add(ws);

  ws.on('close', () => clients.delete(ws));

  ws.on('error', (err) => {
    console.error('[ws] client error:', err.message);
    clients.delete(ws);
  });
});

let broadcastTimer = null;

export function broadcastQueueState() {
  // Debounce: coalesce rapid successive calls into one Redis read + send
  if (broadcastTimer) return Promise.resolve();

  return new Promise((resolve) => {
    broadcastTimer = setTimeout(async () => {
      broadcastTimer = null;

      try {
        const state = {
          queueDepth: await inferenceQueue.getJobCounts(),
          timestamp: Date.now(),
        };
        const msg = JSON.stringify(state);

        for (const client of clients) {
          // Only send to fully open connections
          if (client.readyState === client.OPEN) {
            try {
              client.send(msg);
            } catch (err) {
              console.error('[ws] send error:', err.message);
              clients.delete(client);
            }
          } else {
            clients.delete(client);
          }
        }
      } catch (err) {
        console.error('[ws] broadcastQueueState error:', err.message);
      }

      resolve();
    }, 200); // 200 ms trailing-edge debounce
  });
}

On the React side, a hook connects to the WebSocket and keeps state current. Replace GATEWAY_IP with the actual LAN IP of the gateway machine. In HTTPS contexts, use wss:// instead of ws://.

import { useState, useEffect, useRef } from 'react';

const BASE_RETRY_MS = 1_000;
const MAX_RETRY_MS = 30_000;

export function useQueueState() {
  const [queueState, setQueueState] = useState({ queueDepth: {}, timestamp: null });
  const retryDelay = useRef(BASE_RETRY_MS);
  const retryTimer = useRef(null);

  useEffect(() => {
    let ws;
    let destroyed = false;

    function connect() {
      ws = new WebSocket('ws://GATEWAY_IP:3001');

      ws.onopen = () => {
        retryDelay.current = BASE_RETRY_MS; // reset backoff on successful connect
      };

      ws.onmessage = (event) => {
        try {
          setQueueState(JSON.parse(event.data));
        } catch {
          console.warn('[useQueueState] received non-JSON message');
        }
      };

      ws.onerror = (err) => {
        console.error('[useQueueState] WebSocket error', err);
      };

      ws.onclose = () => {
        if (destroyed) return;
        retryTimer.current = setTimeout(() => {
          retryDelay.current = Math.min(retryDelay.current * 2, MAX_RETRY_MS);
          connect();
        }, retryDelay.current);
      };
    }

    connect();

    return () => {
      destroyed = true;
      clearTimeout(retryTimer.current);
      ws?.close();
    };
  }, []); // empty dep array: connect once on mount

  return queueState;
}

Dashboard UI Components

The dashboard prioritizes function over aesthetics. A queue status card shows current waiting and active job counts. A GPU health indicator reflects whether the worker is currently processing or idle.

import { useQueueState } from './useQueueState';

export function Dashboard() {
  const { queueDepth, timestamp } = useQueueState();

  return (
    <div style={{ fontFamily: 'monospace', padding: '2rem' }}>
      <h2>Team AI GatewayLive Status</h2>
      <div style={{ display: 'flex', gap: '2rem' }}>
        <div>
          <h3>Queue</h3>
          <p>Waiting: {queueDepth?.waiting ?? '...'}</p>
          <p>Active: {queueDepth?.active ?? '...'}</p>
        </div>
        <div>
          <h3>GPU Status</h3>
          <p>{queueDepth?.active > 0 ? '🔴 Busy' : '🟢 Idle'}</p>
        </div>
      </div>
      <p style={{ fontSize: '0.8rem', color: '#888' }}>
        Last update: {timestamp ? new Date(timestamp).toLocaleTimeString() : 'connecting...'}
      </p>
    </div>
  );
}

Developer Workflow: Connecting to the Shared GPU

Pointing Developer Tools at the Gateway

Because the gateway mirrors Ollama's API format for non-streaming generate and chat calls, developers connect by pointing their tools at the gateway's address instead of a local Ollama instance. For VS Code extensions like Continue and Cody, set the tool-specific environment variable for the Ollama base URL:

# Tool-specific variable (NOT the Ollama server bind address):
export OLLAMA_BASE_URL=http://GATEWAY_IP:3000

# Refer to your tool's documentation for the correct env var name.
# Do NOT set this on the GPU workstation — it is unrelated to OLLAMA_HOST.

Custom scripts, CLI tools, and any HTTP client that was previously targeting localhost:11434 only need the URL updated.

Example: Using the Gateway from a Node.js Script

const response = await fetch('http://GATEWAY_IP:3000/api/generate', {
  method: 'POST',
  headers: {
    'Content-Type': 'application/json',
    'Authorization': 'Bearer sk-alice-abc123'
  },
  body: JSON.stringify({
    model: 'codellama:13b',
    prompt: 'Refactor this function to use async/await:
function getData(cb) { fetch("/api").then(r => r.json()).then(cb); }',
    stream: false
  })
});

const data = await response.json();
console.log(data.response);

From the developer's perspective, the experience is nearly identical to using a local Ollama installation, with the addition of an API key header.

Performance Tuning and Scaling Considerations

Concurrency and Batching

The default concurrency of 1 is the safe starting point for a single GPU. With larger VRAM GPUs (48GB on an A6000, for example) running smaller models (7B parameter), you can increase concurrency to 2 if nvidia-smi shows 8 GB or more of free VRAM under typical load. This requires testing: monitor VRAM usage via nvidia-smi and watch for out-of-memory failures. Request batching, grouping similar short prompts into a single inference call, is another optimization, though it adds complexity and works best for uniform workloads like batch code review rather than interactive use.

Model Swapping Costs

Ollama keeps the active model loaded in VRAM. When a request arrives for a different model, Ollama must unload the current model and load the new one. This takes 10 to 30 seconds for a 13B model on NVMe SSD, and 60 seconds or more on SATA SSD. For a shared setup, this penalty is especially disruptive because it affects everyone in the queue.

The practical solution is to standardize the team on one or two models and document the choice. If different models are needed at different times, pinning models by time-of-day (code completion model during working hours, a larger reasoning model overnight for batch tasks) avoids constant swapping.

When One GPU Isn't Enough

A single GPU is saturated when queue depth stays high. If the queue consistently holds more than five pending jobs, or if p95 latency exceeds the team's tolerance (typically 30 to 60 seconds for interactive use), it is time to scale. The options are: adding a second GPU to the same machine (as of Ollama 0.1.29+, Ollama can use multiple GPUs for larger models; verify support for your specific model with ollama show <model>), or deploying a second GPU workstation behind a load balancer. The gateway architecture described here adapts to either approach, since adding a second Ollama backend is a matter of routing logic in the queue worker.

Implementation Checklist

Team Local AI Gateway: Setup Checklist

  1. ☐ Provision GPU workstation (minimum: RTX 3090/4090 with 24GB VRAM, or A4000 with 16GB VRAM suitable for 7B models only)
  2. ☐ Verify NVIDIA driver ≥ 525.x and CUDA ≥ 11.8 (nvidia-smi)
  3. ☐ Install Ollama and pull target model(s)
  4. ☐ Configure Ollama to bind to LAN (OLLAMA_HOST=0.0.0.0)
  5. ☐ Verify inference via curl from another machine
  6. ☐ Install and configure Redis (≥ 6.2, AOF persistence enabled)
  7. ☐ Scaffold Node.js gateway project with dependencies
  8. ☐ Create .gitignore (include .env and node_modules/)
  9. ☐ Create .env with API keys and configuration; chmod 600 .env
  10. ☐ Implement API key authentication middleware (auth.js)
  11. ☐ Set up BullMQ fair queue with appropriate concurrency (queue.js)
  12. ☐ Create proxy endpoints (server.js)
  13. ☐ Add WebSocket broadcasting for queue state (websocket.js)
  14. ☐ Start services in order: Redis → node server.js
  15. ☐ Build React monitoring dashboard
  16. ☐ Distribute API keys and gateway URL to team
  17. ☐ Configure developer tools to point at the gateway (using the tool-specific base URL variable, not OLLAMA_HOST)
  18. ☐ Monitor queue depth and latency for first week; tune concurrency
  19. ☐ Document model standards and usage guidelines for the team

One GPU, Whole Team

What this tutorial produces is a complete shared AI inference stack: a single GPU workstation serving an entire development team through an authenticated, fairly queued API gateway with live monitoring. The cost arithmetic is illustrative but compelling. One ~$2,000 GPU can replace the need for individual GPU hardware across the team. Compared to cloud API spend, which varies significantly by usage volume, model tier, and provider, teams with moderate interactive usage spend less. (For example, assuming ~2,000 requests/day across a five-person team at typical hosted-model token pricing, monthly API costs can easily reach $500 or more.)

Ollama can be swapped for vLLM when throughput requirements increase. Routing logic in the gateway can direct requests to different models based on task type. Queue infrastructure integrates naturally with CI/CD pipelines for automated AI-assisted code review. And because the gateway mirrors standard Ollama API endpoints for generate and chat calls, adopting it requires minimal changes to existing developer workflows.

After deployment, track actual queue metrics for the first week: peak queue depth, average and p95 latency, and per-developer usage distribution. Those numbers will determine whether the single GPU is sufficient or whether the team has the kind of workload that justifies scaling to a second card.

SitePoint TeamSitePoint Team

Sharing our passion for building incredible internet things.

© 2000 – 2026 SitePoint Pty. Ltd.
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.