Local LLM inference has become a mainstream development workflow in 2026. For anyone evaluating a Mac M3 Max vs RTX 4090 setup for local LLM performance, the decision means choosing between two fundamentally different architectural philosophies: Apple's unified memory approach and NVIDIA's dedicated GPU compute path.
Mac M3 Max vs RTX 4090 Comparison
| Dimension | Apple M3 Max (128 GB) | NVIDIA RTX 4090 (24 GB) |
|---|---|---|
| Best model range | 70B–123B+ (fully memory-resident) | 7B–30B (fully VRAM-resident) |
| Memory bandwidth | 400 GB/s unified | 1,008 GB/s GDDR6X |
| Power draw (system, inference) | 30–60 W | 450–650 W |
| Fine-tuning & batch serving | Limited (MLX only) | Full CUDA ecosystem (vLLM, TensorRT-LLM, Axolotl) |
Table of Contents
- Why Local LLM Hardware Choices Matter in 2026
- Understanding the Architecture Differences
- Setting Up Your Local LLM Benchmarking Environment
- Benchmark Results: Head-to-Head Comparison
- Beyond Raw Speed: Practical Considerations
- Building a Simple React + Node.js LLM Chat Interface for Testing
- Decision Framework: Which Hardware Should You Choose?
- The Right Tool for the Right Job
Why Local LLM Hardware Choices Matter in 2026
Local LLM inference has become a mainstream development workflow in 2026. Privacy constraints, accumulating API costs, and the need for low-latency responses have pushed developers toward running models on their own hardware. For anyone evaluating a Mac M3 Max vs RTX 4090 setup for local LLM performance, the decision means choosing between two fundamentally different architectural philosophies: Apple's unified memory approach and NVIDIA's dedicated GPU compute path.
This article presents head-to-head benchmarks using current quantized models (Llama 3.1, Llama 3.3, Mistral Large, Qwen 3), tested through updated inference engines including llama.cpp, Ollama 0.6, and vLLM. It provides reproducible setup instructions, a Node.js benchmarking harness, and a practical decision framework grounded in real performance data.
The test hardware: an Apple MacBook Pro with the M3 Max chip (16-core GPU configuration) and 128GB unified memory, compared against a desktop system running an NVIDIA RTX 4090 with 24GB GDDR6X VRAM and 64GB DDR5 system RAM.
Understanding the Architecture Differences
Apple M3 Max: Unified Memory Advantage
What defines the M3 Max for LLM workloads is its unified memory architecture. With configurations up to 128GB, the CPU and GPU share the same memory pool, meaning large models can be loaded in their entirety without the kind of quantization compromises forced by limited VRAM. A 70B parameter model at Q4_K_M quantization requires roughly 40GB of model weights plus KV cache overhead, which scales with context length (2-4GB at 4K context for 70B models). This fits well within the M3 Max's ceiling but far exceeds the RTX 4090's 24GB VRAM.
Memory bandwidth on the M3 Max reaches 400 GB/s on the 16-core GPU variant and 300 GB/s on the 14-core variant. We test the 16-core configuration; verify yours with system_profiler SPDisplaysDataType. For LLM token generation, which is fundamentally a memory-bound operation, this bandwidth figure matters more than raw compute throughput. The Metal Performance Shaders framework and Apple's MLX framework have improved materially through 2025 and into 2026 — Metal inference throughput for llama.cpp roughly doubled between mid-2024 and early 2026 — narrowing what was once a measurable software optimization gap relative to CUDA.
NVIDIA RTX 4090: Raw Compute Power
With 16,384 CUDA cores, fourth-generation Tensor Cores, and strong FP16/INT8 throughput, the RTX 4090 targets raw inference speed. For models that fit entirely within its 24GB of GDDR6X VRAM, the card delivers exceptional inference performance. Its memory bandwidth of 1,008 GB/s more than doubles the M3 Max's, which translates directly into faster token generation when models are fully resident in VRAM.
The limitation is capacity. At 24GB, the RTX 4090 forces aggressive quantization on models above roughly 13B parameters and requires partial CPU offloading for anything beyond 30B-40B at reasonable quantization levels. Offloading layers to system RAM introduces a severe bandwidth penalty, as PCIe 4.0 x16 has a theoretical unidirectional peak of ~32 GB/s; real-world sustained throughput for LLM weight streaming lands around 24-26 GB/s. The CUDA ecosystem remains the most mature platform for inference optimization, with first-class support across vLLM (full flash attention, paged attention, INT4 kernels), TensorRT-LLM, and the broader fine-tuning toolchain.
Why Memory Bandwidth Is the Real Battleground
During autoregressive inference, token generation is dominated by memory reads. Each token requires reading the model weights, making memory bandwidth the primary throughput constraint at batch size 1 (the typical local development scenario). As batch size increases, the workload shifts toward being compute-bound, which favors the RTX 4090's higher FLOPS. The crossover point where VRAM capacity limitations outweigh the RTX 4090's bandwidth advantage occurs around the 30B-70B parameter range, depending on quantization level. This is where the M3 Max's ability to keep entire models in fast unified memory begins to offset the raw bandwidth gap.
Each token requires reading the model weights, making memory bandwidth the primary throughput constraint at batch size 1 (the typical local development scenario).
Setting Up Your Local LLM Benchmarking Environment
Prerequisites and Tools
The Mac setup assumes a MacBook Pro with the M3 Max chip (16-core GPU, 400 GB/s bandwidth) and 128GB unified memory running macOS Sonoma 14.x or later. The NVIDIA setup assumes a desktop with an RTX 4090, 64GB DDR5 system RAM, running Ubuntu 22.04 or Windows 11. Both platforms require Node.js 22+, Ollama 0.6, and the latest build of llama.cpp. For vLLM benchmarks on the NVIDIA side, you need Python 3.9-3.12. Note: vLLM requires Linux with a CUDA-capable GPU and does not support macOS.
Test Environment
For reproducible results, record the following before running benchmarks:
- macOS version (
sw_vers) or Ubuntu version (lsb_release -a) - Ollama exact version (
ollama --version) - CUDA driver version (
nvidia-smiheader) and toolkit version (nvcc --version) on NVIDIA systems - Node.js version (
node --version) - Whether flash attention is enabled in llama.cpp builds
Installing Ollama and Pulling Models on Both Platforms
# macOS installation
curl -fsSL https://ollama.com/install.sh | sh
# Linux installation (for NVIDIA desktop)
curl -fsSL https://ollama.com/install.sh | sh
# Windows installation (alternative)
# Download installer from https://ollama.com/download/windows
# Verify installation
ollama --version
# Pull test models
# Small: Llama 3.1 8B at Q4_K_M quantization
ollama pull llama3.1:8b-q4_K_M
# Medium: Llama 3.3 70B at Q4_K_M quantization
ollama pull llama3.3:70b-q4_K_M
# Large: Mistral Large 123B at Q3_K_M quantization
ollama pull mistral-large:123b-q3_K_M
# Additional test model
ollama pull qwen3:8b
# Verify models are available
ollama list
# Verify GPU detection on NVIDIA systems
# Confirm nvidia-smi shows the RTX 4090 before pulling models.
# After starting a model, run `ollama ps` to confirm GPU utilization.
Note that pulling the 70B and 123B models requires substantial disk space (around 40GB and 55GB respectively). On the RTX 4090 system, ensure the Ollama service detects the NVIDIA GPU by checking ollama ps after starting a model.
Building a Node.js Benchmarking Script
Before running any benchmarks, initialize a Node.js project and install dependencies:
# Initialize project
npm init -y
# Install express (needed for server.mjs later)
npm install express
Note: The benchmark and batch scripts use the .mjs extension, which Node.js treats as ES modules by default. If you rename files to .js, you must add "type": "module" to your package.json.
// benchmark.mjs
const OLLAMA_BASE = process.env.OLLAMA_BASE_URL ?? "http://localhost:11434";
async function benchmarkModel(modelName, prompt, iterations = 5) {
const results = [];
// Warm-up iteration: load model into memory and discard results
console.log(` Warm-up run for ${modelName}...`);
const warmupResponse = await fetch(`${OLLAMA_BASE}/api/generate`, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
model: modelName,
prompt: "Hello",
stream: false,
options: { num_predict: 1 },
}),
signal: AbortSignal.timeout(120_000),
});
if (!warmupResponse.ok) {
throw new Error(`Warm-up failed for ${modelName}: HTTP ${warmupResponse.status}`);
}
await warmupResponse.json();
for (let i = 0; i < iterations; i++) {
const start = performance.now();
let firstTokenTime = null;
let tokenCount = 0;
let evalCount = null; // authoritative token count from Ollama
const response = await fetch(`${OLLAMA_BASE}/api/generate`, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
model: modelName,
prompt: prompt,
stream: true,
options: { num_predict: 256, temperature: 0.0 }, // Use temperature 0 for reproducible benchmarks
}),
signal: AbortSignal.timeout(120_000),
});
if (!response.ok) {
throw new Error(`Ollama request failed: HTTP ${response.status}`);
}
const reader = response.body.getReader();
const decoder = new TextDecoder(); // one decoder per request; stream:true state is scoped here
try {
while (true) {
const { done, value } = await reader.read();
if (done) break;
const lines = decoder
.decode(value, { stream: true })
.split("
")
.filter(Boolean);
for (const line of lines) {
let data;
try {
data = JSON.parse(line);
} catch {
// Skip partial or malformed lines
continue;
}
if (data.done) {
// Use Ollama's authoritative token count from the final object
evalCount = data.eval_count ?? tokenCount;
} else if (data.response && data.response.length > 0) {
tokenCount++;
if (!firstTokenTime) firstTokenTime = performance.now();
}
}
}
} finally {
reader.cancel();
}
const totalTime = performance.now() - start;
const ttft = firstTokenTime != null ? firstTokenTime - start : null;
const finalTokenCount = evalCount ?? tokenCount;
// TPS = tokens generated / generation time only (excludes TTFT/prompt-processing phase)
const generationTime = ttft != null ? (totalTime - ttft) / 1000 : totalTime / 1000;
const tps =
generationTime > 0 && finalTokenCount > 0
? finalTokenCount / generationTime
: 0;
results.push({
iteration: i + 1,
ttft_ms: ttft != null ? Math.round(ttft) : null,
tps: Math.round(tps * 100) / 100,
total_ms: Math.round(totalTime),
tokens: finalTokenCount,
});
}
// Filter out null TTFT values before averaging
const validTtfts = results.map((r) => r.ttft_ms).filter((v) => v != null);
const avgTtft =
validTtfts.length > 0
? Math.round(validTtfts.reduce((a, v) => a + v, 0) / validTtfts.length)
: null;
// Round only the final average, not intermediates
const avgTps =
Math.round(
(results.reduce((a, r) => a + r.tps, 0) / results.length) * 100
) / 100;
return { model: modelName, iterations, avgTtft_ms: avgTtft, avgTps: avgTps, results };
}
// Run standalone only when executed directly (not when imported)
if (process.argv[1] === new URL(import.meta.url).pathname) {
const model = process.argv[2] || "llama3.1:8b-q4_K_M";
const prompt = "Explain the difference between TCP and UDP in detail.";
benchmarkModel(model, prompt).then((r) => console.log(JSON.stringify(r, null, 2)));
}
export { benchmarkModel };
This script measures three key metrics: time-to-first-token (TTFT), which captures prompt processing and initial decode latency; tokens-per-second (TPS), which measures sustained generation throughput during the generation phase only (excluding TTFT); and total generation time. It uses Ollama's authoritative eval_count from the final streamed response object for accurate token counting, rather than counting individual streamed frames. It includes a warm-up iteration to ensure the model is loaded into memory before timed runs begin. It then runs five timed iterations per model by default and averages results to reduce variance from thermal throttling or background processes. The temperature is set to 0.0 (greedy decoding) for deterministic, reproducible results.
Automating Benchmark Runs Across Models
// batch-benchmark.mjs
import { benchmarkModel } from "./benchmark.mjs";
import { writeFileSync } from "node:fs";
const MODELS = [
"llama3.1:8b-q4_K_M",
"qwen3:8b",
"llama3.3:70b-q4_K_M",
"mistral-large:123b-q3_K_M",
];
const PROMPT = "Explain the difference between TCP and UDP in detail.";
const PLATFORM = process.argv[2] || "unknown";
// Minimal CSV quoting: wrap fields containing commas, quotes, or newlines
function csvField(value) {
const str = String(value);
if (str.includes(",") || str.includes('"') || str.includes("
")) {
return `"${str.replace(/"/g, '""')}"`;
}
return str;
}
function safeWrite(filename, content) {
try {
writeFileSync(filename, content);
console.log(`Results written to ${filename}`);
} catch (err) {
console.error(`Failed to write ${filename}: ${err.message}`);
console.log("--- RESULTS FOLLOW (stdout fallback) ---");
console.log(content);
}
}
async function runAll() {
const allResults = [];
const csvRows = [
["Model", "Platform", "Avg TTFT (ms)", "Avg TPS", "Iterations"]
.map(csvField)
.join(","),
];
for (const model of MODELS) {
console.log(`Benchmarking ${model}...`);
try {
const result = await benchmarkModel(model, PROMPT, 5);
result.platform = PLATFORM;
allResults.push(result);
csvRows.push(
[result.model, PLATFORM, result.avgTtft_ms, result.avgTps, result.iterations]
.map(csvField)
.join(",")
);
console.log(` TTFT: ${result.avgTtft_ms}ms | TPS: ${result.avgTps}`);
} catch (err) {
console.error(` Failed: ${err.message}`);
csvRows.push(
[model, PLATFORM, "ERROR", "ERROR", 0].map(csvField).join(",")
);
}
}
const timestamp = Date.now();
safeWrite(
`benchmark-results-${PLATFORM}-${timestamp}.csv`,
csvRows.join("
")
);
safeWrite(
`benchmark-results-${PLATFORM}.json`,
JSON.stringify(allResults, null, 2)
);
}
runAll();
Run this script with node batch-benchmark.mjs m3max or node batch-benchmark.mjs rtx4090 to tag results by platform. The CSV output enables straightforward comparison in any spreadsheet tool. Models that fail to load (for instance, the 123B model on the RTX 4090 without sufficient offloading configuration) are caught and logged rather than crashing the run.
Benchmark Results: Head-to-Head Comparison
Small Models (7B to 13B Parameters)
The RTX 4090 wins this category by a wide margin. When a model fits entirely within 24GB of VRAM, NVIDIA's 1,008 GB/s bandwidth advantage dominates. The Llama 3.1 8B at Q4_K_M quantization runs at 95-110 tokens per second on the RTX 4090 through Ollama, compared to 45-55 tokens per second on the M3 Max. Qwen 3 8B shows the same pattern. Across 7B-13B models, the RTX 4090 leads by 40-60%. TTFT stays low on both platforms (under 200ms on the 4090, typically 300-500ms on the M3 Max), making the difference imperceptible for interactive use but relevant for batch workloads.
Medium Models (30B to 70B Parameters)
The gap narrows substantially at 70B parameters. A Llama 3.3 70B model at Q4_K_M quantization consumes roughly 40GB of weights plus KV cache overhead, forcing the RTX 4090 into partial CPU offloading. With layers split between VRAM and system RAM, the RTX 4090's effective throughput drops sharply, often falling to 8-15 tokens per second depending on the offloading ratio. The M3 Max loads the entire model into unified memory and sustains 12-18 tokens per second. At this size class, the M3 Max matches or overtakes the RTX 4090, particularly when more than 30-40% of layers must be offloaded to system RAM on the NVIDIA side.
Large Models (100B+ Parameters)
Here the M3 Max stands alone. The 123B Mistral Large at Q3_K_M quantization requires roughly 50-55GB of model weights (plus KV cache overhead scaling with context length). It loads entirely into the M3 Max's 128GB unified memory and runs at 5-8 tokens per second. The RTX 4090 cannot run this model at usable speeds: with the vast majority of layers offloaded to system RAM over PCIe, throughput drops below 2 tokens per second, making interactive use impractical. If you need to experiment with 100B+ parameter models locally, the M3 Max is the only viable consumer-grade option.
If you need to experiment with 100B+ parameter models locally, the M3 Max is the only viable consumer-grade option.
Complete Benchmark Comparison Table
| Model | Quantization | Platform | TTFT (ms) | TPS (tok/s) | Memory Used | Notes |
|---|---|---|---|---|---|---|
| Llama 3.1 8B | Q4_K_M | RTX 4090 | ~120 | ~105 | ~5GB VRAM | Fully VRAM-resident |
| Llama 3.1 8B | Q4_K_M | M3 Max | ~350 | ~52 | ~5GB unified | Metal acceleration |
| Qwen 3 8B | Q4_K_M | RTX 4090 | ~140 | ~95 | ~5GB VRAM | Fully VRAM-resident |
| Qwen 3 8B | Q4_K_M | M3 Max | ~380 | ~48 | ~5GB unified | Metal acceleration |
| Llama 3.3 70B | Q4_K_M | RTX 4090 | ~2,800 | ~10 | 24GB VRAM + ~18GB RAM | Partial CPU offload |
| Llama 3.3 70B | Q4_K_M | M3 Max | ~1,800 | ~15 | ~40GB unified | Fully memory-resident |
| Mistral Large 123B | Q3_K_M | RTX 4090 | ~8,000+ | ~1.5 | 24GB VRAM + ~30GB RAM | Heavy CPU offload, impractical |
| Mistral Large 123B | Q3_K_M | M3 Max | ~4,500 | ~6 | ~53GB unified | Usable for interactive work |
We averaged these figures across multiple runs using Ollama 0.6 with default settings (temperature 0, greedy decoding). Variance of 5-10% is typical across runs due to thermal management and background processes.
Beyond Raw Speed: Practical Considerations
Power Consumption and Thermal Performance
The M3 Max MacBook Pro draws 30-60W at the wall during sustained LLM inference. An RTX 4090 desktop system pulls 450W or more under full system load (the RTX 4090 GPU alone has a 450W TDP; total system draw with CPU, RAM, and other components typically reaches 550-650W). For a developer running inference 8 hours daily, the annual electricity cost delta could reach $150-$300 assuming $0.12-$0.20/kWh. The Mac operates nearly silently under load, while RTX 4090 cooling solutions produce noticeable fan noise under sustained inference workloads. Thermal throttling rarely affects the RTX 4090 with adequate case airflow, but the M3 Max can experience mild throttling during extended generation of very large models.
Total Cost of Ownership (2026 Pricing)
A MacBook Pro with M3 Max and 128GB unified memory runs $4,000 to $4,500 as of early 2026 (verify current pricing at apple.com). A capable RTX 4090 desktop build (including the GPU at current 2026 pricing, a suitable CPU, 64GB DDR5, PSU, storage, and case) lands in the $2,500 to $3,000 range. The Mac includes a display, battery, and full portability. The desktop offers a modular upgrade path: when the RTX 5090 arrives, swapping a GPU is straightforward. Upgrading the M3 Max means replacing the entire machine.
Software Ecosystem and Developer Experience
Ollama 0.6 and llama.cpp have reached near-parity across macOS and Linux for the basic inference workloads tested here, though advanced features (flash attention tuning, quantization type support) still differ between platforms. Both platforms handle GGUF model loading, streaming inference, and API compatibility identically. Where the platforms diverge is in the broader ecosystem. NVIDIA retains large advantages for vLLM (optimized serving with continuous batching, Linux only), TensorRT-LLM (maximum inference optimization), and fine-tuning workflows via tools like Axolotl and Unsloth that depend on CUDA. Apple's MLX framework offers a clean Python and Swift integration for inference and lightweight training, with energy efficiency as a differentiator for developers who work primarily on macOS.
Building a Simple React + Node.js LLM Chat Interface for Testing
Backend API with Node.js and Streaming Responses
Ensure you have installed express (npm install express) in your project directory before running this server.
Warning: The CORS header below is set to Access-Control-Allow-Origin: *, which allows requests from any origin. This is acceptable for local development but should be restricted to specific origins before any network-exposed deployment.
// server.mjs
import express from "express";
const app = express();
const OLLAMA_BASE = process.env.OLLAMA_BASE_URL ?? "http://localhost:11434";
const PORT = parseInt(process.env.PORT ?? "3001", 10);
// Allowed model names: alphanumeric, hyphens, colons, dots, underscores; max 128 chars
const MODEL_PATTERN = /^[a-zA-Z0-9_\-.:]{1,128}$/;
const MAX_PROMPT_LENGTH = 32_768; // ~32K chars; adjust to your use case
app.use(express.json({ limit: "64kb" }));
app.use((req, res, next) => {
// WARNING: Restrict this origin in non-local deployments
res.setHeader("Access-Control-Allow-Origin", "*");
res.setHeader("Access-Control-Allow-Headers", "Content-Type");
res.setHeader("Access-Control-Allow-Methods", "POST, OPTIONS");
next();
});
// Handle CORS preflight
app.options("*", (req, res) => res.sendStatus(204));
app.post("/api/chat", async (req, res) => {
const { model, prompt } = req.body ?? {};
if (typeof model !== "string" || !MODEL_PATTERN.test(model)) {
return res.status(400).json({ error: "Invalid or missing model name." });
}
if (typeof prompt !== "string" || prompt.length === 0) {
return res.status(400).json({ error: "Prompt must be a non-empty string." });
}
if (prompt.length > MAX_PROMPT_LENGTH) {
return res.status(400).json({ error: `Prompt exceeds maximum length of ${MAX_PROMPT_LENGTH} characters.` });
}
const startTime = performance.now();
res.setHeader("Content-Type", "text/event-stream");
res.setHeader("Cache-Control", "no-cache");
res.setHeader("Connection", "keep-alive");
let ollamaResponse;
try {
ollamaResponse = await fetch(`${OLLAMA_BASE}/api/generate`, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({
model,
prompt,
stream: true,
options: { num_predict: 512 },
}),
signal: AbortSignal.timeout(120_000),
});
} catch (err) {
return res.status(502).json({
error: "Could not connect to Ollama. Is it running?",
});
}
if (!ollamaResponse.ok) {
return res.status(502).json({
error: "Ollama request failed",
status: ollamaResponse.status,
});
}
const reader = ollamaResponse.body.getReader();
const decoder = new TextDecoder();
let firstToken = false;
// Note: server-side TTFT measures Ollama processing latency only,
// excluding client-to-server HTTP transit. For end-to-end TTFT,
// use the client-side benchmark script.
try {
while (true) {
const { done, value } = await reader.read();
if (done) break;
// Split NDJSON chunks into individual lines before SSE framing.
// Raw multi-line chunks must NOT be embedded directly in a single data: field.
const lines = decoder
.decode(value, { stream: true })
.split("
")
.filter(Boolean);
for (const line of lines) {
let parsed;
try {
parsed = JSON.parse(line);
} catch {
continue; // skip malformed lines
}
if (!firstToken && parsed.response) {
const ttft = Math.round(performance.now() - startTime);
res.write(
`data: ${JSON.stringify({ type: "meta", ttft_ms: ttft })}
`
);
firstToken = true;
}
// Each parsed object is sent as its own well-formed SSE event
res.write(`data: ${JSON.stringify(parsed)}
`);
}
}
} finally {
reader.cancel();
}
res.write(
`data: ${JSON.stringify({
type: "done",
total_ms: Math.round(performance.now() - startTime),
})}
`
);
res.end();
});
app.listen(PORT, () => console.log(`LLM proxy running on :${PORT}`));
Frontend React Component with Performance Metrics Display
// ChatBenchmark.jsx
import { useState, useRef, useEffect } from "react";
export default function ChatBenchmark({ model = "llama3.1:8b-q4_K_M" }) {
const [input, setInput] = useState("");
const [output, setOutput] = useState("");
const [metrics, setMetrics] = useState({ ttft: null, tps: 0, tokens: 0 });
const [streaming, setStreaming] = useState(false);
const tokenCount = useRef(0);
const startTime = useRef(0);
const firstTokenTime = useRef(null);
// Hold reader ref so we can cancel it on unmount
const readerRef = useRef(null);
// Cancel any in-progress stream on unmount
useEffect(() => {
return () => {
if (readerRef.current) {
readerRef.current.cancel();
readerRef.current = null;
}
};
}, []);
async function handleSubmit(e) {
e.preventDefault();
setOutput("");
setMetrics({ ttft: null, tps: 0, tokens: 0 });
tokenCount.current = 0;
firstTokenTime.current = null;
startTime.current = performance.now();
setStreaming(true);
try {
const res = await fetch("http://localhost:3001/api/chat", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ model, prompt: input }),
});
if (!res.ok) {
setOutput(`Error: server returned ${res.status}`);
return; // finally will reset streaming
}
const reader = res.body.getReader();
readerRef.current = reader;
const decoder = new TextDecoder();
try {
while (true) {
const { done, value } = await reader.read();
if (done) break;
const lines = decoder
.decode(value, { stream: true })
.split("
")
.filter((l) => l.startsWith("data: "));
for (const line of lines) {
let data;
try {
data = JSON.parse(line.slice(6));
} catch {
continue; // skip malformed SSE lines
}
if (data.type === "meta") {
setMetrics((m) => ({ ...m, ttft: data.ttft_ms }));
} else if (data.type === "done") {
// streaming = false handled in finally
} else if (data.response) {
tokenCount.current++;
if (!firstTokenTime.current) {
firstTokenTime.current = performance.now();
}
// TPS = tokens / generation time (excludes TTFT), consistent with benchmark.mjs
const generationElapsed =
(performance.now() - firstTokenTime.current) / 1000;
const tps =
generationElapsed > 0
? Math.round((tokenCount.current / generationElapsed) * 10) / 10
: 0;
setOutput((prev) => prev + data.response);
setMetrics((m) => ({
...m,
tokens: tokenCount.current,
tps,
}));
}
}
}
} finally {
reader.cancel();
readerRef.current = null;
}
} catch (err) {
setOutput("Error: " + err.message);
} finally {
// Always reset streaming state, even on error or unmount
setStreaming(false);
}
}
return (
<div style={{ fontFamily: "monospace", maxWidth: 700, margin: "2rem auto" }}>
<h3>LLM Chat — {model}</h3>
<div
style={{
background: "#f0f0f0",
padding: "0.5rem",
marginBottom: "1rem",
borderRadius: 4,
}}
>
TTFT: {metrics.ttft != null ? `${metrics.ttft}ms` : "—"} | TPS:{" "}
{metrics.tps} | Tokens: {metrics.tokens}
{streaming && " ● Streaming"}
</div>
<form onSubmit={handleSubmit}>
<textarea
value={input}
onChange={(e) => setInput(e.target.value)}
rows={3}
style={{ width: "100%" }}
/>
<button type="submit" disabled={streaming || input.trim().length === 0}>
Send
</button>
</form>
<pre
style={{
whiteSpace: "pre-wrap",
marginTop: "1rem",
background: "#1e1e1e",
color: "#d4d4d4",
padding: "1rem",
borderRadius: 4,
}}
>
{output}
</pre>
</div>
);
}
This interface gives developers a subjective feel for response quality and latency alongside the raw numbers. Watching tokens stream in real time while TTFT and TPS update live provides an intuition that CSV files alone cannot capture.
Decision Framework: Which Hardware Should You Choose?
Choose the M3 Max If...
Portability is a requirement, and you cannot be tethered to a desktop. You need to run 70B+ parameter models without partial offloading or aggressive quantization, because your work depends on output quality at that scale. Power efficiency and near-silent operation matter for home office or travel use. Your primary use case is single-user, interactive inference and experimentation with large models.
Choose the RTX 4090 If...
You need maximum speed on 7B to 30B models, and those model sizes cover your workload. Batch inference or serving concurrent requests is part of the plan, since CUDA's ecosystem handles multi-request serving far better today. Local fine-tuning is on the roadmap, because the CUDA ecosystem is effectively required for that. Getting the most performance per dollar spent matters more than portability.
Implementation Checklist
When purchasing hardware, confirm the model sizes you plan to run regularly (parameter count and quantization level). Calculate total memory requirements (model weights + KV cache + overhead). Verify power supply requirements (850W+ PSU for RTX 4090 builds). Assess portability needs honestly. Budget for total system cost, not just GPU or laptop price. Consider upgrade timeline (GPU swap vs full laptop replacement). Check current availability and pricing for both platforms. Factor in display and peripherals for desktop builds.
To set up your software environment, follow these steps in order:
- Install Ollama 0.6 from official sources
- Verify GPU detection (Metal on macOS, CUDA on NVIDIA -- confirm with
ollama psafter loading a model) - Install Node.js 22+ via nvm or official installer
- Initialize a Node.js project (
npm init -y && npm install express) - Pull target models and verify loading
- Install Python 3.9-3.12 and vLLM for NVIDIA batch inference testing (Linux only -- vLLM does not support macOS)
- Clone and build llama.cpp from source for maximum performance
- Configure Ollama environment variables (OLLAMA_NUM_PARALLEL, OLLAMA_MAX_LOADED_MODELS)
- Set
OLLAMA_KEEP_ALIVE=1h(or longer for large models) to keep models in memory between requests; the default is 5 minutes - Set up monitoring tools (nvidia-smi for NVIDIA, Activity Monitor for macOS)
- Run initial smoke tests with small models
- Validate streaming API responses end to end
For model selection by use case, consider:
- Code completion and chat: 7B-8B models (Llama 3.1 8B, Qwen 3 8B)
- Complex reasoning and analysis: 70B models (Llama 3.3 70B Q4_K_M)
- Near-frontier quality: 100B+ models (Mistral Large 123B, M3 Max only)
- RAG applications: 7B-13B models for speed, 70B for quality
- Creative writing: 30B-70B models for coherence
- Multilingual tasks: Qwen 3 models for broad language coverage
To validate performance, run the benchmark script with standardized prompts (minimum 5 iterations, after warm-up). Compare TTFT, TPS, and total time against the reference table above. Test with realistic prompt lengths matching actual workloads. Monitor memory usage during inference to confirm no unexpected offloading. Target variance under 10% across runs.
Platform-specific optimizations:
- M3 Max: Keep macOS updated for latest Metal optimizations; close memory-heavy applications during inference; use MLX for Python-native workflows
- M3 Max: Verify your M3 Max GPU core count (14-core: 300 GB/s, 16-core: 400 GB/s) -- results will differ by ~25%
- RTX 4090: Use
CUDA_VISIBLE_DEVICESto pin GPU; enable flash attention in llama.cpp builds (cmake -DLLAMA_FLASH_ATTN=ONor--flash-attnruntime flag) - RTX 4090: Consider TensorRT-LLM for production-grade serving optimization
- Both: Use Q4_K_M as the default quantization (best quality-to-size ratio)
- Both: Monitor thermal performance during extended runs to catch throttling
The Right Tool for the Right Job
Neither platform holds a universal advantage. The RTX 4090 dominates throughput on models that fit within 24GB of VRAM. The M3 Max wins on large model accessibility, power efficiency, and portability. Software optimizations through 2025 and 2026, particularly in Ollama, llama.cpp, and Metal shader compilation, have made both platforms dramatically more capable than they were in 2024.
Neither platform holds a universal advantage. The RTX 4090 dominates throughput on models that fit within 24GB of VRAM. The M3 Max wins on large model accessibility, power efficiency, and portability.
Run the benchmarking scripts provided here against your own target models and workloads. Synthetic benchmarks establish a baseline, but real-world prompt patterns, context lengths, and concurrency requirements will change which platform is faster for your workload. With the M4 Ultra and RTX 5090 both expected as of early 2026, size today's investment to current needs rather than speculative future requirements -- availability and pricing for next-generation hardware may shift by the time you read this.

