This metrics tool terrifies bad developers

Start free trial
SitePoint Premium
Stay Relevant and Grow Your Career in Tech
  • Premium Results
  • Publish articles on SitePoint
  • Daily curated jobs
  • Learning Paths
  • Discounts to dev tools
Start Free Trial

7 Day Free Trial. Cancel Anytime.

Running local LLMs on Apple Silicon has shifted from experimental curiosity to a genuinely viable workflow for developers and AI practitioners. The unified memory architecture of M1, M2, and M3 chips removes the single biggest constraint hobbling consumer PCs for large model inference: discrete GPU VRAM limits. Yet most users still run models with default settings, leaving substantial performance untapped. This guide covers concrete optimization techniques, from Metal GPU acceleration and quantization strategy to Neural Engine trade-offs and memory management, with benchmarks and working configurations throughout.

Table of Contents

Prerequisites

macOS 13 Ventura or later is recommended for full Metal acceleration (macOS 11+ minimum for Ollama installation). You will need an Apple Silicon Mac (M1 or later) and Xcode Command Line Tools (xcode-select --install).

  • CMake 3.14+ (for the llama.cpp path): brew install cmake — verify with cmake --version.
  • Python 3.9–3.12 (for the MLX path): verify with python3 --version. A virtual environment is recommended: python3 -m venv mlx-env && source mlx-env/bin/activate.
  • Disk space: 7B Q4 models require ~4GB; 70B Q4 models require ~40GB. Internet access is required for model downloads.

Why Apple Silicon Suits Local LLM Inference

Unified Memory Architecture Advantage

On conventional desktop and laptop hardware, the GPU's dedicated VRAM sets a hard ceiling on model size. A 12GB RTX 4070 cannot hold a 70B parameter model in any quantization format without offloading layers to system RAM over PCIe, which tanks throughput. Apple Silicon's unified memory architecture (UMA) eliminates this bottleneck entirely. The CPU, GPU, and Neural Engine all share a single high-bandwidth memory pool, so the GPU accesses model weights without PCIe bus transfer overhead. The M2 Ultra and M3 Ultra configurations offer up to 192GB of unified memory, enough to hold a full 70B model at higher quantization levels or even larger models (such as Mixtral-based 120B+ variants) in aggressive quantization, all accessible by the GPU at full memory bandwidth.

Apple Silicon's unified memory architecture (UMA) eliminates this bottleneck entirely. The CPU, GPU, and Neural Engine all share a single high-bandwidth memory pool, so the GPU accesses model weights without PCIe bus transfer overhead.

Metal vs. CUDA for LLM Inference

CUDA has long dominated GPU-accelerated inference, but Metal acceleration has matured significantly across every major inference framework. Ollama, llama.cpp, and Apple's own MLX framework all treat Metal as a first-class backend. The llama.cpp project enabled Metal acceleration via the GGML Metal backend, and Ollama uses this under the hood. MLX, developed by Apple's machine learning research team, was built from the ground up for Apple Silicon and exploits UMA directly. In 2025, Apple Silicon is no longer a second-tier platform for local AI workloads; it is a primary target for framework developers.

Hardware Tier Breakdown: M1 vs. M2 vs. M3 for LLMs

Performance Benchmarks by Chip and Model Size

The following table synthesizes community benchmarks from Ollama and llama.cpp users (circa late 2024 – early 2025; no standardized methodology), representing typical inference throughput for quantized models. Actual results vary with quantization format, context length, model family, and background system load. The "Max Unified Memory" column reflects the highest-memory SKU for each chip; lower-memory configurations exist for most tiers.

ChipMax Unified MemoryGPU Cores~tok/s (7B Q4)~tok/s (13B Q4)Max Practical Model (Quantized)
M1 (8GB)8GB7–810–155–87B Q4 only
M1 (16GB)16GB7–815–208–127B Q4
M1 Pro32GB14–1622–2812–1613B Q4
M1 Max64GB24–3230–3518–2230B Q4
M1 Ultra128GB48–6440–5025–3065B Q4
M224GB8–1018–2410–147B–13B Q4
M2 Pro32GB16–1926–3214–1813B Q4
M2 Max96GB30–3835–4222–2830B–65B Q4
M2 Ultra192GB60 or 76 (SKU-dependent)50–6030–3870B+ Q4/Q5
M324GB8–1020–2612–167B–13B Q4
M3 Pro36GB14–1828–3516–2213B–30B Q4
M3 Max128GB30–4040–5026–3430B–65B Q4
M3 Ultra192GB60–8055–6835–4570B+ Q5/Q6

GPU core count is fixed per SKU; verify yours via Apple Menu > About This Mac or system_profiler SPDisplaysDataType.

Which Models Fit Your Mac?

The practical rule: match RAM tier to model parameter count with overhead for the operating system and context window. An 8GB machine caps out at a 7B model in Q4 quantization. With 32GB, 30B Q4 models run comfortably. At 64GB and above, 70B Q4 or Q5 models become feasible. The Neural Engine can accelerate certain operations in CoreML-converted models, but for standard transformer inference, GPU via Metal handles the heavy lifting. Its primary benefit emerges in smaller, production-deployed models converted through Apple's CoreML pipeline (using coremltools) rather than for general-purpose LLM inference. Size thresholds and supported quantization formats depend on the CoreML Tools version used; consult Apple's CoreML documentation for current limits.

Setting Up Your Local LLM Environment

Installing Ollama with Metal Acceleration

Ollama is the fastest path from zero to running inference. It automatically detects and uses the Metal backend on Apple Silicon.

# Download installer separately; verify SHA before executing
curl -fsSL https://ollama.com/install.sh -o /tmp/ollama_install.sh

# Obtain expected SHA from https://ollama.com/download (published alongside installer)
EXPECTED_SHA256="<paste-published-sha256-here>"
ACTUAL_SHA256=$(shasum -a 256 /tmp/ollama_install.sh | awk '{print $1}')

if [ "$ACTUAL_SHA256" != "$EXPECTED_SHA256" ]; then
  echo "ERROR: SHA256 mismatch. Aborting installation." >&2
  echo "Expected: $EXPECTED_SHA256" >&2
  echo "Got:      $ACTUAL_SHA256" >&2
  rm /tmp/ollama_install.sh
  exit 1
fi

sh /tmp/ollama_install.sh
rm /tmp/ollama_install.sh

Security note: for production or enterprise use, download the signed GUI installer from https://ollama.com/download and verify the package signature, rather than piping a remote script into sh.

# Verify installation and version
ollama --version

# Verify Ollama is running (the macOS installer starts it automatically; 10s timeout):
curl -s --max-time 10 http://localhost:11434/api/tags | python3 -m json.tool \
  || echo "ERROR: Ollama not responding on port 11434 within 10s" >&2
# If not running, start with: ollama serve
# (Run in a separate terminal — do not background with & if unsure whether the daemon is already active,
#  as a second instance will fail with "address already in use" on port 11434.)

# Pull a quantized model (tag availability may change over time; check the Ollama model library)
ollama pull llama3.1:8b-instruct-q4_K_M

# Run a quick test
ollama run llama3.1:8b-instruct-q4_K_M "Explain unified memory in two sentences."

Alternative Runtimes: llama.cpp and MLX

Ollama wraps llama.cpp and provides convenience, but building llama.cpp directly grants granular control over GPU layer offloading, batch sizes, and quantization parameters. Apple's MLX framework targets researchers and developers who want native Apple Silicon performance with a NumPy-like API. Choose Ollama for quick setup, llama.cpp for fine-grained tuning, and MLX for Python-native research workflows.

This guide requires llama.cpp built from a commit after July 2024. Confirm the binary name with ls build/bin/; older builds use main instead of llama-cli.

# Clone and build llama.cpp with Metal support
git clone --depth 1 https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

# Record commit for reproducibility
git rev-parse HEAD

cmake -B build -DGGML_METAL=ON

NCPU=$(sysctl -n hw.ncpu 2>/dev/null)
if [ -z "$NCPU" ] || ! [[ "$NCPU" =~ ^[0-9]+$ ]]; then
  echo "WARNING: Could not determine CPU count; defaulting to -j4" >&2
  NCPU=4
fi
cmake --build build --config Release -j"$NCPU"

# Verify the binary name:
ls build/bin/
# Use llama-cli on recent builds, or 'main' on pre-July 2024 builds.

# Run inference with all layers offloaded to GPU
./build/bin/llama-cli \
  -m models/llama-3.1-8b-instruct-q4_K_M.gguf \
  -ngl 99 \
  -p "Describe the advantage of unified memory for LLM inference." \
  -n 128

The -ngl 99 flag instructs llama.cpp to offload as many layers as possible to the Metal GPU. Setting this value higher than the actual layer count is safe; the runtime clamps to the model's maximum.

Neural Engine vs. GPU vs. CPU: Understanding Execution Paths

How Inference Workloads Map to Apple Silicon Components

Metal's GPU backend handles the bulk of transformer inference: large matrix multiplications in attention and feed-forward layers. These operations consume the overwhelming majority of compute time. Apple optimized the Neural Engine for CoreML-converted models, and it delivers high throughput for fixed-graph, smaller models, but its inflexibility with dynamic shapes and large parameter counts limits its usefulness for general LLM inference. The CPU serves as a fallback for unsupported operations. Efficiency cores handle background tasks; performance cores pick up any CPU-bound inference operations, though in a well-configured setup, CPU fallback should be minimal.

When the Neural Engine Actually Helps

Converting a model to CoreML format via Apple's coremltools unlocks the Neural Engine, but the pipeline imposes restrictions: model size limits (which vary by CoreML Tools version and target hardware), limited quantization format support, and conversion complexity. For mainstream local LLM use in 2025 and 2026, GPU via Metal remains the primary and recommended execution path. The Neural Engine becomes relevant for smaller, production-deployed models in app-embedded contexts rather than for interactive 7B+ model inference.

Memory Optimization and Configuration

Controlling GPU Memory Allocation

# Set environment variables before launching Ollama
export OLLAMA_MAX_LOADED_MODELS=1
export OLLAMA_NUM_PARALLEL=2

# Ensure base model is present before creating derived model
BASE_MODEL="llama3.1:8b-instruct-q4_K_M"

if ! ollama list 2>/dev/null | grep -q "$BASE_MODEL"; then
  echo "Base model '$BASE_MODEL' not found locally. Pulling now..." >&2
  ollama pull "$BASE_MODEL" || { echo "ERROR: Failed to pull base model." >&2; exit 1; }
fi

# In a custom Modelfile, control GPU layer offloading
cat <<EOF > Modelfile
FROM ${BASE_MODEL}
PARAMETER num_gpu 99
PARAMETER num_ctx 4096
EOF

ollama create my-optimized-llama -f Modelfile
ollama run my-optimized-llama

Setting OLLAMA_MAX_LOADED_MODELS=1 prevents multiple models from competing for memory. The num_gpu 99 parameter in the Modelfile ensures full GPU offloading.

Quantization Strategy: Choosing the Right Format

The trade-off across quantization levels is direct: lower bit-width means less memory and faster throughput but reduced output quality. Q4_K_M offers a strong balance for most users, though results vary by model family and task. Q5_K_M improves output quality (model-dependent; typically 0.1 to 0.3 perplexity improvement over Q4_K_M on standard benchmarks) at moderate memory cost. Q6_K and Q8_0 cut quantization error and nearly match full-precision output for most tasks, though no format guarantees parity across all model families.

The trade-off across quantization levels is direct: lower bit-width means less memory and faster throughput but reduced output quality.

#!/bin/bash
set -uo pipefail

# Benchmark multiple quantization levels of the same model
MODELS=("llama3.1:8b-instruct-q4_K_M" "llama3.1:8b-instruct-q5_K_M" "llama3.1:8b-instruct-q8_0")
PROMPT="Write a concise explanation of backpropagation."

for MODEL in "${MODELS[@]}"; do
  echo "=== Benchmarking: $MODEL ==="

  if ! ollama pull "$MODEL" 2>&1; then
    echo "Pull failed for $MODEL (exit $?)" >&2
    continue
  fi

  # Note: output field names (e.g., "eval rate", "total duration") may vary by Ollama version.
  # If grep returns nothing, inspect the full --verbose output.
  # Capture run output; preserve exit code through pipe via PIPESTATUS
  OUTPUT=$(echo "$PROMPT" | ollama run "$MODEL" --verbose 2>&1)
  RUN_EXIT=${PIPESTATUS[0]}

  if [ "$RUN_EXIT" -ne 0 ]; then
    echo "ERROR: ollama run failed for $MODEL (exit $RUN_EXIT)" >&2
    continue
  fi

  echo "$OUTPUT" | grep -E "eval rate|total duration"

  echo "--- Memory pressure (macOS only) ---"
  memory_pressure | grep -E "System-wide memory free percentage|pressure level"
  echo ""
done

Note: The memory_pressure command is macOS-only and will not work on Linux.

Swap and Memory Pressure Management

When a model plus its context window exceeds available unified memory, macOS swaps to disk. Performance degrades catastrophically, with throughput dropping by 10x or more in community reports (observed on M2 Max with NVMe SSD; severity varies by SSD speed and swap volume). Monitor with memory_pressure or Activity Monitor's Memory tab. The rule of thumb: keep the model and its context window within 80% of total unified memory to leave headroom for macOS and background processes.

RAM TierMax Model SizeRecommended QuantizationExpected Context Window
8GB7BQ4_K_M2048
16GB7B–13BQ4_K_M / Q5_K_M4096
32GB30BQ4_K_M4096–8192
64GB65B–70BQ4_K_M / Q5_K_M8192
96GB+70B+Q5_K_M / Q6_K8192–16384
192GB70B+ or 120B+Q6_K / Q8_016384+

Advanced Optimization Techniques

Tuning Context Length and Batch Size

Larger context lengths increase KV-cache memory linearly with sequence length and increase attention compute quadratically, both of which pressure unified memory. Batch size controls how many tokens the runtime processes simultaneously during prompt evaluation.

# Custom Ollama Modelfile with tuned context and batch
cat <<EOF > Modelfile-tuned
FROM llama3.1:8b-instruct-q4_K_M
PARAMETER num_ctx 8192
PARAMETER num_batch 512
PARAMETER num_gpu 99
EOF

ollama create tuned-llama -f Modelfile-tuned
ollama run tuned-llama

# Equivalent llama.cpp flags
./build/bin/llama-cli \
  -m models/llama-3.1-8b-instruct-q4_K_M.gguf \
  -ngl 99 \
  --ctx-size 8192 \
  --batch-size 512 \
  -p "Summarize the key trade-offs of quantized LLM inference." \
  -n 256

Increasing num_batch from the default (typically 512 in current Ollama/llama.cpp versions; verify with ollama show --modelfile <model>) improves prompt evaluation speed but increases memory usage during that phase. Reducing context length from 8192 to 4096 frees memory proportional to model layer count and hidden dimension size.

Metal Shader Compilation Caching

The first inference run on a new model triggers Metal shader compilation, which can add several seconds of latency. Subsequent runs benefit from macOS's persistent shader cache. There is no manual pre-warming step required; simply running a short prompt after model load populates the cache. This penalty is per-model and persists across reboots unless the system's shader cache is cleared. (Behavior may vary; Apple does not publicly document Metal shader cache invalidation triggers.)

Using MLX for Peak Apple Silicon Throughput

# Install MLX language model package (pin to a tested version)
# Generate hash: pip download --no-deps mlx-lm==0.19.2 && pip hash mlx_lm-0.19.2-*.whl
pip install mlx-lm==0.19.2 \
  --require-hashes \
  --hash=sha256:<insert-wheel-hash-here>
# Check https://github.com/ml-explore/mlx-examples for current release
# Run inference with a Hugging Face model
# Tested with mlx-lm 0.19.2 — the generate() API may change across versions

from mlx_lm import load, generate

try:
    model, tokenizer = load('mlx-community/Meta-Llama-3.1-8B-Instruct-4bit')
    response = generate(
        model,
        tokenizer,
        prompt='Explain the benefit of unified memory for LLM inference.',
        max_tokens=128,
        verbose=True,
    )
except Exception as e:
    raise RuntimeError(f'MLX inference failed: {e}') from e

if not isinstance(response, str):
    raise TypeError(
        f'Unexpected generate() return type: {type(response)}. '
        'Check mlx-lm version compatibility.'
    )

print(response)

MLX reports tokens per second directly in verbose mode. Because MLX operates natively on Apple Silicon's unified memory without translation layers, it achieves competitive or superior throughput to llama.cpp for supported models in tested configurations (e.g., Llama 3.1 8B 4-bit; no universal cross-framework benchmark exists as of early 2025), particularly in research and experimentation contexts.

Keeping macOS Lean During Inference

Disable Spotlight indexing on directories containing model files by adding them to the Privacy exclusion list. On macOS 13+: System Settings > Siri & Spotlight > Privacy. On macOS 12: System Preferences > Spotlight > Privacy. Close browsers and Electron-based applications, which can consume several gigabytes of memory. In Energy settings, disable low-power mode and prevent sleep during inference runs to maintain consistent GPU clock speeds.

Practical Performance Expectations

Real-World Throughput Numbers

Apple Silicon excels at running memory-bound large models that simply will not fit in a consumer discrete GPU's VRAM. A 70B Q4 model running at 8 to 12 tokens per second on an M2 Ultra (community-reported; see llama.cpp GitHub discussions, circa Q1 2025; results vary by model family and system load) has no equivalent on a 24GB RTX 4090 without severe layer offloading penalties. However, for models that fit entirely in VRAM (e.g., Llama 3.1 8B Q4_K_M), an RTX 4090 (24GB) typically delivers roughly 2x to 3x the raw token generation throughput of an M3 Ultra (community-reported comparisons; no standardized cross-platform benchmark exists as of early 2025). The RTX 5090 (32GB) extends this VRAM ceiling, but comparative benchmarks are limited as of early 2025. Apple Silicon's advantage is capacity; NVIDIA's advantage is raw compute density.

Apple Silicon's advantage is capacity; NVIDIA's advantage is raw compute density.

Key Recommendations

Coding assistants and RAG pipelines that need sub-second first-token latency and 20+ tok/s generation pair well with an M3 Pro or Max (36 to 128GB), running 8B to 30B models via Ollama. Research workflows requiring larger models benefit from MLX on M2 Ultra or M3 Ultra hardware with 192GB of memory, which unlocks 70B+ models that are simply inaccessible on consumer GPUs. Apple Silicon's UMA remains its defining advantage: the ability to run models that exceed any consumer GPU's VRAM, at interactive speeds, on a laptop or desktop. If Apple maintains current memory-bandwidth scaling per chip generation, this capacity lead will continue to grow as model sizes increase.

SitePoint TeamSitePoint Team

Sharing our passion for building incredible internet things.

© 2000 – 2026 SitePoint Pty. Ltd.
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.