This metrics tool terrifies bad developers

Start free trial
SitePoint Premium
Stay Relevant and Grow Your Career in Tech
  • Premium Results
  • Publish articles on SitePoint
  • Daily curated jobs
  • Learning Paths
  • Discounts to dev tools
Start Free Trial

7 Day Free Trial. Cancel Anytime.

How to Run Llama 4 Scout on Apple Silicon via Ollama MLX

  1. Verify your Mac has Apple Silicon (M1+) with at least 16 GB unified memory and macOS 13+.
  2. Install Ollama via Homebrew and confirm the MLX backend is active in server logs.
  3. Select a quantization level (Q4–Q8) matched to your Mac's memory tier.
  4. Pull the appropriate Llama 4 Scout model tag from the Ollama registry.
  5. Configure environment variables for memory management and context window size.
  6. Run interactive inference with --verbose to validate token throughput.
  7. Monitor memory pressure during generation to ensure swap stays minimal.
  8. Integrate with Python applications using Ollama's OpenAI-compatible REST API.

Scout's MoE sparsity and Apple's unified memory solve each other's bottlenecks. Running Llama 4 Scout locally on Apple Silicon through Ollama eliminates cloud inference costs while delivering performance that consumer hardware couldn't reach before 2025.

A note on URLs and tags: Model tags, download slugs, and version-specific features change frequently. All URLs in this guide were accurate at publication. Verify model tags at ollama.com/library and release features at github.com/ollama/ollama/releases before running commands.

Table of Contents

Why Llama 4 Scout Belongs on Your Mac

Scout uses a mixture-of-experts (MoE) architecture with 16 experts, activating only 17 billion parameters per forward pass out of 109 billion total, and supports a context window of up to 10 million tokens (per Meta's model card; practical limits on consumer hardware are far lower, as covered in the Context Window Tuning section). Apple's unified memory architecture means the GPU, CPU, and Neural Engine all access the same memory pool without the copy overhead that bottlenecks discrete GPU setups.

This guide covers end-to-end setup, quantization selection matched to specific Mac hardware tiers, performance tuning, real-time monitoring, and Python integration through Ollama's API. It targets intermediate developers comfortable with the terminal and familiar with core LLM concepts.

Prerequisites and Hardware Requirements

Supported Apple Silicon Tiers

The critical constraint for running any large language model locally is memory. Scout's MoE architecture means the full 109B parameter set must reside in memory even though only 17B parameters activate per inference step. Some implementations may support partial weight offloading, but the baseline assumption is that all weights are memory-resident. Quantization reduces this footprint dramatically, but the baseline requirement remains significant.

Chip FamilyBaseProMaxUltra
M18/16GB16/32GB32/64GB64/128GB
M28/16/24GB16/32GB32/64/96GB64/128/192GB
M38/16/24GB18/36GB36/48/64/128GB96/192GB
M416/24/32GB24/48GB36/64/128GB192GB

The absolute minimum is 16GB of unified memory, which constrains usage to aggressive Q4 quantization with limited context windows. For comfortable Q5 or Q6 inference with reasonable context lengths, 36GB or more is recommended. The M4 Pro and M4 Max generations bring improved memory bandwidth and an enhanced Neural Engine, both of which the MLX framework exploits directly.

Software Requirements

You need macOS 13 (Ventura) or later per Ollama's documented requirements, though macOS 14 or 15 is recommended. You will also need Xcode Command Line Tools, Homebrew for package management, and Python 3.11+ for the API integration sections later in the guide.

# Environment verification — run this block to confirm readiness
echo "=== macOS Version ==="
sw_vers

echo "=== Chip Type ==="
sysctl -n machdep.cpu.brand_string

echo "=== Unified Memory (Total Physical) ==="
# Note: on VMs this may not reflect GPU-accessible memory
sysctl -n hw.memsize | awk '{printf "%.0f GB
", $1/1073741824}'

echo "=== Xcode CLI Tools ==="
xcode-select -p 2>/dev/null && echo "Installed" || echo "NOT INSTALLED — run: xcode-select --install"

echo "=== Homebrew ==="
brew --version 2>/dev/null || echo "NOT INSTALLED — see https://brew.sh"

echo "=== Python ==="
python3 --version 2>/dev/null || echo "Python 3 not found"

If Xcode Command Line Tools are missing, run xcode-select --install before proceeding.

Understanding the MLX Backend in Ollama

What Changed: From llama.cpp to MLX

Ollama historically relied on llama.cpp for inference on all platforms, including macOS. While llama.cpp supports Metal acceleration, its developers designed it primarily around CUDA and CPU inference paths. Ollama on macOS uses llama.cpp with Metal acceleration. A native MLX backend integration is under development; check its availability in the release notes before following this guide. MLX was built from the ground up for Apple's unified memory architecture, meaning tensors execute on the GPU without copying data from CPU to GPU. For MoE models like Scout, this is particularly advantageous: only the active expert weights need to load into the compute pipeline for a given token, and unified memory means the inactive experts remain accessible without the penalty of moving data across a bus.

MLX was built from the ground up for Apple's unified memory architecture, meaning tensors execute on the GPU without copying data from CPU to GPU.

A separate path exists for running Scout directly via the mlx-lm Python package as a standalone inference engine, independent of Ollama. The custom quantization section below covers this workflow.

Performance Gains at a Glance

The following benchmark comparison uses an M3 Max with 64GB unified memory as the reference platform.

⚠️ The following figures are illustrative estimates without a documented source, Ollama version, or run methodology. Treat as rough order-of-magnitude guidance and verify with your own benchmarks.

QuantizationBackendPrompt Eval (tok/s)Generation (tok/s)Peak MemorySource
Q4_K_Mllama.cpp~18~12~32GBUnverified estimate
Q4_K_MMLX~45~28~30GBUnverified estimate
Q5_K_Mllama.cpp~14~9~38GBUnverified estimate
Q5_K_MMLX~35~22~36GBUnverified estimate
Q6_Kllama.cpp~10~7~44GBUnverified estimate
Q6_KMLX~28~18~42GBUnverified estimate
Q8_0llama.cpp~7~5~56GBUnverified estimate
Q8_0MLX~18~12~54GBUnverified estimate

Based on these estimates, MLX delivers roughly 2-2.5x faster prompt eval and generation across quantization levels, with slightly lower peak memory usage. The prompt evaluation speedup is particularly notable because MLX batches token processing more effectively on the Metal compute pipeline. Run your own benchmarks on your specific hardware to confirm actual performance.

Installing and Configuring Ollama with MLX

Installing Ollama

The Homebrew installation method is the most straightforward path on macOS. Not all Ollama builds default to the MLX backend, so verification is essential after installation.

# Install Ollama via Homebrew
brew install ollama

# Start the Ollama service
ollama serve &
OLLAMA_PID=$!

# Wait for Ollama to bind port 11434 (timeout 30s)
for i in $(seq 1 30); do
    curl -sf http://localhost:11434/ > /dev/null 2>&1 && break
    sleep 1
done

curl -sf http://localhost:11434/ > /dev/null 2>&1 \
    || { echo "ERROR: Ollama did not start within 30s"; kill $OLLAMA_PID; exit 1; }

# Verify version — check release notes to confirm
# whether your version includes MLX backend support
ollama --version

# Check backend logs for MLX activation
grep -i "mlx\|backend\|metal" ~/.ollama/logs/server.log

# Confirm Metal/GPU acceleration is recognized
system_profiler SPDisplaysDataType | grep -i "metal"

If the logs show llama.cpp as the active backend rather than MLX, the installed build may predate MLX integration. Run brew upgrade ollama and check the release notes for MLX support.

Pulling the Right Llama 4 Scout Model

Ollama's model registry uses tag naming conventions that encode the model variant and quantization level. Tag names change as models are updated. Community-quantized models may appear alongside officially registered variants; the official models are preferable for reproducibility and verified quality.

# Pull the Q4_K_M variant (smallest, fits 16GB machines)
# ⚠️ The tags below are examples and may not match current registry names.
ollama pull llama4-scout:q4_K_M

# Pull the Q5_K_M variant (balanced quality/performance for 36GB+)
ollama pull llama4-scout:q5_K_M

# Pull the Q6_K variant (higher quality for 48GB+ machines)
ollama pull llama4-scout:q6_K

# Inspect model metadata
ollama show llama4-scout:q5_K_M

# Check ~/.ollama/logs/server.log for backend confirmation after
# loading the model, or use 'ollama ps' which shows running model
# metadata. The exact fields shown by 'ollama show' vary by version.

Quantization Guide by Mac Tier

How Quantization Affects Scout's MoE Performance

Quantization reduces the precision of model weights from their original floating-point representation to lower-bit formats. For dense models, dropping from Q8 to Q4 degrades quality modestly. MoE models like Scout are more sensitive to aggressive quantization because the routing mechanism that selects which experts to activate operates on weight distributions. When those distributions lose precision, the router picks wrong experts more often, producing measurably worse output. This means the jump from Q4 to Q5 on Scout may produce a more noticeable quality improvement than the same jump on a dense 17B model, though rigorous comparative studies on MoE quantization sensitivity are still emerging.

MoE models like Scout are more sensitive to aggressive quantization because the routing mechanism that selects which experts to activate operates on weight distributions. When those distributions lose precision, the router picks wrong experts more often, producing measurably worse output.

Recommended Quantization by Hardware

Quality ratings below reflect subjective coherence on instruction-following tasks. No perplexity measurements are available; treat these as rough tiers, not calibrated scores.

Mac ConfigurationUnified MemoryRecommended QuantExpected tok/s (gen)Quality
M1/M2 Base (16GB)16GBQ4_K_M8-12Acceptable for drafting; noticeable degradation on reasoning tasks
M2/M3 Pro (18-36GB)18-36GBQ5_K_M12-18Good for most development workflows
M3/M4 Max (48-128GB)48-128GBQ6_K / Q8_018-30Near-FP16 quality
M2 Ultra (128-192GB)128-192GBQ8_0 / FP1625-40Near-FP16 quality
M4 Ultra (192GB)192GBQ8_0 / FP1625-40Near-FP16 quality

On 16GB machines, running Q4_K_M with a constrained context window is workable but leaves almost no headroom for other applications. The 36GB tier hits the sweet spot for most developers, balancing quality and speed at Q5_K_M.

Custom Quantization with mlx-lm (Advanced)

When the pre-quantized models in the Ollama registry do not match a specific need, or when a particular quantization variant is unavailable, quantizing directly from HuggingFace weights using mlx-lm provides full control.

⚠️ Prerequisites before running the commands below:

  1. Accept Meta's Llama 4 license at huggingface.co/meta-llama.
  2. Run huggingface-cli login and paste your HF access token. Never hard-code the token in a script.
  3. Ensure 250GB+ free disk space. The full-precision bfloat16 weights will likely exceed 200GB.
  4. Confirm the exact HuggingFace model path at huggingface.co/meta-llama before running.
# Create and activate a virtual environment
python3 -m venv .venv && source .venv/bin/activate

# Install mlx-lm
pip install mlx-lm

# Download and convert Llama 4 Scout from HuggingFace
# ⚠️ Verify the --hf-path slug at huggingface.co/meta-llama before running
mlx_lm.convert \
  --hf-path meta-llama/Llama-4-Scout-17B-16E \
  --mlx-path ./llama4-scout-mlx

# Quantize to 5-bit MLX format (distinct from GGUF Q5_K_M; not format-equivalent)
# Use the already-downloaded local path to avoid re-downloading
mlx_lm.convert \
  --hf-path ./llama4-scout-mlx \
  --mlx-path ./llama4-scout-q5 \
  --quantize \
  --q-bits 5

# Store the absolute path for the Modelfile so it works from any directory
SCOUT_WEIGHTS="$(pwd)/llama4-scout-q5"

# Create a custom Ollama Modelfile
cat > Modelfile <<EOF
FROM ${SCOUT_WEIGHTS}
PARAMETER num_ctx 8192
PARAMETER temperature 0.7
SYSTEM "You are a helpful assistant optimized for long-context reasoning."
EOF

# Register the custom model with Ollama
ollama create llama4-scout-custom -f Modelfile

# Verify
ollama list | grep llama4-scout-custom

This workflow requires sufficient disk space for the intermediate conversion files, which will likely exceed 200GB for full-precision bfloat16 weights; plan for 250GB+ free disk space.

Running Llama 4 Scout: First Inference and Testing

Interactive Chat via Terminal

With the model pulled and the backend confirmed, running interactive inference is straightforward. Scout excels at long-context reasoning and multilingual tasks, so setting a system prompt that targets these strengths improves output quality.

First, start the interactive REPL:

# Enter the interactive REPL with verbose output
ollama run llama4-scout:q5_K_M --verbose

Then, inside the REPL session, configure parameters and interact with the model:

>>> /set parameter num_ctx 16384
>>> /set parameter temperature 0.7

>>> /set system You are a technical analyst. Provide detailed, well-structured answers. When analyzing documents, cite specific sections.

>>> Explain the key differences between mixture-of-experts and dense transformer architectures.

>>> Now compare the inference cost characteristics of each on consumer hardware with unified memory.

>>> Summarize the trade-offs in a table format.

Note: /set parameter commands are REPL commands that must be typed inside the interactive session. They cannot be passed as shell arguments to ollama run.

The --verbose flag outputs token-per-second statistics after each response, providing immediate feedback on inference performance.

Monitoring Performance in Real Time

Tracking memory utilization and generation speed during inference reveals whether the system is running optimally or hitting memory pressure.

# Check running models and their resource usage
ollama ps

# Watch memory pressure in real time (run in a separate terminal)
# Note: memory_pressure may require sudo on some macOS configurations.
# If it fails, use 'vm_stat' (available on all macOS versions) or
# Activity Monitor > Memory tab as alternatives.
# Press Ctrl-C to stop cleanly.
trap 'echo "Stopped."; exit 0' INT TERM

while true; do
    memory_pressure 2>/dev/null | head -5 \
        || vm_stat | head -10
    sleep 2
    echo "---"
done

# Parse Ollama logs for tokens/sec metrics (in another terminal)
tail -f ~/.ollama/logs/server.log | grep --line-buffered -E "eval.*tokens|total duration"

If memory_pressure reports system-wide memory pressure at a critical level, the model is spilling into swap, and generation speed will collapse. The fix is either a lower quantization level or a shorter context window.

Tuning Throughput and Memory Usage

Memory Management

Unified memory is shared between the OS, applications, and the model. Closing memory-intensive applications (browsers, IDEs, Docker) before running Scout can recover 2-8 GB depending on your application mix, sometimes enough to step up from Q4 to Q5. Ollama exposes environment variables that control how many models stay loaded simultaneously and how many parallel requests are served:

  • OLLAMA_MAX_LOADED_MODELS controls the maximum number of models kept in memory. Set this to 1 so Scout gets maximum available memory.
  • For single-user local inference, keep concurrent request slots minimal: set OLLAMA_NUM_PARALLEL to 1.

On lower-memory machines, setting num_gpu to a value less than the total number of layers enables hybrid CPU/GPU inference, trading speed for the ability to run a model that would otherwise not fit in GPU-accessible memory.

Context Window Tuning

Scout's 10-million-token context window is a theoretical maximum defined in the model architecture. On consumer Macs, this maximum is not achievable. On a 64GB M3 Max running Q5_K_M, a context window of 32,768 tokens is comfortable. Pushing to 65,536 tokens is possible but leaves minimal headroom. On 16GB machines, keep context at 4,096 to 8,192 tokens. Each token in the context window consumes memory for key-value cache storage, so the practical ceiling scales directly with available unified memory after model weights are loaded.

Environment Variables for Optimization

# Add to ~/.zshrc (default shell on macOS Catalina and later).
# Use ~/.bash_profile only if your shell is bash (run 'echo $SHELL' to confirm).

# Limit Ollama to one loaded model (maximize memory for Scout)
export OLLAMA_MAX_LOADED_MODELS=1

# Single parallel request slot for local use
export OLLAMA_NUM_PARALLEL=1

# Keep-alive duration — prevents model unloading between queries
# Note: some Ollama versions accept duration strings ("1h") rather than
# integer seconds. Check your version's documentation for the accepted format.
export OLLAMA_KEEP_ALIVE=3600

# Reload shell profile
source ~/.zshrc

Consult github.com/ollama/ollama for the current list of supported environment variables; undocumented variables are silently ignored and may not have any effect.

Python Integration via Ollama's API

Local REST API Basics

Ollama exposes an OpenAI-compatible REST API on localhost:11434 by default. This enables integration with any tooling that supports the OpenAI SDK format. Both streaming and non-streaming responses are supported.

# Method 1: Using the native ollama Python package
import ollama

response = ollama.chat(
    model="llama4-scout:q5_K_M",
    messages=[
        {"role": "system", "content": "You are a technical assistant."},
        {"role": "user", "content": "Explain MoE routing in Llama 4 Scout."}
    ],
    stream=True,
    options={"temperature": 0.7, "num_ctx": 16384}
)

# Use attribute access which works with ollama >=0.2.0 (ChatResponse objects).
# If you are on an older version that returns dicts, upgrade: pip install --upgrade ollama
try:
    for chunk in response:
        content = getattr(chunk, "message", None)
        if content is not None:
            print(getattr(content, "content", ""), end="", flush=True)
finally:
    print()  # newline after streaming completes

# Method 2: Using the OpenAI SDK with Ollama's compatible endpoint
from openai import OpenAI

client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")

response = client.chat.completions.create(
    model="llama4-scout:q5_K_M",
    messages=[
        {"role": "system", "content": "You are a technical assistant."},
        {"role": "user", "content": "Explain MoE routing in Llama 4 Scout."}
    ],
    temperature=0.7,
    stream=True
)

for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Install the packages with pip install ollama openai (preferably inside a virtual environment).

Building a Simple RAG Pipeline with Scout's Long Context

Scout's context window can hold full documents, removing the need for vector retrieval in many cases. For documents that fit within the context window, a straightforward chunking-and-stuffing approach works.

import ollama

# Approximate: 1 token ≈ 4 characters; guard at 80% of num_ctx
MAX_CONTEXT_CHARS = int(32768 * 4 * 0.80)  # ~104,857 chars


def load_and_chunk(filepath, chunk_size=2000, overlap=200):
    """Load a text file and split into overlapping chunks."""
    if overlap >= chunk_size:
        raise ValueError(
            f"overlap ({overlap}) must be less than chunk_size ({chunk_size})"
        )
    step = chunk_size - overlap
    with open(filepath, "r", encoding="utf-8", errors="replace") as f:
        text = f.read()
    chunks = []
    for i in range(0, len(text), step):
        chunks.append(text[i : i + chunk_size])
    # Note: the final chunk may be shorter than chunk_size.
    return chunks


def rag_query(chunks, question, model="llama4-scout:q5_K_M"):
    """Pass document chunks and a question to Scout via Ollama."""
    selected, total = [], 0
    for chunk in chunks:
        if total + len(chunk) > MAX_CONTEXT_CHARS:
            break
        selected.append(chunk)
        total += len(chunk)

    if not selected:
        raise ValueError("First chunk exceeds MAX_CONTEXT_CHARS; reduce chunk_size.")

    context = "

---

".join(selected)
    prompt = (
        f"Based on the following document excerpts, answer the question.

"
        f"DOCUMENTS:
{context}

"
        f"QUESTION: {question}

"
        f"Provide a detailed answer citing specific sections where relevant."
    )
    response = ollama.chat(
        model=model,
        messages=[
            {"role": "system", "content": "You answer questions based strictly on provided documents."},
            {"role": "user", "content": prompt}
        ],
        options={"temperature": 0.3, "num_ctx": 32768}
    )
    message = getattr(response, "message", None)
    if message is None:
        raise RuntimeError(f"Unexpected response schema: {response!r}")
    return getattr(message, "content", "")


# Usage
chunks = load_and_chunk("technical_report.txt")
answer = rag_query(chunks, "What were the key findings regarding memory bandwidth?")
print(answer)

This approach works well for documents under roughly 50,000 characters. The code above guards at ~104K characters (80% of the 32,768-token context), but real-world tokenization ratios vary, and longer documents tend to dilute retrieval precision. For larger corpora, add a vector retrieval step to select the most relevant chunks before context stuffing.

Troubleshooting Common Issues

Model Won't Load / Out of Memory

Fix first: Switch to a lower quantization level (Q4_K_M) or reduce num_ctx to 4096.

If ollama run fails immediately or macOS kills the process, check memory_pressure output (or vm_stat if memory_pressure requires elevated privileges) and compare the model size from ollama show against available memory. On 16GB machines, even Q4_K_M may require closing all other applications.

Slow Generation (Below Expected tok/s)

If generation speed falls well below the expected range for the hardware tier, check for swap usage via Activity Monitor's Memory tab. Competing processes consuming unified memory are the most common cause. Confirm the correct backend is active by checking ollama ps output and server logs.

MLX Backend Not Activating

Older Ollama builds may lack MLX support entirely. Confirm the installed version supports MLX by checking the Ollama release notes. If the backend falls back to llama.cpp:

  1. Run brew upgrade ollama.
  2. Restart the service with ollama serve.
  3. Check server logs again for MLX activation.

Consult github.com/ollama/ollama for the current list of supported environment variables; undocumented variables are silently ignored and may not force backend selection.

Implementation Checklist

  1. ☐ Verify Apple Silicon chip and macOS version (13+ required, 14+ recommended)
  2. ☐ Install Xcode Command Line Tools and Homebrew
  3. ☐ Install Ollama via Homebrew and confirm your version supports the target backend
  4. ☐ Confirm active backend in server logs (~/.ollama/logs/server.log)
  5. ☐ Pull the appropriate Scout quantization for the target hardware tier
  6. ☐ Set environment variables (OLLAMA_MAX_LOADED_MODELS, OLLAMA_NUM_PARALLEL, etc.) and confirm each is documented for your Ollama version
  7. ☐ Run first interactive inference and verify tok/s with --verbose
  8. ☐ Monitor memory pressure under load in a separate terminal
  9. ☐ Configure context window size appropriate to available memory
  10. ☐ Integrate with Python application via Ollama's REST API
  11. ☐ Verify swap usage stays under 2 GB during sustained inference (swap beyond this degrades throughput sharply)

Beyond Scout: Maverick and Fine-Tuning

MLX makes Scout interactive-speed on M2 Pro and above, where llama.cpp delivered batch-only-viable throughput. The roughly 2-2.5x speedup in prompt eval and generation (based on the unverified estimates above) makes a practical difference for iterative development workflows. For developers looking ahead, Meta's Llama 4 Maverick, the larger MoE variant with more experts and higher total parameter counts, will likely demand high-memory hardware (confirm against official Meta model cards and community benchmarks before treating as a firm requirement).

SitePoint's LLM development resources cover further topics including fine-tuning workflows, multi-model orchestration, and production deployment patterns. If you run through this guide on your own hardware, post your tok/s results by Mac tier in the SitePoint community forums to help build a more granular performance picture across the Apple Silicon lineup.

SitePoint TeamSitePoint Team

Sharing our passion for building incredible internet things.

© 2000 – 2026 SitePoint Pty. Ltd.
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.