coursera_2026_06
SitePoint Premium
Stay Relevant and Grow Your Career in Tech
  • Premium Results
  • Publish articles on SitePoint
  • Daily curated jobs
  • Learning Paths
  • Discounts to dev tools
Start Free Trial

7 Day Free Trial. Cancel Anytime.

OpenClaw in Production: Lessons from 4 Weeks of Self-Hosted AI Agents

Self-Hosted AI Agents Comparison

Dimension Self-Hosted OpenClaw Cloud API (GPT-4o-mini equiv.)
Cost per 1,000 tasks (at 47K/mo) $104 $100
Break-even volume (monthly tasks) ~80,000+ Below 30,000
Median latency (optimized) 1.9 seconds ~3.2 seconds
Data sovereignty Full control, no data egress Third-party processing
Ops overhead (weekly hours) ~8 hours ~2 hours

Running OpenClaw as a self-hosted AI agent platform sounded like the right call. After 30 days of running OpenClaw in a real production environment with actual workloads, the picture is messier than any pitch deck suggests — this is the full story.

Table of Contents

OpenClaw is an open source framework for orchestrating multiple AI agents on your own infrastructure. It gives you a declarative configuration layer for defining agent roles, connecting them to tools and APIs, and managing their lifecycle. Think of it as a self-hosted control plane for AI agents: you wire up LLM-powered workers to your internal systems without routing every request through a third-party cloud endpoint. It sits in the same conceptual space as AutoGen, CrewAI, and LangGraph, but leans harder into self-hosted deployment and resource management.

Our hypothesis going in was simple: for sustained, predictable workloads, self-hosting would be cheaper and more controllable than paying per-token to cloud APIs. We expected a rough first week, a stable second week, and smooth sailing after that. Reality had other plans. What follows is a four-week timeline covering architecture decisions, cost analysis, optimization work, and an honest assessment of who should and shouldn't go down this road.

The Starting Architecture: Day 0 Hardware and Software Decisions

Hardware Specs and Why We Chose Them

We committed to a single bare-metal server as our primary compute node, with a cloud VM as a fallback. The bare-metal spec: dual AMD EPYC 7443 CPUs (48 cores total), 256 GB DDR4 ECC RAM, two NVIDIA A10 GPUs (24 GB VRAM each), and 2 TB NVMe storage in a RAID 1 configuration. Monthly colocation cost including power and bandwidth came to roughly $1,200.

The bare-metal decision was deliberate. Cloud GPU instances (A10G on AWS, L4 on GCP) would have run $2,800 or more per month for equivalent specs, and we wanted to eliminate variable pricing from the experiment. Network latency to our application servers was under 2ms since everything sat in the same facility. If your team doesn't have existing colocation relationships, cloud VMs are the pragmatic starting point. But the economics shift hard once you cross a utilization threshold.

OpenClaw Configuration and Initial Agent Setup

We deployed OpenClaw using Docker Compose with five agents, each scoped to a different task type: document summarization, data extraction, customer query routing, code review triage, and report generation. Here is the core deployment configuration from Day 0:

# docker-compose.yml — OpenClaw production stack, Day 0
# Requires: Docker Compose v2+, NVIDIA Container Toolkit installed on host
# Set PG_PASSWORD in a .env file before running: docker compose up -d

services:
  openclaw-core:
    image: openclaw/core:latest
    restart: always
    ports:
      - "8400:8400"
    environment:
      - OPENCLAW_MODEL_BACKEND=vllm
      - OPENCLAW_STATE_STORE=redis
      - OPENCLAW_LOG_LEVEL=info
      - OPENCLAW_MAX_AGENTS=10
      - VLLM_MODEL=mistralai/Mistral-7B-Instruct-v0.3
      - VLLM_GPU_MEMORY_UTILIZATION=0.85
      - VLLM_MAX_MODEL_LEN=8192
    volumes:
      - ./config:/etc/openclaw
      - model-cache:/models
    deploy:
      resources:
        limits:
          memory: 32G
        reservations:
          devices:
            - driver: nvidia
              count: 2
              capabilities: [gpu]
    depends_on:
      - redis
      - postgres

  redis:
    image: redis:7-alpine
    restart: always
    ports:
      - "6379:6379"

  postgres:
    image: postgres:16-alpine
    restart: always
    environment:
      - POSTGRES_DB=openclaw
      - POSTGRES_USER=openclaw
      - POSTGRES_PASSWORD=${PG_PASSWORD}
    volumes:
      - pgdata:/var/lib/postgresql/data

volumes:
  model-cache:
  pgdata:

Each agent was defined in a separate YAML config specifying its role, model assignment, tool access, and timeout behavior:

# config/agents/data-extraction.yml
agent:
  name: data-extraction
  description: "Extracts structured fields from unstructured documents"
  model: mistralai/Mistral-7B-Instruct-v0.3
  max_context_tokens: 6144
  timeout_seconds: 120
  retry_policy:
    max_retries: 2
    backoff_multiplier: 1.5
  tools:
    - name: pdf_parser
      endpoint: http://toolserver:9001/parse
    - name: schema_validator
      endpoint: http://toolserver:9001/validate
  guardrails:
    max_tool_calls_per_task: 10
    allow_external_network: false

Integration points included a REST API for task submission, PostgreSQL for state and audit logging, and Redis for ephemeral session data. Each agent connected to an internal tool server exposing PDF parsing, database lookups, and schema validation over HTTP.

Week 1: Baseline Performance and Early Surprises

Throughput, Latency, and Resource Utilization Benchmarks

We established baselines across all five agent types using a k6 load testing harness. The summarization agent handled roughly 14 requests per minute with a median response latency of 3.2 seconds. The data extraction agent was slower at 8 requests per minute due to multi-step tool calls. GPU utilization hovered at 72% average across both A10s, CPU utilization rarely exceeded 35%. Memory consumption was the surprise: even with a 32 GB container limit, the OpenClaw core process climbed to 28 GB by day three.

Compared to routing the same tasks through a cloud API (GPT-4o-mini for equivalent workloads), our self-hosted latency was actually 40% lower for single requests. The cloud API won on burst throughput, though. It handled sudden spikes with zero provisioning effort on our side.

The First Failures: Memory Leaks and Agent Drift

Day five. First real issue. Long-running agent sessions accumulated context state that never got fully garbage collected between tasks. The symptom: a gradual climb in VRAM consumption until the vLLM backend OOM-killed and restarted. We traced this to agent sessions not closing properly when tasks completed with tool-call errors.

The code review triage agent, after processing roughly 200 consecutive pull requests, started misclassifying severity levels. Its accumulated conversation history was polluting its system prompt context.

More subtle was agent drift. The code review triage agent, after processing roughly 200 consecutive pull requests, started misclassifying severity levels. Its accumulated conversation history was polluting its system prompt context. We solved this by enforcing hard session resets after every 50 tasks, which we tracked using this observability configuration:

# prometheus/openclaw-targets.yml
# This file is referenced by your main prometheus.yml under scrape_configs.
# Ensure prometheus.yml includes: scrape_config_files: ["openclaw-targets.yml"]
scrape_configs:
  - job_name: "openclaw-agents"
    scrape_interval: 15s
    static_configs:
      - targets: ["openclaw-core:8400"]
    metrics_path: /metrics
    metric_relabel_configs:
      - source_labels: [__name__]
        regex: "openclaw_agent_tasks_total|openclaw_agent_latency_seconds|openclaw_gpu_memory_bytes|openclaw_agent_errors_total|openclaw_session_context_tokens"
        action: keep

The openclaw_session_context_tokens metric became our early warning system for drift. Any agent session exceeding 80% of its max context window triggered an alert and forced a reset.

Week 2: Scaling Challenges and the First Architecture Pivot

When Concurrent Agents Broke Everything

On day 10, we tried scaling from 5 to 12 concurrent agents to handle a backlog. Everything fell apart. The vLLM backend could not serve 12 agents simultaneously on two A10 GPUs without severe contention. Requests queued internally, timeouts cascaded, and the PostgreSQL state store filled with incomplete task records. The bottleneck was GPU memory: 24 GB per card is generous for one or two model instances, but with 12 agents all expecting low-latency inference, the KV-cache pressure was unsustainable.

Architecture Pivot: Adding a Task Queue and Agent Pool Manager

We rearchitected mid-flight. Instead of agents pulling tasks directly from the API, we introduced Redis Streams as a task queue with priority levels, retry logic, and dead-letter handling. Agents became pooled workers: five warm agents ran continuously, and additional agents spun up only when queue depth exceeded a threshold.

# task_queue/worker.py — Agent pool worker with Redis Streams
# Requires: pip install redis requests
import os
import json
import time
import logging

import redis
import requests

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

r = redis.Redis(host="redis", port=6379, decode_responses=True)

STREAM = "openclaw:tasks"
GROUP = "agent-pool"
CONSUMER = f"worker-{os.getpid()}"
DLQ_STREAM = "openclaw:tasks:dlq"
MAX_RETRIES = 3

# Create consumer group if it doesn't exist
try:
    r.xgroup_create(STREAM, GROUP, id="0", mkstream=True)
except redis.exceptions.ResponseError:
    # Group already exists
    pass


def process_task(task_data):
    response = requests.post(
        "http://openclaw-core:8400/v1/tasks",
        json=task_data,
        timeout=task_data.get("timeout_seconds", 120),
    )
    response.raise_for_status()
    return response.json()


while True:
    try:
        entries = r.xreadgroup(
            GROUP, CONSUMER, {STREAM: ">"}, count=1, block=5000
        )
    except redis.exceptions.ConnectionError:
        logger.warning("Redis connection lost, retrying in 5s...")
        time.sleep(5)
        continue

    if not entries:
        continue

    for stream_name, messages in entries:
        for msg_id, fields in messages:
            task_data = json.loads(fields["payload"])
            retries = int(fields.get("retries", 0))
            priority = fields.get("priority", "normal")

            try:
                result = process_task(task_data)
                r.xack(STREAM, GROUP, msg_id)
                logger.info("Task %s completed successfully", msg_id)
            except Exception as e:
                logger.error("Task %s failed: %s", msg_id, e)
                # Acknowledge the failed message so it isn't re-delivered
                r.xack(STREAM, GROUP, msg_id)
                if retries >= MAX_RETRIES:
                    r.xadd(
                        DLQ_STREAM,
                        {
                            "payload": fields["payload"],
                            "error": str(e),
                            "original_id": msg_id,
                        },
                    )
                else:
                    r.xadd(
                        STREAM,
                        {
                            "payload": fields["payload"],
                            "retries": str(retries + 1),
                            "priority": priority,
                        },
                    )

After this pivot, throughput stabilized at 45 tasks per minute with seven warm agents, and the queue absorbed traffic spikes without cascading failures. This was the single most impactful change we made across the entire 30 days.

After this pivot, throughput stabilized at 45 tasks per minute with seven warm agents, and the queue absorbed traffic spikes without cascading failures. This was the single most impactful change we made across the entire 30 days.

Week 3: Cost Analysis Deep Dive

Raw Infrastructure Costs: The Full Breakdown

Cost Category Monthly Amount
Bare-metal colocation (server + power + bandwidth) $1,200
Hardware amortization (server purchase / 36 months) $680
Software licensing $0 (open source)
Ops labor (estimated 8 hrs/week at $95/hr) $3,040
Monitoring tools (Grafana Cloud free tier + Prometheus) $0
Total monthly cost $4,920

Over the 30-day period, we processed 47,200 tasks. That puts our all-in cost per task at approximately $0.104, or $104 per 1,000 tasks.

OpenClaw Self-Hosted vs. Cloud API: Side-by-Side Comparison

For an equivalent workload using cloud APIs, we priced out the same task mix:

Line Item Self-Hosted OpenClaw Cloud API (GPT-4o-mini equiv.)
Compute / API costs $1,880 $3,776
Storage $0 (included) $120 (S3 logs + context)
Bandwidth $0 (included) $85 (data egress)
Ops labor $3,040 $760 (integration maintenance)
Total (monthly, 47K tasks) $4,920 $4,741
Cost per 1,000 tasks $104 $100

The numbers surprised us. At our current volume, the costs are nearly identical. The break-even point where self-hosting clearly wins is around 80,000 tasks per month, because infrastructure costs stay flat while API costs scale linearly. Below 30,000 tasks per month, cloud APIs win handily once you factor in ops labor.

The hidden costs cut both ways. Self-hosting carries debugging time, hardware failure risk, and the opportunity cost of skilled engineers babysitting infrastructure. Cloud APIs carry rate limits that force architectural compromises, vendor lock-in that makes switching painful, and data egress fees that grow with logging and audit requirements.

The Viral Asset: Cost Calculator Methodology

We built an interactive cost calculator to help teams model their own scenarios. The variables it accounts for: hardware tier (consumer GPU, data center GPU, cloud VM), task complexity (simple single-shot, multi-step with tools, long-context), target concurrency level, ops engineer hourly rate, and amortization period for purchased hardware. The calculator compares self-hosted costs against current published pricing for OpenAI, Anthropic, and Amazon Bedrock endpoints.

The methodology is straightforward. Self-hosted cost equals (hardware amortized + power + bandwidth + ops hours × rate) divided by measured throughput at your chosen concurrency. Cloud cost equals (tokens per task × price per token × task count) plus integration maintenance. The calculator outputs a break-even task volume and a monthly savings or loss figure. Use it as a starting framework for your own analysis, not as a guarantee. Your actual numbers will depend on workload characteristics, utilization rates, and local labor costs.

Week 4: Stabilization, Optimization, and What We'd Do Differently

Performance Optimizations That Actually Moved the Needle

The biggest single gain came from switching the vLLM backend to AWQ 4-bit quantization. Quality degradation on our task suite was under 2% (measured by a rubric-scored evaluation set), while throughput jumped 38%. The second biggest gain: tuning batch sizes and session management. Here is the before/after configuration diff:

# BEFORE (Week 1 defaults)
openclaw_core:
  inference:
    quantization: none
    max_batch_size: 4
    request_timeout: 120
  agents:
    session_max_tasks: 0        # unlimited
    idle_timeout: 600            # 10 minutes
    memory_limit_per_agent: 4G

# AFTER (Week 4 optimized)
openclaw_core:
  inference:
    quantization: awq-4bit
    max_batch_size: 16
    request_timeout: 90
  agents:
    session_max_tasks: 50        # hard reset after 50 tasks
    idle_timeout: 120            # 2 minutes, recycle faster
    memory_limit_per_agent: 2G   # tighter limit, rely on pooling

Raising max_batch_size from 4 to 16 pushed GPU utilization from 72% to 89%. Tightening idle timeouts freed resources for the pool manager to spin up fresh agents faster. Combined with quantization, our effective throughput went from 45 tasks per minute to 63.

What We'd Change on Day 0 If We Started Over

Four things.

Start with the task queue from day one. The direct-API-to-agent pattern does not survive contact with variable load.

Over-provision GPU memory by at least 30% beyond your initial estimates. KV-cache growth under concurrent agents is steeper than any calculator predicts.

Deploy your full observability stack (Prometheus, alerting rules, dashboards) before deploying a single agent. We wasted two days in Week 1 debugging blind.

Set hard guardrails on agent autonomy scope from the start. Letting agents access broad tool sets without explicit per-agent allow-lists creates debugging nightmares when things go sideways.

Hard Truths About Self-Hosted AI Agents

It's Not "Set and Forget"

We spent an average of 8 hours per week on maintenance: patching dependencies, tuning configurations, investigating failed tasks, and rotating secrets. Model updates from upstream required testing against our evaluation suite before promotion. Dependency drift between the OpenClaw core, the vLLM backend, and CUDA drivers caused two minor outages when an automated OS update pulled in an incompatible NVIDIA driver version. Pin your driver versions. Pin everything.

Self-hosting makes sense for teams with stable, high-volume workloads and existing infrastructure chops. It does not make sense for teams that want to prototype quickly, deal with unpredictable traffic patterns, or lack anyone comfortable debugging GPU memory allocation issues at 2 AM.

Self-hosting makes sense for teams with stable, high-volume workloads and existing infrastructure chops. It does not make sense for teams that want to prototype quickly, deal with unpredictable traffic patterns, or lack anyone comfortable debugging GPU memory allocation issues at 2 AM.

The Talent and Knowledge Bottleneck

Running this stack requires a blend of ML ops, DevOps, and AI agent design knowledge. That combination is rare. Our team had two engineers with overlapping skills, and it still stretched thin during Week 2's architecture pivot. Documentation across the self-hosted AI agent space is fragmented. Community forums exist but skew toward enthusiasts running weekend experiments, not production operators sharing battle-tested patterns. Expect to be on your own for the hard problems.

Data Sovereignty: The One Unambiguous Win

For organizations in regulated industries (healthcare, finance, legal, government), self-hosting is the clearest path to ensuring no customer data or proprietary content leaves your network perimeter. No telemetry phones home if you configure it correctly, no prompts land in a third-party training pipeline, and you maintain full audit logs on infrastructure you control. This single benefit may justify the operational overhead regardless of cost comparisons. Verify telemetry defaults, disable outbound calls for model downloads after initial setup, and audit every third-party tool integration for data leakage.

30-Day Scorecard: Final Metrics and Recommendations

By the Numbers

Metric Week 1 Week 4 Change
Tasks processed (weekly) 8,400 14,800 +76%
Uptime 94.2% 99.6% +5.4pp
Median latency (seconds) 3.2 1.9 -41%
Cost per 1,000 tasks $168 $78 -54%
Peak concurrent agents (stable) 5 9 +80%
Total infrastructure spend (30 days) $4,920
Total tasks processed (30 days) 47,200

The trajectory is clear: self-hosted AI agents get dramatically better with tuning, but the first two weeks are expensive and unstable. Budget for that ramp-up period.

Who Should (and Shouldn't) Self-Host AI Agents Today

Good fit: Teams with ML ops experience, predictable workloads exceeding 80,000 tasks per month, data sensitivity requirements, and existing GPU infrastructure or colocation relationships.

Poor fit: Small teams without dedicated infrastructure engineers, workloads under 30,000 tasks per month, unpredictable traffic patterns, or organizations unwilling to commit to ongoing maintenance hours.

The hybrid approach is often the right answer. Self-host for your steady-state baseline workload where you can optimize utilization and keep data in-house. Burst to cloud APIs for traffic spikes that would otherwise require over-provisioning expensive GPU capacity that sits idle most of the time.

If you're exploring managed alternatives that handle the infrastructure complexity for you, ClawPilot offers a hosted AI agent platform that eliminates the ops overhead discussed above while still giving you control over agent orchestration and workflows.

What's Next: OpenClaw Roadmap and Our Plans

Several features on the OpenClaw roadmap would have solved problems we ran into: native task queue integration (eliminating the need for our Redis Streams workaround), built-in agent session lifecycle management with automatic context pruning, and first-class Prometheus metrics endpoints with pre-built Grafana dashboards. Multi-node distributed deployments are also in discussion, which would address the single-server scaling ceiling we hit in Week 2.

For months two and three, we plan to expand to a second bare-metal node, test multi-model agent configurations (pairing a smaller model for routing with a larger model for complex tasks), and build a proper evaluation pipeline that runs automatically on every config change.

If you're running self-hosted AI agents in production, whether with OpenClaw or another framework, publish your numbers. The community desperately needs more honest production data and fewer "getting started" tutorials. Share your costs, your failure modes, your architecture decisions. That's how the whole space gets better.

Matt MickiewiczMatt Mickiewicz

Matt is the co-founder of SitePoint, 99designs and Flippa. He lives in Vancouver, Canada.

© 2000 – 2026 SitePoint Pty. Ltd.
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.