This metrics tool terrifies bad developers

Start free trial
SitePoint Premium
Stay Relevant and Grow Your Career in Tech
  • Premium Results
  • Publish articles on SitePoint
  • Daily curated jobs
  • Learning Paths
  • Discounts to dev tools
Start Free Trial

7 Day Free Trial. Cancel Anytime.

A note on numbers: Anthropic changes rate limits, pricing, and tier thresholds without notice. Every specific value in this article reflects mid-2025 documentation. Before making purchasing or architectural decisions, verify current figures at docs.anthropic.com/en/api/rate-limits and anthropic.com/pricing.

Claude Code rate limits operate as a system of three independent, overlapping constraints, and the dashboard percentage reflects only one of them. This guide breaks down how those constraints interact, why agentic coding tools like Claude Code trigger them in ways that chat usage never does, and what developers can do to diagnose, avoid, and handle rate limit errors in production workflows.

Table of Contents

The 6% Mystery

The scenario is familiar to any developer working with Claude Code. The Anthropic dashboard reads 6% usage for the day, plenty of quota remaining by any reasonable interpretation. And yet the next command returns a 429 error, halting a coding session mid-refactor. The natural assumption is that something is broken.

The actual explanation: Claude Code rate limits operate as a system of three independent, overlapping constraints, and the dashboard percentage reflects only one of them.

This guide breaks down how those constraints interact, why agentic coding tools like Claude Code trigger them in ways that chat usage never does, and what developers on Claude Pro ($20/month), Claude Max ($100-$200/month), or direct API plans can do to diagnose, avoid, and handle rate limit errors in production workflows.

It covers the tier restructuring Anthropic rolled out across 2025 into 2026, including usage-based tiers with automatic scaling, and the specific ways Claude Code's token consumption patterns collide with per-minute throughput ceilings that sit well below daily quotas.

How Claude Code Consumes Your API Quota

What Happens Under the Hood When You Run a Command

Claude Code does not send a single prompt to the API and wait for a response. Each interaction is a multi-turn conversation that includes the system prompt, the accumulated conversation history, the contents of files pulled into context, and tool-use tokens generated by operations like file reads, bash command execution, and codebase search. A seemingly simple "edit this file" command can consume between 50,000 and 150,000 tokens in a single API call once the full context window is assembled. A 10-file project might use 50k tokens per request; a 500-file monorepo can exceed 500k. Every follow-up message in the same session appends to this context, meaning token consumption per request grows over the course of a session, faster than linearly, since each turn carries the full prior history and token counts grow in proportion to total session length.

Why Claude Code Is Different from Chat Usage

A typical Claude.ai chat interaction might consume roughly 1,000 to 5,000 tokens per exchange. Claude Code routinely consumes 10x to 100x that amount because of the agentic loop: the tool reads a file (input tokens), generates a proposed edit (output tokens), executes a bash command to test the result (more input and output tokens for the tool-use round trip), and then re-reads the modified file to verify (yet more input tokens). Each of these steps is a separate API call or a continuation of a multi-turn conversation that carries the full prior context.

A developer who starts a session and issues 15 iterative commands may find the final command sending 200,000+ input tokens simply because the entire conversation history is included. This is the fundamental reason Claude Code users hit rate limits that chat users never encounter at the same subscription tier.

Long coding sessions make this problem acute. As the context window grows, each subsequent request carries more weight. A developer who starts a session and issues 15 iterative commands may find the final command sending 200,000+ input tokens simply because the entire conversation history is included. This is the fundamental reason Claude Code users hit rate limits that chat users never encounter at the same subscription tier.

The Three Rate Limit Types Explained

Requests Per Minute (RPM)

A single user-visible command in Claude Code can generate multiple API calls due to tool use, so a "lint, fix, test, fix" cycle might produce 8 to 12 API calls within 60 seconds. Every one of those calls, including retries, counts against RPM. Limits vary by tier: the free tier allows roughly 5 RPM, Tier 1 (accessible after a $5 credit purchase) allows 50 RPM, Tier 2 provides 1,000 RPM, Tier 3 offers 2,000 RPM, and Tier 4 reaches 4,000 RPM. For Claude Code users on subscription plans, the effective RPM depends on which underlying API tier the subscription maps to. Rapid iterative workflows can exhaust a 50-RPM budget in under 10 seconds, and because Claude Code often retries failed tool calls automatically, a burst of activity can drain RPM before the developer is aware any retries occurred.

Tokens Per Minute (TPM)

Anthropic tracks input tokens and output tokens separately against per-minute limits. Input tokens tend to dominate in Claude Code because each turn appends to the context, meaning input-side TPM is the constraint most Claude Code users actually hit. The hidden multiplier is a large codebase in context: if Claude Code indexes or reads several files to understand a refactoring task, the input token count for a single request can reach the TPM ceiling on its own.

TPM limits scale with tiers. According to Anthropic's rate limits documentation, Tier 1 provides approximately 20,000 input TPM and 4,000 output TPM for Claude Sonnet-class models. Tier 2 raises input TPM to 40,000 and output to 8,000. Tier 3 reaches 80,000 input and 16,000 output. Tier 4 doubles those again. For Claude Opus-class models, limits are lower at each tier. Specific limits vary by model and model generation; consult the Anthropic rate limits documentation for the model you are using.

The following Python script demonstrates how to inspect rate limit headers programmatically after an API call:

import anthropic
import os

# Requires ANTHROPIC_API_KEY environment variable.
# Set with: export ANTHROPIC_API_KEY="sk-ant-..."
# pip install anthropic>=0.25.0
client = anthropic.Anthropic()

# This single call returns both the response and headers — no separate call needed.
raw_response = client.messages.with_raw_response.create(
    model="claude-sonnet-4-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Explain the purpose of this function."}],
)

headers = raw_response.http_response.headers
print(f"Token limit:     {headers.get('anthropic-ratelimit-tokens-limit')}")
print(f"Tokens remaining:{headers.get('anthropic-ratelimit-tokens-remaining')}")
print(f"Tokens reset at: {headers.get('anthropic-ratelimit-tokens-reset')}")
print(f"Request limit:   {headers.get('anthropic-ratelimit-requests-limit')}")
print(f"Requests left:   {headers.get('anthropic-ratelimit-requests-remaining')}")
print(f"Requests reset:  {headers.get('anthropic-ratelimit-requests-reset')}")

Daily Token Quotas (The Ceiling You Don't See)

Daily quotas cap total token consumption over a 24-hour window. The dashboard percentage that reads "6%" reflects consumption against this daily ceiling. The critical insight is that hitting 6% of a daily quota does not mean 94% of per-minute throughput is available. RPM and TPM limits are completely independent of the daily quota. A developer can be at 6% daily usage and simultaneously at 100% of their TPM allocation for the current minute.

This is the "burst within budget" problem. The daily quota is generous enough to sustain hours of work, but the per-minute limits gate how fast that work can happen. A developer who concentrates their usage into intense 30-minute bursts will hit per-minute limits long before approaching the daily cap.

How the Three Limits Interact

RPM caps how often you call the API. TPM caps how much data those calls carry per minute. The daily quota sets the total budget for the day. A generous daily budget does not help if per-minute throughput is too narrow for the workload's demands.

Error BehaviorLikely Limit HitKey Indicator
Errors on rapid successive commands, resolves after ~60sRPMrequests-remaining near 0
Errors on large-context commands even if spaced apartTPMtokens-remaining near 0
Errors persist across minutes, dashboard shows high %Daily quotaDashboard usage at or near 100%
Errors appear on first command of the dayLikely 529 (overload), not rate limitCheck status code: 529 vs 429

Rate Limit Tiers: Free, Pro, Max, and Enterprise Compared

Tier Breakdown Table

The following table consolidates Anthropic's rate limits across subscription and API tiers. For Claude Code users, the "Approx. Claude Code Sessions/Day" column estimates based on an average session consuming approximately 500,000 to 1,000,000 tokens (estimate based on empirical observation; session length and project size cause wide variation).

†All TPM, RPM, and quota values are as of publication and subject to change.

TierRPM†Input TPM†Output TPM†Daily Token Quota†Approx. Claude Code Sessions/Day
Free (API)~5~20,000~4,000Very limited<1
Tier 1 ($5 credits)~50~20,000~4,000~300M/mo (~10M/day)10-20
Tier 2 ($40 credits)~1,000~40,000~8,000~1B/mo (~33M/day)33-66
Tier 3 ($200 credits)~2,000~80,000~16,000~2.5B/mo (~83M/day)83-166
Tier 4 ($400 credits)~4,000~160,000~32,000~5B/mo (~166M/day)166+
Claude Pro ($20/mo)~50~20,000~4,000Subject to "activity limits"5-15 (variable)
Claude Max ($100/mo)Not published - contact AnthropicNot published - contact AnthropicNot published - contact Anthropic~5x Pro25-75 (variable)
Claude Max ($200/mo)Not published - contact AnthropicNot published - contact AnthropicNot published - contact Anthropic~20x Pro100+ (variable)

~Values are approximate. Tier credit thresholds for advancement ($5, $40, $200, $400) are subject to change; verify at Anthropic Console.

To estimate tier needs: multiply average tokens per Claude Code session (check API logs) by expected sessions per day, then compare against daily quotas. For per-minute constraints, divide the largest single-request token count by TPM limits to see if any individual request could exceed the minute window.

Claude Max vs Claude Pro: Is the Upgrade Worth It?

For Claude Code users specifically, the upgrade question hinges on per-minute throughput rather than daily quota. Claude Pro's activity limits throttle extended coding sessions, often triggering rate limits within 15 to 30 minutes of intensive use. Claude Max at the $100 tier provides a multiplier over Pro, which Anthropic describes as "5x the usage" (per Anthropic's pricing page). However, whether this 5x applies to per-minute limits (RPM/TPM) or only to the daily activity limit is not publicly documented. Contact Anthropic support for confirmation on the exact mechanics. Developers who hit per-minute limits on Pro will likely hit the same pattern on Max during burst usage, just with more daily headroom. The $200 Max tier offers 20x Pro's usage and is the subscription option most likely to eliminate daily quota issues entirely, though if you run sessions exceeding 10 rapid multi-file edits in sequence, per-minute limits will still apply.

API Direct vs Claude Max Subscription

For developers spending more than roughly $60 to $80 per month on API credits, direct API access with prepaid credits costs less per token than a Max subscription, with the added benefit of explicit, documented rate limits and automatic tier scaling as spend increases. The API approach also offers the Batch API (see docs.anthropic.com/en/api/message-batches), which provides a cost reduction for non-time-sensitive work and does not count against standard rate limits. The trade-off: API-direct requires managing authentication, billing, and infrastructure, whereas Claude Max provides an integrated Claude Code experience with no additional setup.

Decoding the Error Messages

The 429 "Rate Limit Exceeded" Error

A 429 response from the Anthropic API includes an error object with type: "rate_limit_error" and a message field describing which limit you exceeded. The response includes a retry-after header indicating the number of seconds to wait before retrying. The value in retry-after reflects the reset window for the specific limit that triggered the error. If you exceeded TPM, retry-after aligns with the anthropic-ratelimit-tokens-reset timestamp. If you exceeded RPM, it aligns with anthropic-ratelimit-requests-reset. The anthropic-ratelimit-* headers on the response distinguish which limit triggered the 429.

The "Overloaded" Error (529)

A 529 status code indicates server-side capacity constraints and is not a rate limit. The API is temporarily unable to handle the request regardless of the developer's quota. Developers frequently conflate 529 with 429 because the symptom is similar: the request fails and must be retried. However, 529 errors require a different strategy. Do not count them against your rate limit backoff timer, but still use exponential backoff for retries. Aggressive immediate retries on a 529 increase server load and reduce your own success rate. A starting delay of 1-5 seconds with exponential growth is appropriate. Anthropic's documentation explicitly distinguishes 529 as an overloaded error separate from rate limiting.

Claude Code-Specific Error Messages

Within the Claude Code CLI, rate limits surface as "You've been rate limited" messages, sometimes accompanied by a "Waiting for capacity" spinner. Claude Code has built-in retry logic that silently retries on some rate limit responses, meaning developers may experience unexplained pauses without seeing an explicit error. When the silent retry also fails, the error surfaces to the user. This built-in retry behavior means that by the time a developer sees a rate limit error in Claude Code, the tool has already exhausted its internal retry budget.

The following Node.js snippet demonstrates catching and differentiating between 429 and 529 errors:

import Anthropic from "@anthropic-ai/sdk";

const client = new Anthropic();

async function callWithErrorHandling(prompt) {
  try {
    const message = await client.messages.create({
      model: "claude-sonnet-4-20250514",
      max_tokens: 1024,
      messages: [{ role: "user", content: prompt }],
    });
    return message;
  } catch (error) {
    if (error instanceof Anthropic.RateLimitError) {
      const retryAfter = error.headers?.["retry-after"] || 60;
      const tokensRemainingRaw = error.headers?.["anthropic-ratelimit-tokens-remaining"];
      const requestsRemainingRaw = error.headers?.["anthropic-ratelimit-requests-remaining"];

      const tokensRemaining = tokensRemainingRaw != null ? parseInt(tokensRemainingRaw, 10) : NaN;
      const requestsRemaining = requestsRemainingRaw != null ? parseInt(requestsRemainingRaw, 10) : NaN;

      console.error(`429 Rate Limited. Retry after: ${retryAfter}s`);
      console.error(`  Tokens remaining: ${tokensRemainingRaw}`);
      console.error(`  Requests remaining: ${requestsRemainingRaw}`);

      if (!isNaN(tokensRemaining) && tokensRemaining === 0) {
        console.error("  → Hit TPM limit. Reduce context size or wait.");
      } else if (!isNaN(requestsRemaining) && requestsRemaining === 0) {
        console.error("  → Hit RPM limit. Slow down request frequency.");
      }
    } else if (error instanceof Anthropic.APIStatusError && error.status === 529) {
      console.error("529 Overloaded: Server capacity issue, not a rate limit.");
      console.error("  → Retry with exponential backoff (1–5s initial delay).");
    } else if (error instanceof Anthropic.APIError) {
      console.error(`API error: ${error.status} - ${error.message}`);
    } else {
      console.error(`Unexpected error: ${error.message}`);
    }
    throw error;
  }
}

Diagnosing Which Limit You're Actually Hitting

Rate Limit Diagnostic Flowchart:

[Request Failed] 
    → Check HTTP status code
        → 529? → Server overload, not your rate limit. Retry with exponential backoff.
        → 429? → Check response headers:
            → anthropic-ratelimit-requests-remaining = 0?
                → You hit RPM. Slow down command frequency.
            → anthropic-ratelimit-tokens-remaining = 0?
                → You hit TPM. Reduce context size or space out large requests.
            → Both above > 0? 
                → Check the 429 error message body for the `message` field — 
                  it specifies the exact limit type (e.g., daily quota, 
                  model-specific limit, or input/output token sub-limit).
                  Daily quota exhaustion will be stated explicitly.
        → 400/401/403? → Not a rate limit. Check API key and request format.

Reading Rate Limit Response Headers

Every API response from Anthropic includes the following rate limit headers (the exact set may vary by API version):

  • anthropic-ratelimit-requests-limit: Maximum requests permitted per minute
  • anthropic-ratelimit-requests-remaining: Requests remaining in the current minute window
  • anthropic-ratelimit-requests-reset: ISO 8601 timestamp when the request counter resets
  • anthropic-ratelimit-tokens-limit: Maximum tokens permitted per minute
  • anthropic-ratelimit-tokens-remaining: Tokens remaining in the current minute window
  • anthropic-ratelimit-tokens-reset: ISO 8601 timestamp when the token counter resets

Additionally, 429 responses include a retry-after header with the number of seconds to wait. These headers appear on both successful and failed responses, making proactive monitoring possible before a limit is actually hit.

Using the Anthropic Dashboard Effectively

The Anthropic Console shows usage data, but there is a reported lag between actual API consumption and dashboard updates. Developers should not rely on the dashboard percentage as a real-time indicator. For accurate monitoring, header-based tracking is the only reliable method. Usage alerts in the Anthropic Console trigger notifications at configurable spend thresholds, but these track billing, not per-minute rate consumption.

The dashboard percentage reflects daily quota only. A 6% reading tells you nothing about per-minute availability.

import anthropic
import time
import signal
import threading

# Requires ANTHROPIC_API_KEY environment variable.
# pip install anthropic>=0.25.0
client = anthropic.Anthropic()

def monitor_rate_limits(interval_seconds=60):
    """Poll the API with a minimal request to check current rate limit status.
    Default 60s; values below 30s will consume significant RPM quota on lower tiers.
    Minimum enforced interval is 30s.
    """
    interval_seconds = max(interval_seconds, 30)
    stop_event = threading.Event()

    def _handle_signal(signum, frame):
        print("Shutdown signal received — stopping monitor.")
        stop_event.set()

    signal.signal(signal.SIGTERM, _handle_signal)
    signal.signal(signal.SIGINT, _handle_signal)

    while not stop_event.is_set():
        try:
            raw = client.messages.with_raw_response.create(
                model="claude-sonnet-4-20250514",
                max_tokens=1,
                messages=[{"role": "user", "content": "hi"}],
            )
            h = raw.http_response.headers
            print(f"[{time.strftime('%H:%M:%S')}] "
                  f"RPM: {h.get('anthropic-ratelimit-requests-remaining') or 'N/A'}"
                  f"/{h.get('anthropic-ratelimit-requests-limit') or 'N/A'} | "
                  f"TPM: {h.get('anthropic-ratelimit-tokens-remaining') or 'N/A'}"
                  f"/{h.get('anthropic-ratelimit-tokens-limit') or 'N/A'} | "
                  f"Reset: {h.get('anthropic-ratelimit-tokens-reset') or 'N/A'}")
        except anthropic.RateLimitError:
            print(f"[{time.strftime('%H:%M:%S')}] ⚠ Currently rate limited!")
        except anthropic.AuthenticationError:
            print(f"[{time.strftime('%H:%M:%S')}] FATAL: Authentication failed — stopping monitor")
            raise
        except Exception as exc:
            print(f"[{time.strftime('%H:%M:%S')}] ERROR: Unexpected exception: {exc!r}")
            raise
        stop_event.wait(timeout=interval_seconds)

if __name__ == "__main__":
    monitor_rate_limits()

Note that this approach itself consumes one request per poll, so the interval should not be too aggressive. The default of 60 seconds is reasonable for background monitoring; do not set below 30 seconds on Tier 1.

Practical Strategies to Avoid Rate Limits in Claude Code

Reduce Context Window Bloat

Irrelevant files waste tokens. Create a .claudeignore file (analogous to .gitignore; see Claude Code documentation for syntax) in the project root to exclude directories like node_modules/, dist/, .git/, large data files, and generated code. Structure prompts to reference specific files by path rather than asking Claude Code to "look at the project." Breaking large refactoring tasks into smaller, scoped operations (one module at a time rather than "refactor the entire codebase") reduces per-request context and keeps individual requests within TPM limits.

Implement Exponential Backoff with Jitter

What happens when three failed requests all retry at the same instant? They create a "thundering herd" that overwhelms the rate limit window and fails again. Naive retry loops that immediately re-send failed requests amplify this problem. The correct approach is exponential backoff with random jitter, respecting the retry-after header when present.

import anthropic
import time
import random

# Requires ANTHROPIC_API_KEY environment variable.
# pip install anthropic>=0.25.0
client = anthropic.Anthropic()

def call_with_backoff(messages, max_retries=5):
    last_exc = None
    for attempt in range(max_retries):
        try:
            return client.messages.create(
                model="claude-sonnet-4-20250514",
                max_tokens=4096,
                messages=messages,
            )
        except anthropic.RateLimitError as e:
            last_exc = e
            # e.headers is the dict-like object on anthropic SDK exceptions.
            # retry-after may be absent; safely parse to float.
            raw_retry = e.headers.get("retry-after") if e.headers else None
            try:
                retry_after = float(raw_retry) if raw_retry is not None else 0.0
            except (ValueError, TypeError):
                retry_after = 0.0
            # attempt=0 → max(retry_after, 1), attempt=1 → max(retry_after, 2), etc.
            base_delay = max(retry_after, float(2 ** attempt))
            jitter = random.uniform(0, base_delay * 0.5)
            wait_time = base_delay + jitter
            print(f"Rate limited (attempt {attempt + 1}/{max_retries}). "
                  f"Waiting {wait_time:.1f}s...")
            time.sleep(wait_time)
        except anthropic.APIStatusError as e:
            if e.status_code == 529:
                last_exc = e
                # 529 = server overload; use backoff, not aggressive retries
                time.sleep(random.uniform(1, 5))
                continue
            raise
    raise RuntimeError("Max retries exceeded for rate limit.") from last_exc

This wrapper respects retry-after, applies exponential backoff starting at 1 second (attempt 0) and doubling each subsequent attempt, adds random jitter up to 50% of the base delay to desynchronize concurrent retries, and handles 529 overload errors with a short random delay and backoff rather than aggressive immediate retries.

Batch and Sequence Your Requests

Instead of firing rapid successive commands, batch related changes into a single, well-scoped prompt. The /compact command in Claude Code (see Claude Code documentation for current syntax and behavior) summarizes or resets conversation context, which can dramatically reduce input tokens for the next request. Alternatively, starting a new conversation achieves a clean reset. Strategic session management matters: after 10 to 15 exchanges in a session, the context window grows large enough that a single request exceeds TPM. Starting a fresh conversation at that point trades conversation continuity for rate limit headroom.

Use Model Routing to Stay Under Limits

Rate limits are tracked per model (verify this behavior in Anthropic's rate limits documentation, as account-level limits may aggregate across models). Routing simpler tasks like formatting, adding comments, or making small syntactic fixes to Claude Haiku (the fastest, cheapest model) preserves Sonnet and Opus quota for complex reasoning tasks. Claude Code supports model configuration; consult Claude Code documentation for how to specify the model per task to prevent a linting fix from consuming the same quota pool as a complex architectural refactor.

Time-Shift Heavy Workloads

Per-minute counters reset every 60 seconds. Daily quotas reset on a schedule that varies by account type (verify reset timing for your account in the Anthropic Console or API documentation). Scheduling large refactoring sessions to start immediately after a reset maximizes available throughput. Distributing work across the day rather than concentrating it in a two-hour burst prevents repeated TPM ceiling collisions.

Building a Rate Limit Monitor for Your Team

For teams with multiple developers sharing API access or using Claude Code against a shared organizational account, centralized rate limit monitoring prevents one developer's burst from impacting others.

import express from "express";
import Anthropic from "@anthropic-ai/sdk";

const app = express();
const MAX_LOG_SIZE = 1000;
const rateLimitLog = [];

const client = new Anthropic();

const ALLOWED_FIELDS = new Set([
  "model", "messages", "max_tokens", "system", "temperature", "stop_sequences",
]);

function sanitizeBody(body) {
  if (!body || typeof body !== "object" || Array.isArray(body)) {
    throw new Error("Request body must be a JSON object");
  }
  const filtered = Object.fromEntries(
    Object.entries(body).filter(([k]) => ALLOWED_FIELDS.has(k))
  );
  if (!filtered.model || !filtered.messages) {
    throw new Error("Required fields 'model' and 'messages' are missing");
  }
  return filtered;
}

function sanitizeHeaderValue(value) {
  if (typeof value !== "string") return "unknown";
  // Strip control characters and limit length
  return value.replace(/[^\x20-\x7E]/g, "").slice(0, 64) || "unknown";
}

function appendLog(entry) {
  rateLimitLog.push(entry);
  // Trim to enforce hard cap; handles concurrent push race
  if (rateLimitLog.length > MAX_LOG_SIZE) {
    rateLimitLog.splice(0, rateLimitLog.length - MAX_LOG_SIZE);
  }
}

// WARNING: Add authentication middleware before deploying.
// The proxy below is for development/internal use only.
app.use((req, res, next) => {
  if (req.headers["x-internal-token"] !== process.env.INTERNAL_TOKEN) {
    return res.status(401).json({ error: "Unauthorized" });
  }
  next();
});

async function proxyAndLog(req, res) {
  try {
    const sanitizedBody = sanitizeBody(req.body);
    const raw = await client.messages.with_raw_response.create(sanitizedBody);
    const headers = raw.http_response.headers;
    const entry = {
      timestamp: new Date().toISOString(),
      user: sanitizeHeaderValue(req.headers["x-team-user"]),
      requestsRemaining: headers.get("anthropic-ratelimit-requests-remaining"),
      requestsLimit: headers.get("anthropic-ratelimit-requests-limit"),
      tokensRemaining: headers.get("anthropic-ratelimit-tokens-remaining"),
      tokensLimit: headers.get("anthropic-ratelimit-tokens-limit"),
      tokensReset: headers.get("anthropic-ratelimit-tokens-reset"),
    };
    appendLog(entry);
    console.log(JSON.stringify(entry));

    const tokensRemaining = parseInt(entry.tokensRemaining, 10);
    const tokensLimit = parseInt(entry.tokensLimit, 10);

    if (!isNaN(tokensRemaining) && !isNaN(tokensLimit) && tokensLimit > 0) {
      if (tokensRemaining < tokensLimit * 0.1) {
        console.warn(`⚠ Token budget below 10% — user: ${entry.user}`);
      }
    }

    res.json(raw.data);
  } catch (error) {
    if (error instanceof Error && error.message.startsWith("Required fields") ||
        error instanceof Error && error.message === "Request body must be a JSON object") {
      return res.status(400).json({ error: error.message });
    }
    const isApiError = error instanceof Anthropic.APIError;
    const status = isApiError ? error.status : 500;
    // Do not expose raw error.message for non-API errors (may contain stack/internals)
    const message = isApiError
      ? error.message
      : "Internal proxy error";
    res.status(status).json({ error: message });
  }
}

app.post("/v1/messages", express.json(), proxyAndLog);
app.get("/rate-limit-log", (req, res) => res.json(rateLimitLog.slice(-100)));
app.listen(3001, () => console.log("Rate limit proxy running on :3001"));

This proxy intercepts all API traffic, validates request bodies against an allowlist of fields, sanitizes caller-supplied headers, extracts rate limit headers, logs per-user consumption, and warns when token budget drops below 10%. The /rate-limit-log endpoint provides a quick audit trail. Note: this code uses ES module syntax and requires "type": "module" in package.json or a .mjs file extension.

When to Consider Enterprise Tier or API Proxy

Your team has outgrown Max subscriptions when multiple developers hit rate limit errors during normal working hours, when CI/CD pipelines that use Claude Code need guaranteed throughput, and when combined spending exceeds $400/month across team members. Anthropic's Enterprise tier offers custom rate limits negotiated directly, higher base quotas, and dedicated capacity (per Anthropic enterprise sales materials). Third-party API proxy tools pool rate limits across team members, implement queuing, and provide granular usage analytics beyond what the Anthropic Console offers natively.

Key Takeaways and Quick Reference

RPM caps how often you call the API. TPM caps how much data those calls carry per minute. The daily quota sets the total budget for the day. A generous daily budget does not help if per-minute throughput is too narrow for the workload's demands.

  • RPM (how often), TPM (how much per minute), and daily quota (how much total) are three independent limits. They do not share a counter.
  • The dashboard percentage reflects daily quota only. A 6% reading tells you nothing about per-minute availability.
  • Claude Code burns through tokens at 10-100x the rate of chat because of multi-turn conversations, growing context windows, and tool-use round trips. Budget accordingly.
  • Inspect anthropic-ratelimit-* headers on every response to identify which specific limit is approaching. The diagnostic flowchart above maps status codes and header values to root causes.
  • Both 429 (rate limit) and 529 (overload) errors require exponential backoff. The difference: do not count 529 against your rate limit backoff timer, and never hammer an overloaded server with aggressive retries.
  • Reduce context aggressively using .claudeignore, scoped prompts, and periodic /compact resets.
  • Route simple tasks to cheaper models to preserve Sonnet/Opus quota for complex work.
  • Monitor per-minute consumption programmatically rather than relying on the lagging dashboard. The Python and Node.js examples above provide a starting point.
  • If your monthly spend exceeds $60-$80, compare the cost of direct API access against Max subscription pricing. Direct API gives you explicit, documented limits and automatic tier scaling.

Use the tier comparison table to evaluate whether the current subscription matches actual usage patterns. The per-minute limits are almost always the constraint behind the 6% mystery, and solving them requires understanding the mechanics rather than simply upgrading tiers.

SitePoint TeamSitePoint Team

Sharing our passion for building incredible internet things.

© 2000 – 2026 SitePoint Pty. Ltd.
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.