Claude API 429 Error Handling | Production Python Guide

Handling Claude API 429 errors in production Python applications demands more than a basic retry loop. Anthropic's rate-limiting system enforces multiple overlapping constraints simultaneously, and a naive approach to retries can amplify failures rather than absorb them.

This article builds a complete, production-ready rate-limit handler from the ground up. It parses Anthropic's proprietary response headers, implements jitter-aware exponential backoff, adds proactive throttling before a 429 ever fires, and wraps the entire mechanism in a circuit breaker for sustained outage resilience. Code examples are designed to be assembled into a single module (claude_rate_limiter.py); a complete runnable file is available for download at the end of the article.

Note: The official Anthropic Python SDK includes built-in retry and rate-limit handling. This article is appropriate when using raw HTTP via requests, or when you need custom control beyond SDK defaults (proactive throttling, circuit breaking, custom observability hooks).

Prerequisites

Python 3.11+ is required for reliable ISO 8601 timezone parsing via datetime.fromisoformat. For Python 3.8-3.10, substitute python-dateutil (see note in the header-parsing section). Dependencies (tested versions): requests>=2.31, tenacity>=8.2, responses>=0.23 (test only). Install with pip install requests tenacity responses. You will also need an active Anthropic API key, stored in the ANTHROPIC_API_KEY environment variable.

Why 429 Errors Deserve Their Own Strategy
Decoding Claude's Rate-Limit Headers
Implementing Exponential Backoff with Jitter
Building the Production-Ready ClaudeRateLimitHandler Class
Adding a Circuit Breaker for Sustained Outages
Testing Your Retry Logic Without Burning API Credits
Key Takeaways and Quick-Reference Checklist

Why 429 Errors Deserve Their Own Strategy

What a 429 Actually Means (and Why Claude's Implementation Differs)

HTTP 429 Too Many Requests signals that a client has exceeded a server's rate limit. Most APIs enforce a single rate-limit dimension. Anthropic enforces three simultaneously: request-rate limits (requests per minute), token-rate limits (tokens per minute, covering both input and output tokens), and concurrent-request limits (the number of in-flight requests at any instant). Each dimension has its own quota and its own reset window. (See Anthropic's rate limits documentation for current per-tier quotas.)

Generic retry logic typically assumes one throttle boundary. The token-budget gate can still reject a request that passed the requests-per-minute check, or the concurrency ceiling can block it independently. Any retry strategy that ignores this multi-bucket architecture will misinterpret which constraint it violated and miscalculate how long to wait.

The Cost of Getting It Wrong

Unhandled or poorly handled 429 errors compound within seconds. When multiple workers retry at the same cadence, they create retry storms: synchronized bursts that spike load and trigger further rejections. This is the thundering-herd problem, where each retry round recreates the original overload. The cost is real. Each partially-streamed request before disconnection wastes billable tokens (verify current billing behavior for aborted streaming requests in Anthropic's pricing documentation). In batch processing scenarios, a single aggressive job can exhaust an organization-level rate limit, effectively locking every other service on that API key out of the Claude API for the duration of the reset window. The downstream result: compounding latency, wasted spend, and degraded availability across the entire system.

Any retry strategy that ignores this multi-bucket architecture will misinterpret which constraint it violated and miscalculate how long to wait.

Decoding Claude's Rate-Limit Headers

The `anthropic-ratelimit-*` Header Family

Every response from the Anthropic API includes a set of proprietary headers that expose the current state of each rate-limit bucket (as documented in Anthropic's rate limits reference). These headers follow a consistent naming pattern across both the request and token dimensions.

For the request bucket:

anthropic-ratelimit-requests-limit: maximum requests allowed in the current window.
anthropic-ratelimit-requests-remaining: requests left before hitting the cap.
anthropic-ratelimit-requests-reset: ISO 8601 timestamp when the request bucket resets.

The token bucket follows the same structure:

anthropic-ratelimit-tokens-limit: maximum tokens allowed in the current window.
anthropic-ratelimit-tokens-remaining: tokens left in the current window.
anthropic-ratelimit-tokens-reset: ISO 8601 timestamp when the token bucket resets.

The reset timestamps are formatted as ISO 8601 strings (e.g., 2025-01-15T12:00:30Z). Calculating wait time requires parsing the timestamp, comparing it to the current UTC time, and converting the difference to seconds.

`Retry-After` vs. Reset Headers: Which to Obey

As of this writing, Anthropic includes a Retry-After header only on actual 429 responses (verify against current API documentation before deployment). When present, this header gives a server-computed value in integer seconds representing the minimum time a client should wait before retrying. (The HTTP spec also allows an HTTP-date format for Retry-After; this implementation assumes seconds only, which matches Anthropic's current behavior.) The reset headers, by contrast, appear on every response regardless of status code.

Prioritize deterministically: use Retry-After when present, since it reflects the server's most accurate assessment of when capacity will free up. Fall back to computing wait time from the nearest reset timestamp if Retry-After is absent. Resort to exponential backoff as the last fallback when neither header is usable.

import logging
from datetime import datetime, timezone
from typing import Optional

logger = logging.getLogger("claude_rate_limiter")


def parse_rate_limit_headers(response) -> dict:
    """Extract Anthropic rate-limit headers and compute actionable wait time.

    Returns a dict with processed rate-limit info (keys like 'requests_remaining',
    'tokens_remaining', etc.) — not raw HTTP header names.
    """
    headers = response.headers

    def _to_int(val: Optional[str]) -> Optional[int]:
        if val is None:
            return None
        try:
            result = int(val)
            return result if result >= 0 else None
        except (ValueError, TypeError):
            return None

    info = {
        "requests_limit": _to_int(headers.get("anthropic-ratelimit-requests-limit")),
        "requests_remaining": _to_int(headers.get("anthropic-ratelimit-requests-remaining")),
        "requests_reset": headers.get("anthropic-ratelimit-requests-reset"),
        "tokens_limit": _to_int(headers.get("anthropic-ratelimit-tokens-limit")),
        "tokens_remaining": _to_int(headers.get("anthropic-ratelimit-tokens-remaining")),
        "tokens_reset": headers.get("anthropic-ratelimit-tokens-reset"),
        "retry_after": headers.get("Retry-After"),
        "wait_seconds": None,
    }

    if info["retry_after"] is not None:
        try:
            info["wait_seconds"] = float(info["retry_after"])
        except (ValueError, TypeError):
            pass  # fall through to reset-header logic
        else:
            return info

    now = datetime.now(timezone.utc)
    earliest_wait = None
    for key in ("requests_reset", "tokens_reset"):
        raw = info.get(key)
        if raw:
            try:
                # Requires Python 3.11+; for 3.8–3.10 use:
                #   from dateutil.parser import parse as parse_dt
                #   reset_time = parse_dt(raw)
                reset_time = datetime.fromisoformat(raw.replace("Z", "+00:00"))
                delta = (reset_time - now).total_seconds()
                wait = max(delta, 0.0)
                if earliest_wait is None or wait < earliest_wait:
                    earliest_wait = wait
            except ValueError:
                logger.warning("Unparseable rate-limit reset header %r: %r", key, raw)

    info["wait_seconds"] = earliest_wait
    return info

This function returns a structured dictionary that downstream retry logic can inspect directly. If Retry-After exists, it takes precedence. Otherwise, the function computes the smallest positive wait from the two reset timestamps. A None value for wait_seconds signals that no header-based wait could be determined and the caller should fall back to backoff math. Numeric header values are cast to int at parse time and validated to be non-negative; malformed or missing values return as None. Malformed reset timestamps are logged and skipped rather than raising an exception.

Implementing Exponential Backoff with Jitter

Why Naive Retries Make Things Worse

When multiple workers receive a 429 at the same instant and all retry after exactly the same fixed delay, they arrive at the API in lockstep. This synchronized burst recreates the original overload condition at every retry interval. Rather than a smooth curve, the load graph shows a series of sharp spikes, each large enough to trigger another round of 429 rejections. Without jitter, the retry pattern itself sustains the congestion, and the system never recovers.

Full-Jitter vs. Decorrelated-Jitter Strategies

Adding randomness to retry timing breaks the synchronization. Two well-studied approaches dominate:

Full jitter computes the sleep duration as:

# pseudocode: random(0, min(cap, base * 2**attempt))

Each retry selects a uniformly random value between zero and the exponentially increasing ceiling. This approach is simple to implement and effective at dispersing retries across the wait window.

Decorrelated jitter uses the formula:

# pseudocode: sleep = min(cap, random(base, previous_sleep * 3))
# Initialize previous_sleep = base before the first attempt.
# On each retry, update previous_sleep to the actual sleep duration used.

Because each sleep depends on the prior sleep rather than the attempt count alone, the resulting wait sequence is less predictable and tends to spread retries more evenly under sustained high-throughput conditions.

Full jitter is a reasonable default for most applications integrating with the Claude API. Decorrelated jitter becomes advantageous at concurrency levels where retry collision probability grows measurable, roughly 50+ simultaneous callers sharing one API key (the AWS Architecture Blog analysis covers the statistical tradeoffs in detail).

Integrating with the `tenacity` Library

Hand-rolled retry loops work, but the tenacity library offers composability, built-in logging hooks, and retry statistics out of the box. The key design decision is writing a custom retry predicate that triggers only on HTTP 429, not on client errors (400) or server errors (500), which require different handling paths.

Note: The following code block assumes parse_rate_limit_headers is defined as shown in the "Decoding Claude's Rate-Limit Headers" section above, or imported from a shared module (e.g., from claude_rate_limiter import parse_rate_limit_headers).

from claude_rate_limiter import parse_rate_limit_headers  # required for standalone use

import random
import logging
import requests
from tenacity import (
    retry, stop_after_attempt, retry_if_exception,
    before_sleep_log, RetryCallState
)

logger = logging.getLogger("claude_retry")


class RateLimitError(Exception):
    def __init__(self, response):
        self.status_code = response.status_code
        self.headers = dict(response.headers)
        self.body = response.text
        # Store a lightweight mock-like object instead of the full Response
        # to avoid unbounded memory retention through tenacity's retry state.
        self._response_proxy = type("ResponseProxy", (), {
            "status_code": response.status_code,
            "headers": dict(response.headers),
        })()
        super().__init__(f"429 Too Many Requests")

    @property
    def response(self):
        return self._response_proxy


def _header_aware_wait(retry_state: RetryCallState) -> float:
    exc = retry_state.outcome.exception()
    if isinstance(exc, RateLimitError):
        info = parse_rate_limit_headers(exc.response)
        if info["wait_seconds"] is not None:
            return info["wait_seconds"] + random.uniform(0, 0.5)

    attempt = retry_state.attempt_number
    base, cap = 1.0, 60.0
    full_jitter_wait = random.uniform(0, min(cap, base * (2 ** attempt)))
    return full_jitter_wait


@retry(
    retry=retry_if_exception(lambda e: isinstance(e, RateLimitError)),
    wait=_header_aware_wait,
    stop=stop_after_attempt(6),
    before_sleep=before_sleep_log(logger, logging.WARNING),
)
def call_claude_api(url: str, headers: dict, payload: dict) -> dict:
    resp = requests.post(url, headers=headers, json=payload, timeout=30)
    if resp.status_code == 429:
        raise RateLimitError(resp)
    resp.raise_for_status()
    return resp.json()

The custom wait callback inspects the Retry-After and reset headers first, adding a small random jitter (up to 0.5 seconds) to prevent synchronization even among clients that receive identical Retry-After values. When no header information is available, it falls back to full-jitter exponential backoff capped at 60 seconds. The before_sleep_log hook logs every retry event with its computed wait time.

Building the Production-Ready `ClaudeRateLimitHandler` Class

Class Architecture Overview

The handler class wraps any callable that returns a requests.Response, managing retry state, quota tracking, and observability through a single object that unifies retry logic, proactive throttling, and monitoring hooks. It follows a composition-over-inheritance design: rather than subclassing an HTTP client, it accepts a request function and orchestrates the execution loop around it. Callback hooks for on_retry, on_success, and on_failure let you integrate with any monitoring system without coupling the handler to a specific observability stack.

Thread safety: This implementation is not thread-safe. If you share a single ClaudeRateLimitHandler instance across threads, protect calls to execute() with a threading.Lock, or instantiate one handler per thread.

Proactive Throttling: Don't Wait for the 429

The most effective 429 handling strategy is to avoid the 429 entirely. Since every Anthropic API response includes requests-remaining and tokens-remaining headers, the handler can track these values from the most recent successful response. Before dispatching a new request, it checks whether remaining quota has fallen below a configurable threshold. If so, it sleeps until the next reset window rather than sending a request that will exceed the remaining quota. This eliminates wasted round trips and keeps the client within its allocation.

Concurrency note: Effectiveness depends on proactive_threshold being set relative to your concurrent worker count. With proactive_threshold=2 and 10 concurrent workers sharing one handler instance, most workers will still fire requests before the throttle engages.

Full Class Implementation Walk-Through

Note: This class depends on parse_rate_limit_headers defined in the "Decoding Claude's Rate-Limit Headers" section. All code in this article is designed to live in a single claude_rate_limiter.py module.

import time
import random
import logging
from datetime import datetime, timezone
from typing import Callable, Optional, Any

logger = logging.getLogger("claude_rate_limiter")


class MaxRetriesExceeded(Exception):
    """Raised when all retry attempts for 429 rate limiting have been exhausted."""
    def __init__(self, message: str, retry_count: int = 0):
        super().__init__(message)
        self.retry_count = retry_count


class ClaudeRateLimitHandler:
    """Wraps a request callable with retry logic, proactive throttling, and observability hooks.

    Attributes:
        _last_rate_limit_info: Stores processed rate-limit info as returned by
            parse_rate_limit_headers (keys like 'requests_remaining', 'tokens_remaining'),
            not raw HTTP header names.
        proactive_threshold: If either requests-remaining or tokens-remaining falls
            at or below this value, the handler sleeps until the corresponding reset.
    """

    def __init__(
        self,
        base_delay: float = 1.0,
        max_delay: float = 60.0,
        max_retries: int = 5,
        proactive_threshold: int = 2,
        on_retry: Optional[Callable] = None,
        on_success: Optional[Callable] = None,
        on_failure: Optional[Callable] = None,
    ):
        self.base_delay = base_delay
        self.max_delay = max_delay
        self.max_retries = max_retries
        self.proactive_threshold = proactive_threshold
        self.on_retry = on_retry
        self.on_success = on_success
        self.on_failure = on_failure
        self._last_rate_limit_info: dict = {}

    def _parse_headers(self, response) -> dict:
        return parse_rate_limit_headers(response)

    def _calculate_wait(self, attempt: int, response=None) -> float:
        if response is not None:
            info = self._parse_headers(response)
            if info["wait_seconds"] is not None:
                return info["wait_seconds"] + random.uniform(0, 0.5)
        cap = min(self.max_delay, self.base_delay * (2 ** attempt))
        return random.uniform(0, cap)

    def _should_throttle_proactively(self) -> Optional[float]:
        for resource in ("requests", "tokens"):
            remaining = self._last_rate_limit_info.get(f"{resource}_remaining")
            reset = self._last_rate_limit_info.get(f"{resource}_reset")
            # remaining is already Optional[int] after parse_rate_limit_headers
            if remaining is not None and remaining <= self.proactive_threshold:
                if reset:
                    try:
                        reset_time = datetime.fromisoformat(
                            reset.replace("Z", "+00:00")
                        )
                        wait = (reset_time - datetime.now(timezone.utc)).total_seconds()
                        return max(wait, 0.0)
                    except ValueError:
                        logger.warning("Unparseable reset header for %r: %r", resource, reset)
        return None

    def execute(self, request_fn: Callable, *args, **kwargs) -> Any:
        self._last_rate_limit_info = {}
        retry_count = 0  # local variable, safe for concurrent use of separate calls

        for attempt in range(self.max_retries + 1):
            throttle_wait = self._should_throttle_proactively()
            if throttle_wait and throttle_wait > 0:
                logger.info("Proactive throttle: sleeping %.2fs", throttle_wait)
                time.sleep(throttle_wait)

            response = request_fn(*args, **kwargs)
            self._last_rate_limit_info = self._parse_headers(response)

            if response.status_code // 100 == 2:
                if self.on_success:
                    self.on_success(response)
                return response

            if response.status_code == 429:
                retry_count += 1
                if retry_count > self.max_retries:
                    break
                wait = self._calculate_wait(attempt + 1, response)
                logger.warning(
                    "429 received (attempt %d/%d), waiting %.2fs",
                    attempt + 1, self.max_retries + 1, wait,
                )
                if self.on_retry:
                    self.on_retry(attempt, wait, response)
                time.sleep(wait)
                continue

            if self.on_failure:
                self.on_failure(retry_count)
            response.raise_for_status()

        if self.on_failure:
            self.on_failure(retry_count)
        raise MaxRetriesExceeded(
            f"Max retries ({self.max_retries}) exceeded for 429 rate limiting",
            retry_count=retry_count,
        )

The execute() method is the primary entry point. On each iteration, it first checks whether proactive throttling is warranted. If the remaining quota for either requests or tokens sits at or below proactive_threshold, the handler sleeps until the corresponding reset window. When a 429 does arrive, the three-tier wait calculation kicks in: Retry-After first, reset-header math second, jittered exponential backoff third. Non-429 errors raise immediately rather than entering the retry loop, keeping error handling responsibility clean. Any 2xx status code (200, 201, etc.) counts as success. The retry_count is a local variable so that concurrent calls on the same handler instance (even though the handler is not fully thread-safe) do not corrupt each other's counters. If all retries are exhausted, including the edge case where max_retries=0, a MaxRetriesExceeded exception fires rather than silently returning None.

Usage Example: Calling Claude's Messages API

import os
import requests

ANTHROPIC_API_KEY = os.environ.get("ANTHROPIC_API_KEY")
if not ANTHROPIC_API_KEY:
    raise ValueError("Set ANTHROPIC_API_KEY environment variable")

API_URL = "https://api.anthropic.com/v1/messages"


def send_message() -> requests.Response:
    return requests.post(
        API_URL,
        headers={
            "x-api-key": ANTHROPIC_API_KEY,
            "anthropic-version": "2023-06-01",
            "content-type": "application/json",
        },
        json={
            "model": "claude-sonnet-4-5",  # Replace with current model ID: https://docs.anthropic.com/en/docs/about-claude/models
            "max_tokens": 1024,
            "messages": [{"role": "user", "content": "Explain rate limiting."}],
        },
        timeout=30,
    )


handler = ClaudeRateLimitHandler(max_retries=5, proactive_threshold=3)
response = handler.execute(send_message)
print(response.json()["content"][0]["text"])

Adding a Circuit Breaker for Sustained Outages

When Retries Aren't Enough

Exponential backoff with a retry cap recovers from transient 429 spikes within the configured retry budget. It does not handle sustained unavailability. When Anthropic enforces an organization-level rate limit or experiences an incident lasting minutes, retries with backoff still consume compute resources, hold threads in sleep states, and delay any fallback logic the application might use (cached responses, alternative models, graceful degradation). A circuit breaker solves this by failing fast once the system detects that retries are consistently futile.

Circuit Breaker States: Closed, Open, Half-Open

The circuit breaker operates in three states. Closed is the normal state: requests flow through and failures are counted. After a configured number of consecutive 429 responses within a time window, the breaker transitions to Open. In this state, the breaker immediately rejects all requests with a CircuitOpenError; no API call is made and no thread blocks on sleep. After a cooldown period elapses, the breaker moves to Half-Open: it permits a probe request through (single-threaded only; in multi-threaded applications, multiple threads may enter Half-Open simultaneously, see the threading note below). If the probe succeeds, the breaker returns to Closed. If it fails, the breaker reopens and the cooldown resets.

Thread safety: This CircuitBreaker implementation is not thread-safe. In multi-threaded applications, protect allow_request(), record_success(), and record_failure() with a threading.Lock to enforce single-probe behavior in the Half-Open state.

import time
from enum import Enum


class CircuitState(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"


class CircuitOpenError(Exception):
    pass


class CircuitBreaker:
    def __init__(self, failure_threshold: int = 5, cooldown: float = 60.0):
        self.failure_threshold = failure_threshold
        self.cooldown = cooldown
        self.state = CircuitState.CLOSED
        self._failure_count = 0
        self._last_failure_time: float = 0

    def record_failure(self):
        self._failure_count += 1
        self._last_failure_time = time.time()
        if self._failure_count >= self.failure_threshold:
            self.state = CircuitState.OPEN

    def record_success(self):
        self._failure_count = 0
        self._last_failure_time = 0
        self.state = CircuitState.CLOSED

    def allow_request(self) -> bool:
        if self.state == CircuitState.CLOSED:
            return True
        if self.state == CircuitState.OPEN:
            if time.time() - self._last_failure_time >= self.cooldown:
                self.state = CircuitState.HALF_OPEN
                return True
            raise CircuitOpenError("Circuit is open; failing fast")
        return True  # HALF_OPEN: allow probe (single-threaded only)

To integrate with ClaudeRateLimitHandler, wrap the execute() call. Note that CircuitOpenError is caught separately to avoid incorrectly incrementing the failure count when the circuit is open. Fast-fail rejections are not new failures:

breaker = CircuitBreaker(failure_threshold=5, cooldown=60.0)
handler = ClaudeRateLimitHandler(max_retries=5, proactive_threshold=3)


def execute_with_circuit_breaker(request_fn, *args, **kwargs):
    breaker.allow_request()  # Raises CircuitOpenError if open
    try:
        response = handler.execute(request_fn, *args, **kwargs)
        breaker.record_success()
        return response
    except CircuitOpenError:
        raise  # fast-fail: not a new failure, do not increment counter
    except Exception:
        breaker.record_failure()
        raise

Call breaker.allow_request() before dispatching, breaker.record_success() on a successful response, and breaker.record_failure() each time execution fails. The CircuitOpenError is explicitly re-raised without recording a failure, so fast-fail rejections do not inflate the failure count and prevent the circuit from ever recovering. This keeps the circuit breaker external to the handler: retry logic stays in the handler, availability policy stays in the breaker.

Testing Your Retry Logic Without Burning API Credits

Mocking 429 Responses with `pytest` and `responses`

The responses library intercepts outbound HTTP calls from requests and returns preconfigured mock responses, including custom headers. This makes it possible to simulate exact Anthropic 429 behavior, complete with Retry-After and the full anthropic-ratelimit-* header set, without making a single real API call.

Note: The test blocks below assume all classes and functions are importable from your claude_rate_limiter module:
from claude_rate_limiter import (
    parse_rate_limit_headers,
    ClaudeRateLimitHandler,
    MaxRetriesExceeded,
    CircuitBreaker,
    CircuitState,
    CircuitOpenError,
)

Testing the Circuit Breaker Transition

Driving the circuit breaker through its full state lifecycle in a single test validates both the failure accumulation logic and the recovery path. The test issues enough mocked 429 responses to trip the breaker, asserts that CircuitOpenError is raised, advances time past the cooldown, and verifies that a subsequent success closes the circuit.

import pytest
import responses
import time
from unittest.mock import patch

from claude_rate_limiter import (
    parse_rate_limit_headers,
    ClaudeRateLimitHandler,
    MaxRetriesExceeded,
    CircuitBreaker,
    CircuitState,
    CircuitOpenError,
)

MOCK_URL = "https://api.anthropic.com/v1/messages"


@responses.activate
def test_exponential_backoff_honors_retry_after():
    responses.add(
        responses.POST, MOCK_URL, status=429,
        headers={"Retry-After": "2", "anthropic-ratelimit-requests-remaining": "0"},
    )
    responses.add(
        responses.POST, MOCK_URL, status=429,
        headers={"Retry-After": "1", "anthropic-ratelimit-requests-remaining": "0"},
    )
    responses.add(responses.POST, MOCK_URL, status=200, json={"content": [{"text": "ok"}]})

    import requests as req

    def send():
        return req.post(MOCK_URL, json={}, timeout=30)

    handler = ClaudeRateLimitHandler(max_retries=5, base_delay=1.0)

    with patch("claude_rate_limiter.time.sleep") as mock_sleep:
        resp = handler.execute(send)

    assert resp.status_code == 200
    assert len(responses.calls) == 3
    sleep_calls = [c.args[0] for c in mock_sleep.call_args_list]
    # First sleep: Retry-After=2 + jitter in [0, 0.5)
    assert 2.0 <= sleep_calls[0] < 2.5
    # Second sleep: Retry-After=1 + jitter in [0, 0.5)
    assert 1.0 <= sleep_calls[1] < 1.5


def test_circuit_breaker_lifecycle():
    breaker = CircuitBreaker(failure_threshold=3, cooldown=5.0)

    for _ in range(3):
        breaker.record_failure()
    assert breaker.state == CircuitState.OPEN

    with pytest.raises(CircuitOpenError):
        breaker.allow_request()

    with patch("time.time", return_value=time.time() + 6):
        assert breaker.allow_request() is True
        assert breaker.state == CircuitState.HALF_OPEN

    breaker.record_success()
    assert breaker.state == CircuitState.CLOSED

The first test verifies that the handler respects Retry-After headers across consecutive 429 responses and eventually succeeds on the third attempt. By mocking time.sleep, the test runs instantly and asserts the exact sleep durations rather than relying on wall-clock timing, making it reliable in CI environments. The second test exercises the full breaker lifecycle without any network calls at all, patching time.time to simulate cooldown expiry.

Key Takeaways and Quick-Reference Checklist

Parse all anthropic-ratelimit-* headers on every response, not just 429s. They provide real-time visibility into quota consumption.
Respect Retry-After first, then fall back to reset-header math, then jittered exponential backoff. This three-tier priority matches Anthropic's server-side expectations.
Apply full-jitter exponential backoff to break thundering-herd synchronization. Reserve decorrelated jitter for sustained high-concurrency pipelines (50+ simultaneous callers per key). Initialize previous_sleep = base on the first attempt.
Avoid the 429 entirely by checking remaining quota before sending requests. Proactive throttling is always cheaper than reactive retry.
Deploy a circuit breaker around the retry handler for sustained outages. Failing fast preserves threads, compute, and the ability to fall back gracefully.
Every retry event should be logged with attempt number, computed wait time, and the header values that informed the decision. Observability is not optional in production.

For the latest rate-limit thresholds and header specifications, consult Anthropic's official rate-limits documentation, which reflects current per-tier quotas and any changes to header semantics.

Claude API 429 Error Handling: Production-Ready Patterns in Python

Prerequisites

Table of Contents

Why 429 Errors Deserve Their Own Strategy

What a 429 Actually Means (and Why Claude's Implementation Differs)

The Cost of Getting It Wrong

Decoding Claude's Rate-Limit Headers

The `anthropic-ratelimit-*` Header Family

`Retry-After` vs. Reset Headers: Which to Obey

Implementing Exponential Backoff with Jitter

Why Naive Retries Make Things Worse

Full-Jitter vs. Decorrelated-Jitter Strategies

Integrating with the `tenacity` Library

Building the Production-Ready `ClaudeRateLimitHandler` Class

Class Architecture Overview

Proactive Throttling: Don't Wait for the 429

Full Class Implementation Walk-Through

Usage Example: Calling Claude's Messages API

Adding a Circuit Breaker for Sustained Outages

When Retries Aren't Enough

Circuit Breaker States: Closed, Open, Half-Open

Testing Your Retry Logic Without Burning API Credits

Mocking 429 Responses with `pytest` and `responses`

Testing the Circuit Breaker Transition

Key Takeaways and Quick-Reference Checklist

Claude API 429 Error Handling: Production-Ready Patterns in Python

Prerequisites

Table of Contents

Why 429 Errors Deserve Their Own Strategy

What a 429 Actually Means (and Why Claude's Implementation Differs)

The Cost of Getting It Wrong

Decoding Claude's Rate-Limit Headers

The anthropic-ratelimit-* Header Family

Retry-After vs. Reset Headers: Which to Obey

Implementing Exponential Backoff with Jitter

Why Naive Retries Make Things Worse

Full-Jitter vs. Decorrelated-Jitter Strategies

Integrating with the tenacity Library

Building the Production-Ready ClaudeRateLimitHandler Class

Class Architecture Overview

Proactive Throttling: Don't Wait for the 429

Full Class Implementation Walk-Through

Usage Example: Calling Claude's Messages API

Adding a Circuit Breaker for Sustained Outages

When Retries Aren't Enough

Circuit Breaker States: Closed, Open, Half-Open

Testing Your Retry Logic Without Burning API Credits

Mocking 429 Responses with pytest and responses

Testing the Circuit Breaker Transition

Key Takeaways and Quick-Reference Checklist

The `anthropic-ratelimit-*` Header Family

`Retry-After` vs. Reset Headers: Which to Obey

Integrating with the `tenacity` Library

Building the Production-Ready `ClaudeRateLimitHandler` Class

Mocking 429 Responses with `pytest` and `responses`