Handling Claude API 429 errors in production Python applications demands more than a basic retry loop. Anthropic's rate-limiting system enforces multiple overlapping constraints simultaneously, and a naive approach to retries can amplify failures rather than absorb them.
This article builds a complete, production-ready rate-limit handler from the ground up. It parses Anthropic's proprietary response headers, implements jitter-aware exponential backoff, adds proactive throttling before a 429 ever fires, and wraps the entire mechanism in a circuit breaker for sustained outage resilience. Code examples are designed to be assembled into a single module (claude_rate_limiter.py); a complete runnable file is available for download at the end of the article.
Note: The official Anthropic Python SDK includes built-in retry and rate-limit handling. This article is appropriate when using raw HTTP via
requests, or when you need custom control beyond SDK defaults (proactive throttling, circuit breaking, custom observability hooks).
Prerequisites
Python 3.11+ is required for reliable ISO 8601 timezone parsing via datetime.fromisoformat. For Python 3.8-3.10, substitute python-dateutil (see note in the header-parsing section). Dependencies (tested versions): requests>=2.31, tenacity>=8.2, responses>=0.23 (test only). Install with pip install requests tenacity responses. You will also need an active Anthropic API key, stored in the ANTHROPIC_API_KEY environment variable.
Table of Contents
- Why 429 Errors Deserve Their Own Strategy
- Decoding Claude's Rate-Limit Headers
- Implementing Exponential Backoff with Jitter
- Building the Production-Ready ClaudeRateLimitHandler Class
- Adding a Circuit Breaker for Sustained Outages
- Testing Your Retry Logic Without Burning API Credits
- Key Takeaways and Quick-Reference Checklist
Why 429 Errors Deserve Their Own Strategy
What a 429 Actually Means (and Why Claude's Implementation Differs)
HTTP 429 Too Many Requests signals that a client has exceeded a server's rate limit. Most APIs enforce a single rate-limit dimension. Anthropic enforces three simultaneously: request-rate limits (requests per minute), token-rate limits (tokens per minute, covering both input and output tokens), and concurrent-request limits (the number of in-flight requests at any instant). Each dimension has its own quota and its own reset window. (See Anthropic's rate limits documentation for current per-tier quotas.)
Generic retry logic typically assumes one throttle boundary. The token-budget gate can still reject a request that passed the requests-per-minute check, or the concurrency ceiling can block it independently. Any retry strategy that ignores this multi-bucket architecture will misinterpret which constraint it violated and miscalculate how long to wait.
The Cost of Getting It Wrong
Unhandled or poorly handled 429 errors compound within seconds. When multiple workers retry at the same cadence, they create retry storms: synchronized bursts that spike load and trigger further rejections. This is the thundering-herd problem, where each retry round recreates the original overload. The cost is real. Each partially-streamed request before disconnection wastes billable tokens (verify current billing behavior for aborted streaming requests in Anthropic's pricing documentation). In batch processing scenarios, a single aggressive job can exhaust an organization-level rate limit, effectively locking every other service on that API key out of the Claude API for the duration of the reset window. The downstream result: compounding latency, wasted spend, and degraded availability across the entire system.
Any retry strategy that ignores this multi-bucket architecture will misinterpret which constraint it violated and miscalculate how long to wait.
Decoding Claude's Rate-Limit Headers
The anthropic-ratelimit-* Header Family
Every response from the Anthropic API includes a set of proprietary headers that expose the current state of each rate-limit bucket (as documented in Anthropic's rate limits reference). These headers follow a consistent naming pattern across both the request and token dimensions.
For the request bucket:
anthropic-ratelimit-requests-limit: maximum requests allowed in the current window.anthropic-ratelimit-requests-remaining: requests left before hitting the cap.anthropic-ratelimit-requests-reset: ISO 8601 timestamp when the request bucket resets.
The token bucket follows the same structure:
anthropic-ratelimit-tokens-limit: maximum tokens allowed in the current window.anthropic-ratelimit-tokens-remaining: tokens left in the current window.anthropic-ratelimit-tokens-reset: ISO 8601 timestamp when the token bucket resets.
The reset timestamps are formatted as ISO 8601 strings (e.g., 2025-01-15T12:00:30Z). Calculating wait time requires parsing the timestamp, comparing it to the current UTC time, and converting the difference to seconds.
Retry-After vs. Reset Headers: Which to Obey
As of this writing, Anthropic includes a Retry-After header only on actual 429 responses (verify against current API documentation before deployment). When present, this header gives a server-computed value in integer seconds representing the minimum time a client should wait before retrying. (The HTTP spec also allows an HTTP-date format for Retry-After; this implementation assumes seconds only, which matches Anthropic's current behavior.) The reset headers, by contrast, appear on every response regardless of status code.
Prioritize deterministically: use Retry-After when present, since it reflects the server's most accurate assessment of when capacity will free up. Fall back to computing wait time from the nearest reset timestamp if Retry-After is absent. Resort to exponential backoff as the last fallback when neither header is usable.
import logging
from datetime import datetime, timezone
from typing import Optional
logger = logging.getLogger("claude_rate_limiter")
def parse_rate_limit_headers(response) -> dict:
"""Extract Anthropic rate-limit headers and compute actionable wait time.
Returns a dict with processed rate-limit info (keys like 'requests_remaining',
'tokens_remaining', etc.) — not raw HTTP header names.
"""
headers = response.headers
def _to_int(val: Optional[str]) -> Optional[int]:
if val is None:
return None
try:
result = int(val)
return result if result >= 0 else None
except (ValueError, TypeError):
return None
info = {
"requests_limit": _to_int(headers.get("anthropic-ratelimit-requests-limit")),
"requests_remaining": _to_int(headers.get("anthropic-ratelimit-requests-remaining")),
"requests_reset": headers.get("anthropic-ratelimit-requests-reset"),
"tokens_limit": _to_int(headers.get("anthropic-ratelimit-tokens-limit")),
"tokens_remaining": _to_int(headers.get("anthropic-ratelimit-tokens-remaining")),
"tokens_reset": headers.get("anthropic-ratelimit-tokens-reset"),
"retry_after": headers.get("Retry-After"),
"wait_seconds": None,
}
if info["retry_after"] is not None:
try:
info["wait_seconds"] = float(info["retry_after"])
except (ValueError, TypeError):
pass # fall through to reset-header logic
else:
return info
now = datetime.now(timezone.utc)
earliest_wait = None
for key in ("requests_reset", "tokens_reset"):
raw = info.get(key)
if raw:
try:
# Requires Python 3.11+; for 3.8–3.10 use:
# from dateutil.parser import parse as parse_dt
# reset_time = parse_dt(raw)
reset_time = datetime.fromisoformat(raw.replace("Z", "+00:00"))
delta = (reset_time - now).total_seconds()
wait = max(delta, 0.0)
if earliest_wait is None or wait < earliest_wait:
earliest_wait = wait
except ValueError:
logger.warning("Unparseable rate-limit reset header %r: %r", key, raw)
info["wait_seconds"] = earliest_wait
return info
This function returns a structured dictionary that downstream retry logic can inspect directly. If Retry-After exists, it takes precedence. Otherwise, the function computes the smallest positive wait from the two reset timestamps. A None value for wait_seconds signals that no header-based wait could be determined and the caller should fall back to backoff math. Numeric header values are cast to int at parse time and validated to be non-negative; malformed or missing values return as None. Malformed reset timestamps are logged and skipped rather than raising an exception.
Implementing Exponential Backoff with Jitter
Why Naive Retries Make Things Worse
When multiple workers receive a 429 at the same instant and all retry after exactly the same fixed delay, they arrive at the API in lockstep. This synchronized burst recreates the original overload condition at every retry interval. Rather than a smooth curve, the load graph shows a series of sharp spikes, each large enough to trigger another round of 429 rejections. Without jitter, the retry pattern itself sustains the congestion, and the system never recovers.
Full-Jitter vs. Decorrelated-Jitter Strategies
Adding randomness to retry timing breaks the synchronization. Two well-studied approaches dominate:
Full jitter computes the sleep duration as:
# pseudocode: random(0, min(cap, base * 2**attempt))
Each retry selects a uniformly random value between zero and the exponentially increasing ceiling. This approach is simple to implement and effective at dispersing retries across the wait window.
Decorrelated jitter uses the formula:
# pseudocode: sleep = min(cap, random(base, previous_sleep * 3))
# Initialize previous_sleep = base before the first attempt.
# On each retry, update previous_sleep to the actual sleep duration used.
Because each sleep depends on the prior sleep rather than the attempt count alone, the resulting wait sequence is less predictable and tends to spread retries more evenly under sustained high-throughput conditions.
Full jitter is a reasonable default for most applications integrating with the Claude API. Decorrelated jitter becomes advantageous at concurrency levels where retry collision probability grows measurable, roughly 50+ simultaneous callers sharing one API key (the AWS Architecture Blog analysis covers the statistical tradeoffs in detail).
Integrating with the tenacity Library
Hand-rolled retry loops work, but the tenacity library offers composability, built-in logging hooks, and retry statistics out of the box. The key design decision is writing a custom retry predicate that triggers only on HTTP 429, not on client errors (400) or server errors (500), which require different handling paths.
Note: The following code block assumes
parse_rate_limit_headersis defined as shown in the "Decoding Claude's Rate-Limit Headers" section above, or imported from a shared module (e.g.,from claude_rate_limiter import parse_rate_limit_headers).
from claude_rate_limiter import parse_rate_limit_headers # required for standalone use
import random
import logging
import requests
from tenacity import (
retry, stop_after_attempt, retry_if_exception,
before_sleep_log, RetryCallState
)
logger = logging.getLogger("claude_retry")
class RateLimitError(Exception):
def __init__(self, response):
self.status_code = response.status_code
self.headers = dict(response.headers)
self.body = response.text
# Store a lightweight mock-like object instead of the full Response
# to avoid unbounded memory retention through tenacity's retry state.
self._response_proxy = type("ResponseProxy", (), {
"status_code": response.status_code,
"headers": dict(response.headers),
})()
super().__init__(f"429 Too Many Requests")
@property
def response(self):
return self._response_proxy
def _header_aware_wait(retry_state: RetryCallState) -> float:
exc = retry_state.outcome.exception()
if isinstance(exc, RateLimitError):
info = parse_rate_limit_headers(exc.response)
if info["wait_seconds"] is not None:
return info["wait_seconds"] + random.uniform(0, 0.5)
attempt = retry_state.attempt_number
base, cap = 1.0, 60.0
full_jitter_wait = random.uniform(0, min(cap, base * (2 ** attempt)))
return full_jitter_wait
@retry(
retry=retry_if_exception(lambda e: isinstance(e, RateLimitError)),
wait=_header_aware_wait,
stop=stop_after_attempt(6),
before_sleep=before_sleep_log(logger, logging.WARNING),
)
def call_claude_api(url: str, headers: dict, payload: dict) -> dict:
resp = requests.post(url, headers=headers, json=payload, timeout=30)
if resp.status_code == 429:
raise RateLimitError(resp)
resp.raise_for_status()
return resp.json()
The custom wait callback inspects the Retry-After and reset headers first, adding a small random jitter (up to 0.5 seconds) to prevent synchronization even among clients that receive identical Retry-After values. When no header information is available, it falls back to full-jitter exponential backoff capped at 60 seconds. The before_sleep_log hook logs every retry event with its computed wait time.
Building the Production-Ready ClaudeRateLimitHandler Class
Class Architecture Overview
The handler class wraps any callable that returns a requests.Response, managing retry state, quota tracking, and observability through a single object that unifies retry logic, proactive throttling, and monitoring hooks. It follows a composition-over-inheritance design: rather than subclassing an HTTP client, it accepts a request function and orchestrates the execution loop around it. Callback hooks for on_retry, on_success, and on_failure let you integrate with any monitoring system without coupling the handler to a specific observability stack.
Thread safety: This implementation is not thread-safe. If you share a single
ClaudeRateLimitHandlerinstance across threads, protect calls toexecute()with athreading.Lock, or instantiate one handler per thread.
Proactive Throttling: Don't Wait for the 429
The most effective 429 handling strategy is to avoid the 429 entirely. Since every Anthropic API response includes requests-remaining and tokens-remaining headers, the handler can track these values from the most recent successful response. Before dispatching a new request, it checks whether remaining quota has fallen below a configurable threshold. If so, it sleeps until the next reset window rather than sending a request that will exceed the remaining quota. This eliminates wasted round trips and keeps the client within its allocation.
Concurrency note: Effectiveness depends on
proactive_thresholdbeing set relative to your concurrent worker count. Withproactive_threshold=2and 10 concurrent workers sharing one handler instance, most workers will still fire requests before the throttle engages.
Full Class Implementation Walk-Through
Note: This class depends on
parse_rate_limit_headersdefined in the "Decoding Claude's Rate-Limit Headers" section. All code in this article is designed to live in a singleclaude_rate_limiter.pymodule.
import time
import random
import logging
from datetime import datetime, timezone
from typing import Callable, Optional, Any
logger = logging.getLogger("claude_rate_limiter")
class MaxRetriesExceeded(Exception):
"""Raised when all retry attempts for 429 rate limiting have been exhausted."""
def __init__(self, message: str, retry_count: int = 0):
super().__init__(message)
self.retry_count = retry_count
class ClaudeRateLimitHandler:
"""Wraps a request callable with retry logic, proactive throttling, and observability hooks.
Attributes:
_last_rate_limit_info: Stores processed rate-limit info as returned by
parse_rate_limit_headers (keys like 'requests_remaining', 'tokens_remaining'),
not raw HTTP header names.
proactive_threshold: If either requests-remaining or tokens-remaining falls
at or below this value, the handler sleeps until the corresponding reset.
"""
def __init__(
self,
base_delay: float = 1.0,
max_delay: float = 60.0,
max_retries: int = 5,
proactive_threshold: int = 2,
on_retry: Optional[Callable] = None,
on_success: Optional[Callable] = None,
on_failure: Optional[Callable] = None,
):
self.base_delay = base_delay
self.max_delay = max_delay
self.max_retries = max_retries
self.proactive_threshold = proactive_threshold
self.on_retry = on_retry
self.on_success = on_success
self.on_failure = on_failure
self._last_rate_limit_info: dict = {}
def _parse_headers(self, response) -> dict:
return parse_rate_limit_headers(response)
def _calculate_wait(self, attempt: int, response=None) -> float:
if response is not None:
info = self._parse_headers(response)
if info["wait_seconds"] is not None:
return info["wait_seconds"] + random.uniform(0, 0.5)
cap = min(self.max_delay, self.base_delay * (2 ** attempt))
return random.uniform(0, cap)
def _should_throttle_proactively(self) -> Optional[float]:
for resource in ("requests", "tokens"):
remaining = self._last_rate_limit_info.get(f"{resource}_remaining")
reset = self._last_rate_limit_info.get(f"{resource}_reset")
# remaining is already Optional[int] after parse_rate_limit_headers
if remaining is not None and remaining <= self.proactive_threshold:
if reset:
try:
reset_time = datetime.fromisoformat(
reset.replace("Z", "+00:00")
)
wait = (reset_time - datetime.now(timezone.utc)).total_seconds()
return max(wait, 0.0)
except ValueError:
logger.warning("Unparseable reset header for %r: %r", resource, reset)
return None
def execute(self, request_fn: Callable, *args, **kwargs) -> Any:
self._last_rate_limit_info = {}
retry_count = 0 # local variable, safe for concurrent use of separate calls
for attempt in range(self.max_retries + 1):
throttle_wait = self._should_throttle_proactively()
if throttle_wait and throttle_wait > 0:
logger.info("Proactive throttle: sleeping %.2fs", throttle_wait)
time.sleep(throttle_wait)
response = request_fn(*args, **kwargs)
self._last_rate_limit_info = self._parse_headers(response)
if response.status_code // 100 == 2:
if self.on_success:
self.on_success(response)
return response
if response.status_code == 429:
retry_count += 1
if retry_count > self.max_retries:
break
wait = self._calculate_wait(attempt + 1, response)
logger.warning(
"429 received (attempt %d/%d), waiting %.2fs",
attempt + 1, self.max_retries + 1, wait,
)
if self.on_retry:
self.on_retry(attempt, wait, response)
time.sleep(wait)
continue
if self.on_failure:
self.on_failure(retry_count)
response.raise_for_status()
if self.on_failure:
self.on_failure(retry_count)
raise MaxRetriesExceeded(
f"Max retries ({self.max_retries}) exceeded for 429 rate limiting",
retry_count=retry_count,
)
The execute() method is the primary entry point. On each iteration, it first checks whether proactive throttling is warranted. If the remaining quota for either requests or tokens sits at or below proactive_threshold, the handler sleeps until the corresponding reset window. When a 429 does arrive, the three-tier wait calculation kicks in: Retry-After first, reset-header math second, jittered exponential backoff third. Non-429 errors raise immediately rather than entering the retry loop, keeping error handling responsibility clean. Any 2xx status code (200, 201, etc.) counts as success. The retry_count is a local variable so that concurrent calls on the same handler instance (even though the handler is not fully thread-safe) do not corrupt each other's counters. If all retries are exhausted, including the edge case where max_retries=0, a MaxRetriesExceeded exception fires rather than silently returning None.
Usage Example: Calling Claude's Messages API
import os
import requests
ANTHROPIC_API_KEY = os.environ.get("ANTHROPIC_API_KEY")
if not ANTHROPIC_API_KEY:
raise ValueError("Set ANTHROPIC_API_KEY environment variable")
API_URL = "https://api.anthropic.com/v1/messages"
def send_message() -> requests.Response:
return requests.post(
API_URL,
headers={
"x-api-key": ANTHROPIC_API_KEY,
"anthropic-version": "2023-06-01",
"content-type": "application/json",
},
json={
"model": "claude-sonnet-4-5", # Replace with current model ID: https://docs.anthropic.com/en/docs/about-claude/models
"max_tokens": 1024,
"messages": [{"role": "user", "content": "Explain rate limiting."}],
},
timeout=30,
)
handler = ClaudeRateLimitHandler(max_retries=5, proactive_threshold=3)
response = handler.execute(send_message)
print(response.json()["content"][0]["text"])
Adding a Circuit Breaker for Sustained Outages
When Retries Aren't Enough
Exponential backoff with a retry cap recovers from transient 429 spikes within the configured retry budget. It does not handle sustained unavailability. When Anthropic enforces an organization-level rate limit or experiences an incident lasting minutes, retries with backoff still consume compute resources, hold threads in sleep states, and delay any fallback logic the application might use (cached responses, alternative models, graceful degradation). A circuit breaker solves this by failing fast once the system detects that retries are consistently futile.
Circuit Breaker States: Closed, Open, Half-Open
The circuit breaker operates in three states. Closed is the normal state: requests flow through and failures are counted. After a configured number of consecutive 429 responses within a time window, the breaker transitions to Open. In this state, the breaker immediately rejects all requests with a CircuitOpenError; no API call is made and no thread blocks on sleep. After a cooldown period elapses, the breaker moves to Half-Open: it permits a probe request through (single-threaded only; in multi-threaded applications, multiple threads may enter Half-Open simultaneously, see the threading note below). If the probe succeeds, the breaker returns to Closed. If it fails, the breaker reopens and the cooldown resets.
Thread safety: This
CircuitBreakerimplementation is not thread-safe. In multi-threaded applications, protectallow_request(),record_success(), andrecord_failure()with athreading.Lockto enforce single-probe behavior in the Half-Open state.
import time
from enum import Enum
class CircuitState(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
class CircuitOpenError(Exception):
pass
class CircuitBreaker:
def __init__(self, failure_threshold: int = 5, cooldown: float = 60.0):
self.failure_threshold = failure_threshold
self.cooldown = cooldown
self.state = CircuitState.CLOSED
self._failure_count = 0
self._last_failure_time: float = 0
def record_failure(self):
self._failure_count += 1
self._last_failure_time = time.time()
if self._failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
def record_success(self):
self._failure_count = 0
self._last_failure_time = 0
self.state = CircuitState.CLOSED
def allow_request(self) -> bool:
if self.state == CircuitState.CLOSED:
return True
if self.state == CircuitState.OPEN:
if time.time() - self._last_failure_time >= self.cooldown:
self.state = CircuitState.HALF_OPEN
return True
raise CircuitOpenError("Circuit is open; failing fast")
return True # HALF_OPEN: allow probe (single-threaded only)
To integrate with ClaudeRateLimitHandler, wrap the execute() call. Note that CircuitOpenError is caught separately to avoid incorrectly incrementing the failure count when the circuit is open. Fast-fail rejections are not new failures:
breaker = CircuitBreaker(failure_threshold=5, cooldown=60.0)
handler = ClaudeRateLimitHandler(max_retries=5, proactive_threshold=3)
def execute_with_circuit_breaker(request_fn, *args, **kwargs):
breaker.allow_request() # Raises CircuitOpenError if open
try:
response = handler.execute(request_fn, *args, **kwargs)
breaker.record_success()
return response
except CircuitOpenError:
raise # fast-fail: not a new failure, do not increment counter
except Exception:
breaker.record_failure()
raise
Call breaker.allow_request() before dispatching, breaker.record_success() on a successful response, and breaker.record_failure() each time execution fails. The CircuitOpenError is explicitly re-raised without recording a failure, so fast-fail rejections do not inflate the failure count and prevent the circuit from ever recovering. This keeps the circuit breaker external to the handler: retry logic stays in the handler, availability policy stays in the breaker.
Testing Your Retry Logic Without Burning API Credits
Mocking 429 Responses with pytest and responses
The responses library intercepts outbound HTTP calls from requests and returns preconfigured mock responses, including custom headers. This makes it possible to simulate exact Anthropic 429 behavior, complete with Retry-After and the full anthropic-ratelimit-* header set, without making a single real API call.
Note: The test blocks below assume all classes and functions are importable from your
claude_rate_limitermodule:from claude_rate_limiter import ( parse_rate_limit_headers, ClaudeRateLimitHandler, MaxRetriesExceeded, CircuitBreaker, CircuitState, CircuitOpenError, )
Testing the Circuit Breaker Transition
Driving the circuit breaker through its full state lifecycle in a single test validates both the failure accumulation logic and the recovery path. The test issues enough mocked 429 responses to trip the breaker, asserts that CircuitOpenError is raised, advances time past the cooldown, and verifies that a subsequent success closes the circuit.
import pytest
import responses
import time
from unittest.mock import patch
from claude_rate_limiter import (
parse_rate_limit_headers,
ClaudeRateLimitHandler,
MaxRetriesExceeded,
CircuitBreaker,
CircuitState,
CircuitOpenError,
)
MOCK_URL = "https://api.anthropic.com/v1/messages"
@responses.activate
def test_exponential_backoff_honors_retry_after():
responses.add(
responses.POST, MOCK_URL, status=429,
headers={"Retry-After": "2", "anthropic-ratelimit-requests-remaining": "0"},
)
responses.add(
responses.POST, MOCK_URL, status=429,
headers={"Retry-After": "1", "anthropic-ratelimit-requests-remaining": "0"},
)
responses.add(responses.POST, MOCK_URL, status=200, json={"content": [{"text": "ok"}]})
import requests as req
def send():
return req.post(MOCK_URL, json={}, timeout=30)
handler = ClaudeRateLimitHandler(max_retries=5, base_delay=1.0)
with patch("claude_rate_limiter.time.sleep") as mock_sleep:
resp = handler.execute(send)
assert resp.status_code == 200
assert len(responses.calls) == 3
sleep_calls = [c.args[0] for c in mock_sleep.call_args_list]
# First sleep: Retry-After=2 + jitter in [0, 0.5)
assert 2.0 <= sleep_calls[0] < 2.5
# Second sleep: Retry-After=1 + jitter in [0, 0.5)
assert 1.0 <= sleep_calls[1] < 1.5
def test_circuit_breaker_lifecycle():
breaker = CircuitBreaker(failure_threshold=3, cooldown=5.0)
for _ in range(3):
breaker.record_failure()
assert breaker.state == CircuitState.OPEN
with pytest.raises(CircuitOpenError):
breaker.allow_request()
with patch("time.time", return_value=time.time() + 6):
assert breaker.allow_request() is True
assert breaker.state == CircuitState.HALF_OPEN
breaker.record_success()
assert breaker.state == CircuitState.CLOSED
The first test verifies that the handler respects Retry-After headers across consecutive 429 responses and eventually succeeds on the third attempt. By mocking time.sleep, the test runs instantly and asserts the exact sleep durations rather than relying on wall-clock timing, making it reliable in CI environments. The second test exercises the full breaker lifecycle without any network calls at all, patching time.time to simulate cooldown expiry.
Key Takeaways and Quick-Reference Checklist
- Parse all
anthropic-ratelimit-*headers on every response, not just 429s. They provide real-time visibility into quota consumption. - Respect
Retry-Afterfirst, then fall back to reset-header math, then jittered exponential backoff. This three-tier priority matches Anthropic's server-side expectations. - Apply full-jitter exponential backoff to break thundering-herd synchronization. Reserve decorrelated jitter for sustained high-concurrency pipelines (50+ simultaneous callers per key). Initialize
previous_sleep = baseon the first attempt. - Avoid the 429 entirely by checking remaining quota before sending requests. Proactive throttling is always cheaper than reactive retry.
- Deploy a circuit breaker around the retry handler for sustained outages. Failing fast preserves threads, compute, and the ability to fall back gracefully.
- Every retry event should be logged with attempt number, computed wait time, and the header values that informed the decision. Observability is not optional in production.
For the latest rate-limit thresholds and header specifications, consult Anthropic's official rate-limits documentation, which reflects current per-tier quotas and any changes to header semantics.

