Claude API costs in production applications can spiral quickly, especially when teams deploy without a deliberate strategy around token management, caching, and model selection. Most developers overspend not because the API is inherently expensive, but because unoptimized prompts, absent caching layers, and blanket use of high-tier models waste tokens at scale. This article presents a systematic, quantified framework for Claude API token optimization that combines prompt engineering, exact-match caching with Redis, and intelligent model routing to reduce spend by 60% or more.
Table of Contents
- Prerequisites
- Understanding Claude API Token Economics
- Prompt Engineering for Token Reduction
- Exact-Match Caching with Redis
- Intelligent Model Routing and Tiering
- Measuring and Monitoring Token Usage
- Token Savings Calculator and Optimization Checklist
- Putting It All Together: Real-World Savings Breakdown
- Key Takeaways
Prerequisites
Python ≥ 3.8 for all Python examples. Install the Anthropic SDK (≥ 0.25.0) with pip install anthropic>=0.25.0 and redis-py (≥ 4.0.0) for caching examples with pip install redis>=4.0.0.
For JavaScript examples, you need Node.js ≥ 18 and @anthropic-ai/sdk: run npm install @anthropic-ai/sdk.
- Set the
ANTHROPIC_API_KEYenvironment variable:export ANTHROPIC_API_KEY=<your_key> - Run a Redis instance at
localhost:6379for caching examples (or configure your own host/port) - Prices and model IDs in this article were verified as of July 2025. Always confirm current pricing at anthropic.com/pricing before building cost models.
Understanding Claude API Token Economics
How Claude Tokenization Works
Claude models use tokenization that follows the same general approach as Byte Pair Encoding (BPE), used across most large language models. BPE works by iteratively merging the most frequent pairs of subword units in the training corpus into single tokens, producing a vocabulary where common words are single tokens and rare or compound words get split into multiple subword units. In practice, BPE operates on subword units; the exact tokenizer vocabulary for Claude is not publicly released.
The practical consequence for developers: token count does not map neatly to word count. An approximate rule of thumb for typical English prose is around 1.3 tokens per word, though this ratio fluctuates with vocabulary complexity, code snippets, and non-English text. Technical content, code, and non-English text may vary significantly. A 1,000-word prompt typically consumes around 1,300 tokens, but a prompt heavy with technical jargon or URLs may run higher. Use client.messages.count_tokens() for precise counts before committing to cost estimates.
Claude bills input tokens (the prompt, system message, and any conversation history sent to the model) and output tokens (the model's generated response) at different rates. Output tokens cost 5x more per token than input tokens across the three model tiers shown in this article. Verify the current ratio for any model at anthropic.com/pricing.
Current Claude Pricing Breakdown
The Haiku model used in code examples is claude-haiku-3-5-20241022 (Haiku 3.5). The Sonnet and Opus models are claude-sonnet-4-20250514 and claude-opus-4-20250514, respectively. Always verify current pricing and available model IDs at anthropic.com/pricing before deploying.
| Model | Model ID | Input Cost (per 1M tokens) | Output Cost (per 1M tokens) | Input Cost (per 1K tokens) | Output Cost (per 1K tokens) |
|---|---|---|---|---|---|
| Claude Haiku 3.5 | claude-haiku-3-5-20241022 | $0.25 | $1.25 | $0.00025 | $0.00125 |
| Claude Sonnet 4 | claude-sonnet-4-20250514 | $3.00 | $15.00 | $0.003 | $0.015 |
| Claude Opus 4 | claude-opus-4-20250514 | $15.00 | $75.00 | $0.015 | $0.075 |
The key insight here is the 5x multiplier on output tokens. A response that generates 500 unnecessary output tokens on Sonnet 4 costs the same as 2,500 wasted input tokens. Optimizing output length, through format constraints, stop sequences, and assistant prefilling, delivers disproportionate savings compared to trimming input alone.
Optimizing output length, through format constraints, stop sequences, and assistant prefilling, delivers disproportionate savings compared to trimming input alone.
Note: Pricing is subject to change. If you hardcode these values, periodically verify them against Anthropic's current pricing page.
import anthropic
client = anthropic.Anthropic()
prompt_text = "Explain the key differences between REST and GraphQL APIs in detail."
# Count tokens before sending to estimate cost
result = client.messages.count_tokens(
model="claude-sonnet-4-20250514",
messages=[{"role": "user", "content": prompt_text}]
)
token_count = result.input_tokens
# Estimate cost assuming Sonnet 4 pricing
input_cost_per_token = 3.00 / 1_000_000
estimated_input_cost = token_count * input_cost_per_token
print(f"Token count: {token_count}")
print(f"Estimated input cost (Sonnet 4): ${estimated_input_cost:.6f}")
Prompt Engineering for Token Reduction
Eliminating Verbose System Prompts
System prompts are a persistent source of token waste. Many production applications carry system prompts bloated with redundant instructions, restated constraints, and conversational padding that consume tokens on every single request without improving model behavior.
Strip your system prompt to its minimal functional form and test whether output quality degrades. In our tests, we compressed a 500-token system prompt to 180 tokens with identical behavioral results by removing filler phrases, consolidating overlapping instructions, and using structured shorthand instead of natural language paragraphs. Measure your own prompts with count_tokens() to confirm.
Replace prose-style instructions with terse role definitions and bulleted constraints. The model does not need polite framing or motivation; it needs clear directives.
import anthropic
client = anthropic.Anthropic()
# Verbose system prompt (~500 tokens)
verbose_system = """You are a highly knowledgeable and professional customer support assistant
working for TechCorp Inc. Your primary responsibility is to help customers with their questions
about our software products. You should always be polite, thorough, and helpful in your responses.
Please make sure to provide accurate information based on what you know about our products.
If you don't know the answer to something, please let the customer know that you will escalate
their issue to a human agent. Always maintain a professional and friendly tone throughout the
conversation. Do not provide information about competitor products. Do not make promises about
future features or releases. Keep your responses focused and relevant to the customer's question."""
# Optimized system prompt (~180 tokens)
optimized_system = """Role: TechCorp customer support agent.
Rules:
- Answer product questions accurately
- Unknown answers → escalate to human agent
- No competitor info
- No future feature promises
- Professional tone, concise responses"""
# Compare token counts — pass system prompts as the system parameter, not as user messages
verbose_count = client.messages.count_tokens(
model="claude-sonnet-4-20250514",
system=verbose_system,
messages=[{"role": "user", "content": "Hello"}]
).input_tokens
optimized_count = client.messages.count_tokens(
model="claude-sonnet-4-20250514",
system=optimized_system,
messages=[{"role": "user", "content": "Hello"}]
).input_tokens
savings = verbose_count - optimized_count
print(f"Verbose: {verbose_count} tokens | Optimized: {optimized_count} tokens")
print(f"Savings per request: {savings} tokens ({(savings/verbose_count)*100:.0f}% reduction)")
Constraining Output Length and Format
Treat max_tokens as an active optimization tool, not merely a safety net against runaway responses. Setting it to the minimum viable length for a given task prevents the model from generating verbose completions that inflate output token costs.
Instructing Claude to return structured JSON instead of prose can cut output tokens by 40 to 70% on extraction and classification tasks, though actual compression depends on task type and prompt structure. Measure on your own workload to confirm. A natural language answer that consumes 200 tokens often compresses to 60 to 80 tokens as a JSON object containing only the essential data fields.
Stop sequences provide an additional lever. By specifying delimiters that signal the end of useful output, applications prevent the model from appending unnecessary elaboration.
Prerequisites for JavaScript examples: Node.js ≥18. Run npm install @anthropic-ai/sdk before executing.
const Anthropic = require("@anthropic-ai/sdk");
const client = new Anthropic();
async function optimizedApiCall() {
const response = await client.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 150,
stop_sequences: ["
---"],
messages: [
{
role: "user",
content: `Extract the product name, price, and availability from this text.
Return ONLY valid JSON with keys: product, price, available.
Text: "The UltraWidget Pro is currently in stock at $49.99 with free shipping."`,
},
],
});
if (!response.content.length || response.content[0].type !== "text") {
console.error("Unexpected response structure:", response.content);
return null;
}
console.log("Response:", response.content[0].text);
console.log("Input tokens:", response.usage.input_tokens);
console.log("Output tokens:", response.usage.output_tokens);
const outputCost = (response.usage.output_tokens / 1_000_000) * 15.0;
const inputCost = (response.usage.input_tokens / 1_000_000) * 3.0;
console.log(`Estimated cost: $${(inputCost + outputCost).toFixed(6)}`);
}
optimizedApiCall();
Prefilling Assistant Responses
The assistant message prefill technique involves providing a partial response in the assistant role that steers the model's output from the first token. This constrains the model's generation search space, eliminating preamble tokens like "Here is the JSON response:" or "Sure, I'd be happy to help with that."
When prefilling with a partial JSON structure, the model completes the structure directly, saving tokens that would otherwise be spent on conversational framing.
import anthropic
client = anthropic.Anthropic()
API_TIMEOUT_SECONDS = 30.0
def extract_text(response) -> str:
"""Safely extract text from a Claude response, raising clearly on unexpected structure."""
if not response.content:
raise ValueError(f"Empty content in response. Stop reason: {response.stop_reason}")
block = response.content[0]
if block.type != "text":
raise ValueError(f"Expected text block, got: {block.type}")
return block.text
# Without prefill
response_no_prefill = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=200,
timeout=API_TIMEOUT_SECONDS,
messages=[
{"role": "user", "content": "Classify this support ticket as billing, technical, or general: 'I can't log into my account after resetting my password'"}
]
)
# With assistant prefill
response_with_prefill = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=100,
timeout=API_TIMEOUT_SECONDS,
messages=[
{"role": "user", "content": "Classify this support ticket and provide a brief reason. Return JSON only."},
{"role": "assistant", "content": '{"category": "'}
]
)
print(f"Without prefill — output tokens: {response_no_prefill.usage.output_tokens}")
print(f"With prefill — output tokens: {response_with_prefill.usage.output_tokens}")
print(f"Token savings: {response_no_prefill.usage.output_tokens - response_with_prefill.usage.output_tokens}")
Exact-Match Caching with Redis
Why Caching Matters
Caching strategies that match prompts against previously seen requests can eliminate redundant API calls entirely. The simplest approach, and the one implemented below, is exact-match caching, where each prompt is hashed and matched against cached keys. This captures identical repeat queries effectively.
Note: exact-match caching will not match semantically equivalent but differently worded prompts ("What is your return policy?" vs. "How do returns work?"). True semantic caching requires an embedding model and a vector store (e.g., pgvector or Redis Vector Similarity Search) with a cosine similarity threshold, which is beyond the scope of this implementation.
Implementing an Exact-Match Redis Cache Layer
The architecture is direct: hash each incoming prompt, check Redis for a matching key, and return the cached response on a hit or call the Claude API on a miss. Set TTL values to match how often the underlying content changes. Static reference data can tolerate TTLs of hours or days; queries about frequently changing information need shorter windows. Cache hit rate is highly application-specific and depends entirely on your traffic patterns and prompt diversity. Measure your actual prompt diversity before projecting savings.
Warning: The Redis connection below has no error handling. In production, wrap cache operations in try/except blocks to fall back to a direct API call if Redis is unavailable. Also configure Redis authentication if your instance is exposed to a network.
import hashlib
import json
import time
import redis
import anthropic
client = anthropic.Anthropic()
API_TIMEOUT_SECONDS = 30.0
try:
cache = redis.Redis(host="localhost", port=6379, db=0)
cache.ping()
except redis.exceptions.ConnectionError:
print("WARNING: Redis unavailable. Caching disabled; all calls go to API.")
cache = None
CACHE_TTL_SECONDS = 3600 # 1 hour for general queries
PRICING = {
"claude-haiku-3-5-20241022": {"input": 0.25, "output": 1.25},
"claude-sonnet-4-20250514": {"input": 3.00, "output": 15.00},
"claude-opus-4-20250514": {"input": 15.00, "output": 75.00},
}
def extract_text(response) -> str:
"""Safely extract text from a Claude response, raising clearly on unexpected structure."""
if not response.content:
raise ValueError(f"Empty content in response. Stop reason: {response.stop_reason}")
block = response.content[0]
if block.type != "text":
raise ValueError(f"Expected text block, got: {block.type}")
return block.text
def get_prompt_hash(prompt: str) -> str:
# Do NOT lowercase: case may be semantically significant
return hashlib.sha256(prompt.strip().encode()).hexdigest()
def cached_claude_call(prompt: str, model: str = "claude-sonnet-4-20250514", max_tokens: int = 300):
# Include model in cache key to prevent cross-model cache collisions
cache_key = f"claude:{model}:{get_prompt_hash(prompt)}"
# Check cache
if cache is not None:
try:
cached = cache.get(cache_key)
if cached:
result = json.loads(cached)
print(f"CACHE HIT | Saved ~${result.get('estimated_cost', 0):.6f}")
return result["response"]
except redis.exceptions.RedisError:
pass # Fall through to API call
# Cache miss — call API
response = client.messages.create(
model=model,
max_tokens=max_tokens,
timeout=API_TIMEOUT_SECONDS,
messages=[{"role": "user", "content": prompt}]
)
response_text = extract_text(response)
pricing = PRICING.get(model, PRICING["claude-sonnet-4-20250514"])
input_cost = (response.usage.input_tokens / 1_000_000) * pricing["input"]
output_cost = (response.usage.output_tokens / 1_000_000) * pricing["output"]
total_cost = input_cost + output_cost
cache_entry = {
"response": response_text,
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
"estimated_cost": total_cost,
"cached_at": time.time()
}
if cache is not None:
try:
cache.setex(cache_key, CACHE_TTL_SECONDS, json.dumps(cache_entry))
except redis.exceptions.RedisError:
pass # Cache write failure is non-fatal
print(f"CACHE MISS | Cost: ${total_cost:.6f} | Tokens: {response.usage.input_tokens}in/{response.usage.output_tokens}out")
return response_text
Anthropic's Built-in Prompt Caching
Anthropic offers native prompt caching that operates at the infrastructure level. Developers mark cacheable sections of their requests using cache_control breakpoints, typically on large system prompts or static context blocks. When subsequent requests reuse the same cached prefix, the API charges cached input tokens (reads) at 10% of the base input token rate. Anthropic bills cache write requests (cache_creation_input_tokens) at a premium above the base rate; consult Anthropic's current pricing documentation for the exact write multiplier.
Prompt caching requires a minimum cacheable prefix length (consult Anthropic's documentation for the current minimum, typically 1,024 tokens for Claude Sonnet). The * 3 in the example below is illustrative only; replace with your actual large context in production.
The response metadata includes cache_creation_input_tokens (charged at the write premium on first cache population) and cache_read_input_tokens (charged at the discounted rate on subsequent hits). Native caching and application-level Redis caching are complementary: Anthropic's caching reduces the per-token cost of cache misses in the Redis layer, while Redis prevents redundant API calls entirely.
Native caching and application-level Redis caching are complementary: Anthropic's caching reduces the per-token cost of cache misses in the Redis layer, while Redis prevents redundant API calls entirely.
import anthropic
client = anthropic.Anthropic()
API_TIMEOUT_SECONDS = 30.0
# Illustrative large system prompt. In production, replace with your actual
# large context block. The prompt must meet the minimum cacheable token
# threshold (typically 1,024 tokens for Sonnet) for caching to activate.
large_system_prompt = """You are an expert financial analyst assistant. You have deep knowledge
of accounting standards (GAAP, IFRS), financial modeling, valuation methodologies, and market
analysis techniques. Your responses should reference specific standards where applicable
and provide quantitative analysis when possible. Always cite the relevant accounting standard
numbers. Format all financial figures with appropriate precision.""" * 3
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=200,
timeout=API_TIMEOUT_SECONDS,
system=[
{
"type": "text",
"text": large_system_prompt,
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{"role": "user", "content": "What is the current GAAP treatment for lease accounting under ASC 842?"}
]
)
# Safely access cache usage attributes (not present on all responses or SDK versions)
cache_creation = getattr(response.usage, "cache_creation_input_tokens", 0) or 0
cache_read = getattr(response.usage, "cache_read_input_tokens", 0) or 0
print(f"Input tokens: {response.usage.input_tokens}")
print(f"Cache creation tokens: {cache_creation}")
print(f"Cache read tokens: {cache_read}")
print(f"Output tokens: {response.usage.output_tokens}")
# Note: cache_creation_input_tokens are billed at a write premium above the
# base input rate. Include this cost when calculating first-request expense.
Intelligent Model Routing and Tiering
Classifying Requests by Complexity
Not every request warrants the capability or cost of Sonnet or Opus. Haiku 3.5, at $0.25 per million input tokens and $1.25 per million output tokens, handles classification, entity extraction, and simple question-answering well enough for many production use cases. Benchmark Haiku on your specific task and set an acceptance threshold (e.g., >95% classification accuracy) before committing to it for a given route. Sonnet fits generation and analytical tasks. Reserve Opus for complex multi-step reasoning where lower tiers demonstrably fail.
The economics are substantial: routing to Haiku delivers up to 12x savings on input tokens and approximately 9.6x on output tokens versus Sonnet 4. Blended savings depend on your input/output token ratio; use the pricing table to calculate the exact multiplier for your workload.
Building a Simple Model Router
A heuristic classifier routes requests without adding latency beyond one hash lookup and keyword scan. Keyword matching, prompt length thresholds, and task-type detection provide a starting point. For higher-fidelity routing, a Haiku pre-check that classifies the request's complexity before selecting the responding model adds roughly $0.00004 per routed request while improving routing accuracy. Fallback logic ensures that if Haiku's confidence on a response falls below a threshold, the request escalates to Sonnet.
Prerequisites for JavaScript examples: Node.js ≥18. Run npm install @anthropic-ai/sdk before executing.
const Anthropic = require("@anthropic-ai/sdk");
const client = new Anthropic();
const MODEL_PRICING = {
"claude-haiku-3-5-20241022": { input: 0.25, output: 1.25 },
"claude-sonnet-4-20250514": { input: 3.0, output: 15.0 },
"claude-opus-4-20250514": { input: 15.0, output: 75.0 },
};
function classifyComplexity(prompt) {
const wordCount = prompt.split(/\s+/).length;
const complexKeywords = [
"analyze",
"compare",
"synthesize",
"evaluate",
"design",
"architect",
];
const simpleKeywords = [
"classify",
"extract",
"list",
"define",
"what is",
"summarize",
];
const lowerPrompt = prompt.toLowerCase();
const complexScore = complexKeywords.filter((k) =>
lowerPrompt.includes(k)
).length;
const simpleScore = simpleKeywords.filter((k) =>
lowerPrompt.includes(k)
).length;
// Opus only when BOTH complexity signals are present
if (complexScore >= 2 && wordCount > 500) return "opus";
// Haiku for explicitly simple tasks regardless of length
if (simpleScore >= 1 && complexScore === 0) return "haiku";
return "sonnet";
}
async function routedApiCall(prompt) {
const tier = classifyComplexity(prompt);
const modelMap = {
haiku: "claude-haiku-3-5-20241022",
sonnet: "claude-sonnet-4-20250514",
opus: "claude-opus-4-20250514",
};
const model = modelMap[tier];
try {
const response = await client.messages.create({
model,
max_tokens: 300,
messages: [{ role: "user", content: prompt }],
});
if (!response.content.length || response.content[0].type !== "text") {
console.error("Unexpected response structure:", response.content);
return null;
}
const pricing = MODEL_PRICING[model];
const cost =
(response.usage.input_tokens / 1_000_000) * pricing.input +
(response.usage.output_tokens / 1_000_000) * pricing.output;
console.log(`Routed to: ${tier} (${model})`);
console.log(
`Tokens: ${response.usage.input_tokens}in / ${response.usage.output_tokens}out`
);
console.log(`Cost: $${cost.toFixed(6)}`);
return response.content[0].text;
} catch (error) {
console.error(`API call failed for model ${model}:`, error.message);
throw error;
}
}
// Simple task → routes to Haiku
routedApiCall("Classify this as positive or negative: 'I love this product!'");
Measuring and Monitoring Token Usage
Extracting Usage Metadata from API Responses
Every Claude API response includes a usage object with input_tokens and output_tokens fields. Parse these on every call and feed them into a cost tracker to identify optimization opportunities and detect regressions.
Note: The CSV log below writes to the current working directory. In production, set an absolute log path or use a logging framework.
import anthropic
import csv
import os
import time
from datetime import datetime, timezone
from functools import wraps
client = anthropic.Anthropic()
API_TIMEOUT_SECONDS = 30.0
PRICING = {
"claude-haiku-3-5-20241022": {"input": 0.25, "output": 1.25},
"claude-sonnet-4-20250514": {"input": 3.00, "output": 15.00},
"claude-opus-4-20250514": {"input": 15.00, "output": 75.00},
}
LOG_FILE = "token_usage_log.csv"
def extract_text(response) -> str:
"""Safely extract text from a Claude response, raising clearly on unexpected structure."""
if not response.content:
raise ValueError(f"Empty content in response. Stop reason: {response.stop_reason}")
block = response.content[0]
if block.type != "text":
raise ValueError(f"Expected text block, got: {block.type}")
return block.text
def track_usage(func):
"""Decorator that reads model from the API response, not a separate parameter."""
@wraps(func)
def wrapper(*args, **kwargs):
response = func(*args, **kwargs)
# Read the model actually used from the response object
model = response.model
pricing = PRICING.get(model, PRICING["claude-sonnet-4-20250514"])
input_cost = (response.usage.input_tokens / 1_000_000) * pricing["input"]
output_cost = (response.usage.output_tokens / 1_000_000) * pricing["output"]
total_cost = input_cost + output_cost
row = {
"timestamp": datetime.now(timezone.utc).isoformat(),
"model": model,
"input_tokens": response.usage.input_tokens,
"output_tokens": response.usage.output_tokens,
"cost_usd": f"{total_cost:.6f}"
}
fieldnames = list(row.keys())
with open(LOG_FILE, "a", newline="") as f:
# Use file position to determine if header is needed (avoids TOCTOU race)
is_empty = f.tell() == 0
writer = csv.DictWriter(f, fieldnames=fieldnames)
if is_empty:
writer.writeheader()
writer.writerow(row)
print(f"[{row['timestamp']}] {model} | {response.usage.input_tokens}in/{response.usage.output_tokens}out | ${total_cost:.6f}")
return response
return wrapper
@track_usage
def ask_claude(prompt):
return client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=200,
timeout=API_TIMEOUT_SECONDS,
messages=[{"role": "user", "content": prompt}]
)
Setting Budget Alerts and Rate Limits
Application-layer spend caps provide an essential safety net. Implement daily or weekly budget thresholds that pause or degrade API calls when exceeded to prevent runaway costs from bugs, prompt injection, or unexpected traffic spikes. Anthropic's usage dashboard (see console.anthropic.com) and built-in API rate limits offer a secondary layer of protection, but relying on them exclusively is insufficient for granular cost control.
Token Savings Calculator and Optimization Checklist
Interactive Token Savings Calculator
Teams can estimate their savings by plugging four variables into a straightforward formula:
- Average prompt size (tokens)
- Average output size (tokens)
- Requests per day
- Current model tier
Monthly baseline cost = (prompt_tokens × input_price + output_tokens × output_price) × requests_per_day × 30
Apply reductions sequentially: prompt trimming reduces input tokens by approximately 20%; caching eliminates 30 to 50% of requests entirely; model routing shifts 60% of remaining requests to a cheaper tier. The formula is transparent enough that any team can implement it as a spreadsheet or internal tool.
For a concrete example: 50,000 daily requests averaging 800 input tokens and 400 output tokens on Sonnet 4 yields a baseline monthly cost. After applying the optimization stack, the same workload costs roughly 60% less.
Prompt Optimization Checklist
No single optimization delivers the full 60% reduction, but the combination of all twelve compounds to reach or exceed that threshold. Work through each item systematically:
- Audit system prompt length and eliminate padding
- Remove redundant or restated instructions
- Specify output format as structured JSON where possible
- Set
max_tokensto the minimum viable value for each task - Use stop sequences to prevent unnecessary generation
- Apply assistant prefill to eliminate preamble tokens
- Enable Anthropic's native prompt caching on static context
- Implement an exact-match Redis cache layer for repeated queries
- Set cache TTL values based on content volatility
- Classify incoming requests by complexity tier
- Route each request to the cheapest model that meets quality requirements
- Monitor and log per-request token usage and cost
Putting It All Together: Real-World Savings Breakdown
Before and After Cost Comparison
Consider a customer support bot handling 50,000 requests per day with an average of 800 input tokens and 400 output tokens per request, running entirely on Sonnet 4.
Baseline calculation: (800 × $3.00/1M + 400 × $15.00/1M) × 50,000 × 30 = $12,600/month.
| Optimization Layer | Monthly Cost | Incremental Savings |
|---|---|---|
| Baseline (Sonnet 4, no optimization) | $12,600 | — |
| After prompt optimization (20% input token reduction) | $11,880 | $720 (5.7%) |
| After caching (35% fewer API calls) | $7,722 | $4,158 (33%) |
| After model routing (60% to Haiku) | ~$5,040 | ~$2,682 (21%) |
| Total optimized | ~$5,040 | ~$7,560 (60%) |
A necessary caveat: actual savings depend entirely on the application's traffic patterns, prompt diversity, and task complexity distribution. We have seen 40 to 70% reductions across workloads with cache hit rates of 20% or higher, but your results will vary. Some optimizations compound while others overlap. Applications with highly diverse prompts will see lower cache hit rates; applications that already use Haiku for most tasks have less headroom from model routing. The 60% figure is achievable but not universal. The intermediate rows above are approximate because these optimizations interact (e.g., caching applies after prompt optimization, routing applies to un-cached requests), so per-layer figures may shift depending on your workload. The bottom-line 60% reduction is the target to validate against your own measurements.
Key Takeaways
Output tokens cost 5x input tokens for the model tiers covered here. That makes output optimization the highest-leverage target. Caching, both Anthropic's native prompt caching and application-level Redis caching, delivers the highest ROI for most production applications. Model routing adds one Haiku call (~$0.00004) per routed request when using a pre-check classifier, and heuristic routing costs even less. There is rarely a reason to send classification tasks to Opus. None of these optimizations work without per-request measurement. Track every token, every call, every dollar.
None of these optimizations work without per-request measurement. Track every token, every call, every dollar.

