Claude API Cost Optimization: Reduce Costs 60%

Claude API costs in production applications can spiral quickly, especially when teams deploy without a deliberate strategy around token management, caching, and model selection. Most developers overspend not because the API is inherently expensive, but because unoptimized prompts, absent caching layers, and blanket use of high-tier models waste tokens at scale. This article presents a systematic, quantified framework for Claude API token optimization that combines prompt engineering, exact-match caching with Redis, and intelligent model routing to reduce spend by 60% or more.

Prerequisites
Understanding Claude API Token Economics
Prompt Engineering for Token Reduction
Exact-Match Caching with Redis
Intelligent Model Routing and Tiering
Measuring and Monitoring Token Usage
Token Savings Calculator and Optimization Checklist
Putting It All Together: Real-World Savings Breakdown
Key Takeaways

Prerequisites

Python ≥ 3.8 for all Python examples. Install the Anthropic SDK (≥ 0.25.0) with pip install anthropic>=0.25.0 and redis-py (≥ 4.0.0) for caching examples with pip install redis>=4.0.0.

For JavaScript examples, you need Node.js ≥ 18 and @anthropic-ai/sdk: run npm install @anthropic-ai/sdk.

Set the ANTHROPIC_API_KEY environment variable: export ANTHROPIC_API_KEY=<your_key>
Run a Redis instance at localhost:6379 for caching examples (or configure your own host/port)
Prices and model IDs in this article were verified as of July 2025. Always confirm current pricing at anthropic.com/pricing before building cost models.

Understanding Claude API Token Economics

How Claude Tokenization Works

Claude models use tokenization that follows the same general approach as Byte Pair Encoding (BPE), used across most large language models. BPE works by iteratively merging the most frequent pairs of subword units in the training corpus into single tokens, producing a vocabulary where common words are single tokens and rare or compound words get split into multiple subword units. In practice, BPE operates on subword units; the exact tokenizer vocabulary for Claude is not publicly released.

The practical consequence for developers: token count does not map neatly to word count. An approximate rule of thumb for typical English prose is around 1.3 tokens per word, though this ratio fluctuates with vocabulary complexity, code snippets, and non-English text. Technical content, code, and non-English text may vary significantly. A 1,000-word prompt typically consumes around 1,300 tokens, but a prompt heavy with technical jargon or URLs may run higher. Use client.messages.count_tokens() for precise counts before committing to cost estimates.

Claude bills input tokens (the prompt, system message, and any conversation history sent to the model) and output tokens (the model's generated response) at different rates. Output tokens cost 5x more per token than input tokens across the three model tiers shown in this article. Verify the current ratio for any model at anthropic.com/pricing.

Current Claude Pricing Breakdown

The Haiku model used in code examples is claude-haiku-3-5-20241022 (Haiku 3.5). The Sonnet and Opus models are claude-sonnet-4-20250514 and claude-opus-4-20250514, respectively. Always verify current pricing and available model IDs at anthropic.com/pricing before deploying.

Model	Model ID	Input Cost (per 1M tokens)	Output Cost (per 1M tokens)	Input Cost (per 1K tokens)	Output Cost (per 1K tokens)
Claude Haiku 3.5	`claude-haiku-3-5-20241022`	$0.25	$1.25	$0.00025	$0.00125
Claude Sonnet 4	`claude-sonnet-4-20250514`	$3.00	$15.00	$0.003	$0.015
Claude Opus 4	`claude-opus-4-20250514`	$15.00	$75.00	$0.015	$0.075

The key insight here is the 5x multiplier on output tokens. A response that generates 500 unnecessary output tokens on Sonnet 4 costs the same as 2,500 wasted input tokens. Optimizing output length, through format constraints, stop sequences, and assistant prefilling, delivers disproportionate savings compared to trimming input alone.

Optimizing output length, through format constraints, stop sequences, and assistant prefilling, delivers disproportionate savings compared to trimming input alone.

Note: Pricing is subject to change. If you hardcode these values, periodically verify them against Anthropic's current pricing page.

import anthropic

client = anthropic.Anthropic()

prompt_text = "Explain the key differences between REST and GraphQL APIs in detail."

# Count tokens before sending to estimate cost
result = client.messages.count_tokens(
    model="claude-sonnet-4-20250514",
    messages=[{"role": "user", "content": prompt_text}]
)
token_count = result.input_tokens

# Estimate cost assuming Sonnet 4 pricing
input_cost_per_token = 3.00 / 1_000_000
estimated_input_cost = token_count * input_cost_per_token

print(f"Token count: {token_count}")
print(f"Estimated input cost (Sonnet 4): ${estimated_input_cost:.6f}")

Prompt Engineering for Token Reduction

Eliminating Verbose System Prompts

System prompts are a persistent source of token waste. Many production applications carry system prompts bloated with redundant instructions, restated constraints, and conversational padding that consume tokens on every single request without improving model behavior.

Strip your system prompt to its minimal functional form and test whether output quality degrades. In our tests, we compressed a 500-token system prompt to 180 tokens with identical behavioral results by removing filler phrases, consolidating overlapping instructions, and using structured shorthand instead of natural language paragraphs. Measure your own prompts with count_tokens() to confirm.

Replace prose-style instructions with terse role definitions and bulleted constraints. The model does not need polite framing or motivation; it needs clear directives.

import anthropic

client = anthropic.Anthropic()

# Verbose system prompt (~500 tokens)
verbose_system = """You are a highly knowledgeable and professional customer support assistant
working for TechCorp Inc. Your primary responsibility is to help customers with their questions
about our software products. You should always be polite, thorough, and helpful in your responses.
Please make sure to provide accurate information based on what you know about our products.
If you don't know the answer to something, please let the customer know that you will escalate
their issue to a human agent. Always maintain a professional and friendly tone throughout the
conversation. Do not provide information about competitor products. Do not make promises about
future features or releases. Keep your responses focused and relevant to the customer's question."""

# Optimized system prompt (~180 tokens)
optimized_system = """Role: TechCorp customer support agent.
Rules:
- Answer product questions accurately
- Unknown answers → escalate to human agent
- No competitor info
- No future feature promises
- Professional tone, concise responses"""

# Compare token counts — pass system prompts as the system parameter, not as user messages
verbose_count = client.messages.count_tokens(
    model="claude-sonnet-4-20250514",
    system=verbose_system,
    messages=[{"role": "user", "content": "Hello"}]
).input_tokens

optimized_count = client.messages.count_tokens(
    model="claude-sonnet-4-20250514",
    system=optimized_system,
    messages=[{"role": "user", "content": "Hello"}]
).input_tokens

savings = verbose_count - optimized_count

print(f"Verbose: {verbose_count} tokens | Optimized: {optimized_count} tokens")
print(f"Savings per request: {savings} tokens ({(savings/verbose_count)*100:.0f}% reduction)")

Constraining Output Length and Format

Treat max_tokens as an active optimization tool, not merely a safety net against runaway responses. Setting it to the minimum viable length for a given task prevents the model from generating verbose completions that inflate output token costs.

Instructing Claude to return structured JSON instead of prose can cut output tokens by 40 to 70% on extraction and classification tasks, though actual compression depends on task type and prompt structure. Measure on your own workload to confirm. A natural language answer that consumes 200 tokens often compresses to 60 to 80 tokens as a JSON object containing only the essential data fields.

Stop sequences provide an additional lever. By specifying delimiters that signal the end of useful output, applications prevent the model from appending unnecessary elaboration.

Prerequisites for JavaScript examples: Node.js ≥18. Run npm install @anthropic-ai/sdk before executing.

const Anthropic = require("@anthropic-ai/sdk");

const client = new Anthropic();

async function optimizedApiCall() {
  const response = await client.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 150,
    stop_sequences: ["

---"],
    messages: [
      {
        role: "user",
        content: `Extract the product name, price, and availability from this text.
Return ONLY valid JSON with keys: product, price, available.

Text: "The UltraWidget Pro is currently in stock at $49.99 with free shipping."`,
      },
    ],
  });

  if (!response.content.length || response.content[0].type !== "text") {
    console.error("Unexpected response structure:", response.content);
    return null;
  }

  console.log("Response:", response.content[0].text);
  console.log("Input tokens:", response.usage.input_tokens);
  console.log("Output tokens:", response.usage.output_tokens);

  const outputCost = (response.usage.output_tokens / 1_000_000) * 15.0;
  const inputCost = (response.usage.input_tokens / 1_000_000) * 3.0;
  console.log(`Estimated cost: $${(inputCost + outputCost).toFixed(6)}`);
}

optimizedApiCall();

Prefilling Assistant Responses

The assistant message prefill technique involves providing a partial response in the assistant role that steers the model's output from the first token. This constrains the model's generation search space, eliminating preamble tokens like "Here is the JSON response:" or "Sure, I'd be happy to help with that."

When prefilling with a partial JSON structure, the model completes the structure directly, saving tokens that would otherwise be spent on conversational framing.

import anthropic

client = anthropic.Anthropic()

API_TIMEOUT_SECONDS = 30.0


def extract_text(response) -> str:
    """Safely extract text from a Claude response, raising clearly on unexpected structure."""
    if not response.content:
        raise ValueError(f"Empty content in response. Stop reason: {response.stop_reason}")
    block = response.content[0]
    if block.type != "text":
        raise ValueError(f"Expected text block, got: {block.type}")
    return block.text


# Without prefill
response_no_prefill = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=200,
    timeout=API_TIMEOUT_SECONDS,
    messages=[
        {"role": "user", "content": "Classify this support ticket as billing, technical, or general: 'I can't log into my account after resetting my password'"}
    ]
)

# With assistant prefill
response_with_prefill = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=100,
    timeout=API_TIMEOUT_SECONDS,
    messages=[
        {"role": "user", "content": "Classify this support ticket and provide a brief reason. Return JSON only."},
        {"role": "assistant", "content": '{"category": "'}
    ]
)

print(f"Without prefill — output tokens: {response_no_prefill.usage.output_tokens}")
print(f"With prefill — output tokens: {response_with_prefill.usage.output_tokens}")
print(f"Token savings: {response_no_prefill.usage.output_tokens - response_with_prefill.usage.output_tokens}")

Exact-Match Caching with Redis

Why Caching Matters

Caching strategies that match prompts against previously seen requests can eliminate redundant API calls entirely. The simplest approach, and the one implemented below, is exact-match caching, where each prompt is hashed and matched against cached keys. This captures identical repeat queries effectively.

Note: exact-match caching will not match semantically equivalent but differently worded prompts ("What is your return policy?" vs. "How do returns work?"). True semantic caching requires an embedding model and a vector store (e.g., pgvector or Redis Vector Similarity Search) with a cosine similarity threshold, which is beyond the scope of this implementation.

Implementing an Exact-Match Redis Cache Layer

The architecture is direct: hash each incoming prompt, check Redis for a matching key, and return the cached response on a hit or call the Claude API on a miss. Set TTL values to match how often the underlying content changes. Static reference data can tolerate TTLs of hours or days; queries about frequently changing information need shorter windows. Cache hit rate is highly application-specific and depends entirely on your traffic patterns and prompt diversity. Measure your actual prompt diversity before projecting savings.

Warning: The Redis connection below has no error handling. In production, wrap cache operations in try/except blocks to fall back to a direct API call if Redis is unavailable. Also configure Redis authentication if your instance is exposed to a network.

import hashlib
import json
import time
import redis
import anthropic

client = anthropic.Anthropic()

API_TIMEOUT_SECONDS = 30.0

try:
    cache = redis.Redis(host="localhost", port=6379, db=0)
    cache.ping()
except redis.exceptions.ConnectionError:
    print("WARNING: Redis unavailable. Caching disabled; all calls go to API.")
    cache = None

CACHE_TTL_SECONDS = 3600  # 1 hour for general queries

PRICING = {
    "claude-haiku-3-5-20241022": {"input": 0.25, "output": 1.25},
    "claude-sonnet-4-20250514": {"input": 3.00, "output": 15.00},
    "claude-opus-4-20250514": {"input": 15.00, "output": 75.00},
}


def extract_text(response) -> str:
    """Safely extract text from a Claude response, raising clearly on unexpected structure."""
    if not response.content:
        raise ValueError(f"Empty content in response. Stop reason: {response.stop_reason}")
    block = response.content[0]
    if block.type != "text":
        raise ValueError(f"Expected text block, got: {block.type}")
    return block.text


def get_prompt_hash(prompt: str) -> str:
    # Do NOT lowercase: case may be semantically significant
    return hashlib.sha256(prompt.strip().encode()).hexdigest()


def cached_claude_call(prompt: str, model: str = "claude-sonnet-4-20250514", max_tokens: int = 300):
    # Include model in cache key to prevent cross-model cache collisions
    cache_key = f"claude:{model}:{get_prompt_hash(prompt)}"

    # Check cache
    if cache is not None:
        try:
            cached = cache.get(cache_key)
            if cached:
                result = json.loads(cached)
                print(f"CACHE HIT | Saved ~${result.get('estimated_cost', 0):.6f}")
                return result["response"]
        except redis.exceptions.RedisError:
            pass  # Fall through to API call

    # Cache miss — call API
    response = client.messages.create(
        model=model,
        max_tokens=max_tokens,
        timeout=API_TIMEOUT_SECONDS,
        messages=[{"role": "user", "content": prompt}]
    )

    response_text = extract_text(response)

    pricing = PRICING.get(model, PRICING["claude-sonnet-4-20250514"])
    input_cost = (response.usage.input_tokens / 1_000_000) * pricing["input"]
    output_cost = (response.usage.output_tokens / 1_000_000) * pricing["output"]
    total_cost = input_cost + output_cost

    cache_entry = {
        "response": response_text,
        "input_tokens": response.usage.input_tokens,
        "output_tokens": response.usage.output_tokens,
        "estimated_cost": total_cost,
        "cached_at": time.time()
    }

    if cache is not None:
        try:
            cache.setex(cache_key, CACHE_TTL_SECONDS, json.dumps(cache_entry))
        except redis.exceptions.RedisError:
            pass  # Cache write failure is non-fatal

    print(f"CACHE MISS | Cost: ${total_cost:.6f} | Tokens: {response.usage.input_tokens}in/{response.usage.output_tokens}out")

    return response_text

Anthropic's Built-in Prompt Caching

Anthropic offers native prompt caching that operates at the infrastructure level. Developers mark cacheable sections of their requests using cache_control breakpoints, typically on large system prompts or static context blocks. When subsequent requests reuse the same cached prefix, the API charges cached input tokens (reads) at 10% of the base input token rate. Anthropic bills cache write requests (cache_creation_input_tokens) at a premium above the base rate; consult Anthropic's current pricing documentation for the exact write multiplier.

Prompt caching requires a minimum cacheable prefix length (consult Anthropic's documentation for the current minimum, typically 1,024 tokens for Claude Sonnet). The * 3 in the example below is illustrative only; replace with your actual large context in production.

The response metadata includes cache_creation_input_tokens (charged at the write premium on first cache population) and cache_read_input_tokens (charged at the discounted rate on subsequent hits). Native caching and application-level Redis caching are complementary: Anthropic's caching reduces the per-token cost of cache misses in the Redis layer, while Redis prevents redundant API calls entirely.

Native caching and application-level Redis caching are complementary: Anthropic's caching reduces the per-token cost of cache misses in the Redis layer, while Redis prevents redundant API calls entirely.

import anthropic

client = anthropic.Anthropic()

API_TIMEOUT_SECONDS = 30.0

# Illustrative large system prompt. In production, replace with your actual
# large context block. The prompt must meet the minimum cacheable token
# threshold (typically 1,024 tokens for Sonnet) for caching to activate.
large_system_prompt = """You are an expert financial analyst assistant. You have deep knowledge
of accounting standards (GAAP, IFRS), financial modeling, valuation methodologies, and market
analysis techniques. Your responses should reference specific standards where applicable
and provide quantitative analysis when possible. Always cite the relevant accounting standard
numbers. Format all financial figures with appropriate precision.""" * 3

response = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=200,
    timeout=API_TIMEOUT_SECONDS,
    system=[
        {
            "type": "text",
            "text": large_system_prompt,
            "cache_control": {"type": "ephemeral"}
        }
    ],
    messages=[
        {"role": "user", "content": "What is the current GAAP treatment for lease accounting under ASC 842?"}
    ]
)

# Safely access cache usage attributes (not present on all responses or SDK versions)
cache_creation = getattr(response.usage, "cache_creation_input_tokens", 0) or 0
cache_read = getattr(response.usage, "cache_read_input_tokens", 0) or 0

print(f"Input tokens: {response.usage.input_tokens}")
print(f"Cache creation tokens: {cache_creation}")
print(f"Cache read tokens: {cache_read}")
print(f"Output tokens: {response.usage.output_tokens}")

# Note: cache_creation_input_tokens are billed at a write premium above the
# base input rate. Include this cost when calculating first-request expense.

Intelligent Model Routing and Tiering

Classifying Requests by Complexity

Not every request warrants the capability or cost of Sonnet or Opus. Haiku 3.5, at $0.25 per million input tokens and $1.25 per million output tokens, handles classification, entity extraction, and simple question-answering well enough for many production use cases. Benchmark Haiku on your specific task and set an acceptance threshold (e.g., >95% classification accuracy) before committing to it for a given route. Sonnet fits generation and analytical tasks. Reserve Opus for complex multi-step reasoning where lower tiers demonstrably fail.

The economics are substantial: routing to Haiku delivers up to 12x savings on input tokens and approximately 9.6x on output tokens versus Sonnet 4. Blended savings depend on your input/output token ratio; use the pricing table to calculate the exact multiplier for your workload.

Building a Simple Model Router

A heuristic classifier routes requests without adding latency beyond one hash lookup and keyword scan. Keyword matching, prompt length thresholds, and task-type detection provide a starting point. For higher-fidelity routing, a Haiku pre-check that classifies the request's complexity before selecting the responding model adds roughly $0.00004 per routed request while improving routing accuracy. Fallback logic ensures that if Haiku's confidence on a response falls below a threshold, the request escalates to Sonnet.

Prerequisites for JavaScript examples: Node.js ≥18. Run npm install @anthropic-ai/sdk before executing.

const Anthropic = require("@anthropic-ai/sdk");

const client = new Anthropic();

const MODEL_PRICING = {
  "claude-haiku-3-5-20241022": { input: 0.25, output: 1.25 },
  "claude-sonnet-4-20250514": { input: 3.0, output: 15.0 },
  "claude-opus-4-20250514": { input: 15.0, output: 75.0 },
};

function classifyComplexity(prompt) {
  const wordCount = prompt.split(/\s+/).length;
  const complexKeywords = [
    "analyze",
    "compare",
    "synthesize",
    "evaluate",
    "design",
    "architect",
  ];
  const simpleKeywords = [
    "classify",
    "extract",
    "list",
    "define",
    "what is",
    "summarize",
  ];

  const lowerPrompt = prompt.toLowerCase();
  const complexScore = complexKeywords.filter((k) =>
    lowerPrompt.includes(k)
  ).length;
  const simpleScore = simpleKeywords.filter((k) =>
    lowerPrompt.includes(k)
  ).length;

  // Opus only when BOTH complexity signals are present
  if (complexScore >= 2 && wordCount > 500) return "opus";
  // Haiku for explicitly simple tasks regardless of length
  if (simpleScore >= 1 && complexScore === 0) return "haiku";
  return "sonnet";
}

async function routedApiCall(prompt) {
  const tier = classifyComplexity(prompt);
  const modelMap = {
    haiku: "claude-haiku-3-5-20241022",
    sonnet: "claude-sonnet-4-20250514",
    opus: "claude-opus-4-20250514",
  };
  const model = modelMap[tier];

  try {
    const response = await client.messages.create({
      model,
      max_tokens: 300,
      messages: [{ role: "user", content: prompt }],
    });

    if (!response.content.length || response.content[0].type !== "text") {
      console.error("Unexpected response structure:", response.content);
      return null;
    }

    const pricing = MODEL_PRICING[model];
    const cost =
      (response.usage.input_tokens / 1_000_000) * pricing.input +
      (response.usage.output_tokens / 1_000_000) * pricing.output;

    console.log(`Routed to: ${tier} (${model})`);
    console.log(
      `Tokens: ${response.usage.input_tokens}in / ${response.usage.output_tokens}out`
    );
    console.log(`Cost: $${cost.toFixed(6)}`);

    return response.content[0].text;
  } catch (error) {
    console.error(`API call failed for model ${model}:`, error.message);
    throw error;
  }
}

// Simple task → routes to Haiku
routedApiCall("Classify this as positive or negative: 'I love this product!'");

Measuring and Monitoring Token Usage

Extracting Usage Metadata from API Responses

Every Claude API response includes a usage object with input_tokens and output_tokens fields. Parse these on every call and feed them into a cost tracker to identify optimization opportunities and detect regressions.

Note: The CSV log below writes to the current working directory. In production, set an absolute log path or use a logging framework.

import anthropic
import csv
import os
import time
from datetime import datetime, timezone
from functools import wraps

client = anthropic.Anthropic()

API_TIMEOUT_SECONDS = 30.0

PRICING = {
    "claude-haiku-3-5-20241022": {"input": 0.25, "output": 1.25},
    "claude-sonnet-4-20250514": {"input": 3.00, "output": 15.00},
    "claude-opus-4-20250514": {"input": 15.00, "output": 75.00},
}

LOG_FILE = "token_usage_log.csv"


def extract_text(response) -> str:
    """Safely extract text from a Claude response, raising clearly on unexpected structure."""
    if not response.content:
        raise ValueError(f"Empty content in response. Stop reason: {response.stop_reason}")
    block = response.content[0]
    if block.type != "text":
        raise ValueError(f"Expected text block, got: {block.type}")
    return block.text


def track_usage(func):
    """Decorator that reads model from the API response, not a separate parameter."""
    @wraps(func)
    def wrapper(*args, **kwargs):
        response = func(*args, **kwargs)
        # Read the model actually used from the response object
        model = response.model
        pricing = PRICING.get(model, PRICING["claude-sonnet-4-20250514"])

        input_cost = (response.usage.input_tokens / 1_000_000) * pricing["input"]
        output_cost = (response.usage.output_tokens / 1_000_000) * pricing["output"]
        total_cost = input_cost + output_cost

        row = {
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "model": model,
            "input_tokens": response.usage.input_tokens,
            "output_tokens": response.usage.output_tokens,
            "cost_usd": f"{total_cost:.6f}"
        }

        fieldnames = list(row.keys())
        with open(LOG_FILE, "a", newline="") as f:
            # Use file position to determine if header is needed (avoids TOCTOU race)
            is_empty = f.tell() == 0
            writer = csv.DictWriter(f, fieldnames=fieldnames)
            if is_empty:
                writer.writeheader()
            writer.writerow(row)

        print(f"[{row['timestamp']}] {model} | {response.usage.input_tokens}in/{response.usage.output_tokens}out | ${total_cost:.6f}")
        return response
    return wrapper


@track_usage
def ask_claude(prompt):
    return client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=200,
        timeout=API_TIMEOUT_SECONDS,
        messages=[{"role": "user", "content": prompt}]
    )

Setting Budget Alerts and Rate Limits

Application-layer spend caps provide an essential safety net. Implement daily or weekly budget thresholds that pause or degrade API calls when exceeded to prevent runaway costs from bugs, prompt injection, or unexpected traffic spikes. Anthropic's usage dashboard (see console.anthropic.com) and built-in API rate limits offer a secondary layer of protection, but relying on them exclusively is insufficient for granular cost control.

Token Savings Calculator and Optimization Checklist

Interactive Token Savings Calculator

Teams can estimate their savings by plugging four variables into a straightforward formula:

Average prompt size (tokens)
Average output size (tokens)
Requests per day
Current model tier

Monthly baseline cost = (prompt_tokens × input_price + output_tokens × output_price) × requests_per_day × 30

Apply reductions sequentially: prompt trimming reduces input tokens by approximately 20%; caching eliminates 30 to 50% of requests entirely; model routing shifts 60% of remaining requests to a cheaper tier. The formula is transparent enough that any team can implement it as a spreadsheet or internal tool.

For a concrete example: 50,000 daily requests averaging 800 input tokens and 400 output tokens on Sonnet 4 yields a baseline monthly cost. After applying the optimization stack, the same workload costs roughly 60% less.

Prompt Optimization Checklist

No single optimization delivers the full 60% reduction, but the combination of all twelve compounds to reach or exceed that threshold. Work through each item systematically:

Audit system prompt length and eliminate padding
Remove redundant or restated instructions
Specify output format as structured JSON where possible
Set max_tokens to the minimum viable value for each task
Use stop sequences to prevent unnecessary generation
Apply assistant prefill to eliminate preamble tokens
Enable Anthropic's native prompt caching on static context
Implement an exact-match Redis cache layer for repeated queries
Set cache TTL values based on content volatility
Classify incoming requests by complexity tier
Route each request to the cheapest model that meets quality requirements
Monitor and log per-request token usage and cost

Putting It All Together: Real-World Savings Breakdown

Before and After Cost Comparison

Consider a customer support bot handling 50,000 requests per day with an average of 800 input tokens and 400 output tokens per request, running entirely on Sonnet 4.

Baseline calculation: (800 × $3.00/1M + 400 × $15.00/1M) × 50,000 × 30 = $12,600/month.

Optimization Layer	Monthly Cost	Incremental Savings
Baseline (Sonnet 4, no optimization)	$12,600	—
After prompt optimization (20% input token reduction)	$11,880	$720 (5.7%)
After caching (35% fewer API calls)	$7,722	$4,158 (33%)
After model routing (60% to Haiku)	~$5,040	~$2,682 (21%)
Total optimized	~$5,040	~$7,560 (60%)

A necessary caveat: actual savings depend entirely on the application's traffic patterns, prompt diversity, and task complexity distribution. We have seen 40 to 70% reductions across workloads with cache hit rates of 20% or higher, but your results will vary. Some optimizations compound while others overlap. Applications with highly diverse prompts will see lower cache hit rates; applications that already use Haiku for most tasks have less headroom from model routing. The 60% figure is achievable but not universal. The intermediate rows above are approximate because these optimizations interact (e.g., caching applies after prompt optimization, routing applies to un-cached requests), so per-layer figures may shift depending on your workload. The bottom-line 60% reduction is the target to validate against your own measurements.

Key Takeaways

Output tokens cost 5x input tokens for the model tiers covered here. That makes output optimization the highest-leverage target. Caching, both Anthropic's native prompt caching and application-level Redis caching, delivers the highest ROI for most production applications. Model routing adds one Haiku call (~$0.00004) per routed request when using a pre-check classifier, and heuristic routing costs even less. There is rarely a reason to send classification tasks to Opus. None of these optimizations work without per-request measurement. Track every token, every call, every dollar.

None of these optimizations work without per-request measurement. Track every token, every call, every dollar.

Claude API Token Optimization: Reducing Costs by 60%

Table of Contents