This metrics tool terrifies bad developers

Start free trial
SitePoint Premium
Stay Relevant and Grow Your Career in Tech
  • Premium Results
  • Publish articles on SitePoint
  • Daily curated jobs
  • Learning Paths
  • Discounts to dev tools
Start Free Trial

7 Day Free Trial. Cancel Anytime.

LLM cost optimization is fundamentally a token economics problem. This tutorial covers four distinct techniques — prompt compression, semantic caching, chain-of-thought pruning, and output length constraints — that when combined can reduce LLM API costs by up to 63%.

How to Reduce LLM API Costs

  1. Instrument token logging on every API call to establish a cost baseline before optimizing.
  2. Compress system prompts by eliminating hedge language, consolidating instructions into structured formats, and using tools like LLMLingua.
  3. Constrain output length with max_completion_tokens or max_tokens and enforce structured JSON schemas.
  4. Prune chain-of-thought reasoning in production by instructing the model to return only the final answer.
  5. Implement semantic caching using embedding similarity to skip redundant API calls entirely.
  6. Leverage provider-native prompt caching from OpenAI, Anthropic, or Google for automatic input token discounts.
  7. Validate output quality against your evaluation set after each optimization to ensure accuracy holds.

Table of Contents

Why Standard Prompting Is Burning Your Budget

LLM cost optimization is fundamentally a token economics problem. Every API call to OpenAI, Anthropic, or Google Gemini bills by the token, and most production systems send far more tokens than the task actually requires. Verbose system prompts padded with hedge language, repeated context across conversation turns, unconstrained output lengths, and chain-of-thought reasoning left enabled in production all contribute to bills that run two to three times higher than necessary.

This tutorial covers four distinct techniques for reducing that waste: prompt compression, semantic caching, chain-of-thought pruning, and output length constraints. When combined, these methods can reduce LLM API costs by up to 63%, though the exact figure depends on use case, model selection, and traffic patterns. The techniques are not theoretical. Each section includes working code examples in Python and Node.js that target the OpenAI and Anthropic APIs directly, with measured token counts showing the before and after.

The audience here is developers already calling LLM APIs in production or at scale, not those experimenting with chat completions for the first time.

Understanding Token Economics Across Providers

How OpenAI, Anthropic, and Google Gemini Price Tokens

All three major providers split billing into input tokens and output tokens, but the ratio between them varies significantly. Output tokens cost more than input tokens, by a factor of 2x to 5x depending on the model. For GPT-4o, OpenAI charges $2.50 per million input tokens and $10.00 per million output tokens, a 4x ratio. Anthropic's Claude 3.5 Sonnet prices at $3.00 per million input and $15.00 per million output, a 5x ratio. Google's Gemini 1.5 Flash costs roughly 33x less than GPT-4o on both input ($0.075 per million) and output ($0.30 per million) for prompts under 128K tokens.

Note: All pricing figures in this article are as of the time of writing. Verify current pricing at openai.com/pricing, anthropic.com/pricing, and Google's Generative AI pricing page before running cost projections.

This asymmetry has a direct consequence for optimization priority: reducing output tokens yields disproportionately larger cost savings per token eliminated.

Reducing output tokens yields disproportionately larger cost savings per token eliminated.

Each provider also offers cached token discounts. OpenAI's automatic prompt caching provides a 50% discount on cached input tokens. Anthropic's explicit prompt caching offers a 90% discount on cache reads (though cache writes cost 25% more than base input). Google Gemini's context caching charges at about 25% of the standard input rate for cached content.

Where Tokens Are Wasted in a Typical API Call

Four categories account for the bulk of unnecessary token spend:

  • System prompt bloat. Instructions contain filler phrases, excessive examples, and redundant guardrails that often double the prompt length without improving output quality.
  • Repeated context across conversation turns. Multi-turn flows resend the same background information with every request.
  • Uncontrolled output verbosity. Models generate explanations, caveats, and preambles that the consuming application immediately discards when you don't cap output length.
  • Chain-of-thought reasoning left active in production. Lengthy intermediate reasoning steps that served their purpose during development add no value in a deployed pipeline.

Technique 1: Prompt Compression

What Prompt Compression Means in Practice

Prompt compression reduces the token count of a prompt while preserving the information the model needs to produce an accurate response. There are two categories. Lossy compression removes content entirely, such as dropping optional examples or eliminating edge case instructions that apply to a small fraction of requests. Lossless compression rephrases the same content more concisely, such as converting prose instructions into structured YAML or JSON format, or replacing multi-sentence explanations with terse directives.

Compression hurts quality when it removes disambiguation that the model genuinely needs. For tasks with narrow, well-defined outputs like entity extraction or classification, aggressive compression is safe. For tasks requiring nuanced judgment, such as open-ended writing or complex reasoning, over-compression can degrade results. Track output quality metrics (F1 score for extraction, human evaluation scores for generation) alongside token counts; if quality drops more than 2-3% on your eval set, you've compressed too far.

Manual Prompt Compression Strategies

Three manual strategies yield the largest gains with the least risk:

  • Eliminate hedge language and politeness tokens. Phrases like "Please kindly ensure that you carefully consider" become "Ensure."
  • Consolidate multi-sentence instructions into structured formats. A five-sentence paragraph explaining a desired JSON output shape becomes the JSON schema itself, which is both shorter and more precise.
  • Use reference tokens instead of repeating context. Rather than restating a product description in both the system prompt and the user message, define it once and refer to it by label.

Programmatic Prompt Compression with LLMLingua

Microsoft Research's LLMLingua approach uses a small language model to identify and remove tokens from a prompt that contribute least to the model's ability to produce correct outputs. The library evaluates token-level perplexity and prunes low-information tokens while preserving semantic integrity.

Install the required dependencies first:

pip install openai "llmlingua>=0.2.2" numpy

Note: The first run will download a transformer model checkpoint (~500MB) from Hugging Face. Ensure sufficient disk space and allow several minutes for the download.

Note: The checkpoint microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank used below is optimized for meeting transcripts (MeetingBank dataset). Validate compressed output quality on your domain before production use. For other text types, evaluate alternative LLMLingua-2 checkpoints and compare entity extraction accuracy before and after compression.

import time
from llmlingua import PromptCompressor
from openai import OpenAI, RateLimitError, APIError

client = OpenAI()

# Original ~500-token system prompt (token count is approximate;
# use tiktoken for exact measurement)
original_prompt = """You are an expert product review analyst. Your job is to carefully
read product reviews submitted by users and extract structured information from them.
You should identify the key entities mentioned in the review, including product names,
brand names, and specific features that the reviewer discusses. Please make sure to
consider both positive and negative sentiments expressed about each entity. When you
find an entity, classify it into one of the following categories: product, brand, or
feature. Also determine the sentiment as positive, negative, or neutral. Return your
analysis as a JSON object with an array called 'entities', where each entity has the
fields 'name', 'type', and 'sentiment'. Be thorough but concise in your extraction.
Do not include entities that are only mentioned in passing without any opinion expressed.
Focus on entities where the reviewer has expressed a clear opinion or evaluation.
Make sure your JSON is valid and properly formatted. Do not include any explanation
or commentary outside the JSON object. Only return the JSON.
You should handle reviews in English. If the review contains multiple products being
compared, extract entities for all of them. If a feature is mentioned for multiple
products, create separate entity entries for each product-feature combination.
Ensure that entity names are normalized — for example, use the full brand name rather
than abbreviations when possible. If the reviewer uses slang or informal language,
interpret it to the best of your ability and use standard terminology in your output."""

# Compress using LLMLingua-2
compressor = PromptCompressor(
    model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank",
    use_llmlingua2=True
)

compressed = compressor.compress_prompt(
    original_prompt,
    rate=0.4,  # Target 40% of original length
    force_tokens=["JSON", "entities", "name", "type", "sentiment"]
)

compressed_prompt = compressed["compressed_prompt"]

# Introspect available keys at runtime to guard against version differences
origin_tokens     = compressed.get("origin_tokens",          "UNVERIFIED")
compressed_tokens = compressed.get("compressed_tokens",      "UNVERIFIED")
ratio             = compressed.get("compressed_tokens_ratio", "UNVERIFIED")

print(f"Available keys: {list(compressed.keys())}")
print(f"Original tokens:   {origin_tokens}")
print(f"Compressed tokens: {compressed_tokens}")
print(f"Compression ratio: {ratio}")

# Send compressed prompt to OpenAI with retry logic
max_retries = 3
response = None
for attempt in range(max_retries):
    try:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": compressed_prompt},
                {"role": "user", "content": "The new Sony WH-1000XM5 headphones have amazing noise cancellation but the build quality feels cheaper than the XM4. Battery life is stellar though."}
            ]
        )
        break
    except RateLimitError:
        wait = 2 ** attempt
        print(f"Rate limited. Retrying in {wait}s (attempt {attempt + 1}/{max_retries})")
        time.sleep(wait)
    except APIError as e:
        print(f"API error on attempt {attempt + 1}: {e}")
        if attempt == max_retries - 1:
            raise

if response is None:
    raise RuntimeError("Exceeded max retries for OpenAI API call")

if response.usage is None:
    raise ValueError("response.usage is None — streaming mode is not supported here")

print(f"Prompt tokens used: {response.usage.prompt_tokens}")
print(f"Completion tokens used: {response.usage.completion_tokens}")
print(response.choices[0].message.content)

The force_tokens parameter ensures that critical terms survive the compression pass. With a rate of 0.4, the compressed prompt retains about 200 tokens from the original ~500 while preserving the extraction instructions and output format requirements.

Measuring Compression Impact

Systematic measurement requires logging token usage on every call and comparing against a known baseline.

Note: These JavaScript examples use top-level await and require Node.js 14.8+ with ES modules. Add "type": "module" to your package.json or wrap the code in (async () => { ... })();.

npm install openai @anthropic-ai/sdk
import OpenAI from "openai";

const openai = new OpenAI();

// Pricing per million tokens (verify current pricing at openai.com/pricing)
const PRICING = {
  "gpt-4o": { input: 2.5, output: 10.0 },
  "gpt-4o-mini": { input: 0.15, output: 0.6 },
};

async function trackedCompletion(model, messages, label = "default") {
  const pricing = PRICING[model];
  if (!pricing) {
    throw new Error(
      `Model "${model}" not found in PRICING table. ` +
      `Add it or verify the model name. Known models: ${Object.keys(PRICING).join(", ")}`
    );
  }

  let response;
  const MAX_RETRIES = 3;
  for (let attempt = 0; attempt < MAX_RETRIES; attempt++) {
    try {
      response = await openai.chat.completions.create({ model, messages });
      break;
    } catch (err) {
      if (err?.status === 429 && attempt < MAX_RETRIES - 1) {
        const wait = Math.pow(2, attempt) * 1000;
        console.warn(`[${label}] Rate limited. Retrying in ${wait}ms`);
        await new Promise(r => setTimeout(r, wait));
      } else {
        throw err;
      }
    }
  }

  if (!response?.usage) {
    throw new Error(`[${label}] response.usage is null — check for streaming mode`);
  }

  const { prompt_tokens, completion_tokens } = response.usage;

  const inputCost = (prompt_tokens / 1_000_000) * pricing.input;
  const outputCost = (completion_tokens / 1_000_000) * pricing.output;
  const totalCost = inputCost + outputCost;

  console.log(`[${label}] Model: ${model}`);
  console.log(`  Prompt tokens: ${prompt_tokens}`);
  console.log(`  Completion tokens: ${completion_tokens}`);
  console.log(`  Input cost:  $${inputCost.toFixed(6)}`);
  console.log(`  Output cost: $${outputCost.toFixed(6)}`);
  console.log(`  Total cost:  $${totalCost.toFixed(6)}`);

  return { response, prompt_tokens, completion_tokens, totalCost };
}

// Compare baseline vs compressed
const baseline = await trackedCompletion(
  "gpt-4o",
  [
    { role: "system", content: "Your original 500-token system prompt here..." },
    { role: "user", content: "Review text here..." },
  ],
  "baseline"
);

const compressed = await trackedCompletion(
  "gpt-4o",
  [
    { role: "system", content: "Your compressed 200-token prompt here..." },
    { role: "user", content: "Review text here..." },
  ],
  "compressed"
);

const savings = ((baseline.totalCost - compressed.totalCost) / baseline.totalCost) * 100;
console.log(`
Cost reduction: ${savings.toFixed(1)}%`);

You can drop this wrapper into any production pipeline to continuously monitor token spend and validate that compression delivers expected savings.

Technique 2: Semantic Caching

What Semantic Caching Is and How It Differs from Exact-Match Caching

Exact-match caching only returns a stored result when the incoming request is identical, character for character, to a previously seen request. Semantic caching uses embedding-based similarity to recognize that "What is the capital of France?" and "Tell me France's capital city" should return the same cached response. This increases cache hit rates significantly for applications where users phrase similar questions in different ways.

Provider-native caching and application-layer semantic caching solve different problems. OpenAI and Anthropic's prompt caching discount the cost of resending identical prompt prefixes. Application-layer semantic caching avoids the API call entirely when a sufficiently similar query has already been answered.

Implementing Application-Layer Semantic Caching

Note: The in-memory cache below is for demonstration only and is not production-safe. It has no TTL and uses a simple size cap for eviction, meaning it will not handle expiration or sophisticated eviction strategies. For production use, replace with Redis (using RediSearch for vector similarity) or a dedicated vector database with TTL and eviction configured.

import threading
import time
import numpy as np
from openai import OpenAI, RateLimitError, APIError

client = OpenAI()

# Thread-safe bounded in-memory vector cache — DEMONSTRATION ONLY (see note above)
_cache_lock = threading.Lock()
_cache: list[dict] = []  # List of {"embedding": np.ndarray, "query": str, "response": str}
CACHE_MAX_SIZE = 10_000       # evict oldest when exceeded
SIMILARITY_THRESHOLD = 0.95


def get_embedding(text: str) -> np.ndarray:
    result = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return np.array(result.data[0].embedding)


def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    if norm_a == 0.0 or norm_b == 0.0:
        return 0.0
    return float(np.dot(a, b) / (norm_a * norm_b))


def cached_completion(user_query: str, system_prompt: str, model: str = "gpt-4o") -> str:
    query_embedding = get_embedding(user_query)

    # Search cache for similar queries (thread-safe read)
    with _cache_lock:
        for entry in _cache:
            similarity = cosine_similarity(query_embedding, entry["embedding"])
            if similarity >= SIMILARITY_THRESHOLD:
                print(f"Cache HIT (similarity: {similarity:.4f})")
                return entry["response"]

    # Cache miss — call the API (outside the lock to avoid blocking other threads)
    print("Cache MISS — calling API")
    response = None
    max_retries = 3
    for attempt in range(max_retries):
        try:
            response = client.chat.completions.create(
                model=model,
                messages=[
                    {"role": "system", "content": system_prompt},
                    {"role": "user", "content": user_query},
                ]
            )
            break
        except RateLimitError:
            wait = 2 ** attempt
            print(f"Rate limited. Retrying in {wait}s (attempt {attempt + 1}/{max_retries})")
            time.sleep(wait)
        except APIError as e:
            print(f"API error on attempt {attempt + 1}: {e}")
            if attempt == max_retries - 1:
                raise

    if response is None:
        raise RuntimeError("Exceeded max retries for API call")
    if response.usage is None:
        raise ValueError("response.usage is None — streaming mode is not supported here")

    result = response.choices[0].message.content

    # Store in cache (thread-safe write with eviction)
    with _cache_lock:
        if len(_cache) >= CACHE_MAX_SIZE:
            _cache.pop(0)   # evict oldest; use collections.deque for O(1)
        _cache.append({
            "embedding": query_embedding,
            "query": user_query,
            "response": result
        })

    return result


# First call — cache miss
result1 = cached_completion(
    "What are the main features of the iPhone 15 Pro?",
    "You are a product expert. Answer concisely."
)

# Second call — semantically similar, should hit cache
result2 = cached_completion(
    "Tell me the key features of Apple's iPhone 15 Pro",
    "You are a product expert. Answer concisely."
)

For production use, replacing the in-memory list with Redis using its vector search capability (RediSearch) or a dedicated vector database provides persistence and scalability. The embedding call itself is very cheap: OpenAI's text-embedding-3-small costs $0.02 per million tokens (as of the time of writing — verify current pricing at openai.com/pricing before projecting costs).

Using Provider-Native Prompt Caching

OpenAI's prompt caching is automatic. When the first 1,024 or more tokens of a prompt match a previous request exactly, cached tokens are billed at a 50% discount. No code changes are required, but structuring prompts so that the static system instructions appear first and variable content appears last maximizes cache hit rates.

Note: OpenAI's automatic prompt caching only activates when the matching prompt prefix is at least 1,024 tokens. Prompts shorter than this threshold will not benefit from caching.

Anthropic's prompt caching is explicit and offers steeper discounts. Cache reads cost 90% less than base input pricing. Cache writes cost 25% more, which is worth noting as a cost factor for low-traffic deployments where cache writes may outnumber reads. The developer places cache_control breakpoints to mark which prompt segments should be cached.

Note: Anthropic requires the cached segment to be at least 1,024 tokens for cache_control to take effect. The example below uses a shortened prompt for readability; in practice, expand or combine segments to meet the ≥1,024 token threshold. Confirm caching activated by checking cache_creation_input_tokens > 0 in the response.

import Anthropic from "@anthropic-ai/sdk";

const anthropic = new Anthropic();

// This prompt is shortened for readability. In practice, the cached segment
// must be at least 1,024 tokens for cache_control to activate.
const systemPrompt = `You are an expert product review analyst. Extract entities 
from reviews as JSON with fields: name, type (product/brand/feature), sentiment 
(positive/negative/neutral). Return only valid JSON. Handle comparisons by creating 
separate entries. Normalize entity names to full brand names.`;

async function analyzeReview(reviewText) {
  let response;
  try {
    response = await anthropic.messages.create({
      model: "claude-3-5-sonnet-20241022", // Verify current model ID at docs.anthropic.com/en/docs/about-claude/models
      max_tokens: 1024,
      system: [
        {
          type: "text",
          text: systemPrompt,
          cache_control: { type: "ephemeral" },
        },
      ],
      messages: [{ role: "user", content: reviewText }],
    });
  } catch (err) {
    if (err?.status === 429) {
      console.warn("Rate limited by Anthropic. Implement retry logic for production use.");
    }
    throw err;
  }

  console.log("Input tokens:", response.usage.input_tokens);
  console.log("Cache creation tokens:", response.usage.cache_creation_input_tokens || 0);
  console.log("Cache read tokens:", response.usage.cache_read_input_tokens || 0);

  if (!response.content || response.content.length === 0 || response.content[0].type !== "text") {
    throw new Error("Unexpected response content format from Anthropic API");
  }

  return response.content[0].text;
}

// First call — cache write (25% premium on system prompt tokens)
await analyzeReview("The Sony WH-1000XM5 has great ANC but feels flimsy.");

// Subsequent calls — cache read (90% discount on system prompt tokens)
await analyzeReview("Samsung Galaxy S24 Ultra camera is incredible, battery is mediocre.");
await analyzeReview("MacBook Pro M3 performance is outstanding but it runs hot.");

Anthropic's cached prompt content has a minimum length requirement of 1,024 tokens and a time-to-live of 5 minutes from the last cache write; cache reads do not extend the TTL. For high-throughput applications making multiple calls per minute with the same system prompt, the 90% read discount accumulates rapidly. In low-traffic scenarios, be aware that cache writes cost 25% more than standard input pricing, so infrequent usage patterns may not see net savings from caching.

Cache Invalidation and Freshness

Set TTLs based on how frequently the underlying data or instructions change. For static system prompts, long TTLs or no expiration are appropriate. For queries against rapidly changing data, such as real-time pricing or inventory, semantic caching introduces stale response risk. User-specific dynamic queries with personal context should bypass the cache entirely.

Technique 3: Chain-of-Thought Pruning for Production

Why CoT Reasoning Inflates Output Costs

Chain-of-thought prompting is valuable during development and evaluation because it makes the model's reasoning auditable. In production, however, downstream systems consume only the final answer. CoT reasoning can inflate output length by 3x to 5x (this is a commonly observed range and varies by task), and since output tokens carry the highest per-token cost, this represents a 3x to 5x increase in output cost that adds no value to the deployed system.

CoT reasoning can inflate output length by 3x to 5x, and since output tokens carry the highest per-token cost, this represents a 3x to 5x increase in output cost that adds no value to the deployed system.

Strategies for Pruning CoT in Production

The most direct approach: instruct the model to return only the final answer. Combining this with structured output mode (JSON) constrains the response shape and eliminates explanatory prose.

Anthropic's extended thinking feature (available on Claude 3.7 Sonnet and later compatible models) provides a budget_tokens parameter that caps the number of tokens the model can spend on internal reasoning. Verify model support in Anthropic's extended thinking documentation before use. This allows controlled reasoning depth without unlimited output expansion.

import time
from openai import OpenAI, RateLimitError, APIError

client = OpenAI()

review = """The Bose QuietComfort Ultra earbuds deliver exceptional sound quality 
with deep bass and clear highs. The noise cancellation is top-tier, rivaling 
over-ear headphones. However, the fit can be uncomfortable during long sessions, 
and the case is unnecessarily bulky. Battery life of 6 hours is decent but not 
class-leading. At $299, they're expensive but justified for audiophiles."""

max_retries = 3

# WITH chain-of-thought
cot_response = None
for attempt in range(max_retries):
    try:
        cot_response = client.chat.completions.create(
            model="gpt-4o",
            messages=[
                {"role": "system", "content": "Extract product entities with sentiment. Think step by step."},
                {"role": "user", "content": review}
            ]
        )
        break
    except RateLimitError:
        wait = 2 ** attempt
        print(f"Rate limited. Retrying in {wait}s (attempt {attempt + 1}/{max_retries})")
        time.sleep(wait)
    except APIError as e:
        print(f"API error on attempt {attempt + 1}: {e}")
        if attempt == max_retries - 1:
            raise

if cot_response is None:
    raise RuntimeError("Exceeded max retries for CoT API call")
if cot_response.usage is None:
    raise ValueError("cot_response.usage is None — streaming mode is not supported here")

# WITHOUT chain-of-thought — constrained to JSON only
direct_response = None
for attempt in range(max_retries):
    try:
        direct_response = client.chat.completions.create(
            model="gpt-4o",
            max_completion_tokens=256,  # Cap output length to prevent runaway generation
            response_format={"type": "json_object"},
            messages=[
                {"role": "system", "content": "Extract entities as JSON: {\"entities\": [{\"name\": str, \"type\": str, \"sentiment\": str}]}. Return ONLY the JSON."},
                {"role": "user", "content": review}
            ]
        )
        break
    except RateLimitError:
        wait = 2 ** attempt
        print(f"Rate limited. Retrying in {wait}s (attempt {attempt + 1}/{max_retries})")
        time.sleep(wait)
    except APIError as e:
        print(f"API error on attempt {attempt + 1}: {e}")
        if attempt == max_retries - 1:
            raise

if direct_response is None:
    raise RuntimeError("Exceeded max retries for direct API call")
if direct_response.usage is None:
    raise ValueError("direct_response.usage is None — streaming mode is not supported here")

print(f"CoT output tokens: {cot_response.usage.completion_tokens}")
print(f"Direct output tokens: {direct_response.usage.completion_tokens}")

# Pricing per million output tokens for GPT-4o (verify at openai.com/pricing)
OUTPUT_PRICE_PER_MILLION = 10.0

cot_cost = (cot_response.usage.completion_tokens / 1_000_000) * OUTPUT_PRICE_PER_MILLION
direct_cost = (direct_response.usage.completion_tokens / 1_000_000) * OUTPUT_PRICE_PER_MILLION
print(f"CoT output cost:    ${cot_cost:.6f}")
print(f"Direct output cost: ${direct_cost:.6f}")

The CoT version returns several paragraphs of reasoning followed by the extraction, while the direct version returns only the JSON object. On a task like this, expect a 3x or greater difference in output token count.

Keeping CoT for Debugging Without Paying for It

A practical pattern: gate CoT behind an environment variable or feature flag. Enable CoT during development and in error-analysis pipelines. Disable it in production. When production errors surface for investigation, replay the specific failing input with CoT enabled, generating the reasoning trace on demand rather than on every request.

Technique 4: Output Length Constraints

Using max_tokens / max_completion_tokens Strategically

Most developers leave the maximum output length unset, allowing the model to generate as many tokens as it deems appropriate. This is expensive. For tasks with predictable output shapes, such as classification, extraction, or short-answer responses, setting a ceiling prevents runaway generation.

The parameter names differ by provider: OpenAI uses max_completion_tokens, Anthropic uses max_tokens, and Google Gemini uses maxOutputTokens. To find the right ceiling, sample outputs from representative inputs during development and set the limit at 1.5x to 2x the observed p95 (the 95th percentile — i.e., the length exceeded by only 5% of outputs in your sample) output length.

Structured Output as a Cost Control Mechanism

Function calling and tool use schemas act as implicit output constraints. When the model must conform to a defined schema, it cannot generate preambles, explanations, or unnecessary fields. Note that when using tool_choice to force a function call, the model's response content will be null — the actual payload is in tool_calls[0].function.arguments, which must be parsed as JSON.

import OpenAI from "openai";

const openai = new OpenAI();

// Pricing per million output tokens for GPT-4o (verify at openai.com/pricing)
const OUTPUT_PRICE_PER_MILLION = 10.0;

const review = `The Dyson V15 Detect has incredible suction power and the laser dust 
detection is genuinely useful. But at $750 it's overpriced, and the battery only 
lasts 25 minutes on max power. The attachments are well-designed.`;

// Unconstrained prose response
let proseResponse;
try {
  proseResponse = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
      { role: "system", content: "Extract product entities with sentiment from this review." },
      { role: "user", content: review },
    ],
  });
} catch (err) {
  if (err?.status === 429) {
    console.warn("Rate limited. Implement retry logic for production use.");
  }
  throw err;
}

if (!proseResponse?.usage) {
  throw new Error("proseResponse.usage is null — check for streaming mode");
}

// Structured function calling response
let structuredResponse;
try {
  structuredResponse = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
      { role: "system", content: "Extract product entities with sentiment." },
      { role: "user", content: review },
    ],
    tools: [
      {
        type: "function",
        function: {
          name: "extract_entities",
          description: "Extract entities from a product review",
          parameters: {
            type: "object",
            properties: {
              entities: {
                type: "array",
                items: {
                  type: "object",
                  properties: {
                    name: { type: "string" },
                    type: { type: "string", enum: ["product", "brand", "feature"] },
                    sentiment: { type: "string", enum: ["positive", "negative", "neutral"] },
                  },
                  required: ["name", "type", "sentiment"],
                },
              },
            },
            required: ["entities"],
          },
        },
      },
    ],
    tool_choice: { type: "function", function: { name: "extract_entities" } },
  });
} catch (err) {
  if (err?.status === 429) {
    console.warn("Rate limited. Implement retry logic for production use.");
  }
  throw err;
}

if (!structuredResponse?.usage) {
  throw new Error("structuredResponse.usage is null — check for streaming mode");
}

// Extract the tool call payload (content is null for tool_choice responses)
const message = structuredResponse.choices[0].message;

if (!message.tool_calls || message.tool_calls.length === 0) {
  throw new Error("No tool_calls returned. Check tool_choice config.");
}

const rawArgs = message.tool_calls[0].function.arguments;

let entities;
try {
  entities = JSON.parse(rawArgs).entities;
} catch (e) {
  throw new Error(`Failed to parse tool arguments as JSON: ${rawArgs}`);
}

console.log(`Prose completion tokens: ${proseResponse.usage.completion_tokens}`);
console.log(`Structured completion tokens: ${structuredResponse.usage.completion_tokens}`);
console.log("Extracted entities:", entities);

const proseCost = (proseResponse.usage.completion_tokens / 1_000_000) * OUTPUT_PRICE_PER_MILLION;
const structuredCost = (structuredResponse.usage.completion_tokens / 1_000_000) * OUTPUT_PRICE_PER_MILLION;
console.log(`Prose output cost:      $${proseCost.toFixed(6)}`);
console.log(`Structured output cost: $${structuredCost.toFixed(6)}`);

The structured response constrains the model to populating only the defined fields, while the prose response includes introductory text, explanations of each entity, and a closing summary. In practice, structured output produces 2x to 4x fewer tokens than unconstrained prose for extraction tasks. Run the code above on your own inputs and log the difference.

Cost Comparison Table: Before and After Across 5 Models

The following table shows estimated costs for a standardized task, extracting three entities from a two-paragraph product review, run 1,000 times. Baseline uses a verbose 500-token system prompt with unconstrained output. Optimized uses a compressed 200-token prompt with structured JSON output.

Note on pricing: GPT-4o: $2.50/$10.00 per million input/output tokens. GPT-4o mini: $0.15/$0.60. Claude 3.5 Sonnet: $3.00/$15.00. Claude 3.5 Haiku (Anthropic's lower-cost model tier): $0.80/$4.00. Gemini 1.5 Flash: $0.075/$0.30 (under 128K tokens). All prices are as of the time of writing — verify at each provider's pricing page before projecting costs.

Model Baseline Input Compressed Input Baseline Output Constrained Output Baseline Cost/1K Optimized Cost/1K Savings
GPT-4o 580 280 350 120 $4.95 $1.90 62%
GPT-4o mini 580 280 350 120 $0.30 $0.11 63%
Claude 3.5 Sonnet 580 280 350 120 $6.99 $2.64 62%
Claude 3.5 Haiku 580 280 350 120 $1.86 $0.70 62%
Gemini 1.5 Flash 580 280 350 120 $0.15 $0.06 60%

The savings percentages are consistent by construction, since token reductions are fixed and pricing scales linearly. Models with higher output-to-input price ratios, like Claude 3.5 Sonnet at 5x, show slightly higher absolute dollar savings. The Gemini 1.5 Flash savings, while proportionally similar, represent a much smaller absolute dollar figure because the base pricing is already very low. These figures do not include additional savings from semantic caching, which would further reduce costs proportional to cache hit rate.

Combining All Four Techniques: A Real-World Optimization Pipeline

Recommended Order of Operations

Apply the techniques in order of effort-to-impact ratio:

  1. Compress prompts. This delivers the largest input savings and takes the least effort — you only rewrite prompts.
  2. Constrain outputs using max_completion_tokens (OpenAI) or max_tokens (Anthropic) and structured output schemas. This targets the most expensive token category with minimal code changes.
  3. Prune chain-of-thought for production. This requires a conditional flag but yields 3x to 5x output token reductions.
  4. Add semantic caching. This demands the most infrastructure (embedding generation, a vector store) but delivers the highest long-term savings at scale because it eliminates API calls entirely.

Estimating Your Savings

The savings formula: (baseline_cost - optimized_cost) / baseline_cost. As an estimate based on the token reductions demonstrated above, prompt compression saves 20% to 40% on input tokens. Output constraints save 30% to 50% on output tokens. Caching saves proportionally to hit rate — even a 30% hit rate eliminates nearly a third of all API calls.

The 60%+ aggregate figure is realistic when at least three of the four techniques target a workload with repeated query patterns and predictable output shapes. Workloads with highly unique queries and variable-length outputs will see lower caching benefits but can still achieve 40% to 50% savings from compression and output constraints alone.

Start With the Lowest-Hanging Fruit

The four techniques covered here — prompt compression, semantic caching, chain-of-thought pruning, and output length constraints — form a practical framework for LLM token optimization that works across providers and models. The highest-priority first step is not implementing any technique but instrumenting token logging on every API call. Without a baseline measurement, savings cannot be quantified or validated.

The highest-priority first step is not implementing any technique but instrumenting token logging on every API call. Without a baseline measurement, savings cannot be quantified or validated.

For implementation details, see the LLMLingua repository, OpenAI's prompt caching guide, Anthropic's prompt caching documentation, and Google's context caching reference. Check current pricing on each provider's pricing page before running cost projections.

SitePoint TeamSitePoint Team

Sharing our passion for building incredible internet things.

© 2000 – 2026 SitePoint Pty. Ltd.
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.