coursera_2026_06
SitePoint Premium
Stay Relevant and Grow Your Career in Tech
  • Premium Results
  • Publish articles on SitePoint
  • Daily curated jobs
  • Learning Paths
  • Discounts to dev tools
Start Free Trial

7 Day Free Trial. Cancel Anytime.

A developer I know spent two weeks wiring up a RAG pipeline for a 200-page internal docs site. Vector database, chunking strategy, embedding model selection, retrieval evaluation harness, deployment behind an API. The whole nine yards. Then someone pointed out the entire corpus was about 150K tokens and fit inside a single Gemini prompt.

He could have been done in an afternoon.

This keeps happening. RAG became the default way to ground LLMs in private data, and it made sense when context windows topped out at 4K tokens. You had to retrieve selectively. But we went from GPT-3's 4K ceiling to Gemini 1.5 Pro accepting 1M tokens in a single call. For datasets that actually fit inside that window, the retrieval step is overhead you don't need.

I'm not writing a RAG obituary here. RAG is alive, well, and correct for plenty of use cases. What I want to give you is a decision framework: when direct context injection beats RAG, when it doesn't, how to calculate the cost crossover point, and working code for both patterns.

Table of Contents

Context Windows in 2026: Where We Actually Are

The Expansion Timeline

GPT-3 shipped with roughly 4K tokens. GPT-4 pushed to 8K, then 32K. Claude 2 broke six figures at 100K. Then Gemini 1.5 Pro showed up with 1M, and 2M in experimental preview.

Here's where the major providers sit today:

Model Max Context Window Approx. Word Equivalent Input Cost per 1M Tokens
GPT-4o (OpenAI) 128K tokens ~96K words Varies by tier
Claude 3.5 Sonnet (Anthropic) 200K tokens ~150K words $3.00
Gemini 1.5 Pro (Google) 1M tokens (up to 2M in preview) ~750K words $1.25 ($0.3125 cached)

Pricing changes constantly. Check each provider's current pricing page before committing to anything.

What 1M Tokens Actually Looks Like

One million tokens is roughly 750,000 words. That's about 3,000 pages of text. In concrete terms: an entire mid-size corporate wiki, a full codebase with docs, the complete Harry Potter series plus Lord of the Rings, or around 10 hours of audio transcripts.

Here's what most teams miss: their "knowledge base" is way smaller than they think. Internal documentation, product catalogs, support wikis, policy manuals for most organizations? Well under 500K tokens. These aren't big data problems. They're small data problems that teams keep solving with big data tools.

RAG in 60 Seconds (And Where It Still Wins)

The Standard Architecture

You know the drill. Split documents into chunks. Generate embeddings. Store them in a vector database. At query time, embed the query, find the most similar chunks, inject those chunks into the prompt, let the model generate a response.

This made perfect sense when context windows were tiny. You couldn't fit your data in the prompt, so you had to pick and choose.

When RAG Is Still the Right Call

RAG isn't going anywhere. It's the correct choice when:

  • Your corpus exceeds context limits. Tens of millions of tokens simply won't fit.
  • Your data changes frequently. Incremental indexing beats reloading everything.
  • You need multi-tenant access control. RAG lets you filter retrieval results by user permissions before anything hits the model.
  • Latency matters more than anything. Sending 1M tokens per request is slower than retrieving 5 relevant chunks.
  • Regulators want audit trails. Some industries require traceable retrieval provenance showing exactly which documents informed an answer.
  • Data governance says minimize exposure. Selective retrieval means less data leaving your perimeter.

Long Context Injection: The "Just Stuff It In" Approach

How It Works

Load your entire dataset. Put it in the prompt. Ask your question. No chunking. No embeddings. No vector database.

Three things used to make this impractical: cost was absurd, latency was unacceptable, and models lost track of information buried in long contexts. All three have shifted. Models handle long sequences better architecturally. Prices dropped aggressively, especially with context caching. And long-context accuracy improved dramatically.

Code Example: Full Corpus Q&A with Gemini 1.5 Pro

# Requires: pip install google-generativeai

import os
import glob
import datetime

import google.generativeai as genai
from google.generativeai import caching

genai.configure(api_key=os.environ["GEMINI_API_KEY"])

# Load all markdown files from a docs directory
docs = []
for filepath in sorted(glob.glob("docs/**/*.md", recursive=True)):
    with open(filepath, "r") as f:
        docs.append(f"--- FILE: {filepath} ---\n{f.read()}")

corpus = "\n\n".join(docs)

# Create a cached context for repeated queries against the same corpus
# Note: Cached content requires a minimum of 32,768 tokens.
# The ttl (time-to-live) controls how long the cache persists.
cache = caching.CachedContent.create(
    model="models/gemini-1.5-pro-001",
    display_name="internal-docs",
    system_instruction=(
        "You are a helpful assistant. Answer questions using only the provided documents. "
        "Cite the filename when referencing a source."
    ),
    contents=[corpus],
    ttl=datetime.timedelta(hours=1),
)

model = genai.GenerativeModel.from_cached_content(cached_content=cache)

# Query against the full corpus
response = model.generate_content("How do we handle authentication for API keys?")
print(response.text)

# Cleanup when done
# cache.delete()

The first request indexes the corpus. Every subsequent query against the same content gets reduced latency and the cached pricing tier. One gotcha: context caching usually has a minimum token count (Gemini wants at least 32,768 tokens of cached content), so very small corpora won't qualify.

Code Example: Same Task with Claude's 200K Context

# Requires: pip install anthropic

import os
import glob

import anthropic

client = anthropic.Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])

# Load documentation files
docs = []
for filepath in sorted(glob.glob("docs/**/*.md", recursive=True)):
    with open(filepath, "r") as f:
        docs.append(f"--- FILE: {filepath} ---\n{f.read()}")

corpus = "\n\n".join(docs)

message = client.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=4096,
    system=(
        "You are a helpful assistant. Answer questions using only the provided documents. "
        "Cite the filename when referencing a source."
    ),
    messages=[
        {
            "role": "user",
            "content": (
                f"\n{corpus}\n\n\n"
                "Question: How do we handle authentication for API keys?"
            ),
        }
    ],
)

print(message.content[0].text)

The pattern works the same regardless of provider. Load files, concatenate with clear delimiters, inject as context, query.

One thing to watch: sending hundreds of thousands of tokens can run into HTTP payload size limits or gateway timeouts depending on your provider and infrastructure. Some providers offer file upload APIs that handle large inputs more gracefully than raw prompt injection.

Head-to-Head: Long Context vs. RAG on Five Dimensions

Accuracy and Recall

Multiple research groups have directly tested whether long-context models can replace retrieval. The findings keep pointing the same direction: long-context models match or beat RAG on most benchmarks when the data fits in the window.

This shouldn't surprise anyone. RAG's retrieval step is lossy by design. Chunking splits documents at arbitrary boundaries. Top-K selection throws away passages that might have been relevant. Re-ranking helps but can't recover information that was never retrieved in the first place.

Long context sidesteps all of that. The model sees everything. Older long-context models had "lost in the middle" problems where information buried in the center of a long prompt got ignored. I've tested newer models like Gemini 1.5 Pro, and the positional bias is mostly gone. Google's needle-in-a-haystack evaluations show near-perfect retrieval regardless of where you put the target information.

Bottom line: for corpora that fit in the window, long context gives you equal or better recall. Just structure your documents with clear delimiters and metadata headers.

Latency

RAG chains together multiple steps: embed the query, search the vector database, retrieve chunks, build the prompt, call the LLM. You're looking at one to several seconds depending on your infrastructure.

Long context is a single LLM call, but with a lot more input tokens. That first call against a big corpus is slow. Context caching changes the math entirely for repeat queries though.

Scenario RAG Latency Long Context (Uncached) Long Context (Cached)
First query, 500K corpus 1–3s 5–15s 5–15s
Subsequent queries, same corpus 1–3s 5–15s 1–5s
High concurrency (100 QPS) Scales with infra Higher cost per query Efficient with cache

Cost

This is where it gets interesting. I wrote a cost model:

def compare_monthly_costs(
    corpus_tokens: int,
    queries_per_day: int,
    avg_output_tokens: int = 500,
    avg_retrieved_tokens: int = 3000,
    cache_hit_rate: float = 0.95,
    provider: str = "gemini",
):
    """Compare monthly costs for RAG vs long-context approaches."""
    monthly_queries = queries_per_day * 30

    # Provider pricing (per 1M tokens) - illustrative; verify current rates
    pricing = {
        "gemini": {
            "input": 1.25,
            "cached_input": 0.3125,
            "output": 5.00,
            "embedding": 0.00,
        },
        "openai": {
            "input": 2.50,
            "cached_input": 1.25,
            "output": 10.00,
            "embedding": 0.02,
        },
    }
    p = pricing.get(provider, pricing["gemini"])

    # --- Long Context Cost ---
    first_call_tokens = corpus_tokens * monthly_queries * (1 - cache_hit_rate)
    cached_call_tokens = corpus_tokens * monthly_queries * cache_hit_rate
    output_tokens_total = avg_output_tokens * monthly_queries

    lc_input_cost = (first_call_tokens / 1_000_000) * p["input"]
    lc_cached_cost = (cached_call_tokens / 1_000_000) * p["cached_input"]
    lc_output_cost = (output_tokens_total / 1_000_000) * p["output"]

    long_context_total = lc_input_cost + lc_cached_cost + lc_output_cost

    # --- RAG Cost ---
    embedding_cost = (corpus_tokens / 1_000_000) * p["embedding"]  # one-time embed
    query_embedding_cost = (monthly_queries * 256 / 1_000_000) * p["embedding"]

    rag_input_tokens = (avg_retrieved_tokens + 500) * monthly_queries  # chunks + query
    rag_input_cost = (rag_input_tokens / 1_000_000) * p["input"]
    rag_output_cost = lc_output_cost  # same output volume

    vector_db_cost = 25.0  # monthly managed vector DB estimate

    rag_total = (
        embedding_cost
        + query_embedding_cost
        + rag_input_cost
        + rag_output_cost
        + vector_db_cost
    )

    return {
        "long_context_monthly": round(long_context_total, 2),
        "rag_monthly": round(rag_total, 2),
        "cheaper": "Long Context" if long_context_total < rag_total else "RAG",
    }


# Example: 500K token corpus, 1000 queries/day
result = compare_monthly_costs(500_000, 1000, provider="gemini")
print(f"Long Context: ${result['long_context_monthly']}/mo")
print(f"RAG: ${result['rag_monthly']}/mo")
print(f"Winner: {result['cheaper']}")

The variables that matter: corpus size, query volume, cache hit rate, and vector DB hosting costs. A managed vector database adds a fixed monthly cost that dominates at low query volumes. Long context costs scale linearly with corpus size times query count. Plug in your own provider's current rates; these numbers shift often.

Implementation Complexity

This is the dimension people underestimate most. A production RAG system needs: vector database provisioning and management, an embedding pipeline with model selection, a chunking strategy (recursive character? semantic? document-based?), retrieval evaluation and tuning, re-ranking configuration, and ongoing maintenance for embedding drift when you swap models.

A long context implementation needs: file loading, token counting, prompt construction, and optionally caching setup.

Realistically? 2 to 5 developer-days for long context versus 2 to 6 weeks for a well-tuned production RAG pipeline. That's not a small gap.

Maintainability

When your source documents change, a RAG system needs re-embedding of affected chunks, careful handling of stale vectors, and awareness that chunk boundary shifts can change retrieval behavior in subtle ways. A long context system needs you to reload the files and invalidate the cache.

That asymmetry compounds over time. Especially if your team doesn't have dedicated ML ops people.

The Vector DB vs. Token Cost Calculator

Running the Numbers

The cost comparison function above captures the methodology, but real-world results depend on your actual cache hit rate, your vector DB tier (serverless vs. managed, costing anywhere from a few dollars to $100+ monthly), and your average output length.

I ran the calculator across a bunch of scenarios. Clear patterns emerge:

Under 200K tokens with fewer than 500 queries per day? Long context with caching wins almost every time. The vector DB's fixed hosting cost alone often exceeds your total long-context API spend.

Over 500K tokens with 5,000+ queries per day? RAG typically wins on cost because you're sending a few thousand retrieved tokens per query instead of the full corpus every time.

Context caching is the swing factor. At Gemini's cached rate of $0.3125 per million input tokens versus $1.25 uncached, a high cache hit rate slashes long-context costs dramatically. The exact savings depend on cache eligibility rules and TTL settings specific to your provider.

The Hybrid Architecture: Best of Both

Pattern: RAG as Pre-filter, Long Context as Processor

There's a middle path. Use lightweight retrieval to select relevant documents from a larger corpus, then inject those complete documents (not chunks) as full context.

This avoids the information loss from chunk boundaries cutting through relevant passages while keeping input size manageable. It works especially well for corpora in the 1M to 10M token range, where the full dataset exceeds context limits but the relevant subset fits comfortably.

Code Example: Hybrid Retrieval Plus Full Context

# Requires: pip install rank_bm25 google-generativeai

import os
import glob

from rank_bm25 import BM25Okapi
import google.generativeai as genai

genai.configure(api_key=os.environ["GEMINI_API_KEY"])

# Load documents as complete units (not chunks)
documents = {}
for filepath in sorted(glob.glob("docs/**/*.md", recursive=True)):
    with open(filepath, "r") as f:
        documents[filepath] = f.read()

# Build a lightweight BM25 index over full documents
doc_keys = list(documents.keys())
tokenized_docs = [documents[k].lower().split() for k in doc_keys]
bm25 = BM25Okapi(tokenized_docs)


def hybrid_query(question: str, top_k: int = 30, token_budget: int = 800_000) -> str:
    """Retrieve top documents via BM25, then inject full docs into long context."""
    tokenized_query = question.lower().split()
    scores = bm25.get_scores(tokenized_query)

    # Rank and select top-K complete documents
    ranked_indices = sorted(
        range(len(scores)),
        key=lambda i: scores[i],
        reverse=True,
    )

    selected_docs = []
    total_chars = 0  # rough proxy; use proper token counting in production

    for idx in ranked_indices[:top_k]:
        doc_text = documents[doc_keys[idx]]

        # ~4 chars per token
        if total_chars + len(doc_text) > token_budget * 4:
            break

        selected_docs.append(f"--- FILE: {doc_keys[idx]} ---\n{doc_text}")
        total_chars += len(doc_text)

    context = "\n\n".join(selected_docs)

    model = genai.GenerativeModel("gemini-1.5-pro-001")
    response = model.generate_content(
        "Using the following documents, answer the question. Cite filenames.\n\n"
        f"\n{context}\n\n\n"
        f"Question: {question}"
    )
    return response.text


answer = hybrid_query("What are the rate limiting policies for our public API?")
print(answer)

When to Graduate from Long Context to RAG

Think of it as a progression. Start with long context injection. Add retrieval when your corpus grows past the context window, when query volume makes per-request costs untenable, when you need per-user access control at the retrieval layer, or when your data changes so fast that cache invalidation becomes its own problem.

The migration path is smooth. Your long-context prompt templates and document formatting transfer directly to the generation stage of a RAG pipeline. You're bolting a retrieval layer on the front, not rebuilding from scratch.

Practical Implementation Guide

Optimizing Long Context Performance

I've tested a few things that make a real difference:

Put documents you expect to be most relevant near the beginning and end of the context. Newer models have reduced positional bias a lot, but structuring for readability never hurts.

Separate documents with consistent markers: XML tags, horizontal rules, or header blocks that include metadata like filename, last-modified date, and document type. Models latch onto structure.

Tell the model to cite sources explicitly in the system prompt. Without this instruction, models tend to synthesize answers without grounding, and you lose the ability to verify anything.

Set up context caching for any scenario where multiple queries target the same corpus. It amortizes the input cost and latency across requests.

Token Budget Management

# Requires: pip install tiktoken

import tiktoken


def assemble_context(
    documents: list[dict],  # [{"id": str, "text": str, "priority": float}]
    token_budget: int = 900_000,
    model_encoding: str = "cl100k_base",
) -> str:
    """Assemble optimal context within token budget, prioritizing high-value docs.

    Note: tiktoken with cl100k_base gives accurate counts for OpenAI models.
    For Gemini or Claude, token counts will differ since each provider uses its
    own tokenizer. Use this as an approximation, or use each provider's own
    token counting utility for precision.
    """
    enc = tiktoken.get_encoding(model_encoding)

    # Sort by priority descending
    sorted_docs = sorted(documents, key=lambda d: d["priority"], reverse=True)

    selected = []
    tokens_used = 0

    for doc in sorted_docs:
        doc_tokens = len(enc.encode(doc["text"]))
        if tokens_used + doc_tokens > token_budget:
            continue  # skip this doc; try smaller ones

        selected.append(f"--- DOC ID: {doc['id']} ---\n{doc['text']}")
        tokens_used += doc_tokens

    context = "\n\n".join(selected)
    print(f"Context assembled: {tokens_used:,} tokens across {len(selected)} documents")
    return context

This sorts documents by a priority heuristic (recency, relevance score, manual weighting, whatever you want), then greedily fills the token budget. It skips documents that would exceed the limit rather than truncating them, which preserves document integrity. The tiktoken counts are accurate for OpenAI models; for Gemini or Claude, treat them as approximations and use each provider's own counting utility when precision matters.

Error Handling and Fallbacks

Build in graceful degradation for when your corpus approaches the window limit. Monitor token counts per request. Set alerts at 80% of the context limit. When you cross that threshold, fall back to the hybrid pattern (pre-filter, then inject) rather than silently truncating content.

And this is worth saying explicitly: scan your corpus for PII and secrets before sending it to external APIs. What used to be a retrieval-time concern becomes an ingestion-time requirement when you're shipping the entire knowledge base in every prompt.

What This Means for the Vector Database Market

Vector database companies built significant businesses on the RAG thesis, and their products solve real problems. But the addressable market for "you need a vector DB to build any LLM application" is narrower than the 2023 hype cycle suggested.

Long context doesn't kill vector search. It narrows the use case. Vector databases are still the right tool for semantic search interfaces, recommendation systems, multimodal retrieval, anomaly detection, deduplication, clustering, and any corpus that exceeds context window limits.

The shift is from "every LLM app needs a vector DB" to "large-scale retrieval applications need a vector DB." For the long tail of developers building internal tools, support bots, and document Q&A systems over modest corpora, the vector database is unnecessary infrastructure.

Choose Boring Architecture (Until You Can't)

The argument is simple. Start with the simplest thing that works. For most developers' actual data sizes, that's long context injection right now. Load your documents, stuff them in the prompt, ask your question. No vector database. No embedding pipeline. No chunking strategy debates.

RAG is still the answer at scale. But "scale" starts much higher than most teams realize.

Your triggers for adding complexity should be concrete. The corpus exceeds the context window (hard limit). Per-request latency violates your SLOs (soft limit). Query volume pushes long-context costs past the RAG crossover (economic limit). Data governance requires minimizing what gets sent to external providers (organizational limit).

Until you hit one of those? Run the cost calculator against your actual numbers. Benchmark long context accuracy on your real corpus. Make the decision with data, not defaults.

The 1M token window is here. For most of us, it's more than enough.

Related: Building Production-Ready RAG Pipelines

Matt MickiewiczMatt Mickiewicz

Matt is the co-founder of SitePoint, 99designs and Flippa. He lives in Vancouver, Canada.

© 2000 – 2026 SitePoint Pty. Ltd.
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.