The T4 Stack is what happens when you bolt an intelligence layer onto the T3 foundation — Next.js 16, Vercel AI SDK, and local RAG powered by context engineering principles. This tutorial walks through every layer of it with full code examples.
Table of Contents
- The T3 Stack Changed Everything — Now It Needs an Upgrade
- What's New in Next.js 16 That Matters for AI
- The Vercel AI SDK: The Backbone of the T4 Stack
- Context Engineering > Prompt Engineering
- Building Local RAG: Embeddings + Vector Search Without External APIs
- Wiring It All Together: The Full T4 RAG Chat
- Making It "Senior": Production Patterns
- Performance and Cost Comparison
- What's Next for the T4 Stack
- This Is the Stack Senior Engineers Are Shipping
The T3 Stack gave full-stack TypeScript development a backbone. Next.js, TypeScript, tRPC, Tailwind, Prisma or Drizzle: it was the scaffold that let teams ship fast with end-to-end type safety. But the T4 Stack is what happens when you bolt an intelligence layer onto that foundation. T4 stands for T3 plus Transformers, and it represents a pattern built around Next.js 16 AI capabilities, the Vercel AI SDK, and local RAG powered by context engineering principles. This isn't a theoretical architecture diagram. It's a working stack you can build and deploy today, and this tutorial walks through every layer of it with full code examples.
If you've been writing fetch calls to OpenAI's API and duct-taping JSON parsing around the responses, this article is your off-ramp into something more solid.
The T3 Stack Changed Everything — Now It Needs an Upgrade
What Made T3 the Default (Next.js + TypeScript + tRPC + Tailwind)
The create-t3-app repository sits at over 25,000 stars on GitHub, and for good reason. It solved a real problem: the gap between frontend and backend type systems. Before T3, full-stack TypeScript developers burned enormous effort keeping API contracts in sync. You'd define a response shape on the server, then manually mirror that type on the client, and pray nothing drifted.
T3 closed that gap with a curated set of tools. tRPC killed the API contract problem entirely by sharing types between server and client at compile time. Prisma (and later Drizzle) brought type-safe database queries. Zod handled runtime validation. Tailwind removed the CSS-in-JS debate. The whole thing scaffolded in seconds via the CLI.
The result was a T3 Stack evolution that became the default starting point for thousands of projects. Teams could go from zero to a production-ready, fully typed application without spending a week on boilerplate.
But T3 has a gap. It has no opinion about AI. No embedding layer. No retrieval pattern. No streaming primitives for language model responses. Teams that need AI features today are bolting them on ad hoc, adding route handlers that call OpenAI directly, parsing unstructured text responses with regex, and storing nothing for context. It works, but it's not a stack. It's a workaround.
Enter the T4 Stack: T3 + Transformers
The T4 Stack is a name for a pattern, not an official project (yet). It takes the T3 philosophy of curated, type-safe, opinionated tooling and extends it with an AI layer built on three pillars:
- Next.js 16 as the application framework (with its new async primitives)
- Vercel AI SDK as the model interface and streaming layer
- Local RAG (embeddings + vector search + context assembly) for grounded responses
Here's how the layers connect:
User Query
→ Next.js 16 Route Handler
→ Local Embedding Model (query → vector)
→ Vector Store (SQLite + vector extension)
→ Top-K Relevant Chunks
→ Context Assembly (system prompt + chunks + history)
→ LLM via AI SDK (stream response)
→ React Client (useChat hook, streaming UI)
Why "Transformers" and not just "AI"? Specificity matters. The transformer architecture underpins both the embedding models (which power retrieval) and the language models (which generate responses). Every component in the AI layer of this stack is a transformer. The name is accurate, and it keeps the alliterative T-naming convention alive.
What's New in Next.js 16 That Matters for AI
Next.js 16 introduces changes that directly affect how you build AI features. This isn't a minor version bump. There are breaking changes that require migration, and new primitives that make React Server Components AI patterns significantly cleaner.
Async Request APIs and Why They Unlock Streaming AI
The shift to asynchronous request-level APIs — cookies(), headers(), and dynamic params — actually began in Next.js 15, where synchronous access was deprecated. Next.js 16 completes this transition and removes the synchronous fallback entirely. You must await these calls.
This sounds like a small syntactic change, but the architectural implications are real. Async request APIs mean the framework can defer request-scoped work, which aligns perfectly with streaming AI responses. When your route handler needs to read a cookie (say, for a session token to determine user context), then embed a query, then stream an LLM response, the entire pipeline is now naturally async from top to bottom. No synchronous bottlenecks forcing the runtime to block.
React 19 as Default: Server Functions and useActionState
Next.js 16 ships with React 19 as the default (React 19 was also available in Next.js 15, but 16 cements it as the baseline). This brings Server Functions (the evolution of Server Actions) and hooks like useActionState into the core framework. For AI applications, this matters because the Vercel AI SDK's client-side hooks like useChat and useCompletion pair naturally with React 19's model of server-client interaction. You can trigger server-side AI generation from client components using the same patterns you use for form submissions, with built-in support for pending states and optimistic updates.
Turbopack Stable: Dev Speed for AI Iteration
Turbopack reached stable status for development in Next.js 15 and continues as the default development bundler in Next.js 16. If you've worked on AI features, you know the iteration loop: tweak a system prompt, adjust the context assembly logic, reload, test. With Webpack, hot module replacement in large applications could take seconds. Turbopack's Rust-based architecture brings that down dramatically, especially in projects with heavy dependency trees (and AI-heavy apps tend to pull in large model libraries). The faster your feedback loop, the faster you iterate on context engineering.
Here's what the migration looks like in practice for an AI chat endpoint:
Code Example 1: Migrating a Next.js 15 route handler to Next.js 16 async API patterns
// BEFORE: Next.js 15 — app/api/chat/route.ts
import { cookies } from 'next/headers';
export async function POST(request: Request) {
// cookies() was synchronous in Next.js 15
const cookieStore = cookies();
const sessionToken = cookieStore.get('session')?.value;
const { messages } = await request.json();
// ... AI logic
return new Response('OK');
}
// AFTER: Next.js 16 — app/api/chat/route.ts
import { cookies } from 'next/headers';
export async function POST(request: Request) {
// cookies() is now async in Next.js 16 — must be awaited
const cookieStore = await cookies();
const sessionToken = cookieStore.get('session')?.value;
const { messages } = await request.json();
// ... AI logic
return new Response('OK');
}
The change is small in syntax but matters in practice. If you skip the await, you get a build error. Every route handler touching request APIs needs this update, and if you have AI endpoints that read session data or custom headers for context, you'll hit this immediately.
The Vercel AI SDK: The Backbone of the T4 Stack
The Vercel AI SDK is what turns a Next.js application into a full-stack AI architecture. It gives you a unified interface for working with language models, streaming responses to the client, and generating structured output, all with TypeScript types flowing through the entire pipeline.
Unified Provider Interface: One API for Multiple Models
The SDK's provider abstraction is its most practically useful feature. You write your application logic once and swap between model providers by changing configuration. This matters for the T4 Stack because you'll likely use different models in different contexts: a large cloud-hosted model for production, a local model via Ollama for development and testing, and possibly a different provider for embeddings versus generation.
Code Example 2: Setting up the AI SDK with multiple providers
// lib/ai/providers.ts
// npm install ai @ai-sdk/openai ollama-ai-provider
import { createOpenAI } from '@ai-sdk/openai';
import { createOllama } from 'ollama-ai-provider';
// Production: OpenAI
const openai = createOpenAI({
apiKey: process.env.OPENAI_API_KEY,
});
// Local development: Ollama (ensure Ollama is running: ollama serve)
const ollama = createOllama({
baseURL: 'http://localhost:11434/api',
});
// Select provider based on environment
export const chatModel =
process.env.NODE_ENV === 'production'
? openai('gpt-4o')
: ollama('llama3.1');
export const embeddingModel =
process.env.NODE_ENV === 'production'
? openai.embedding('text-embedding-3-small')
: ollama.embedding('nomic-embed-text');
The key insight: your route handlers, your context assembly logic, your streaming UI components never reference a specific provider. They import chatModel or embeddingModel from this config file. When you need to swap from Ollama to Anthropic, you change one line in one file.
The key insight: your route handlers, your context assembly logic, your streaming UI components never reference a specific provider. They import
chatModelorembeddingModelfrom this config file. When you need to swap from Ollama to Anthropic, you change one line in one file.
Streaming UI Primitives: streamText, streamUI, and generateObject
The SDK provides three server-side functions that cover the major AI response patterns:
streamText: Streams text from a language model, returning aReadableStreamcompatible with Next.js route handlers. This is the workhorse for chat interfaces.streamUI: Streams React Server Components from the server to the client. Instead of streaming text that the client renders, you stream actual UI components. Powerful for rich, interactive AI responses. (Note:streamUIis considered experimental and its API may change between SDK versions.)generateObject: Generates a structured, typed object from a language model using a Zod schema. This replaces the fragile pattern of asking the LLM for JSON and then parsing it withJSON.parsewrapped in a try-catch. The model is constrained to produce valid output matching your schema, and the SDK validates it.
For AI SDK structured output, generateObject changes the game. Instead of hoping the model returns valid JSON, you define exactly what shape you need and the SDK handles the rest.
Code Example 3: streamText in a Next.js 16 route handler
// app/api/chat/route.ts
import { streamText } from 'ai';
import { chatModel } from '@/lib/ai/providers';
export async function POST(request: Request) {
const { messages } = await request.json();
const result = streamText({
model: chatModel,
system: 'You are a helpful assistant. Answer concisely.',
messages,
});
return result.toDataStreamResponse();
}
That's the entire route handler for a streaming chat endpoint. The toDataStreamResponse() method returns a Response object with the correct headers and streaming body that the AI SDK's client hooks expect. No manual ReadableStream construction, no chunked encoding setup.
useChat and useCompletion: Client-Side Hooks That Just Work
On the client side, the SDK provides React hooks that manage the full lifecycle of an AI conversation: sending messages, receiving streamed tokens, handling loading states, managing errors, and maintaining conversation history.
Code Example 4: Full useChat client component
// components/Chat.tsx
'use client';
import { useChat } from '@ai-sdk/react';
export function Chat() {
const { messages, input, handleInputChange, handleSubmit, isLoading, error } =
useChat({
api: '/api/chat',
});
return (
<div className="flex flex-col h-screen max-w-2xl mx-auto p-4">
<div className="flex-1 overflow-y-auto space-y-4">
{messages.map((message) => (
<div
key={message.id}
className={`p-3 rounded-lg ${
message.role === 'user'
? 'bg-blue-100 ml-auto max-w-xs'
: 'bg-gray-100 mr-auto max-w-md'
}`}
>
<p className="text-sm font-medium">
{message.role === 'user' ? 'You' : 'Assistant'}
</p>
<p className="text-sm mt-1 whitespace-pre-wrap">
{message.content}
</p>
</div>
))}
{isLoading && (
<div className="bg-gray-100 mr-auto max-w-md p-3 rounded-lg animate-pulse">
<p className="text-sm text-gray-500">Thinking...</p>
</div>
)}
</div>
{error && (
<div className="p-2 mb-2 bg-red-100 text-red-700 text-sm rounded">
Error: {error.message}
</div>
)}
<form onSubmit={handleSubmit} className="flex gap-2 pt-4 border-t">
<input
value={input}
onChange={handleInputChange}
placeholder="Ask something..."
className="flex-1 p-2 border rounded-lg"
disabled={isLoading}
/>
<button
type="submit"
disabled={isLoading}
className="px-4 py-2 bg-blue-600 text-white rounded-lg disabled:opacity-50"
>
Send
</button>
</form>
</div>
);
}
The useChat hook handles state management that would otherwise eat dozens of lines of custom code: optimistic message display, streaming token accumulation, error recovery, and input clearing on submit. It speaks the AI SDK's expected protocol when communicating with the route handler, so the two sides just work together.
Context Engineering > Prompt Engineering
What Context Engineering Actually Means (Not Just a Buzzword)
The concept of context engineering crystallized in mid-2025, building on the idea — articulated by Andrej Karpathy and others — that the quality of an AI system's output depends far more on what information you put into the context window than on how cleverly you phrase your instructions.
Prompt engineering treated the system prompt as the primary lever. You'd craft elaborate instructions, add few-shot examples, and iterate on phrasing to coax the model into better behavior. It worked, up to a point. But the approach is fundamentally static. A carefully crafted prompt doesn't adapt to the user's specific question, the relevant documents in your knowledge base, or the results of tool calls made during the conversation.
Context engineering treats the entire context window as a dynamic data structure assembled at runtime. The system prompt is just one component. The real leverage comes from what else you include: retrieved documents, tool outputs, conversation history, user metadata. This is a systems architecture problem, not a copywriting problem.
Context engineering treats the entire context window as a dynamic data structure assembled at runtime. The system prompt is just one component. The real leverage comes from what else you include: retrieved documents, tool outputs, conversation history, user metadata. This is a systems architecture problem, not a copywriting problem.
The Four Layers of Context in the T4 Stack
The T4 Stack implements context engineering through four distinct layers, each corresponding to a specific part of the codebase:
Layer 1: System Instructions (Static)
Your base system prompt. It defines the AI's role, tone, and constraints. Written once, rarely changes per request. In code, it's a string constant or a template loaded from a config file.
Layer 2: Retrieved Knowledge via RAG (Dynamic, Per-Query)
This is the retrieval augmented generation layer. For each user query, you embed the query, search a vector store for relevant document chunks, and inject those chunks into the context. This grounds the model's responses in your specific data rather than its training set.
Layer 3: Tool/Function Call Results (Dynamic, Per-Turn)
When the model invokes tools (API calls, database lookups, calculations), the results flow back into the context for the next generation step. The AI SDK handles this with its tool calling primitives.
Layer 4: Conversation Memory (Session-Scoped)
The ongoing conversation history. The useChat hook manages this on the client side, and the route handler receives the full message array with each request.
Each layer maps to a concrete module in the T4 architecture. System instructions live in configuration. RAG lives in the embedding and vector search modules. Tool results come from tool definitions registered with the AI SDK. Conversation memory flows through the messages array. When you see the full route handler later in this article, you'll see all four layers assembled into a single context window before being sent to the model.
Building Local RAG: Embeddings + Vector Search Without External APIs
Why Local RAG in 2025?
Running retrieval-augmented generation locally, with the embedding model and vector store both on your own infrastructure, has gone from a niche optimization to a pragmatic default for many applications.
Cost: A cloud embedding API charges per token. OpenAI's text-embedding-3-small costs $0.02 per million tokens. That sounds cheap until you're embedding a large corpus and re-embedding on every query. With a local model, the marginal cost per embedding is zero after initial setup.
Privacy: When you send document text to an external API for embedding, that data leaves your infrastructure. For applications handling sensitive documents (legal, medical, financial, internal company knowledge), that's a non-starter. Local RAG keeps everything on your server.
Latency: An API round-trip for embedding generation adds 100 to 300 milliseconds per query. A local model running on CPU can generate embeddings in under 50 milliseconds for a single short query, depending on hardware and model size. For interactive applications where the user is waiting, that difference is noticeable.
Quality: Local embeddings local models have improved dramatically. The MTEB (Massive Text Embedding Benchmark) leaderboard on Hugging Face tracks embedding model quality across dozens of tasks. Models like Nomic Embed and BGE variants now score competitively with commercial APIs across many benchmark categories. The gap has narrowed to the point where, for most retrieval tasks, local models are more than good enough.
Choosing Your Local Embedding Model
Here's a practical comparison of popular local embedding models for the T4 Stack:
| Model | Dimensions | Size | Best For |
|---|---|---|---|
nomic-embed-text-v1.5 |
768 (supports Matryoshka dimensions down to 64) | ~274 MB | Best quality/size tradeoff |
BAAI/bge-base-en-v1.5 |
768 | ~438 MB | Mature, well-documented |
all-MiniLM-L6-v2 |
384 | ~80 MB | Resource-constrained environments |
For most T4 Stack applications, Nomic Embed hits the sweet spot: strong retrieval quality, reasonable model size, and available in ONNX format for fast CPU inference in Node.js.
You have two main options for running these locally:
@huggingface/transformers(formerly@xenova/transformers): Runs ONNX models directly in Node.js. No external process, no GPU required. The simplest path for a Next.js application.- Ollama: Runs models as a local service. More overhead to set up but supports a wider range of models and provides a consistent API.
Code Example 5: Generating embeddings locally using @huggingface/transformers
// lib/ai/embeddings.ts
// npm install @huggingface/transformers
import { pipeline, type FeatureExtractionPipeline } from '@huggingface/transformers';
let embeddingPipeline: FeatureExtractionPipeline | null = null;
async function getEmbeddingPipeline(): Promise<FeatureExtractionPipeline> {
if (!embeddingPipeline) {
embeddingPipeline = await pipeline(
'feature-extraction',
'Xenova/nomic-embed-text-v1.5',
{ dtype: 'q8' } // Quantized for faster CPU inference
);
}
return embeddingPipeline;
}
export async function embedText(text: string): Promise<number[]> {
const pipe = await getEmbeddingPipeline();
// Nomic Embed expects a task prefix for best results
const output = await pipe(`search_query: ${text}`, {
pooling: 'mean',
normalize: true,
});
return Array.from(output.data as Float32Array);
}
export async function embedDocuments(texts: string[]): Promise<number[][]> {
const pipe = await getEmbeddingPipeline();
const results: number[][] = [];
for (const text of texts) {
const output = await pipe(`search_document: ${text}`, {
pooling: 'mean',
normalize: true,
});
results.push(Array.from(output.data as Float32Array));
}
return results;
}
The singleton pattern for embeddingPipeline matters here. Loading the model takes a few seconds on first call, but subsequent calls reuse the loaded model and return in milliseconds. Note the different prefixes: search_query: for user queries and search_document: for corpus documents. Nomic Embed uses these prefixes to optimize the embedding space for retrieval.
Vector Storage: SQLite + Vector Extensions for Zero-Infra RAG
For the local vector database layer, you don't need Pinecone, Weaviate, or any managed service. SQLite with a vector search extension gives you cosine similarity search with zero infrastructure beyond a single file on disk.
Several options exist: sqlite-vss (by Alex Garcia), vectorlite, and for a pure JavaScript alternative, orama (which provides full-text and vector search in a single library). For this tutorial, we'll use a SQLite-based approach that keeps things simple and portable.
Code Example 6: Setting up SQLite with vector search
// lib/db/vectorStore.ts
// npm install better-sqlite3
// npm install -D @types/better-sqlite3
import Database from 'better-sqlite3';
import path from 'path';
import fs from 'fs';
const DATA_DIR = path.join(process.cwd(), 'data');
const DB_PATH = path.join(DATA_DIR, 'vectors.db');
const VECTOR_DIMENSIONS = 768; // Nomic Embed output dimensions
let db: ReturnType<typeof Database> | null = null;
export function getDatabase() {
if (!db) {
// Ensure the data directory exists
if (!fs.existsSync(DATA_DIR)) {
fs.mkdirSync(DATA_DIR, { recursive: true });
}
db = new Database(DB_PATH);
db.pragma('journal_mode = WAL');
// Create documents table
db.exec(`
CREATE TABLE IF NOT EXISTS documents (
id INTEGER PRIMARY KEY AUTOINCREMENT,
content TEXT NOT NULL,
source TEXT NOT NULL,
chunk_index INTEGER NOT NULL,
embedding BLOB NOT NULL
)
`);
// Create index for faster scanning (for small-medium corpora)
db.exec(`
CREATE INDEX IF NOT EXISTS idx_documents_source
ON documents(source)
`);
}
return db;
}
// Store a document chunk with its embedding
export function insertChunk(
content: string,
source: string,
chunkIndex: number,
embedding: number[]
) {
const db = getDatabase();
const stmt = db.prepare(`
INSERT INTO documents (content, source, chunk_index, embedding)
VALUES (?, ?, ?, ?)
`);
// Store embedding as a binary buffer for efficiency
const buffer = Buffer.from(new Float32Array(embedding).buffer);
stmt.run(content, source, chunkIndex, buffer);
}
// Cosine similarity search
export function searchSimilar(
queryEmbedding: number[],
topK: number = 5
): Array<{ content: string; source: string; score: number }> {
const db = getDatabase();
const rows = db
.prepare('SELECT id, content, source, embedding FROM documents')
.all() as Array<{
id: number;
content: string;
source: string;
embedding: Buffer;
}>;
// Compute cosine similarity in JS
const scored = rows.map((row) => {
const storedEmbedding = Array.from(
new Float32Array(
row.embedding.buffer,
row.embedding.byteOffset,
row.embedding.byteLength / Float32Array.BYTES_PER_ELEMENT
)
);
const score = cosineSimilarity(queryEmbedding, storedEmbedding);
return { content: row.content, source: row.source, score };
});
return scored
.sort((a, b) => b.score - a.score)
.slice(0, topK);
}
function cosineSimilarity(a: number[], b: number[]): number {
let dotProduct = 0;
let normA = 0;
let normB = 0;
for (let i = 0; i < a.length; i++) {
dotProduct += a[i] * b[i];
normA += a[i] * a[i];
normB += b[i] * b[i];
}
const denominator = Math.sqrt(normA) * Math.sqrt(normB);
if (denominator === 0) return 0;
return dotProduct / denominator;
}
This implementation stores embeddings as binary blobs in SQLite and computes cosine similarity in JavaScript. For corpora up to tens of thousands of chunks, this brute-force approach is fast enough (sub-50ms searches). For larger corpora, you'd swap in a proper vector index extension or migrate to a managed vector database. The beauty of this pattern is that you can start simple and scale later without touching the rest of the stack.
Fair warning: this brute-force approach loads all rows into memory for every search. It works well for small to medium corpora (up to low tens of thousands of chunks) but will become a bottleneck beyond that. For larger datasets, use a dedicated vector extension like sqlite-vec (the successor to sqlite-vss, also by Alex Garcia) or an external vector store.
The Ingestion Pipeline: Chunking, Embedding, Storing
Before you can search, you need to get your documents into the vector store. The ingestion pipeline reads source documents, splits them into chunks, generates embeddings for each chunk, and stores everything.
Code Example 7: Complete ingestion script
// scripts/ingest.ts
// Run with: npx tsx scripts/ingest.ts
// Requires a 'docs' directory at the project root with .md files
import fs from 'fs';
import path from 'path';
import { embedDocuments } from '../lib/ai/embeddings';
import { insertChunk, getDatabase } from '../lib/db/vectorStore';
const DOCS_DIR = path.join(process.cwd(), 'docs');
const CHUNK_SIZE = 500; // characters
const CHUNK_OVERLAP = 100; // characters
function chunkText(text: string): string[] {
const chunks: string[] = [];
let start = 0;
while (start < text.length) {
const end = Math.min(start + CHUNK_SIZE, text.length);
const chunk = text.slice(start, end).trim();
if (chunk.length > 50) {
// Skip very small trailing chunks
chunks.push(chunk);
}
start += CHUNK_SIZE - CHUNK_OVERLAP;
}
return chunks;
}
async function ingest() {
console.log('Starting ingestion pipeline...');
// Ensure docs directory exists
if (!fs.existsSync(DOCS_DIR)) {
console.error(`Docs directory not found: ${DOCS_DIR}`);
console.error('Create a "docs" directory and add .md files to ingest.');
process.exit(1);
}
// Ensure DB is initialized
getDatabase();
// Read all markdown files from docs directory
const files = fs
.readdirSync(DOCS_DIR)
.filter((f) => f.endsWith('.md'));
if (files.length === 0) {
console.error('No .md files found in docs directory.');
process.exit(1);
}
console.log(`Found ${files.length} documents to process.`);
for (const file of files) {
const filePath = path.join(DOCS_DIR, file);
const content = fs.readFileSync(filePath, 'utf-8');
const chunks = chunkText(content);
console.log(` ${file}: ${chunks.length} chunks`);
// Generate embeddings for all chunks in this file
const embeddings = await embedDocuments(chunks);
// Store each chunk with its embedding
for (let i = 0; i < chunks.length; i++) {
insertChunk(chunks[i], file, i, embeddings[i]);
}
}
console.log('Ingestion complete.');
}
ingest().catch(console.error);
Run this script with npx tsx scripts/ingest.ts after placing your markdown files in the docs/ directory. The chunking strategy here uses fixed-size windows with overlap, which ensures information at chunk boundaries isn't lost. For production use, you might consider semantic chunking (splitting on paragraph or section boundaries), but fixed-size chunking is a reliable baseline.
Wiring It All Together: The Full T4 RAG Chat
Architecture Diagram
Here's the complete request flow for a RAG-enhanced chat in the T4 Stack:
┌──────────┐ ┌──────────────────────────────────────┐
│ Client │ │ Next.js 16 Server │
│ │ │ │
│ useChat() ├────►│ POST /api/chat │
│ │ │ │ │
│ Stream │ │ ├─ 1. Extract latest user message │
│ Display │◄────│ ├─ 2. Embed query (local model) │
│ │ │ ├─ 3. Vector search (SQLite) │
│ Source │ │ ├─ 4. Assemble context window │
│ Citations│ │ │ ├─ System prompt │
│ │ │ │ ├─ Retrieved chunks │
│ │ │ │ └─ Conversation history │
│ │ │ └─ 5. streamText → LLM response │
└──────────┘ └──────────────────────────────────────┘
The RAG Route Handler
This is the centerpiece of the T4 Stack. It brings together the embedding model, vector store, context assembly, and streaming response into a single route handler.
Code Example 8: Complete RAG route handler
// app/api/chat/route.ts
import { streamText, type Message } from 'ai';
import { chatModel } from '@/lib/ai/providers';
import { embedText } from '@/lib/ai/embeddings';
import { searchSimilar } from '@/lib/db/vectorStore';
const SYSTEM_PROMPT = `You are a knowledgeable assistant that answers questions
based on the provided documentation. Use the retrieved context to give accurate,
grounded answers. If the context does not contain enough information to answer
the question, say so honestly. Always reference which source documents you
used in your answer.`;
export async function POST(request: Request) {
const { messages }: { messages: Message[] } = await request.json();
// 1. Get the latest user message for retrieval
const lastUserMessage = messages
.filter((m) => m.role === 'user')
.pop();
if (!lastUserMessage) {
return new Response('No user message found', { status: 400 });
}
// 2. Embed the user's query locally
const queryEmbedding = await embedText(lastUserMessage.content);
// 3. Search for relevant document chunks
const relevantChunks = searchSimilar(queryEmbedding, 5);
// 4. Build the context block from retrieved chunks
const contextBlock = relevantChunks
.map(
(chunk, _i) =>
`[Source: ${chunk.source} | Relevance: ${chunk.score.toFixed(3)}]\n${chunk.content}`
)
.join('\n\n---\n\n');
// 5. Assemble the full system prompt with retrieved context
const augmentedSystemPrompt = `${SYSTEM_PROMPT}
## Retrieved Context
The following document excerpts are relevant to the user's question:
${contextBlock}
## Instructions
Base your answer on the retrieved context above. Cite the source filenames
when referencing specific information.`;
// 6. Stream the response
const result = streamText({
model: chatModel,
system: augmentedSystemPrompt,
messages,
});
return result.toDataStreamResponse();
}
That's roughly 50 lines of meaningful code, and it implements the complete retrieval augmented generation tutorial pattern: embed the query, retrieve relevant documents, inject them into the context, and stream the response. All four layers of context engineering are present: the static system prompt, the dynamically retrieved chunks, the conversation history (via messages), and the source metadata for grounding.
The Chat UI Component with Source Attribution
The client component needs a small enhancement over the basic useChat example: it should display which sources the model used in its response. One approach is to pass the retrieved sources as metadata alongside the streamed response. A simpler approach, used here, is to instruct the model to cite sources inline and render them.
Code Example 9: Chat UI with source attribution
// components/RAGChat.tsx
'use client';
import { useChat } from '@ai-sdk/react';
export function RAGChat() {
const { messages, input, handleInputChange, handleSubmit, isLoading, error } =
useChat({
api: '/api/chat',
});
return (
<div className="flex flex-col h-screen max-w-3xl mx-auto p-4">
<header className="pb-4 border-b mb-4">
<h1 className="text-xl font-bold">T4 Stack RAG Chat</h1>
<p className="text-sm text-gray-500">
Answers grounded in your local documentation
</p>
</header>
<div className="flex-1 overflow-y-auto space-y-4">
{messages.map((message) => (
<div
key={message.id}
className={`p-4 rounded-lg ${
message.role === 'user'
? 'bg-blue-50 ml-auto max-w-lg'
: 'bg-white border mr-auto max-w-2xl'
}`}
>
<div className="flex items-center gap-2 mb-2">
<span className="text-xs font-semibold uppercase text-gray-500">
{message.role === 'user' ? 'You' : 'Assistant'}
</span>
</div>
<div className="text-sm whitespace-pre-wrap leading-relaxed">
{message.content}
</div>
{/* Highlight source citations in assistant messages */}
{message.role === 'assistant' && (
<SourceHighlights content={message.content} />
)}
</div>
))}
{isLoading && (
<div className="border mr-auto max-w-2xl p-4 rounded-lg">
<div className="flex items-center gap-2">
<div className="w-2 h-2 bg-blue-500 rounded-full animate-bounce" />
<div
className="w-2 h-2 bg-blue-500 rounded-full animate-bounce"
style={{ animationDelay: '0.1s' }}
/>
<div
className="w-2 h-2 bg-blue-500 rounded-full animate-bounce"
style={{ animationDelay: '0.2s' }}
/>
</div>
</div>
)}
</div>
{error && (
<div className="p-3 mb-2 bg-red-50 text-red-700 text-sm rounded-lg border border-red-200">
Something went wrong: {error.message}
</div>
)}
<form onSubmit={handleSubmit} className="flex gap-2 pt-4 border-t mt-4">
<input
value={input}
onChange={handleInputChange}
placeholder="Ask a question about your docs..."
className="flex-1 p-3 border rounded-lg focus:outline-none focus:ring-2 focus:ring-blue-500"
disabled={isLoading}
/>
<button
type="submit"
disabled={isLoading || !input.trim()}
className="px-6 py-3 bg-blue-600 text-white rounded-lg font-medium
disabled:opacity-50 hover:bg-blue-700 transition-colors"
>
Send
</button>
</form>
</div>
);
}
// Extract and display source references from the model's response
function SourceHighlights({ content }: { content: string }) {
// Match patterns like [Source: filename.md]
const sourcePattern = /\[Source:\s*([^\]]+)\]/g;
const sources = new Set<string>();
let match;
while ((match = sourcePattern.exec(content)) !== null) {
sources.add(match[1].trim());
}
if (sources.size === 0) return null;
return (
<div className="mt-3 pt-3 border-t border-gray-100">
<p className="text-xs font-semibold text-gray-400 mb-1">Sources used:</p>
<div className="flex flex-wrap gap-1">
{Array.from(sources).map((source) => (
<span
key={source}
className="text-xs bg-blue-50 text-blue-700 px-2 py-0.5 rounded"
>
{source}
</span>
))}
</div>
</div>
);
}
The SourceHighlights component parses the model's response for source citations (which the system prompt instructs it to include) and renders them as tags below the response. This gives users transparency into what information grounded the AI's answer, which is one of the primary benefits of RAG: traceable, auditable responses.
Making It "Senior": Production Patterns
A working prototype is not a production system. Here are the patterns that separate a demo from a senior developer AI stack.
Structured Output with generateObject for Reliable Parsing
When you need the AI to return data, not prose, generateObject with a Zod schema is the right tool. This pattern eliminates an entire class of bugs: malformed JSON responses, missing fields, type mismatches, and the retry logic that comes with all of them.
Code Example 10: generateObject with a Zod schema
// app/api/extract/route.ts
import { generateObject } from 'ai';
import { z } from 'zod';
import { chatModel } from '@/lib/ai/providers';
const ExtractionSchema = z.object({
intent: z.enum(['question', 'command', 'feedback', 'other']),
entities: z.array(
z.object({
name: z.string(),
type: z.enum(['person', 'product', 'topic', 'date']),
})
),
urgency: z.enum(['low', 'medium', 'high']),
summary: z.string().max(200),
});
type Extraction = z.infer<typeof ExtractionSchema>;
export async function POST(request: Request) {
const { text }: { text: string } = await request.json();
const { object } = await generateObject({
model: chatModel,
schema: ExtractionSchema,
prompt: `Analyze the following user message and extract structured information:\n\n"${text}"`,
});
// `object` is fully typed as Extraction
// No JSON.parse, no try-catch, no regex
return Response.json(object);
}
The returned object is typed as Extraction at compile time and validated at runtime. If the model can't produce a conforming response (rare with well-designed schemas and capable models), the SDK throws a structured error you can handle, rather than silently returning garbage. Worth noting: structured output reliability varies by model. Smaller or local models may struggle with complex schemas compared to GPT-4o or Claude, which have strong instruction-following capabilities.
Guardrails: Input Validation and Output Filtering
Production AI endpoints need defensive layers:
Input validation: Sanitize user input before embedding it. Strip excessive whitespace, truncate to a maximum length (embedding models have token limits), and reject obviously malicious inputs. A simple middleware approach:
function sanitizeInput(text: string): string {
return text
.trim()
.slice(0, 2000) // Max input length
.replace(/[\x00-\x08\x0B\x0C\x0E-\x1F]/g, ''); // Strip control characters
}
Output filtering: The model might generate content that violates your application's policies. Check for blocked patterns, PII leakage, or instructions leaking from the system prompt. For sensitive applications, add a post-generation check using a smaller, faster model or regex-based rules.
Rate limiting: AI endpoints are expensive (in compute, if not in API costs). Apply rate limiting per user session, per IP, or per API key. Next.js middleware is a natural place for this.
Caching Embeddings and Responses
Embedding the same query twice produces the same vector. Cache it.
const embeddingCache = new Map<string, number[]>();
async function cachedEmbed(text: string): Promise<number[]> {
const cacheKey = text.trim().toLowerCase();
if (embeddingCache.has(cacheKey)) {
return embeddingCache.get(cacheKey)!;
}
const embedding = await embedText(text);
embeddingCache.set(cacheKey, embedding);
return embedding;
}
This in-memory cache resets on every server restart and will grow without bound. For production, consider an LRU cache with a size limit, or an external cache like Redis.
For deterministic AI responses (like structured extraction where temperature is 0), you can also cache the full response using Next.js caching patterns or a simple LRU cache keyed on the input hash.
Error Boundaries for AI Components
LLM streams fail. The model might time out, the connection might drop mid-stream, or the provider might return a 500. Wrapping your AI UI components in React error boundaries prevents these failures from crashing the entire page:
'use client';
import { Component, type ReactNode } from 'react';
interface Props {
children: ReactNode;
fallback: ReactNode;
}
interface State {
hasError: boolean;
}
export class AIErrorBoundary extends Component<Props, State> {
constructor(props: Props) {
super(props);
this.state = { hasError: false };
}
static getDerivedStateFromError(): State {
return { hasError: true };
}
render() {
if (this.state.hasError) {
return this.props.fallback;
}
return this.props.children;
}
}
Use it to wrap the chat component:
<AIErrorBoundary fallback={<p>Chat is temporarily unavailable. Please try again.</p>}>
<RAGChat />
</AIErrorBoundary>
Graceful degradation is a hallmark of senior engineering. Your application should still work when the AI layer fails. That might mean showing cached responses, falling back to traditional search, or simply displaying an honest error message.
Graceful degradation is a hallmark of senior engineering. Your application should still work when the AI layer fails. That might mean showing cached responses, falling back to traditional search, or simply displaying an honest error message.
Performance and Cost Comparison
T4 Stack vs. Traditional API-Only AI Architecture
Here's how the local RAG approach in the T4 Stack compares to the more common pattern of calling external APIs for everything:
| Dimension | T4 Stack (Local RAG) | API-Only Architecture |
|---|---|---|
| Embedding cost per query | $0 (local model) | ~$0.02 per million tokens (OpenAI) |
| Embedding latency | Under 50ms (warm model, single short query, varies by hardware) | 100-300ms (network round-trip) |
| Infrastructure | SQLite file + Node.js process | Managed vector DB + API keys + network |
| Data privacy | Data stays on your server | Text sent to third-party for embedding |
| Setup complexity | Higher initial (model download, DB setup) | Lower initial (just API keys) |
| Scalability ceiling | Thousands to low tens of thousands of docs | Virtually unlimited with managed services |
| Operational overhead | Minimal (single file DB, no service to manage) | Moderate (vector DB monitoring, API quotas) |
The latency numbers are rough estimates based on typical developer experience rather than published benchmarks. Your mileage will vary based on hardware, model size, and network conditions. The cost comparison scales linearly: for a prototype or small application, the API cost is negligible. For an application making thousands of embedding calls per day, local models save real money.
When to Stay Local vs. When to Go Managed
The decision framework is straightforward:
Stay local when:
- Your corpus is under 50,000 document chunks
- Data privacy requirements prevent sending content to external APIs
- You want zero marginal cost for experimentation
- You're building a single-tenant application or developer tool
- You need low embedding latency for interactive features
Migrate to managed services when:
- Your corpus exceeds what SQLite can search efficiently (hundreds of thousands of vectors)
- You need multi-tenant isolation with separate indexes per customer
- You're running at a scale where managing infrastructure is worth the cost
- You need advanced features like hybrid search, metadata filtering, or automatic reindexing
The graduation path is smooth. Because the AI SDK abstracts the model provider, swapping from a local Ollama embedding model to OpenAI's embedding API means changing one line in your provider config. Swapping from SQLite to Pinecone means changing the vector store module. The rest of the application — the route handlers, the context assembly logic, the client components — stays the same. That's the payoff of building on abstractions.
What's Next for the T4 Stack
MCP (Model Context Protocol) Integration
Anthropic's Model Context Protocol is emerging as a standard for how AI applications connect to external tools and data sources. The Vercel AI SDK has been adding MCP support, which could become the standard tool layer in the T4 Stack. Instead of hand-coding tool definitions for each data source, you'd connect to MCP servers that expose tools and resources through a standardized protocol. Think of it as a typed, discoverable interface between your application and its capabilities.
Edge RAG with Next.js Middleware
An ambitious frontier: running lightweight vector search at the edge, closer to users. This would require WebAssembly-compatible embedding models and edge-compatible storage. The pieces are coming together (ONNX models compile to WASM, edge runtimes support more APIs each year), but the stack isn't production-ready for this pattern yet. When it is, you could embed queries and retrieve context at the edge, then route to the LLM from there, cutting latency further.
The T4 Community and Getting Involved
The T3 Stack succeeded because of its community: thousands of developers contributing, filing issues, sharing patterns, and building on top of the scaffold. The T4 pattern needs the same energy. A create-t4-app CLI that scaffolds the full stack described in this article (Next.js 16 + AI SDK + local embedding model + SQLite vector store) would lower the barrier to entry dramatically. The open questions for the community to figure out: standard patterns for context window management, embedding model selection heuristics, and migration paths from local to managed infrastructure.
This Is the Stack Senior Engineers Are Shipping
The T4 Stack isn't a product. It's a pattern: Next.js 16 + TypeScript + tRPC + Tailwind + Vercel AI SDK + Local RAG. It takes the type-safe, full-stack philosophy of T3 and extends it with an intelligence layer that's grounded, private, and fast.
The core insight from this entire architecture: context engineering is a systems design problem, not a prompt-writing problem. Your system prompt is one input. Your retrieved documents, your tool results, your conversation history, your structured output schemas — all of it is part of the context you engineer. The code modules that assemble that context are the real product.
The core insight from this entire architecture: context engineering is a systems design problem, not a prompt-writing problem. Your system prompt is one input. Your retrieved documents, your tool results, your conversation history, your structured output schemas — all of it is part of the context you engineer. The code modules that assemble that context are the real product.
If you want to build this yourself: start with a fresh Next.js 16 application, install the AI SDK (ai and @ai-sdk/react), add @huggingface/transformers for local embeddings, set up SQLite with better-sqlite3, and follow the code examples above. Drop your own markdown documentation into the docs/ folder, run the ingestion script, and start asking questions. You'll have a working, local, streaming RAG application in under an hour.
That's the T4 Stack. Type-safe. Grounded. Local-first. And ready to ship.


