RAG Architecture Patterns: Design Decisions That Actually Matter
There's a pattern I've noticed with teams building LLM-powered applications: they rush to implement RAG (Retrieval-Augmented Generation) without understanding the architectural decisions that determine success or failure. The result is systems that are slow, expensive, and return irrelevant results.
RAG isn't complicated in concept - retrieve relevant context, augment your prompt, generate a response. But the devil is in the details. This post covers the design decisions that actually matter.
What RAG Solves (And What It Doesn't)
Before diving into architecture, let's be clear about what RAG is for.
RAG solves:
- Knowledge cutoff limitations (your LLM doesn't know about your internal docs)
- Hallucination reduction (grounding responses in actual documents)
- Domain-specific accuracy (legal, medical, proprietary information)
- Dynamic knowledge (information that changes frequently)
RAG doesn't solve:
- Poor data quality (garbage in, garbage out)
- Reasoning limitations of the underlying model
- Tasks requiring real-time computation or actions
- Problems better solved by fine-tuning or traditional search
If your documents are poorly written or contradictory, RAG will faithfully retrieve and amplify that confusion. Start with data quality.
The Core Architecture
A RAG system has two phases: indexing (offline) and retrieval + generation (runtime).
INDEXING PIPELINE
─────────────────────────────────────────────────────────
Documents → Chunking → Embedding → Vector Database
↓ ↓
Metadata Embedding Model
Extraction (OpenAI, Cohere, etc.)
QUERY PIPELINE
─────────────────────────────────────────────────────────
Query → Embedding → Vector Search → Reranking → LLM
↓ ↓
Top-K Results Filtered Context
↓
Generated Response
Each component involves trade-offs. Let's examine them.
Decision 1: Chunking Strategy
Chunking is where most RAG implementations go wrong first. Your chunking strategy directly impacts retrieval quality.
The Trade-off
| Chunk Size | Pros | Cons |
|---|---|---|
| Small (100-200 tokens) | Precise retrieval, less noise | Loses context, more chunks to search |
| Large (1000+ tokens) | Preserves context, relationships | Dilutes relevance signal, slower |
| Medium (300-500 tokens) | Balanced | May still miss optimal boundaries |
Beyond Fixed-Size Chunking
Fixed-size chunking ignores document structure. Consider these alternatives:
Semantic chunking: Split on topic boundaries rather than token counts. Use an LLM or topic model to identify natural breakpoints.
Recursive chunking: Start with large chunks, recursively split only when they exceed your threshold. Preserves structure better than fixed-size.
Document-aware chunking: Respect document structure - don't split mid-paragraph, keep headers with their content, preserve code blocks.
Hierarchical chunking: Store both large and small chunks. Retrieve small chunks for precision, expand to parent chunks for context.
My Recommendation
Start with recursive chunking at 400-600 tokens with 50-100 token overlap. Use document-aware splitting to respect natural boundaries. Then measure retrieval quality and adjust. There's no universal optimal size - it depends on your content and query patterns.
Decision 2: Embedding Model Selection
Your embedding model determines how well semantic similarity maps to actual relevance.
Key Considerations
Dimensionality: Higher dimensions (1536 for OpenAI, 1024 for Cohere) capture more nuance but increase storage and search costs. For most applications, 768-1536 dimensions is sufficient.
Domain alignment: General-purpose embeddings (OpenAI, Cohere) work well for broad content. For specialized domains (legal, medical, code), consider domain-specific models or fine-tuning.
Multilingual needs: If your content spans languages, choose models trained for multilingual understanding. Don't assume English-optimized models transfer well.
Cost vs. quality: OpenAI's embeddings are convenient but add up at scale. Open-source models (sentence-transformers, Instructor) can match quality at a fraction of the cost for self-hosted deployments.
AWS Bedrock Embeddings
If you're already in the AWS ecosystem, Bedrock provides solid embedding options without managing infrastructure:
import { BedrockRuntimeClient, InvokeModelCommand } from '@aws-sdk/client-bedrock-runtime';
const bedrock = new BedrockRuntimeClient({ region: 'us-east-1' });
async function getEmbedding(text: string): Promise<number[]> {
const response = await bedrock.send(new InvokeModelCommand({
modelId: 'amazon.titan-embed-text-v2:0',
body: JSON.stringify({
inputText: text,
dimensions: 1024, // Titan v2 supports 256, 512, or 1024
normalize: true,
}),
}));
const result = JSON.parse(new TextDecoder().decode(response.body));
return result.embedding;
}
Amazon Titan Embeddings v2 offers configurable dimensions (256/512/1024) - useful for trading off precision vs. storage costs. For multilingual content, Cohere Embed on Bedrock handles 100+ languages well.
The advantage of Bedrock: unified billing, no API key management, and integration with other AWS services. The trade-off: slightly higher latency than direct API calls and regional availability constraints.
The Asymmetric Retrieval Problem
Here's a subtle issue: queries and documents are fundamentally different. Queries are short and intent-driven; documents are long and information-dense. Some embedding models handle this asymmetry poorly.
Models like Instructor and E5 allow you to prefix inputs differently for queries vs. documents, improving retrieval quality. If you're seeing semantically relevant documents not surfacing, this might be why.
Decision 3: Vector Database Selection
Vector databases are the hot new category, but your choice matters less than you think - at least initially.
When Simple Is Enough
For prototypes and smaller datasets (<100K vectors), you don't need a dedicated vector database:
- pgvector: If you're already on PostgreSQL, add vector search without new infrastructure
- SQLite with extensions: Great for local development and embedded applications
- In-memory (FAISS, Annoy): Fastest for small datasets, no persistence overhead
When to Scale Up
Dedicated vector databases (Pinecone, Weaviate, Qdrant, Milvus) become necessary when:
- Dataset exceeds what fits in memory
- You need distributed search across multiple nodes
- Filtering on metadata is as important as vector similarity
- You need real-time index updates at scale
Hybrid Search: The Underrated Pattern
Pure vector search has a weakness: it misses exact keyword matches that matter. "Error code E-4012" might not surface if the embedding doesn't capture that specific identifier.
Hybrid search combines vector similarity with traditional keyword search (BM25). Most production RAG systems benefit from this approach:
Final Score = α × Vector Score + (1 - α) × BM25 Score
Databases like Weaviate and Elasticsearch support this natively. If yours doesn't, implement it at the application layer.
Decision 4: Retrieval Strategy
Retrieval is where architectural decisions compound. A poor strategy wastes good embeddings.
Beyond Naive Top-K
Simple top-K retrieval has problems:
- Redundancy: Top results often contain overlapping information
- Diversity loss: Clustering around one interpretation of the query
- Recency blindness: No awareness of document freshness
MMR (Maximal Marginal Relevance) addresses redundancy by penalizing chunks similar to already-selected ones. It's a simple win for most applications.
Reranking with a cross-encoder model (like Cohere Rerank or a local model) dramatically improves precision. The pattern: retrieve top-50 with fast vector search, rerank to top-5 with a more expensive model.
Query expansion generates multiple query variants to improve recall. An LLM can rephrase your query into 3-5 alternative formulations, each retrieving candidates that get merged.
Metadata Filtering
Don't overlook metadata. User permissions, document dates, categories, and source systems are powerful filters that should apply before or during vector search, not after.
Query: "vacation policy"
Metadata filters:
- document_type: "policy"
- department: user.department
- status: "current"
Without filtering, you'll retrieve outdated policies from irrelevant departments.
Decision 5: Context Window Management
You've retrieved relevant chunks. Now what? How you construct the prompt matters enormously.
The Context Window Trade-off
More context gives the LLM more information but:
- Increases cost (tokens aren't free)
- Increases latency
- Risks "lost in the middle" - LLMs pay less attention to middle context
- Can introduce contradictory information
Strategies
Position matters: Put the most relevant information at the beginning and end of your context. LLMs attend to these positions more reliably.
Summarization: For large document sets, summarize before including. Trade some fidelity for token efficiency.
Source attribution: Include source identifiers so the LLM can cite specific documents. This helps users verify answers and improves trust.
Explicit instructions: Tell the LLM how to handle conflicts, missing information, and uncertainty. "If the documents don't contain the answer, say so clearly."
Generation with Claude on Bedrock
Here's a practical example using Claude for the generation step:
import { BedrockRuntimeClient, InvokeModelCommand } from '@aws-sdk/client-bedrock-runtime';
const bedrock = new BedrockRuntimeClient({ region: 'us-east-1' });
interface RetrievedChunk {
content: string;
source: string;
score: number;
}
async function generateAnswer(
query: string,
chunks: RetrievedChunk[]
): Promise<string> {
// Format chunks with source attribution
const context = chunks
.map((chunk, i) => `[Source ${i + 1}: ${chunk.source}]\n${chunk.content}`)
.join('\n\n---\n\n');
const systemPrompt = `You are a helpful assistant that answers questions based on the provided context.
Rules:
- Only use information from the provided context
- Cite sources using [Source N] notation
- If the context doesn't contain enough information, say so clearly
- Never make up information not present in the context`;
const response = await bedrock.send(new InvokeModelCommand({
modelId: 'anthropic.claude-sonnet-4-20250514',
body: JSON.stringify({
anthropic_version: 'bedrock-2023-05-31',
max_tokens: 1024,
system: systemPrompt,
messages: [{
role: 'user',
content: `Context:\n${context}\n\n---\n\nQuestion: ${query}`,
}],
}),
}));
const result = JSON.parse(new TextDecoder().decode(response.body));
return result.content[0].text;
}
Claude's 200K context window is particularly useful for RAG - you can include more chunks when relevance scores are close, reducing the risk of missing important information. The trade-off is cost: more input tokens means higher per-request costs.
When RAG Is The Wrong Choice
RAG isn't always the answer. Consider alternatives:
Fine-tuning is better when:
- You need to change the model's behavior or style
- Your knowledge is relatively static
- You want facts baked into the model weights
- Query latency is critical (no retrieval step)
Long context windows (Claude's 200K, GPT-4's 128K) are better when:
- Your entire knowledge base fits in context
- Documents are highly interrelated
- Retrieval precision is hard to achieve
Traditional search is better when:
- Users need to browse results, not get a single answer
- Exact matching matters more than semantic similarity
- You need explainable ranking
Structured queries are better when:
- Your data is already in databases
- Questions map to SQL/GraphQL queries
- Accuracy requires deterministic retrieval
The Managed Alternative: Bedrock Knowledge Bases
Before building custom RAG infrastructure, consider whether a managed solution fits your needs. Amazon Bedrock Knowledge Bases handles the entire RAG pipeline:
Documents (S3) → Automatic Chunking → Titan Embeddings → OpenSearch Serverless
↓
Query → Retrieval → Claude → Response
When Managed Makes Sense
Bedrock Knowledge Bases is worth considering when:
- Time-to-market matters: Skip weeks of infrastructure work
- Your team lacks RAG expertise: Sensible defaults out of the box
- Documents are in S3: Native integration, automatic syncing
- You need guardrails: Built-in content filtering and PII redaction
When to Build Custom
Build your own when:
- You need fine-grained control: Custom chunking, hybrid search, specific reranking
- Cost optimization is critical: Managed services have overhead
- You have specialized requirements: Domain-specific embeddings, complex metadata filtering
- Multi-cloud or vendor independence matters
Quick Setup with CDK
import * as cdk from 'aws-cdk-lib';
import * as bedrock from 'aws-cdk-lib/aws-bedrock';
import * as s3 from 'aws-cdk-lib/aws-s3';
const documentBucket = new s3.Bucket(this, 'DocumentBucket');
const knowledgeBase = new bedrock.CfnKnowledgeBase(this, 'KnowledgeBase', {
name: 'company-docs-kb',
roleArn: kbRole.roleArn,
knowledgeBaseConfiguration: {
type: 'VECTOR',
vectorKnowledgeBaseConfiguration: {
embeddingModelArn: `arn:aws:bedrock:${region}::foundation-model/amazon.titan-embed-text-v2:0`,
},
},
storageConfiguration: {
type: 'OPENSEARCH_SERVERLESS',
opensearchServerlessConfiguration: {
collectionArn: collection.attrArn,
vectorIndexName: 'docs-index',
fieldMapping: {
vectorField: 'embedding',
textField: 'text',
metadataField: 'metadata',
},
},
},
});
// Data source syncs automatically when documents change
new bedrock.CfnDataSource(this, 'DocsDataSource', {
knowledgeBaseId: knowledgeBase.attrKnowledgeBaseId,
name: 'company-documents',
dataSourceConfiguration: {
type: 'S3',
s3Configuration: {
bucketArn: documentBucket.bucketArn,
},
},
});
The honest assessment: Bedrock Knowledge Bases gets you 80% of the way with 20% of the effort. For many use cases, that's the right trade-off. You can always migrate to custom infrastructure later when requirements demand it.
Common Failure Modes
After seeing many RAG implementations, these patterns emerge:
1. Ignoring Evaluation
Teams ship without measuring retrieval quality. Build evaluation sets early:
- Curate 50-100 realistic queries with expected relevant documents
- Measure recall@k and precision@k for retrieval
- Use LLM-as-judge for answer quality (with human spot-checking)
Without measurement, you're optimizing blindly.
2. One-Size-Fits-All Chunking
Different document types need different chunking. Code documentation, legal contracts, and support tickets have different structures. A chunking strategy that works for one may fail on others.
3. Neglecting the Query Side
All optimization effort goes into indexing. But query understanding matters equally:
- Query classification (is this a factual lookup, comparison, or synthesis?)
- Query rewriting for retrieval optimization
- Handling conversational context in multi-turn interactions
4. Over-Engineering Early
Teams add reranking, query expansion, hybrid search, and hierarchical retrieval before validating that basic retrieval works. Start simple, measure, then add complexity where it helps.
A Practical Starting Point
If I were building a RAG system today, here's where I'd start:
Option A: Custom Build
- Chunking: Recursive with 500 token target, document-aware boundaries
- Embedding: Titan Embed v2 on Bedrock (or OpenAI
text-embedding-3-small) - Vector store: pgvector if on Postgres, OpenSearch Serverless on AWS
- Retrieval: Top-20 vector search → Cohere rerank to top-5
- Generation: Claude Sonnet on Bedrock with explicit citation instructions
- Evaluation: 50 golden queries, measure weekly
Option B: Managed (Faster Start)
- Platform: Bedrock Knowledge Bases
- Storage: S3 for documents, OpenSearch Serverless for vectors
- Model: Claude for generation, Titan for embeddings
- Customization: Add guardrails, configure chunking parameters
- Evaluation: Still build your golden query set - managed doesn't mean unmeasured
Start with Option B if speed matters and your requirements are standard. Move to Option A when you hit limitations.
Then iterate based on observed failures. The right architecture depends on your specific data, queries, and quality requirements.
Conclusion
RAG architecture is about trade-offs, not best practices. The optimal chunking strategy, embedding model, and retrieval approach depend on your specific content, query patterns, and accuracy requirements.
Start simple, measure rigorously, and add complexity only when measurements show it helps. The teams that succeed with RAG are the ones that treat it as an iterative engineering problem, not a one-time implementation.
The most sophisticated RAG architecture is worthless if your source documents are poorly written or your evaluation is nonexistent. Focus on fundamentals first.
Related Posts
Event-Driven Architecture: Patterns, Practices, and Pitfalls
A practical guide to building event-driven systems - when to use events, core patterns, AWS implementation, and the mistakes that derail teams.
Making Your Website Agent-Ready
More of your visitors are agents now, not humans. Here's what it actually takes to make a personal site or blog readable by them, with real code from my own setup.
Agentic Engineering: A Practical Guide to Working With AI Agents
Agentic engineering is how software gets built now. Here's what it actually means, how it works under the hood, and how to set up your projects to get the most out of it.