Skip to Content
DocumentationRAG GuideRAG Architecture

RAG Architecture

Understanding the architecture of a RAG system is crucial for implementing one effectively.

System Overview

A complete RAG system consists of several interconnected components:

┌─────────────────────────────────────────┐ │ Knowledge Base Setup │ │ (Documents → Chunks → Embeddings) │ │ Vector Store / Index │ └─────────────────────────────────────────┘ ┌─────────┴─────────┐ │ │ ┌───▼──────────┐ ┌──────▼─────────┐ │ Retrieval │ │ Reranking │ │ Engine │ │ (Optional) │ └───┬──────────┘ └──────┬─────────┘ │ │ └─────────┬─────────┘ ┌─────────────────────┐ │ Prompt Engineer │ │ (Context Assembly) │ └─────────┬───────────┘ ┌─────────────────────┐ │ Language Model │ │ (Generation) │ └─────────┬───────────┘ ┌─────────────────────┐ │ Post-Processing │ │ (Formatting, etc) │ └─────────┬───────────┘ Response

Detailed Components

1. Document Processing

Prepares raw documents for retrieval:

  • Chunking: Break documents into manageable pieces (typically 200-1000 tokens)
  • Cleaning: Remove noise, format text, extract metadata
  • Metadata Addition: Include source, date, category, etc.
  • Deduplication: Remove duplicate content

2. Embedding Generation

Converts text into numerical vectors:

  • Embedding Model: Uses models like text-embedding-3-small or all-MiniLM-L6-v2
  • Dimensionality: Typically 384-3072 dimensions
  • Similarity: Enables semantic search in vector space

3. Vector Store / Index

Stores and retrieves embeddings efficiently:

Popular options:

  • Pinecone: Managed vector database
  • Weaviate: Open-source vector database
  • FAISS: Facebook’s fast similarity search
  • Milvus: Cloud-native vector database
  • Elasticsearch: Full-text + vector search
  • PostgreSQL with pgvector: SQL + vectors

4. Retrieval Engine

Finds relevant documents for a query:

Semantic Search

  • Embed the query in the same space as documents
  • Find k-nearest neighbors
  • Scores based on vector similarity
  • Best for semantic understanding

5. Reranking (Optional)

Improves retrieval quality:

  • Takes top-k results from retrieval
  • Re-scores using cross-encoder model
  • Filters out irrelevant results
  • Improves precision (but adds latency)

6. Prompt Assembly

Constructs the final prompt:

system_prompt = """You are a helpful assistant. Use the provided context to answer questions. If you cannot find the answer in the context, say so.""" user_prompt = f""" Context: {retrieved_context} Question: {user_query} Answer:"""

7. Language Model

Generates the response:

  • Takes assembled prompt
  • Generates coherent, contextual answer
  • Can use various models (GPT-4, Claude, etc.)
  • Models can be different sizes depending on budget

8. Post-Processing

Refines the final output:

  • Citation Extraction: Link answers to source documents
  • Format Validation: Ensure output matches expected format
  • Quality Checks: Verify response quality
  • Fallback Handling: Handle edge cases

Common Architectural Patterns

Simple RAG

Query → Retrieve → Prompt → LLM → Response

Best for: Quick prototypes, simple use cases

Advanced RAG

Query → Multi-stage Retrieval → Reranking → Prompt Enhancement → LLM → Post-Processing → Response

Best for: Production systems, high accuracy needs

Iterative RAG

Query → Retrieve → Generate → Check Quality → If low quality: Retrieve more / Refine → Generate → Response

Best for: Complex queries, multi-step reasoning

The choice of architecture depends on your use case, latency requirements, and accuracy needs.

Performance Considerations

ComponentImpactCost
Retrieval QualityHighLow
Model SizeHighHigh
RerankingMediumMedium
Context LengthMediumHigh
Embedding QualityHighMedium

Focus on improving retrieval first—it’s usually the biggest bottleneck.

Last updated on