Skip to Content

RAG Best Practices

Proven strategies for building effective RAG systems.

Document Preparation

1. Quality Over Quantity

A smaller set of high-quality documents beats a massive noisy dataset.

  • Remove duplicates and near-duplicates
  • Filter out low-quality or irrelevant content
  • Validate sources
  • Remove personally identifiable information

2. Proper Chunking

The right chunk size is critical:

  • Too small (< 256 tokens): Loses context
  • Too large (> 2000 tokens): Dilutes signal with noise
  • Sweet spot: 512-1024 tokens with 100-200 token overlap
from langchain.text_splitter import RecursiveCharacterTextSplitter splitter = RecursiveCharacterTextSplitter( chunk_size=1024, chunk_overlap=128, separators=["\n\n", "\n", ".", " "] # Respect document structure )

3. Rich Metadata

Add contextual metadata to chunks:

chunk.metadata = { "source": "product_manual.pdf", "chapter": "Installation", "version": "2.0", "date": "2024-01-15", "type": "tutorial" }

Benefits:

  • Filter results by date, type, or source
  • Better context in responses
  • Improved retrieval precision

4. Preprocessing

def preprocess_text(text): # Remove extra whitespace text = " ".join(text.split()) # Remove URLs text = re.sub(r'http\S+', '', text) # Remove special characters but keep structure text = re.sub(r'[^\w\s.,!?-]', '', text) return text

Retrieval Optimization

Combine dense and sparse retrieval:

# Dense (semantic) dense_results = vectorstore.similarity_search(query, k=10) # Sparse (keyword) sparse_results = bm25.retrieve(query, k=10) # Combine combined = combine_results(dense_results, sparse_results)

Hybrid search typically improves recall by 10-20% with minimal latency increase.

2. Reranking

Use cross-encoders to improve precision:

from sentence_transformers import CrossEncoder reranker = CrossEncoder('cross-encoder/mmarco-mMiniLMv2-L12-H384-v1') # Get initial candidates candidates = retriever.retrieve(query, k=20) # Rerank scores = reranker.predict([ [query, doc.page_content] for doc in candidates ]) # Keep top-k ranked_docs = [d for _, d in sorted( zip(scores, candidates), reverse=True )][:5]

3. Query Expansion

Generate multiple query variants:

def expand_query(query): """Generate related queries""" variants = [ query, # Original paraphrase(query), # Rephrase break_into_questions(query), # Sub-questions add_synonyms(query) # With synonyms ] return variants # Retrieve from all variants all_results = [] for variant in expand_query(query): all_results.extend(vectorstore.search(variant, k=5)) # Deduplicate and rank final_docs = deduplicate_and_rank(all_results)

Prompt Engineering

1. Clear Instructions

system = """You are a helpful assistant. Use the provided context to answer the user's question. Follow these rules: 1. Only use information from the context 2. If the answer isn't in the context, say "I don't know" 3. Always cite which document you're using 4. Be concise and clear"""

2. Few-Shot Examples

examples = [ { "context": "The Earth orbits the Sun...", "question": "What does Earth orbit?", "answer": "According to the document, Earth orbits the Sun." }, { "context": "Photosynthesis is the process...", "question": "How do plants make food?", "answer": "The document explains that plants use photosynthesis..." } ]

3. Structured Output

prompt = """Based on the context, provide: 1. Direct Answer: [1-2 sentences] 2. Supporting Evidence: [key facts from context] 3. Source Document: [which document this comes from] 4. Confidence: [high/medium/low] Context: {context} Question: {question}"""

Quality Assurance

1. Evaluation Metrics

Track these metrics in production:

MetricTargetHow
Context Relevance> 0.8LLM evaluates retrieved context
Answer Faithfulness> 0.85Response uses only context info
Answer Relevance> 0.9Response answers the question
Latency< 2sTrack p95 response time

2. User Feedback

# Ask users to rate responses response = { "answer": "...", "feedback_url": "/api/feedback?id={query_id}" } # Store feedback for improvement def log_feedback(query_id, rating, comment): db.save({ "query_id": query_id, "rating": rating, "comment": comment, "timestamp": now() })

3. Continuous Testing

# Test set of known Q&A pairs test_cases = [ {"query": "What is RAG?", "expected": "Retrieval-Augmented Generation..."}, {"query": "How does it work?", "expected": "It retrieves documents and..."} ] for test in test_cases: result = rag.query(test["query"]) score = evaluate_similarity(result, test["expected"]) assert score > 0.7, f"Test failed: {test['query']}"

Performance Optimization

StrategyDescription
CachingCache embeddings and frequent queries to reduce latency
BatchingProcess multiple queries together for better throughput
IndexingUse proper database indices for faster retrieval
CompressionCompress stored embeddings to reduce memory
AsyncUse async/await for I/O-bound operations
CDNDistribute vector database geographically

Common Pitfalls to Avoid

Don’t:

  • Use massive chunks that dilute signal
  • Ignore document quality
  • Forget to add metadata
  • Use outdated embedding models
  • Skip evaluation and testing
  • Deploy without monitoring
  • Trust single retrieval method
  • Forget to update knowledge base

Do:

  • ✅ Start simple, iterate
  • ✅ Measure everything
  • ✅ Use multiple retrieval methods
  • ✅ Add monitoring and logging
  • ✅ Get user feedback
  • ✅ Version your documents and models
  • ✅ Have a fallback strategy
  • ✅ Regularly audit results

Security Considerations

1. Input Validation

def sanitize_query(query): # Prevent prompt injection if len(query) > 500: return None if contains_dangerous_chars(query): return None return query.strip()

2. Output Validation

def validate_response(response): # Ensure response is grounded in context citations = extract_citations(response) for citation in citations: if citation not in retrieved_docs: flag_for_review(response)

3. Access Control

# Only allow access to relevant documents def filter_by_user_permissions(docs, user_id): return [d for d in docs if user_has_access(user_id, d.source)]

Monitoring Checklist

  • Track query volume and latency
  • Monitor retrieval quality (precision, recall)
  • Log failed queries for analysis
  • Track embedding costs
  • Monitor vector store size and growth
  • Regular evaluation against test cases
  • User feedback analysis
  • Document freshness (when was last update?)
Last updated on