Evaluation Methods for RAG
Measuring RAG performance is critical for improving your system.
Evaluation Framework
A complete RAG evaluation has three main components:
Query → [Retrieval] → [Generation] → Response
↓ ↓ ↓
Context Faithfulness Relevance
RelevanceRetrieval Evaluation
1. Standard IR Metrics
Precision@k: What fraction of top-k results are relevant?
Precision@5 = (relevant in top-5) / 5
Example: Retrieved [A✓, B✗, C✓, D✓, E✗]
Precision@5 = 3/5 = 0.6Recall@k: What fraction of all relevant docs are in top-k?
Recall@5 = (relevant in top-5) / (total relevant)
Example: 10 relevant docs total, 3 found in top-5
Recall@5 = 3/10 = 0.3Mean Reciprocal Rank (MRR): Average position of first relevant result
MRR = 1/N * Σ(1 / rank of first relevant)
Example: [B✗, A✓, C✓, D✗]
MRR = 1/2 = 0.5NDCG (Normalized Discounted Cumulative Gain)
Considers both relevance and ranking position:
DCG = Σ(rel_i / log2(i+1))
NDCG = DCG / ideal_DCG
More sophisticated, accounts for relevance scores2. Implementation
from sklearn.metrics import precision_score, recall_score
# Define ground truth
queries = [
{
"query": "What is RAG?",
"relevant_docs": ["doc1", "doc3", "doc5"]
}
]
# Evaluate retrieval
results = []
for q in queries:
retrieved = retrieve(q["query"], k=5)
retrieved_ids = [d.id for d in retrieved]
# Calculate metrics
relevant_set = set(q["relevant_docs"])
retrieved_set = set(retrieved_ids)
tp = len(relevant_set & retrieved_set)
precision = tp / len(retrieved_set)
recall = tp / len(relevant_set)
results.append({
"query": q["query"],
"precision": precision,
"recall": recall
})
# Average across all queries
avg_precision = sum(r["precision"] for r in results) / len(results)
avg_recall = sum(r["recall"] for r in results) / len(results)
print(f"Avg Precision: {avg_precision:.3f}")
print(f"Avg Recall: {avg_recall:.3f}")Generation Evaluation
1. Context Relevance
Is the retrieved context actually relevant to the query?
from langchain.evaluation import EvaluatorChain
# Use LLM to evaluate relevance
evaluator = EvaluatorChain(
criteria="The context directly addresses the query"
)
score = evaluator.evaluate_strings(
prediction=retrieved_context,
input=query
)2. Faithfulness
Is the response grounded in the retrieved documents?
def evaluate_faithfulness(response, context):
"""Check if response uses only context information"""
prompt = f"""
Does the response only use information from the context?
Consider these claims from the response:
{extract_claims(response)}
Context:
{context}
Rate faithfulness 0-1:
"""
rating = llm_evaluate(prompt)
return rating3. Answer Relevance
Does the response actually answer the question?
def evaluate_answer_relevance(question, response):
"""Check if response addresses the question"""
prompt = f"""
Question: {question}
Response: {response}
Does the response answer the question? (0-1):
"""
score = llm_evaluate(prompt)
return score4. Automated Metrics
BLEU
BLEU Score
Measures overlap between generated and reference text.
from nltk.translate.bleu_score import sentence_bleu
reference = ["the cat is on the mat"]
candidate = "the cat is on the mat"
score = sentence_bleu(
[reference.split()],
candidate.split()
)
# Returns 1.0 for perfect matchPros: Fast, standard metric
Cons: Doesn’t capture semantic similarity
Building an Evaluation Suite
Create Test Dataset
# Curate quality test cases
test_set = [
{
"query": "What is RAG?",
"expected_answer": "Retrieval-Augmented Generation is...",
"relevant_docs": ["doc1", "doc3"],
"category": "definition"
},
{
"query": "How does RAG improve accuracy?",
"expected_answer": "RAG reduces hallucinations by...",
"relevant_docs": ["doc2", "doc4"],
"category": "how-it-works"
}
]Define Evaluation Criteria
criteria = {
"context_relevance": {
"weight": 0.2,
"target": 0.8,
"method": "llm_evaluate"
},
"retrieval_recall": {
"weight": 0.2,
"target": 0.7,
"method": "metric_recall"
},
"answer_relevance": {
"weight": 0.3,
"target": 0.9,
"method": "llm_evaluate"
},
"faithfulness": {
"weight": 0.3,
"target": 0.85,
"method": "llm_evaluate"
}
}Implement Evaluation Function
def evaluate_rag_system(test_set, criteria):
results = []
for test in test_set:
# Get RAG response
response = rag.query(test["query"])
retrieved = response["source_documents"]
answer = response["answer"]
# Evaluate each criterion
scores = {}
for criterion, config in criteria.items():
score = config["method"](
test,
response,
retrieved
)
scores[criterion] = score
# Calculate weighted score
weighted = sum(
scores[c] * criteria[c]["weight"]
for c in criteria
)
results.append({
"query": test["query"],
"scores": scores,
"weighted_score": weighted,
"answer": answer
})
return resultsAnalyze Results
# Aggregate scores
avg_scores = {}
for criterion in criteria:
scores = [r["scores"][criterion] for r in results]
avg = sum(scores) / len(scores)
target = criteria[criterion]["target"]
print(f"{criterion}: {avg:.3f} (target: {target:.3f})")
if avg < target:
print(f" ⚠️ Below target")Identify Failure Cases
# Find worst performing queries
failures = sorted(
results,
key=lambda x: x["weighted_score"]
)[:5]
for failure in failures:
print(f"Query: {failure['query']}")
print(f"Score: {failure['weighted_score']:.3f}")
print(f"Issue: {analyze_failure(failure)}")Iterate
Based on failures, improve:
- Retrieval quality
- Prompt engineering
- Document preprocessing
- Model selection
Continuous Evaluation
1. Production Monitoring
def log_evaluation(query, response):
"""Log response for later evaluation"""
event = {
"timestamp": now(),
"query": query,
"answer": response["answer"],
"sources": response["sources"],
"latency": response["latency"],
"model": response["model"]
}
database.log_event(event)2. Batch Evaluation
# Periodically evaluate recent queries
recent_queries = database.get_queries(
since=24.hours_ago
)
scores = evaluate_rag_system(
recent_queries,
criteria
)
# Alert if metrics drop
if avg_score < threshold:
send_alert(f"RAG quality dropped to {avg_score}")3. Comparative Testing
# Compare models/approaches
variants = {
"baseline": old_rag_system,
"new_model": updated_rag_system,
"with_reranking": rag_with_reranker
}
for name, system in variants.items():
results = evaluate_rag_system(test_set, system)
print(f"{name}: {avg_score(results):.3f}")Evaluation Benchmarks
Common target ranges for RAG systems:
- Retrieval Recall: 0.7-0.9 (depends on acceptable missing docs)
- Answer Relevance: 0.85-0.95 (strict requirement)
- Faithfulness: 0.80-0.95 (no hallucinations)
- Context Relevance: 0.75-0.90 (good retrieval)
- Latency: < 2 seconds (typical requirement)
Tools and Frameworks
Popular libraries for RAG evaluation:
# RAGAS - RAG Assessment
from ragas.metrics import (
context_precision,
context_recall,
faithfulness,
answer_relevance
)
# TruLens - LLM evaluation
from trulens_eval import Tru
# DeepEval - LLM testing
from deepeval.metrics import AnswerRelevancy
# LangChain Evaluators
from langchain.evaluation import EvaluatorChainStart with RAGAS—it’s designed specifically for RAG systems.