Generation Models in RAG
The language model that generates responses from retrieved context is crucial for RAG performance.
Model Selection Criteria
When choosing a generation model for RAG, consider:
- Context Length: How much context can the model handle?
- Quality: Does it produce accurate, coherent responses?
- Speed: Latency requirements for your application
- Cost: Token pricing and inference costs
- Availability: API availability and reliability
- Capabilities: Does it support function calling, structured output, etc?
Popular Models for RAG
Proprietary
Proprietary Models
GPT-4 Turbo
- Context: 128K tokens
- Quality: Excellent reasoning
- Cost: $10-30 per M tokens
- Best for: High-accuracy requirements
Claude 3.5 Sonnet
- Context: 200K tokens
- Quality: Excellent, balanced
- Cost: $3-15 per M tokens
- Best for: General purpose RAG
Gemini Pro
- Context: 1M tokens (experimental)
- Quality: Very good
- Cost: Competitive
- Best for: Long documents
Llama 2/3 (via API)
- Context: 4K-128K
- Quality: Good
- Cost: Low
- Best for: Cost-conscious projects
Context Length Considerations
Longer context windows allow more retrieved documents, but have tradeoffs:
| Context Length | Advantage | Disadvantage |
|---|---|---|
| 4K | Fast, cheap | Limited context |
| 8K-16K | Balanced | Standard |
| 32K-128K | More context | Higher cost/latency |
| 200K+ | Maximum context | Expensive, slower |
For RAG, you typically don’t need the longest context lengths. A 4K-8K window with good retrieval is often sufficient.
Model-Specific Considerations
Instruction-Following
Models fine-tuned for instructions work better with RAG:
- Explicitly tell it to use context
- Format instructions clearly
- Provide examples
Hallucination Prevention
RAG reduces hallucinations by providing context, but:
- Instruct model to cite sources
- Ask it to say “I don’t know” if unsure
- Fine-tune on grounded responses
Structured Output
Some models support JSON/structured output:
- Helpful for downstream processing
- Enables validation
- Easier integration with systems
Prompt Engineering for RAG
System Prompt
You are a helpful assistant that answers questions
using the provided context. Always cite your sources.
If the context doesn't contain the answer, say so
rather than making up information.User Prompt Format
Context:
{retrieved_context}
Question: {user_query}
Please provide a detailed answer based on the context above.
If you use information from the context, cite the source.Best Practices
- Be explicit: Tell the model to use the context
- Show examples: Few-shot examples help
- Structure clearly: Clear delimiters between sections
- Request citations: Ask for source references
- Set expectations: Define output format
Token Efficiency
RAG typically uses 1-5 complete document chunks as context:
Input tokens:
System prompt: ~100 tokens
Context: ~1000-2000 tokens (2-5 chunks)
Query: ~50-100 tokens
Total: ~1200-2200 tokens
Output tokens:
Typical response: 200-500 tokensModel Configuration
Important parameters when calling the model:
response = client.messages.create(
model="claude-3-sonnet-20240229",
max_tokens=1024, # Limit output length
temperature=0.3, # Lower = more consistent
top_p=0.9, # Nucleus sampling
system=system_prompt,
messages=[{"role": "user", "content": user_prompt}]
)Temperature:
- 0.0: Deterministic, focused
- 0.5: Balanced
- 1.0+: Creative, diverse
Lower temperature is usually better for RAG (we want consistency).
Cost Optimization
Strategies
- Use smaller models: Claude 3.5 Haiku vs Opus
- Limit context: Only include relevant chunks
- Cached prompts: Cache system + context for repeated queries
- Batch processing: Process multiple queries together
- Fine-tune small model: Instead of using large model
Calculation
Cost per query =
(input_tokens * input_price) +
(output_tokens * output_price)
For 2000 input tokens + 300 output tokens with Claude 3.5 Sonnet:
(2000 * $0.003/M) + (300 * $0.015/M)
= $0.0060 + $0.0045 = $0.01 per queryPrompt caching can reduce costs by 80-90% if you’re making multiple queries with the same context.
Evaluating Generation Quality
Key metrics to track:
- Accuracy: Does response match retrieved facts?
- Relevance: Does it answer the actual question?
- Faithfulness: Is it grounded in provided context?
- Readability: Is it well-formatted and clear?
- Citation Quality: Are sources cited correctly?
Always evaluate against ground truth answers when possible.