Skip to Content
DocumentationRAG GuideGeneration Models

Generation Models in RAG

The language model that generates responses from retrieved context is crucial for RAG performance.

Model Selection Criteria

When choosing a generation model for RAG, consider:

  • Context Length: How much context can the model handle?
  • Quality: Does it produce accurate, coherent responses?
  • Speed: Latency requirements for your application
  • Cost: Token pricing and inference costs
  • Availability: API availability and reliability
  • Capabilities: Does it support function calling, structured output, etc?

Proprietary Models

GPT-4 Turbo

  • Context: 128K tokens
  • Quality: Excellent reasoning
  • Cost: $10-30 per M tokens
  • Best for: High-accuracy requirements

Claude 3.5 Sonnet

  • Context: 200K tokens
  • Quality: Excellent, balanced
  • Cost: $3-15 per M tokens
  • Best for: General purpose RAG

Gemini Pro

  • Context: 1M tokens (experimental)
  • Quality: Very good
  • Cost: Competitive
  • Best for: Long documents

Llama 2/3 (via API)

  • Context: 4K-128K
  • Quality: Good
  • Cost: Low
  • Best for: Cost-conscious projects

Context Length Considerations

Longer context windows allow more retrieved documents, but have tradeoffs:

Context LengthAdvantageDisadvantage
4KFast, cheapLimited context
8K-16KBalancedStandard
32K-128KMore contextHigher cost/latency
200K+Maximum contextExpensive, slower

For RAG, you typically don’t need the longest context lengths. A 4K-8K window with good retrieval is often sufficient.

Model-Specific Considerations

Instruction-Following

Models fine-tuned for instructions work better with RAG:

  • Explicitly tell it to use context
  • Format instructions clearly
  • Provide examples

Hallucination Prevention

RAG reduces hallucinations by providing context, but:

  • Instruct model to cite sources
  • Ask it to say “I don’t know” if unsure
  • Fine-tune on grounded responses

Structured Output

Some models support JSON/structured output:

  • Helpful for downstream processing
  • Enables validation
  • Easier integration with systems

Prompt Engineering for RAG

System Prompt

You are a helpful assistant that answers questions using the provided context. Always cite your sources. If the context doesn't contain the answer, say so rather than making up information.

User Prompt Format

Context: {retrieved_context} Question: {user_query} Please provide a detailed answer based on the context above. If you use information from the context, cite the source.

Best Practices

  • Be explicit: Tell the model to use the context
  • Show examples: Few-shot examples help
  • Structure clearly: Clear delimiters between sections
  • Request citations: Ask for source references
  • Set expectations: Define output format

Token Efficiency

RAG typically uses 1-5 complete document chunks as context:

Input tokens: System prompt: ~100 tokens Context: ~1000-2000 tokens (2-5 chunks) Query: ~50-100 tokens Total: ~1200-2200 tokens Output tokens: Typical response: 200-500 tokens

Model Configuration

Important parameters when calling the model:

response = client.messages.create( model="claude-3-sonnet-20240229", max_tokens=1024, # Limit output length temperature=0.3, # Lower = more consistent top_p=0.9, # Nucleus sampling system=system_prompt, messages=[{"role": "user", "content": user_prompt}] )

Temperature:

  • 0.0: Deterministic, focused
  • 0.5: Balanced
  • 1.0+: Creative, diverse

Lower temperature is usually better for RAG (we want consistency).

Cost Optimization

Strategies

  1. Use smaller models: Claude 3.5 Haiku vs Opus
  2. Limit context: Only include relevant chunks
  3. Cached prompts: Cache system + context for repeated queries
  4. Batch processing: Process multiple queries together
  5. Fine-tune small model: Instead of using large model

Calculation

Cost per query = (input_tokens * input_price) + (output_tokens * output_price) For 2000 input tokens + 300 output tokens with Claude 3.5 Sonnet: (2000 * $0.003/M) + (300 * $0.015/M) = $0.0060 + $0.0045 = $0.01 per query

Prompt caching can reduce costs by 80-90% if you’re making multiple queries with the same context.

Evaluating Generation Quality

Key metrics to track:

  • Accuracy: Does response match retrieved facts?
  • Relevance: Does it answer the actual question?
  • Faithfulness: Is it grounded in provided context?
  • Readability: Is it well-formatted and clear?
  • Citation Quality: Are sources cited correctly?

Always evaluate against ground truth answers when possible.

Last updated on