Production

Evaluating RAG

Measure and improve RAG system quality with the right evaluation metrics.

Why Evaluate RAG?

Without evaluation, you can't tell if your RAG system is working well. You need metrics for:

Retrieval quality: Are we finding the right documents?
Generation quality: Is the answer accurate and relevant?
End-to-end: Does the system answer correctly?

Key Metrics

Retrieval Metrics

Recall@k: What fraction of relevant documents appear in the top-k results?
MRR (Mean Reciprocal Rank): How high is the first relevant result?
NDCG: Normalized Discounted Cumulative Gain

Generation Metrics

Faithfulness: Is the answer faithful to the retrieved context? (no hallucination)
Answer Relevance: Does the answer address the question?
Context Relevance: Is the retrieved context actually relevant?

RAGAs Framework

RAGAs (Retrieval Augmented Generation Assessment) is a popular evaluation framework that uses LLMs to evaluate RAG systems automatically.

Example

python

import anthropic
from typing import Optional

client = anthropic.Anthropic()

def evaluate_faithfulness(question: str, answer: str, context: str) -> float:
    """
    Check if the answer is faithful to the context (0-1 score).
    High score = answer is grounded in context (no hallucination).
    """
    prompt = f"""Rate how faithful this answer is to the provided context.

Context: {context}

Question: {question}
Answer: {answer}

Score from 0.0 to 1.0 where:
1.0 = Everything in the answer is directly supported by the context
0.5 = Answer is partially supported but adds some unsupported claims
0.0 = Answer contradicts or ignores the context entirely

Respond with ONLY a decimal number between 0.0 and 1.0."""

    response = client.messages.create(
        model="claude-3-5-haiku-20241022",
        max_tokens=10,
        messages=[{"role": "user", "content": prompt}]
    )
    try:
        return float(response.content[0].text.strip())
    except ValueError:
        return 0.5

def evaluate_answer_relevance(question: str, answer: str) -> float:
    """Rate how relevant the answer is to the question (0-1 score)."""
    prompt = f"""Rate how relevant this answer is to the question.

Question: {question}
Answer: {answer}

Score 0.0 to 1.0. Respond with ONLY a decimal."""

    response = client.messages.create(
        model="claude-3-5-haiku-20241022",
        max_tokens=10,
        messages=[{"role": "user", "content": prompt}]
    )
    try:
        return float(response.content[0].text.strip())
    except ValueError:
        return 0.5

def evaluate_rag_response(
    question: str,
    answer: str,
    context: str
) -> dict:
    """Comprehensive RAG evaluation."""
    faithfulness = evaluate_faithfulness(question, answer, context)
    relevance = evaluate_answer_relevance(question, answer)

    overall = (faithfulness + relevance) / 2

    return {
        "faithfulness": faithfulness,
        "answer_relevance": relevance,
        "overall_score": overall,
        "quality": "good" if overall > 0.7 else "fair" if overall > 0.4 else "poor"
    }

# Test evaluation
result = evaluate_rag_response(
    question="What is RAG?",
    answer="RAG stands for Retrieval Augmented Generation, which combines document retrieval with LLM generation.",
    context="RAG (Retrieval Augmented Generation) is a technique that enhances LLM responses by retrieving relevant documents first."
)
print(result)

Try it yourself — PYTHON

import anthropic
from typing import Optional

client = anthropic.Anthropic()

def evaluate_faithfulness(question: str, answer: str, context: str) -> float:
    """
    Check if the answer is faithful to the context (0-1 score).
    High score = answer is grounded in context (no hallucination).
    """
    prompt = f"""Rate how faithful this answer is to the provided context.

Context: {context}

Question: {question}
Answer: {answer}

Score from 0.0 to 1.0 where:
1.0 = Everything in the answer is directly supported by the context
0.5 = Answer is partially supported but adds some unsupported claims
0.0 = Answer contradicts or ignores the context entirely

Respond with ONLY a decimal number between 0.0 and 1.0."""

response = client.messages.create(
        model="claude-3-5-haiku-20241022",
        max_tokens=10,
        messages=[{"role": "user", "content": prompt}]
    )
    try:
        return float(response.content[0].text.strip())
    except ValueError:
        return 0.5

def evaluate_answer_relevance(question: str, answer: str) -> float:
    """Rate how relevant the answer is to the question (0-1 score)."""
    prompt = f"""Rate how relevant this answer is to the question.

Question: {question}
Answer: {answer}

Score 0.0 to 1.0. Respond with ONLY a decimal."""

def evaluate_rag_response(
    question: str,
    answer: str,
    context: str
) -> dict:
    """Comprehensive RAG evaluation."""
    faithfulness = evaluate_faithfulness(question, answer, context)
    relevance = evaluate_answer_relevance(question, answer)

overall = (faithfulness + relevance) / 2

return {
        "faithfulness": faithfulness,
        "answer_relevance": relevance,
        "overall_score": overall,
        "quality": "good" if overall > 0.7 else "fair" if overall > 0.4 else "poor"
    }

# Test evaluation
result = evaluate_rag_response(
    question="What is RAG?",
    answer="RAG stands for Retrieval Augmented Generation, which combines document retrieval with LLM generation.",
    context="RAG (Retrieval Augmented Generation) is a technique that enhances LLM responses by retrieving relevant documents first."
)
print(result)