Measuring RAG Performance with Advanced Evaluation Metrics

Introduction

In evaluating a RAG model, assessing whether the output response is accurate is just one of many factors to consider. An effective evaluation approach should not only focus on measuring the relevance of the retrieved document, but also on how the ranker prioritized useful information and on the similarity of the output response to the answer. Metrics such as Answer Correctness will help determine the factual correctness and semantic equivalence of the output response. Meanwhile, Context Relevancy is used to assess the relevancy of the retrieved context to the input question of the user. On the other hand, Hit Rate checks whether a relevant document is among the top-k retrieved documents, while MRR assesses the ranking of the relevant document.

Pioneers in Cloud Consulting & Migration Services

Reduced infrastructural costs
Accelerated application deployment

Get Started

Metric 1: Answer Correctness

What It Measures

Answer Correctness evaluates how close the generated answer is to the ground truth answer. It combines both factual overlap and semantic similarity to provide an overall measure of answer quality.

Formulation

Where:

F1 overlap measures the token-level similarity between the generated answer and the ground truth
Semantic Similarity measures meaning-level similarity using embeddings
w₁ and w₂ are weighting factors

Example

Ground Truth: “The boiling point of water is 100°C at standard atmospheric pressure.”

Generated Answer: “Water boils at 100 degrees Celsius under normal pressure conditions.”

Even though the wording is different, both sentences convey nearly the same meaning.

A higher score indicates the generated answer is both factually and semantically close to the correct answer.

Metric 2: Context Relevancy

What It Measures

Context Relevancy assesses the degree of relevance between the retrieved context and the question. Context Precision considers the relevancy of each chunk based on ground truth, while Context Relevancy assesses its relevancy to the question without any reference point.

Formulation

An LLM judges which sentences in the context are relevant to answering the given question.

Why This Matters

If your retriever pulls in long, multi-topic documents, a lot of the content is noise. High context relevancy means the retrieved passages are tightly focused, which generally leads to better generation because the LLM has less irrelevant content to navigate.

Metric 3: Hit Rate (Recall@k)

What It Measures

The Hit Rate measure tests if any relevant documents are present among the top k retrieved documents. This method is straightforward and quick for evaluating the efficiency of the retriever.

Formulation

Where:

Q = set of all questions in the evaluation set
The indicator function equals 1 if at least one relevant document is in the top-k, 0 otherwise

Example: With , if 80 out of 100 questions had at least one relevant document in their top-5 results:

Hit Rate is a coarse but fast metric, useful for quick sanity checks on your retriever.

Metric 4: Mean Reciprocal Rank (MRR)

What It Measures

MRR is not just concerned with whether a document relevant to the query appears in the top-k positions, but also with its ranking within that range. A document relevant to the query that ranks #1 is far more useful than one that ranks #5.

Formulation

Where:

the rank of the first relevant document retrieved for question q
If no relevant document is found, the reciprocal rank for that query is 0

Example: For 3 questions, the first relevant document appears at ranks 1, 3, and 2:

Metric 5: Semantic Similarity (Embedding-Based)

What It Measures

Semantic Similarity refers to the degree of similarity between two texts in terms of meaning, despite differences in language. Embeddings are used instead of exact word matching. This metric is widely used in the assessment of Answer Correctness and Answer Relevancy.

Formulation

Where:

embed(A) = embedding of generated answer
embed (GT) = embedding of ground truth answer

Why Token Overlap Is Not Enough

Token overlap alone cannot accurately capture meaning.

Example

Ground Truth: “The car is very fast.”
Generated Answer: “The vehicle moves quickly.”

Even though the wording is different, both sentences convey nearly the same meaning. Semantic Similarity helps capture this meaning-level alignment.

RAGAS implementation

1. Import Required Libraries

Initialize Amazon Bedrock Client

Configure Embedding and LLM Models

Prepare and Store Documents in Vector Database

Prepare Evaluation Dataset

Run RAGAS Evaluation

Save and Analyze Results

Conclusion

The improved evaluation techniques for RAG allow gaining a better understanding of the quality of the system, not only in terms of the correctness of retrieving data but also in regard to the correctness of the answer provided by the model.

For example, metrics such as Answer Correctness and Semantic Similarity allow us to estimate the closeness of the response given by the system to the correct one from the semantic point of view. On the other hand, metrics such as Context Relevancy can be used to evaluate whether the context is actually relevant to the question asked by the user, while MRR and Hit rate will estimate the efficiency and quality of the retrieved data.

Drop a query if you have any questions regarding RAG, and we will get back to you quickly.

Making IT Networks Enterprise-ready – Cloud Management Services

Accelerated cloud migration
End-to-end view of the cloud environment

Get Started

About CloudThat

CloudThat is an award-winning company and the first in India to offer cloud training and consulting services worldwide. As an AWS Premier Tier Services Partner, AWS Advanced Training Partner, Microsoft Solutions Partner, and Google Cloud Platform Partner, CloudThat has empowered over 1.1 million professionals through 1000+ cloud certifications, winning global recognition for its training excellence, including 20 MCT Trainers in Microsoft’s Global Top 100 and an impressive 14 awards in the last 9 years. CloudThat specializes in Cloud Migration, Data Platforms, DevOps, Security, IoT, and advanced technologies like Gen AI & AI/ML. It has delivered over 750 consulting projects for 850+ organizations in 30+ countries as it continues to empower professionals and enterprises to thrive in the digital-first world.

FAQs

1. What is the purpose of answer correctness in RAG evaluation?

ANS: – The metric of Answer Correctness captures the resemblance between the generated answer and the ground truth answer using factual overlap and semantic similarity.

2. What is the importance of Context Relevancy in the RAG pipeline?

ANS: – The importance of Context Relevancy lies in helping determine whether the selected context is relevant to answering the user’s question.

3. What is the meaning of Hit Rate in retrieval metrics?

ANS: – This metric evaluates whether at least one relevant document exists among the top-k retrieved documents.

WRITTEN BY Aniket Bembale

Aniket Bembale is Senior Research Associate – Data & AIoT at CloudThat, focusing on Generative AI, Agentic AI solutions and Cloud Computing. He is involved in building scalable AI-driven applications and implementing modern data and AI technologies to solve real-world business challenges.