You, as a data scientist, have developed a RAG system and want to evaluate the quality of the responses it generates. What methodology should you adopt? At this point, we’re not evaluating how good your retrieval or context generation is, but rather focusing solely on the final response.
A great way to evaluate these responses is by using BERTScore, and that’s the topic of today’s post. At the end of this post, in the References section, you’ll find the original article that served as the basis for writing this.
The only requirement to use this evaluation metric is having a set of question-answer pairs. With that, you will generate new responses for these questions using the language model you wish to evaluate.
BERTScore allows you to assess a tokenized reference sentence— in our case, the expected answer— against a tokenized candidate sentence, which in this case is the answer provided by the LLM. This is done by extracting contextual embeddings for each token from both sentences. The calculations use cosine similarity weighted by inverse document frequency (IDF).
The embeddings are extracted from BERT, and the calculations performed with these embeddings are described below:
- Recall ($R_{BERT}$): each token in the reference sentence $x$ is compared with all the tokens in the candidate sentence $\hat{x}$. The goal is to find, for each token $x_i$ in $x$, the most similar token in $\hat{x}$, as per the equation below.
- Precision ($P_{BERT}$): similarly, each token in the candidate sentence $\hat{x}$ is compared with the tokens in the reference sentence $x$, as shown below.
- F1 Score: used to balance both recall and precision, according to the equation below.
Note that in the equations for $R_{BERT}$ and $P_{BERT}$, there is a weighting by IDF. The idea behind this weighting is to give more relevance to less frequent words, as these words may be more indicative of similarity than common words (such as articles and prepositions).
Where $M$ is the number of sentences in the reference corpus, and $\mathbb{I}[w \in x^{(i)}]$ is an indicator function that equals 1 if the token $w$ is present in sentence $x^{(i)}$, and 0 otherwise. This results in a measure that gives greater weight to words that appear less frequently in the corpus.
You can easily use this in Python, just install the library with:
pip install bert-score
And use it as follows:
from bert_score import score
answers = [...]
expected_answers = [...]
P, R, F1 = score(answers, expected_answers, lang='en', rescale_with_baseline=True)
print(P.mean().item(), R.mean().item(), F1.mean().item())
Let’s look at an example. Consider the following candidate answers to the question “Who is Rob Halford?”:
- Candidate 1: Rob Halford is the lead vocalist of the heavy metal band Judas Priest. He is widely recognized for his powerful voice and stage presence, becoming an iconic figure in the metal genre. Throughout his career, Halford has helped define the sound and aesthetics of heavy metal, influencing countless bands and artists.
- Candidate 2: Rob Halford is a jazz musician famous for playing the trumpet in various jazz bands. He is best known for his smooth melodies and improvisation skills, which have made him a renowned figure in the jazz world.
- Candidate 3: Rob Halford was an English outlaw and folk hero, best known for robbing from the rich and giving to the poor. He became a legendary figure in medieval England, often associated with Sherwood Forest and his band of Merry Men.
Now let’s adopt the following answer as the correct one:
- Expected Answer: Rob Halford is known as the singer of the famous metal band Judas Priest. With his distinctive voice and energetic performances, he has become a legend in the world of metal. His contributions were crucial in shaping the style and image of heavy metal, inspiring many musicians over the years.
Notice that among the candidates, the first is correct, providing a good answer to the question. The second is incorrect but still treats Rob Halford as a musician. The third is entirely wrong.
These three candidates produced the following metrics:
Candidate | Recall | Precision | F1 |
---|---|---|---|
1 | 0.670 | 0.666 | 0.668 |
2 | 0.434 | 0.377 | 0.406 |
3 | 0.176 | 0.200 | 0.189 |
For more information details, I invite you to read the article referenced below. Thanks for reading!