Temperature scaling and top-k sampling
What is it and how to calculate it
Figure 1. Adopted Approaches in the Applied Strategies
Two approaches were adopted for document vectorization: TF-IDF and sentence-BERT. TF-IDF was applied using the Scikit-Learn library. To generate the embeddings via BERT, the pre-trained model multi-qa-mpnet-base-dot-v11 was used. This model was fine-tuned for semantic search using 215 million question-answer pairs from various sources and supports sentences of up to 512 words, generating an embedding of 768 dimensions.
Two types of indexing were applied: Flat and IVFFlat, both available in the FAISS library.
Flat indexing search is conducted by calculating the similarity between all documents and the query. IVFFlat indexing, on the other hand, clusters documents based on Euclidean distance, with 10 clusters generated in this work. In the IVFFlat search, the query is compared to the centroids of each cluster, and similarity is calculated only with documents present in the cluster to which the query was assigned, thus reducing the search space. The search results using both approaches rank the relevant documents according to the similarity found.
The ranking function to determine the relevance of a document given a query will be obtained using cosine similarity, according to the following equation:
</div> \(\begin{equation} sim(\vec{q}, \vec{d_j}) = \frac{\vec{q} \bullet \vec{d_j} }{|\vec{q}| \times |\vec{d_j}|} \end{equation}\) </div>
The adopted strategies are evaluated through precision, recall, precision in 5 and 10 retrieved documents (P@5 and P@10), precision vs. recall curve, precision-R histogram, and Mean Reciprocal Rank.
All the libraries mentioned in Section 3 are available for Python, and therefore, all experiments were conducted using this language. The FAISS library, which provides indexing and search options, is available for execution via CPU or GPU processing; in this work, CPU processing was used. Consequently, the configuration of the machine used is: Ryzen 5 3600 processor, 6-CORE, 12-THREADS, speed of 3.6GHz, and two DDR4 RAMs of 16GB each with a speed of 2666MHz.
All the codes used to perform the experiments mentioned in this work can be accessed on GitHub.
The vectors produced by sentenceBERT have 768 dimensions, while those produced by TF-IDF have 13,171 dimensions for documents without preprocessing and 13,141 dimensions for documents with stopword removal.
Among the 1,239 documents, 79 were truncated for the documents that did not undergo preprocessing (6.4%) and 25 for the documents from which stopwords were removed (2%).
Figure 2 presents the boxplot of the precision results for the adopted strategies. This chart allows for a visual summary of the precision statistics for all queries. The average precision ranged from 20.1% for IVFFlat indexing, vectorization with TF-IDF and stopword removal, to 27.1% for Flat indexing, vectorization with BERT without preprocessing.
Figure 2. Precision
Comparatively, a two-tailed t-test for the means of two independent samples was used to identify if there is a statistical difference in the experiments conducted with and without the removal of stopwords. The null hypothesis for this test is that the two independent samples have identical mean values. In all cases, the observed p-value was greater than 5% (95% confidence interval), which means we cannot reject our null hypothesis.
Figure 3 presents the recall results. The t-test was implemented with the results obtained, and similar to what was observed with precision, no statistical difference (p-value > 5%) was detected between the strategies using or not using the removal of stopwords.
Figure 3. Recall
The precision vs. recall curves are presented in Figure 4. This chart is generated from the average precision obtained for the 11 levels of recall (from 0% to 100%). The chart on the left shows the results for the strategies with vector generation from documents without preprocessing, while the chart on the right shows the results for the strategies that used texts with stopwords removed. It can be observed that the precision values of the strategies at the higher levels of recall (>60%) reverse in the two charts. While the strategy represented by the black line finishes with 100% recall and precision greater than 10% on the left chart, on the right chart, the same strategy finishes with precision less than 10%. This same pattern can be observed in the other strategies.
Figure 4. Precision vs Recall
Figures 5 and 6 present the violin plot for the precision results in 5 and 10 documents, respectively. This chart displays a statistical summary and the distribution of the results. In most cases, it is possible to see that there is no difference in the quartiles for the distributions in blue (no preprocessing) and gray (no stopwords), which is corroborated by the results of the t-test, where in all cases the p-value > 5%.
Figure 5. Precision in 5 Documents
Figure 6. Precision in 10 Documents
The MRR (Mean Reciprocal Rank) is presented in Figure 7. It is noticeable that with the use of documents from which stopwords were removed, fewer relevant documents were retrieved in the top positions, and more queries resulted in no relevant documents within the first five positions. This was observed in all strategies, except for the strategy that uses Flat indexing and TF-IDF vectorization.
Figure 7. Mean Reciprocal Rank (MRR)
Considering the average results presented so far, two strategies have shown to be more robust given the conditions of the experiment. The first is Flat indexing with BERT vectorization without preprocessing, and the second is the combination of Flat indexing with TF-IDF vectorization and the removal of stopwords. Figure 8 presents the precision-R histogram for the first 20 queries. It is observed that the first strategy has higher precision than the second in 50% of the queries, the second in 30% of the queries, and there is a tie in 20% of the queries.
Figure 8. Total Execution Time of the Queries
The last factor evaluated was the execution time of the 1,239 queries for each of the strategies. Table 1 presents the times in seconds. It is noted that the strategies using IVFFlat indexing were on average 6.5 times faster compared to Flat indexing. Regarding the type of vectorization, the speed gain provided by IVFFlat indexing compared to Flat was greater in TF-IDF vectorization, with a gain of 7.1x, while in BERT vectorization, the gain was 5.8x.
Tabela 1. Total Execution Time of the Queries
Among all the strategies used, the one that achieved the best overall results was the strategy that combined Flat indexing with vector representation via sentenceBERT. However, the search time in this type is directly proportional to the size of the collection. That is, using this type of search is advantageous when precision is more important than speed and/or when the size of the collection is small.
As observed, statistically, there was no difference in the results for the experiments conducted here, whether applying stopword removal or not applying any preprocessing technique.
The FAISS library proved to be a good tool for indexing and searching documents. Its greatest quality is the ease of implementation, requiring only a few lines of code to develop this stage.
Baeza-Yates, R., Ribeiro-Neto, B., et al. (1999). Modern information retrieval, volume 463. ACM press New York.
Croft, W. B., Metzler, D., and Strohman, T. (2010). Search engines: Information retrieval in practice, volume 520. Addison-Wesley Reading.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.
Singhal, A. et al. (2001). Modern information retrieval: A brief overview. IEEE Data Eng. Bull., 24(4):35–43.