Requirements Similarity and Retrieval: July 2024
1 Introduction
This preprint will appear as a chapter in a book provisionally titled "Natural Language Processing for Requirements Engineering", to be published by Springer
we presents and discusses the obtained results. Finally, Section 5 concludes the
chapter with future directions.
2 Linguistic Similarity
tically different but semantically similar inputs like the example requirements
described above.
The process of computing linguistic similarity for various NLP tasks, includ-
ing requirements retrieval and reuse, can be categorized into the following steps:
data pre-processing, data representation, similarity computation methods, and
performance evaluation of the developed NLP pipelines. We discuss these steps
The second part of the technique is IDF, which reduces the effect of common
words such as “the”, “and” etc. IDF values the rarity of the words in a corpus
and assigns more weight to words that occur less frequently across the pool of
documents. IDF is calculated across the entire corpus, therefore, for a given word
The TFIDF algorithm considers the lexical aspects of the input requirements
and derives the term matrix. Based on the feature matrix, feature vectors for
individual requirements can be extracted. It is important to highlight that the
TFIDF approach can not capture words’ semantics and syntactical information
but rather is only weighting the terms.
Continuous Bag of Words (CBOW) language model aims to predict the tar-
get word vector based on its context that constitutes preceding and proceeding
words. The CBOW architecture consists of a shallow feed-forward natural net-
work consisting of three layers. The first layer consists of the context words that
result in an average word vector of fixed length, regardless of the context size,
and is projected to the middle hidden layer. Finally, the last layer correlates the
output and improves the representation based on the backpropagation of the
error to predict the middle word based on its context words.
Skip-Gram (SG) model for representing words as feature vectors are opposite
in approach to CBOW but similar in the architecture. The SG language model
estimates the context words given the current word as input. The first input
layer represents the current word and the second projection layer for predicting
a range of context words. This range is defined as the window size representing
the words to be considered for prediction. The window slides across the given
input sentence to calculate the likelihood of context words given a specific target
word [40].
The CBOW and SG are Word2Vec models that capture the semantic relation-
ship between words but struggle to comprehend the polysemy—same word with
multiple meanings—of words. For example, unless specifically trained as a single
vector, both CBOW and SG models treat words like “Windows 11” as distinct
vectors “Windows” and “11”, which results in only taking into account the local
context of the words. As a result, this will lead to less accurate predictions and
poor performance on NLP tasks. To address such issues, the researchers proposed
two enhanced distributed representation algorithms, i.e., GloVe and FastText,
which are commonly used in multiple NLP tasks. Below, we discuss them briefly.
Dice [0, 1] 1. Effective for binary and 1. Less effective for continuous
sparse datasets. or non-binary data.
2. Emphasize the relative impor- 2. Sensitive to duplicate ele-
tance of shared data items. ments.
Edit [0, 1] 1. Effective for comparing two 1. Less effective for comparing
distance sets where the order of ele- numerical vectors.
ments is essential. 2. Ignore semantics between
2. Offer a granularity measure of data items.
JSI [0, 1] 1. Effective for sparse data. 1. Sensitive to the size of data
2. Focuses on data items overlap items.
ratio. 2. Ignores the order of data
two vectors in multi-dimensional feature space and does not depend on their
magnitude. The similarity score between two non-zero vectors ranges from -1
to 1, with -1 representing opposite vectors, 0 representing no similarity (or-
thogonal vectors), and 1 indicating perfect similarity (proportional vectors).
In the case of non-negative vectors, the cosine similarity ranges between 0 to
Companies often do not develop products from scratch but reuse existing com-
ponents from existing projects. In contexts like these, when a new project is to
be delivered to a new customer, finding reuse opportunities for existing software
based on similar requirements could save time and boost confidence in the end
products. Reusing existing software increases confidence in the end product be-
cause they have already been tested and proven in other products before [2].
In addition, it reduces the development, certification, and testing time. In such
cases, to aid reuse, new requirements are compared to requirements from exist-
ing projects to recommend reuse and avoid redundant development efforts. This
chapter demonstrates similarity-driven requirements retrieval for two cases with
the RE domain as follows.
ilarity using relevant metrics, such as precision, recall, and their harmonic
mean (F1 score).
– Case 2: Requirements-driven software retrieval for reuse goes beyond iden-
tifying similar requirements and looks into the relevance of the retrieved
software. This case demonstrates how similar requirements could be used
to retrieve software for reuse. Furthermore, this case evaluated the similar-
ity computation pipelines in light of the relevance of the retrieved software
by correlating requirements similarity to software similarity. This case is
demonstrated on both an industrial and a public dataset.
STSb-roberta-base-v2
Mini-LM
In basic pre-processing, the text is converted to lower case, and special characters
are replaced with a blank space that is followed by lemmatization.
Full pre-processing uses basic pre-processing and also removes stop-words.
realized by
Pairs Source Code
Computation Human-rated
Similarity Pairs Selector
In Table 2, we also present the total number of words in the datasets and
the average number of words per requirement/phrase in three cases, i.e., un-
processed (No P.), basic pre-processing with lemmatization (Basic P.), and
full pre-processing with stop words removal (P.). In addition, to provide some
insights into the data, Figure 1 presents the top ten most frequently occurring
words in both datasets.
was selected based on its effects on data imbalance between the similar and
non-similar groups of pairs. In other words, the chosen 60% threshold resulted
in a nearly perfect balanced dataset. As shown in Case 1 of Figure 2, the resul-
tant data was subjected to similarity computation with the selected pipelines.
In particular, the first set of sentences in the pairs are used as queries to retrieve
the most similar sentences from the second set of sentences based on computed
similarity. Since the ground truth of most similar sentences to query sentences is
already available and the pairs are already grouped into similar and non-similar
pairs, we can calculate relevant standard metrics for performance evaluation,
such as precision, recall, and F1 score. In this context, True Positives (TP) are
requirements correctly identified as the most similar and match the ground truth.
False Negatives (FN) are instances where requirements are similar (ground truth
= True), but not identified as the most similar by the pipeline. Similarly, False
Positives (FP) represent requirement instances misidentified as the most similar
when they are actually not the most similar ones (ground truth = False).
As shown in the bottom right part of Figure 2, we also compute the simi-
larity among the pairs of phrases in the STS dataset using the same similarity
measuring pipelines. We then use the association between the computed similar-
ities and human-rated similarities as a means of evaluation. The STS dataset is
quite similar to our industrial dataset as the phrases could be used to represent
requirements, and the scaled human-rated similarity could be used to repre-
sent software similarity. Therefore, both tasks could be performed on both of
the considered datasets and will enable replication. We provide our replication
package with the source code and dataset 9 to allow replication and support
future research on the topic.
This section presents and discusses the results of the two considered example
cases where the similarity analysis is relevant. In particular, we first present and
discuss the performance of the pipeline in requirements reuse based on standard
metrics like precision, recall and F1 score. Note that for the requirements reuse,
we manipulated the STS benchmark for demonstration, and therefore, the results
may vary in other cases. We also present the software retrieval performance of
Replication package
Precision (STS)
Recall (STS)
F1 Score (STS)
ER −A .
bBERT Avg.
pST−Rober M
bMini−L S
U ta
bST−R i−LM
p in M
T− ob ta
be a
− g
Ro er t
S Mini−L
We apply the pipelines presented in Section 3.2 to the STS dataset, as described
in Section 3.3. To demonstrate the applicability of the pipelines in the context of
requirements reuse, we present standard metrics that evaluate the performance
of the various pipelines, shown in Figure 3. Below, we discuss the results briefly.
(0.78). The shorter length of the sentences and common vocabulary in the STS
dataset could explain this. On the other hand, the GLV and FT-based pipelines
with pre-processing tend to perform slightly better than TFIDF in terms of
recall and F1 score. However, we observe that the TFIDF-based pipeline with
pre-processing achieved a slightly higher precision. It is important to consider
that recall may take precedence over precision (or vice versa) in some scenarios.
E −A .
BE RT−Avg.
pS −Rober M
U ta
bMini−L S
p in M
bST−R i−LM
T− o b t a
be a
E R −C S
E −A .
BE RT−Avg.
pST−Rober M
bMini−L S
U ta
b JS I
p SI
bST−R i−LM
ER − C S
pMini− M
T − o b ta
be a
bB RT vg
bB RT vg
Ro e r t
Ro e r t
S Mini−L
in L
S Pipelines Pipelines
In this section, we first present areas of future research and then conclude the
chapter with a summary.
Pre-processing and similarity. In NLP for RE, we mainly borrow standard pre-
processing pipelines from text mining and the NLP community. These borrowed
pre-processing pipelines for textual requirements often use domain-generic part-
of-speech (POS) and entity tagging that guides lemmatization and other tasks to
produce input for similarity eventually. However, current similarity-driven tasks
in the field rarely consider software engineering-related named entity recognition
(such as classes, components, and parameters) or other meta information (such
as input and conditions in the requirements) for similarity computation. A recent
study also suggests that engineers perceive two requirements to be similar if they
share similar input processing and have similar conditions [1]. Such additional
information is not extracted for similarity and, if done, could guide the similarity
computation in the right direction.
5.2 Conclusions
This work has been supported by and received funding from the ITEA Smart-Delta and the KDT AIDOaRT projects.
Delta [50] and the KDT AIDOaRT projects.
