Requirements Similarity and Retrieval: July 2024
Requirements Similarity and Retrieval: July 2024
Requirements Similarity and Retrieval: July 2024
net/publication/379485242
CITATIONS READS
0 65
5 authors, including:
All content following this page was uploaded by Muhammad Abbas on 02 April 2024.
1 Introduction
This preprint will appear as a chapter in a book provisionally titled "Natural Language Processing for Requirements Engineering", to be published by Springer
2 M. Abbas et al.
we presents and discusses the obtained results. Finally, Section 5 concludes the
chapter with future directions.
2 Linguistic Similarity
tically different but semantically similar inputs like the example requirements
described above.
The process of computing linguistic similarity for various NLP tasks, includ-
ing requirements retrieval and reuse, can be categorized into the following steps:
data pre-processing, data representation, similarity computation methods, and
performance evaluation of the developed NLP pipelines. We discuss these steps
below.
The second part of the technique is IDF, which reduces the effect of common
words such as “the”, “and” etc. IDF values the rarity of the words in a corpus
and assigns more weight to words that occur less frequently across the pool of
documents. IDF is calculated across the entire corpus, therefore, for a given word
6 M. Abbas et al.
The TFIDF algorithm considers the lexical aspects of the input requirements
and derives the term matrix. Based on the feature matrix, feature vectors for
individual requirements can be extracted. It is important to highlight that the
TFIDF approach can not capture words’ semantics and syntactical information
but rather is only weighting the terms.
Continuous Bag of Words (CBOW) language model aims to predict the tar-
get word vector based on its context that constitutes preceding and proceeding
words. The CBOW architecture consists of a shallow feed-forward natural net-
work consisting of three layers. The first layer consists of the context words that
result in an average word vector of fixed length, regardless of the context size,
and is projected to the middle hidden layer. Finally, the last layer correlates the
output and improves the representation based on the backpropagation of the
error to predict the middle word based on its context words.
Skip-Gram (SG) model for representing words as feature vectors are opposite
in approach to CBOW but similar in the architecture. The SG language model
estimates the context words given the current word as input. The first input
layer represents the current word and the second projection layer for predicting
a range of context words. This range is defined as the window size representing
the words to be considered for prediction. The window slides across the given
input sentence to calculate the likelihood of context words given a specific target
Requirements Similarity and Retrieval 7
word [40].
The CBOW and SG are Word2Vec models that capture the semantic relation-
ship between words but struggle to comprehend the polysemy—same word with
multiple meanings—of words. For example, unless specifically trained as a single
vector, both CBOW and SG models treat words like “Windows 11” as distinct
vectors “Windows” and “11”, which results in only taking into account the local
context of the words. As a result, this will lead to less accurate predictions and
poor performance on NLP tasks. To address such issues, the researchers proposed
two enhanced distributed representation algorithms, i.e., GloVe and FastText,
which are commonly used in multiple NLP tasks. Below, we discuss them briefly.
3
https://nlp.stanford.edu/projects/glove/
4
https://fasttext.cc/
8 M. Abbas et al.
Dice [0, 1] 1. Effective for binary and 1. Less effective for continuous
sparse datasets. or non-binary data.
2. Emphasize the relative impor- 2. Sensitive to duplicate ele-
tance of shared data items. ments.
Edit [0, 1] 1. Effective for comparing two 1. Less effective for comparing
distance sets where the order of ele- numerical vectors.
ments is essential. 2. Ignore semantics between
2. Offer a granularity measure of data items.
similarity.
JSI [0, 1] 1. Effective for sparse data. 1. Sensitive to the size of data
2. Focuses on data items overlap items.
ratio. 2. Ignores the order of data
items.
two vectors in multi-dimensional feature space and does not depend on their
magnitude. The similarity score between two non-zero vectors ranges from -1
to 1, with -1 representing opposite vectors, 0 representing no similarity (or-
thogonal vectors), and 1 indicating perfect similarity (proportional vectors).
In the case of non-negative vectors, the cosine similarity ranges between 0 to
Requirements Similarity and Retrieval 11
Companies often do not develop products from scratch but reuse existing com-
ponents from existing projects. In contexts like these, when a new project is to
be delivered to a new customer, finding reuse opportunities for existing software
based on similar requirements could save time and boost confidence in the end
products. Reusing existing software increases confidence in the end product be-
cause they have already been tested and proven in other products before [2].
In addition, it reduces the development, certification, and testing time. In such
cases, to aid reuse, new requirements are compared to requirements from exist-
ing projects to recommend reuse and avoid redundant development efforts. This
chapter demonstrates similarity-driven requirements retrieval for two cases with
the RE domain as follows.
ilarity using relevant metrics, such as precision, recall, and their harmonic
mean (F1 score).
– Case 2: Requirements-driven software retrieval for reuse goes beyond iden-
tifying similar requirements and looks into the relevance of the retrieved
software. This case demonstrates how similar requirements could be used
to retrieve software for reuse. Furthermore, this case evaluated the similar-
ity computation pipelines in light of the relevance of the retrieved software
by correlating requirements similarity to software similarity. This case is
demonstrated on both an industrial and a public dataset.
5
STSb-roberta-base-v2, available online at Hugging Face
6
Mini-LM, available online at Hugging Face
7
In basic pre-processing, the text is converted to lower case, and special characters
are replaced with a blank space that is followed by lemmatization.
8
Full pre-processing uses basic pre-processing and also removes stop-words.
14 M. Abbas et al.
Requirement
realized by
Pairs Source Code
Similarity
Computation Human-rated
similarity
Similarity Pairs Selector
Computation
In Table 2, we also present the total number of words in the datasets and
the average number of words per requirement/phrase in three cases, i.e., un-
processed (No P.), basic pre-processing with lemmatization (Basic P.), and
full pre-processing with stop words removal (P.). In addition, to provide some
insights into the data, Figure 1 presents the top ten most frequently occurring
words in both datasets.
was selected based on its effects on data imbalance between the similar and
non-similar groups of pairs. In other words, the chosen 60% threshold resulted
in a nearly perfect balanced dataset. As shown in Case 1 of Figure 2, the resul-
tant data was subjected to similarity computation with the selected pipelines.
In particular, the first set of sentences in the pairs are used as queries to retrieve
the most similar sentences from the second set of sentences based on computed
similarity. Since the ground truth of most similar sentences to query sentences is
already available and the pairs are already grouped into similar and non-similar
pairs, we can calculate relevant standard metrics for performance evaluation,
such as precision, recall, and F1 score. In this context, True Positives (TP) are
requirements correctly identified as the most similar and match the ground truth.
False Negatives (FN) are instances where requirements are similar (ground truth
= True), but not identified as the most similar by the pipeline. Similarly, False
Positives (FP) represent requirement instances misidentified as the most similar
when they are actually not the most similar ones (ground truth = False).
As shown in the bottom right part of Figure 2, we also compute the simi-
larity among the pairs of phrases in the STS dataset using the same similarity
measuring pipelines. We then use the association between the computed similar-
ities and human-rated similarities as a means of evaluation. The STS dataset is
quite similar to our industrial dataset as the phrases could be used to represent
requirements, and the scaled human-rated similarity could be used to repre-
sent software similarity. Therefore, both tasks could be performed on both of
the considered datasets and will enable replication. We provide our replication
package with the source code and dataset 9 to allow replication and support
future research on the topic.
This section presents and discusses the results of the two considered example
cases where the similarity analysis is relevant. In particular, we first present and
discuss the performance of the pipeline in requirements reuse based on standard
metrics like precision, recall and F1 score. Note that for the requirements reuse,
we manipulated the STS benchmark for demonstration, and therefore, the results
may vary in other cases. We also present the software retrieval performance of
9
Replication package, https://github.com/a66as/ReqSim/
18 M. Abbas et al.
Precision (STS)
Recall (STS)
F1 Score (STS)
ER −A .
bBERT Avg.
pBERT−CL .
G F
pST−Rober M
bMini−L S
U ta
T SI
LV
SE
pTFIDF
bB RT pFT
bST−R i−LM
bJSI
pJSI
bTFIDF
bGLV
pGLV
bFT
ER −C S
M −CLS
p in M
T− ob ta
be a
bUSE
pUSE
pBERT−Avg
− g
FID
Ro er t
F
T L
v
r
S Mini−L
J
T
BE
Pipelines
We apply the pipelines presented in Section 3.2 to the STS dataset, as described
in Section 3.3. To demonstrate the applicability of the pipelines in the context of
requirements reuse, we present standard metrics that evaluate the performance
of the various pipelines, shown in Figure 3. Below, we discuss the results briefly.
(0.78). The shorter length of the sentences and common vocabulary in the STS
dataset could explain this. On the other hand, the GLV and FT-based pipelines
with pre-processing tend to perform slightly better than TFIDF in terms of
recall and F1 score. However, we observe that the TFIDF-based pipeline with
pre-processing achieved a slightly higher precision. It is important to consider
that recall may take precedence over precision (or vice versa) in some scenarios.
rho
rho
0.3
0.2
0.1
E −A .
BE RT−Avg.
pBERT−CL .
G F
pS −Rober M
T SI
U ta
LV
bMini−L S
SE
bJSI
pJSI
pTFIDF
bFT
bB RT pFT
p in M
bST−R i−LM
T− o b t a
be a
bTFIDF
bGLV
pGLV
E R −C S
M −CLS
bUSE
pUSE
E −A .
BE RT−Avg.
pBERT−CL .
G F
pST−Rober M
TF JSI
LV
bMini−L S
U ta
SE
b JS I
p SI
pTFIDF
bB RT pFT
bST−R i−LM
bT IDF
bGLV
pGLV
bFT
ER − C S
M −CLS
bUSE
pUSE
pMini− M
T − o b ta
be a
pBERT−Avg
bB RT vg
pBERT−Avg
bB RT vg
FID
Ro e r t
F
T L
FI D
Ro e r t
F
r
S Mini−L
T L
J
r
in L
J
BE
BE
T
S Pipelines Pipelines
In this section, we first present areas of future research and then conclude the
chapter with a summary.
Pre-processing and similarity. In NLP for RE, we mainly borrow standard pre-
processing pipelines from text mining and the NLP community. These borrowed
pre-processing pipelines for textual requirements often use domain-generic part-
of-speech (POS) and entity tagging that guides lemmatization and other tasks to
produce input for similarity eventually. However, current similarity-driven tasks
in the field rarely consider software engineering-related named entity recognition
(such as classes, components, and parameters) or other meta information (such
as input and conditions in the requirements) for similarity computation. A recent
study also suggests that engineers perceive two requirements to be similar if they
share similar input processing and have similar conditions [1]. Such additional
information is not extracted for similarity and, if done, could guide the similarity
computation in the right direction.
5.2 Conclusions
Acknowledgements
This work has been supported by and received funding from the ITEA Smart-
Delta [50] and the KDT AIDOaRT projects.
24 M. Abbas et al.
References
1. Abbas, M., Ferrari, A., Shatnawi, A., Enoiu, E., Saadatmand, M., Sundmark, D.:
On the relationship between similar requirements and similar software: A case
study in the railway domain. Requirements Engineering 28(1), 23–47 (2023)
2. Abbas, M., Jongeling, R., Lindskog, C., Enoiu, E.P., Saadatmand, M., Sundmark,
D.: Product line adoption in industry: An experience report from the railway do-
main. In: Proceedings of the 24th ACM Conference on Systems and Software Prod-
uct Line: Volume A - Volume A. SPLC ’20, Association for Computing Machinery,
New York, NY, USA (2020). https://doi.org/10.1145/3382025.3414953
3. Abbas, M., Saadatmand, M., Enoiu, E., Sundamark, D., Lindskog, C.: Automated
reuse recommendation of product line assets based on natural language require-
ments. In: International Conference on Software and Software Reuse. pp. 173–189.
Springer (2020)
4. Abualhaija, S., Arora, C., Sabetzadeh, M., Briand, L.C., Traynor, M.: Automated
demarcation of requirements in textual specifications: a machine learning-based
approach. Empirical Software Engineering 25, 5454–5497 (2020)
5. Aizawa, A.: An information-theoretic perspective of tf–idf measures. Information
Processing & Management 39(1), 45–65 (2003)
6. Arora, C., Sabetzadeh, M., Goknil, A., Briand, L.C., Zimmer, F.: Change impact
analysis for natural language requirements: An nlp approach. In: 2015 IEEE 23rd
International Requirements Engineering Conference (RE). pp. 6–15. IEEE (2015)
7. Arsan, T., Köksal, E., Bozkus, Z.: Comparison of collaborative filtering algorithms
with various similarity measures for movie recommendation. International Journal
of Computer Science, Engineering and Applications (IJCSEA) 6(3), 1–20 (2016)
8. Balazs, J.A., Velásquez, J.D.: Opinion mining and information fusion: a survey.
Information Fusion 27, 95–110 (2016)
9. Bashir, S., Abbas, M., Ferrari, A., Saadatmand, M., Lindberg, P.: Requirements
classification for smart allocation: A case study in the railway industry. In: 31st
IEEE International Requirements Engineering Conference (September 2023)
10. Bashir, S., Abbas, M., Saadatmand, M., Enoiu, E.P., Bohlin, M., Lindberg, P.:
Requirement or not, that is the question: A case from the railway industry. In:
International Working Conference on Requirements Engineering: Foundation for
Software Quality. pp. 105–121. Springer (2023)
11. Bengio, Y., Ducharme, R., Vincent, P.: A neural probabilistic language model.
Advances in neural information processing systems 13 (2000)
12. Berry, D.M.: Empirical evaluation of tools for hairy requirements engineering tasks.
Empirical Software Engineering 26(6), 1–77 (2021)
13. Birner, B.J.: Introduction to pragmatics. John Wiley & Sons (2012)
14. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with
subword information. Transactions of the association for computational linguistics
5, 135–146 (2017)
15. Borg, M., Wnuk, K., Regnell, B., Runeson, P.: Supporting change impact analy-
sis using a recommendation system: An industrial case study in a safety-critical
context. IEEE Transactions on Software Engineering 43(7), 675–700 (2016)
16. Bowman, S.R., Angeli, G., Potts, C., Manning, C.D.: A large annotated corpus for
learning natural language inference. arXiv preprint arXiv:1508.05326 (2015)
17. Bybee, J.: Phonology and language use, vol. 94. Cambridge University Press (2003)
18. Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., Specia, L.: Semeval-2017 task
1: Semantic textual similarity-multilingual and cross-lingual focused evaluation.
arXiv preprint arXiv:1708.00055 (2017)
Requirements Similarity and Retrieval 25
19. Cer, D., Yang, Y., Kong, S.y., Hua, N., Limtiaco, N., John, R.S., Constant, N.,
Guajardo-Cespedes, M., Yuan, S., Tar, C., et al.: Universal sentence encoder. arXiv
preprint arXiv:1803.11175 (2018)
20. Chandrasekaran, D., Mago, V.: Evolution of semantic similarity—a survey. ACM
Computing Surveys (CSUR) 54(2), 1–37 (2021)
21. Clark, K., Khandelwal, U., Levy, O., Manning, C.D.: What does bert look at? an
analysis of bert’s attention. arXiv preprint arXiv:1906.04341 (2019)
22. och Dag, J.N., Regnell, B., Gervasi, V., Brinkkemper, S.: A linguistic-engineering
approach to large-scale requirements management. IEEE software 22(1), 32–39
(2005)
23. Davidson, D., Harman, G.: Semantics of natural language. Philosophy of language:
The central topics pp. 57–63 (2008)
24. Deza, E., Deza, M.M., Deza, M.M., Deza, E.: Encyclopedia of distances. Springer
(2009)
25. Dice, L.R.: Measures of the amount of ecologic association between species. Ecology
26(3), 297–302 (1945)
26. Efstathiou, V., Chatzilenas, C., Spinellis, D.: Word embeddings for the software
engineering domain. In: 2018 IEEE/ACM 15th International Conference on Mining
Software Repositories (MSR). pp. 38–41 (2018)
27. Elmore, K.L., Richman, M.B.: Euclidean distance as a similarity metric for prin-
cipal component analysis. Monthly weather review 129(3), 540–549 (2001)
28. Falessi, D., Cantone, G., Canfora, G.: A comprehensive characterization of nlp
techniques for identifying equivalent requirements. In: Proceedings of the 2010
ACM-IEEE international symposium on empirical software engineering and mea-
surement. pp. 1–10 (2010)
29. Falessi, D., Cantone, G., Canfora, G.: Empirical principles and an industrial case
study in retrieving equivalent requirements via natural language processing tech-
niques. IEEE Transactions on Software Engineering 39(1), 18–44 (2011)
30. Faruqui, M., Dodge, J., Jauhar, S.K., Dyer, C., Hovy, E., Smith, N.A.: Retrofitting
word vectors to semantic lexicons. arXiv preprint arXiv:1411.4166 (2014)
31. Fischbach, J., Frattini, J., Vogelsang, A., Mendez, D., Unterkalmsteiner, M.,
Wehrle, A., Henao, P.R., Yousefi, P., Juricic, T., Radduenz, J., et al.: Automatic
creation of acceptance tests by extracting conditionals from requirements: Nlp ap-
proach and case study. Journal of Systems and Software 197, 111549 (2023)
32. Guo, J., Cheng, J., Cleland-Huang, J.: Semantically enhanced software traceability
using deep learning techniques. In: 2017 IEEE/ACM 39th International Conference
on Software Engineering (ICSE). pp. 3–14. IEEE (2017)
33. Halliday, M.A.K., Webster, J.J.: On Language and Linguistics: Volume 3. A&C
Black (2003)
34. Haspelmath, M., Sims, A.: Understanding morphology. Routledge (2013)
35. Hinkle, D.E., Wiersma, W., Jurs, S.G.: Applied statistics for the behavioral
sciences. Houghton Mifflin, 5th ed edn. (2003), https://cir.nii.ac.jp/crid/
1130012535369496890
36. Ilyas, M., Kung, J.: A similarity measurement framework for requirements en-
gineering. In: 2009 Fourth International Multi-Conference on Computing in the
Global Information Technology. pp. 31–34. IEEE (2009)
37. Kotonya, G., Sommerville, I.: Requirements engineering: processes and techniques.
Wiley Publishing (1998)
38. Latif, S., Bashir, S., Agha, M.M.A., Latif, R.: Backward-forward sequence genera-
tive network for multiple lexical constraints. In: Artificial Intelligence Applications
26 M. Abbas et al.
and Innovations: 16th IFIP WG 12.5 International Conference, AIAI 2020, Neos
Marmaras, Greece, June 5–7, 2020, Proceedings, Part II 16. pp. 39–50. Springer
(2020)
39. Manning, C.D.: An introduction to information retrieval. Cambridge university
press (2009)
40. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre-
sentations in vector space. arXiv preprint arXiv:1301.3781 (2013)
41. Mikolov, T., Yih, W.t., Zweig, G.: Linguistic regularities in continuous space word
representations. In: Proceedings of the 2013 conference of the north american chap-
ter of the association for computational linguistics: Human language technologies.
pp. 746–751 (2013)
42. Mohammad, S.M., Hirst, G.: Distributional measures of semantic distance: A sur-
vey. arXiv preprint arXiv:1203.1858 (2012)
43. Naseem, U., Razzak, I., Khan, S.K., Prasad, M.: A comprehensive survey on word
representation models: From classical to state-of-the-art word representation lan-
guage models. Transactions on Asian and Low-Resource Language Information
Processing 20(5), 1–35 (2021)
44. Navarro, G.: A guided tour to approximate string matching. ACM computing sur-
veys (CSUR) 33(1), 31–88 (2001)
45. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word repre-
sentation. In: Proceedings of the 2014 conference on empirical methods in natural
language processing (EMNLP). pp. 1532–1543 (2014)
46. Prechelt, L., Malpohl, G., Philippsen, M., et al.: Finding plagiarisms among a set
of programs with jplag. J. UCS 8(11), 1016 (2002)
47. Raychev, V., Vechev, M., Yahav, E.: Code completion with statistical language
models. In: Proceedings of the 35th ACM SIGPLAN conference on programming
language design and implementation. pp. 419–428 (2014)
48. Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert-
networks. arXiv preprint arXiv:1908.10084 (2019)
49. Rodriguez, D.V., Carver, D.L.: Comparison of information retrieval techniques for
traceability link recovery. In: 2019 IEEE 2nd International Conference on Infor-
mation and Computer Technologies (ICICT). pp. 186–193. IEEE (2019)
50. Saadatmand, M., Abbas, M., Enoiu, E.P., Schlingloff, B.H., Afzal, W., Dornauer,
B., Felderer, M.: Smartdelta project: Automated quality assurance and optimiza-
tion across product versions and variants. Microprocessors and Microsystems 103,
104967 (2023). https://doi.org/10.1016/j.micpro.2023.104967
51. Schnabel, T., Labutov, I., Mimno, D., Joachims, T.: Evaluation methods for un-
supervised word embeddings. In: Proceedings of the 2015 conference on empirical
methods in natural language processing. pp. 298–307 (2015)
52. Schütze, H., Manning, C.D., Raghavan, P.: Introduction to information retrieval,
vol. 39. Cambridge University Press Cambridge (2008)
53. Sunilkumar, P., Shaji, A.P.: A survey on semantic similarity. In: 2019 International
Conference on Advances in Computing, Communication and Control (ICAC3).
pp. 1–8. IEEE (2019)
54. Tabassum, J., Maddela, M., Xu, W., Ritter, A.: Code and named entity recognition
in StackOverflow. In: Proceedings of the 58th Annual Meeting of the Association
for Computational Linguistics. pp. 4913–4926. ACL (2020)
55. Van Valin, R.D.: An introduction to syntax. Cambridge university press (2001)
56. Williams, A., Nangia, N., Bowman, S.R.: A broad-coverage challenge corpus for
sentence understanding through inference. arXiv preprint arXiv:1704.05426 (2017)
Requirements Similarity and Retrieval 27
57. Zhang, Y., Jin, R., Zhou, Z.H.: Understanding bag-of-words model: a statistical
framework. International journal of machine learning and cybernetics 1, 43–52
(2010)
58. Zhao, L., Alhoshan, W., Ferrari, A., Letsholo, K.J., Ajagbe, M.A., Chioasca, E.V.,
Batista-Navarro, R.T.: Natural language processing for requirements engineering:
A systematic mapping study. ACM Computing Surveys (CSUR) 54(3), 1–41 (2021)
59. Zhao, Y., Scholer, F., Tsegay, Y.: Effective pre-retrieval query performance pre-
diction using similarity and variability evidence. In: Advances in Information Re-
trieval: 30th European Conference on IR Research, ECIR 2008, Glasgow, UK,
March 30-April 3, 2008. Proceedings 30. pp. 52–64. Springer (2008)