29thJunePresentation

International Conference on Intelligent Computing and
Communication Techniques, JNU, New Delhi, June 2024
Comparative analysis of FastText and

Doc2Vec for Semantic Feature oriented
Software Defect Prediction
Authors
Priya Singh Gaurav Sharma

Department of Software Engineering, Department of Software Engineering,
Delhi Technological University, Delhi Technological University,
Delhi, India Delhi, India
Priya.singh.academia@gmail.com sharmaGaurav171691@gmail.com
INTRODUCTION
 Software defects are an unavoidable aspect of the development process. It
includes logical errors, coding mistakes and design flaws.
 SDP provides developers with a crucial opportunity to address issues

proactively, refining their approach and optimizing the allocation of
resources for testing.
 Modern SDP models emphasizes innovation, with machine learning

techniques and natural language processing.
 For improving predicting capability of models different embedding

techniques such as Word2Vec, GloVe, FastText, and Doc2Vec are used.
 Current techniques lacks a comprehensive comparison of word embedding

techniques for model performance evaluation.
OBJECTIVES
 Examines how different embedding techniques have an impact on SDP
tasks.
 Compares the most commonly used embedding techniques, such as

FastText and Doc2Vec, for SDP tasks based on evaluation metrics.
 Evaluates the performance of various deep learning models’ used

for SDP.
 Analyzes how the combination of embedding techniques and deep

learning models sync together to enhance SDP.
 Provides insights for selecting the best suitable embedding

techniques for SDP tasks.
SOFTWARE DEFECT PREDICTION
SDP is a process of using ML and DL
techniques to predict the likelihood
of software defects in a software
system. The goal of SDP is to
identify potential defects before
they occur, enabling software
development teams to take
preventive measures and improve
Figure 1. Flowchart of SDP
software quality. The following
figure shows the process of SDP.
EMBEDDING TECHNIQUES
Embeddings represent words, phrases, or even entire documents as
vectors within a continuous vector space based on both syntactic
relationships and contextual meaning.
 FastText is a word embedding model developed by Facebook that

extends Word2Vec by representing words as n-grams of characters,
enabling it to capture sub-word information and handle out-of-
vocabulary words effectively.
 Doc2Vec is an extension of Word2Vec that learns fixed-length feature

representations for variable-length texts, such as sentences or
documents, by capturing the semantic meaning of the entire text.
DATASET DESCRIPTION
The dataset used in this research is a set of 10 open-source Java projects that are taken
from the PROMISE repository.
METHODOLOGY
The methodology of this research
encompasses several key stages of
analyzing Java code.
 Corpus Generation Using AST.
 Generation of Sequence Tokens.
 Fine Tuning of Pre-Trained Models.
 Generation of Embeddings
 Comparison of Techniques
The following figures illustrates the

Figure 2. Flowchart for
process followed during this study. predicting bug
Corpus Generation Using AST
 Source Files are the Java Projects AST.
that are taken from PROMISE
repository for Software
Engineering.
 AST helps in representing Java

Code in a tree like structure.
 From the repository each java

project is taken (for e.g. ANT)
and the AST is generated.
 Figure 3. illustrates the sample

Figure 3. AST of sample code
Generation of Sequence Tokens
 Once the AST of a java project is that project.
generated a sequence token file
for each project is created.
 The categories of AST nodes

selected for creating sequence
token file are Control Flow Node,
Class Declaration and method
invocation.
 When any of the token present in

fig 4. is found we add those
token in sequence token file of Figure 4. AST Selected Nodes
FINE TUNING OF PRE-TRAINED
MODELS
 Transfer learning was applied for Doc2Vec and FastText models
imported from the Gensim library, and they were trained on the AST-
generated corpus for each Java project.
 Both models were initialized with key parameters: vector size of 100,
window size of 5, and a minimum word count of 5.
 The models were trained for 10 epochs, meaning the dataset was
iterated over 10 times for training.
 Four CPU cores were utilized to accelerate the training process.
 The trained models were then used to generate embeddings from the
tokens.
Generation of Embeddings
 The pre-trained models were fine-tuned using the corpus.
 Sequence tokens from each project version were inputted into these
models.
 This allowed the models to produce embeddings.

Comparison Of Techniques
 Embeddings are input into deep learning models (ANN, CNN, LSTM and GRU).
 The trained models output "1" for detected bugs and "0" for bug-free
software.
 A comparative analysis is performed using evaluation metrics to assess the

effectiveness of the embeddings.
HYPER PARAMETER SETTINGS
 Training Duration: Extended over 200 epochs for comprehensive
learning.
 Activation Functions: Sigmoid for ANN layers, ReLU for dense

layers in RNNs, and sigmoid for RNN output layers.
 Loss Function: Binary cross-entropy applied across all models.
 Optimization: Adam optimizer used for parameter tuning.
 Model Architecture: Included two dense layers with 64 and 32

neurons for capturing intricate data patterns.
 CNN Specifics: 1D convolutional layers used to capture spatial

dependencies in sequential data, enhancing model performance.
RESULT
The following table and figures
shows the performance of different
embedding techniques among all the
deep learning models used.
CONCLUSION
 Doc2Vec embeddings showed improved recall, F1 scores, and
precision and offer superior semantic representations.
 Doc2Vec enhances the model's accuracy in identifying positive and

negative situations.
 Lower FNR and FPR values are associated with Doc2Vec embeddings.
 This implies a reduced frequency of false positives and false

negatives, enhancing model dependability.
 Higher TNR values with Doc2Vec embeddings demonstrate effective

detection of adverse occurrences.
FUTURE WORK
 Use advanced embeddings like BERT, Code-BERT, RoBERTa, ELMO, and
XLNet for better semantic understanding.
 Consider the PROMISE dataset and include NASA’s dataset to expand

the corpus for future studies.
 Larger datasets may lead to more universal and effective models.
 Apply Cross-Project Defect Prediction instead of Within-Project Defect

Prediction.
 CPDP enhances model generalization by training on data from multiple

projects.
 Advanced techniques, diverse datasets can significantly improve SDP

models.

29thJunePresentation

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

29thJunePresentation

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

29thJunePresentation

Uploaded by

Copyright:

Available Formats

International Conference on Intelligent Computing and

Communication Techniques, JNU, New Delhi, June 2024

Comparative analysis of FastText and

Priya Singh Gaurav Sharma

 SDP provides developers with a crucial opportunity to address issues

 Modern SDP models emphasizes innovation, with machine learning

 For improving predicting capability of models different embedding

 Current techniques lacks a comprehensive comparison of word embedding

 Compares the most commonly used embedding techniques, such as

 Evaluates the performance of various deep learning models’ used

 Analyzes how the combination of embedding techniques and deep

 Provides insights for selecting the best suitable embedding

 FastText is a word embedding model developed by Facebook that

 Doc2Vec is an extension of Word2Vec that learns fixed-length feature

 Corpus Generation Using AST.

 Generation of Sequence Tokens.

 Fine Tuning of Pre-Trained Models.

The following figures illustrates the

 AST helps in representing Java

 From the repository each java

 Figure 3. illustrates the sample

 The categories of AST nodes

 When any of the token present in

 Four CPU cores were utilized to accelerate the training process.

 This allowed the models to produce embeddings.

 A comparative analysis is performed using evaluation metrics to assess the

 Activation Functions: Sigmoid for ANN layers, ReLU for dense

 Loss Function: Binary cross-entropy applied across all models.

 Optimization: Adam optimizer used for parameter tuning.

 Model Architecture: Included two dense layers with 64 and 32

 CNN Specifics: 1D convolutional layers used to capture spatial

 Doc2Vec enhances the model's accuracy in identifying positive and

 This implies a reduced frequency of false positives and false

 Higher TNR values with Doc2Vec embeddings demonstrate effective

 Consider the PROMISE dataset and include NASA’s dataset to expand

 Larger datasets may lead to more universal and effective models.

 Apply Cross-Project Defect Prediction instead of Within-Project Defect

 CPDP enhances model generalization by training on data from multiple

 Advanced techniques, diverse datasets can significantly improve SDP

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.