0% found this document useful (0 votes)
2 views15 pages

29thJunePresentation

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1/ 15

International Conference on Intelligent Computing and

Communication Techniques, JNU, New Delhi, June 2024

Comparative analysis of FastText and


Doc2Vec for Semantic Feature oriented
Software Defect Prediction

Authors

Priya Singh Gaurav Sharma


Department of Software Engineering, Department of Software Engineering,
Delhi Technological University, Delhi Technological University,
Delhi, India Delhi, India
Priya.singh.academia@gmail.com sharmaGaurav171691@gmail.com
INTRODUCTION
 Software defects are an unavoidable aspect of the development process. It
includes logical errors, coding mistakes and design flaws.

 SDP provides developers with a crucial opportunity to address issues


proactively, refining their approach and optimizing the allocation of
resources for testing.

 Modern SDP models emphasizes innovation, with machine learning


techniques and natural language processing.

 For improving predicting capability of models different embedding


techniques such as Word2Vec, GloVe, FastText, and Doc2Vec are used.

 Current techniques lacks a comprehensive comparison of word embedding


techniques for model performance evaluation.
OBJECTIVES
 Examines how different embedding techniques have an impact on SDP
tasks.

 Compares the most commonly used embedding techniques, such as


FastText and Doc2Vec, for SDP tasks based on evaluation metrics.

 Evaluates the performance of various deep learning models’ used


for SDP.

 Analyzes how the combination of embedding techniques and deep


learning models sync together to enhance SDP.

 Provides insights for selecting the best suitable embedding


techniques for SDP tasks.
SOFTWARE DEFECT PREDICTION
SDP is a process of using ML and DL
techniques to predict the likelihood
of software defects in a software
system. The goal of SDP is to
identify potential defects before
they occur, enabling software
development teams to take
preventive measures and improve
Figure 1. Flowchart of SDP
software quality. The following
figure shows the process of SDP.
EMBEDDING TECHNIQUES
Embeddings represent words, phrases, or even entire documents as
vectors within a continuous vector space based on both syntactic
relationships and contextual meaning.

 FastText is a word embedding model developed by Facebook that


extends Word2Vec by representing words as n-grams of characters,
enabling it to capture sub-word information and handle out-of-
vocabulary words effectively.

 Doc2Vec is an extension of Word2Vec that learns fixed-length feature


representations for variable-length texts, such as sentences or
documents, by capturing the semantic meaning of the entire text.
DATASET DESCRIPTION
The dataset used in this research is a set of 10 open-source Java projects that are taken
from the PROMISE repository.
METHODOLOGY
The methodology of this research
encompasses several key stages of
analyzing Java code.

 Corpus Generation Using AST.

 Generation of Sequence Tokens.

 Fine Tuning of Pre-Trained Models.

 Generation of Embeddings

 Comparison of Techniques

The following figures illustrates the


Figure 2. Flowchart for
process followed during this study. predicting bug
Corpus Generation Using AST
 Source Files are the Java Projects AST.
that are taken from PROMISE
repository for Software
Engineering.

 AST helps in representing Java


Code in a tree like structure.

 From the repository each java


project is taken (for e.g. ANT)
and the AST is generated.

 Figure 3. illustrates the sample


Figure 3. AST of sample code
Generation of Sequence Tokens
 Once the AST of a java project is that project.
generated a sequence token file
for each project is created.

 The categories of AST nodes


selected for creating sequence
token file are Control Flow Node,
Class Declaration and method
invocation.

 When any of the token present in


fig 4. is found we add those
token in sequence token file of Figure 4. AST Selected Nodes
FINE TUNING OF PRE-TRAINED
MODELS
 Transfer learning was applied for Doc2Vec and FastText models
imported from the Gensim library, and they were trained on the AST-
generated corpus for each Java project.

 Both models were initialized with key parameters: vector size of 100,
window size of 5, and a minimum word count of 5.

 The models were trained for 10 epochs, meaning the dataset was
iterated over 10 times for training.

 Four CPU cores were utilized to accelerate the training process.

 The trained models were then used to generate embeddings from the
tokens.
Generation of Embeddings
 The pre-trained models were fine-tuned using the corpus.

 Sequence tokens from each project version were inputted into these
models.

 This allowed the models to produce embeddings.


Comparison Of Techniques
 Embeddings are input into deep learning models (ANN, CNN, LSTM and GRU).

 The trained models output "1" for detected bugs and "0" for bug-free
software.

 A comparative analysis is performed using evaluation metrics to assess the


effectiveness of the embeddings.
HYPER PARAMETER SETTINGS
 Training Duration: Extended over 200 epochs for comprehensive
learning.

 Activation Functions: Sigmoid for ANN layers, ReLU for dense


layers in RNNs, and sigmoid for RNN output layers.

 Loss Function: Binary cross-entropy applied across all models.

 Optimization: Adam optimizer used for parameter tuning.

 Model Architecture: Included two dense layers with 64 and 32


neurons for capturing intricate data patterns.

 CNN Specifics: 1D convolutional layers used to capture spatial


dependencies in sequential data, enhancing model performance.
RESULT
The following table and figures
shows the performance of different
embedding techniques among all the
deep learning models used.
CONCLUSION
 Doc2Vec embeddings showed improved recall, F1 scores, and
precision and offer superior semantic representations.

 Doc2Vec enhances the model's accuracy in identifying positive and


negative situations.

 Lower FNR and FPR values are associated with Doc2Vec embeddings.

 This implies a reduced frequency of false positives and false


negatives, enhancing model dependability.

 Higher TNR values with Doc2Vec embeddings demonstrate effective


detection of adverse occurrences.
FUTURE WORK
 Use advanced embeddings like BERT, Code-BERT, RoBERTa, ELMO, and
XLNet for better semantic understanding.

 Consider the PROMISE dataset and include NASA’s dataset to expand


the corpus for future studies.

 Larger datasets may lead to more universal and effective models.

 Apply Cross-Project Defect Prediction instead of Within-Project Defect


Prediction.

 CPDP enhances model generalization by training on data from multiple


projects.

 Advanced techniques, diverse datasets can significantly improve SDP


models.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy