IET Information Security - 2023 - Prabakaran
IET Information Security - 2023 - Prabakaran
IET Information Security - 2023 - Prabakaran
DOI: 10.1049/ise2.12106
ORIGINAL RESEARCH
- -Revised: 21 September 2022 Accepted: 24 December 2022
1
Department of Information Technology, Abstract
Thiagarajar College of Engineering, Madurai, India
Phishing attacks have become one of the powerful sources for cyber criminals to impose
2
Department of Computer Science and Engineering, various forms of security attacks in which fake website Uniform Resource Locators (URL)
Sethu Institute of Technology, Virudhunagar, India
3
are circulated around the Internet community in the form of email, messages etc., in order to
Department of Computer Science and Engineering,
deceive users, resulting in the loss of their valuable assets. The phishing URLs are predicted
Mepco Schlenk Engineering College, Sivakasi, India
using several blacklist‐based traditional phishing website detection techniques. However,
Correspondence numerous phishing websites are frequently constructed and launched on the Internet over
Manoj Kumar Prabakaran, Department of time; these blacklist‐based traditional methods do not accurately predict most phishing
Information Technology, Thiagarajar College of websites. In order to effectively identify malicious URLs, an enhanced deep learning‐based
Engineering, Madurai, India.
phishing detection approach has been proposed by integrating the strength of Variational
Email: mano.btechme@gmail.com
Autoencoders (VAE) and deep neural networks (DNN). In the proposed framework, the
inherent features of a raw URL are automatically extracted by the VAE model by recon-
structing the original input URL to enhance phishing URL detection. For experimentation,
around 1 lakh URLs were crawled from two publicly available datasets, namely ISCX‐URL‐
2016 dataset and Kaggle dataset. The experimental results suggested that the pro-
posed model has reached a maximum accuracy of 97.45% and exhibits a quicker response
time of 1.9 s, which is better when compared to all the other experimented models.
KEYWORDS
deep learning, deep neural network, machine learning, phishing, uniform resource locator, variational
autoencoders
1 | INTRODUCTION Taking into account the website of one of the largest banks
in India—HDFC bank‐as an example, the cyber criminal may
The cybercrime referred to as ‘phishing’ is the process of enticing term the official website URL, ‘https://www.hdfcbank.com/’
people into visiting fraudulent websites and persuading them to as ‘https://www.hbfcbank.com/’, with the aid of few unethical
enter identifying information such as usernames, passwords, techniques such as URL hijacking, in order to distract the users
addresses, social security numbers, personal identification to mistakenly login to the website since the URL and the web
numbers and anything else that can be made to appear to be page content are close enough to the legitimate website [2].
plausible [1]. These websites have similar webpage content and Once the user is logged in, the hacker tries to collect the
closely associated Uniform Resource Locator (URL) with the sensitive data of the user which can be used in identity theft, to
legitimate websites which in turn deceives the users into entering remove funds from a customer account and in the theft of
inside the website and providing their valuable credentials. online resources [3].
[Correction added on 9 March 2023, after first online publication. The below correction needs to be noted: The order of the authors should be as follows: Manoj Kumar Prabakaran,
Parvathy Meenakshi Sundaram and Abinaya Devi Chandrasekar.]
-
This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is
properly cited.
© 2023 The Authors. IET Information Security published by John Wiley & Sons Ltd on behalf of The Institution of Engineering and Technology.
The numbers of phishing attacks have grown exponentially extraction process involves manual feature engineering that
in the recent years, and according to the reports published in requires the assistance of engineering experts. They must
APWG [4], an international coalition unifying the global manually identify the features needed for classifier training from
response to cybercrime across industry, government and URLs and HTML text, which would lead to errors. (c) Inability
law‐enforcement sectors and NGO communities, between to cope with URLs that do not conform to manually crafted
68,000 and 94,000 phishing attacks per month have been features, for example, the effect of lexical features in catego-
observed since the early 2020s. By 2021, APWG recorded risation has been inadvertently reduced due to the shift in the
260,642 malicious website URLs, the highest monthly attack structure of today's modern‐day URLs, which are substantially
count in the organizsation's history. shorter in length.
Conditions specifically during the COVID‐19 pandemic Though many technical issues disrupt the effective per-
led to an increase in the number of cyberattacks and cyber formance of machine learning approaches, one way to reduce
security problems. Phishing was by far the most prevalent form the cost of developing and maintaining machine learning ap-
of attack that had distinctive features and was intended to take proaches is to move beyond manual feature engineering, which
advantage of COVID‐19's quirks in order to maximise its is often considered as the most time‐consuming attribute of
likelihood of success. [5] The majority of prominent phishing machine learning.
attacks that took place during COVID 2019 focussed on In recent years, Deep learning (DL) algorithms have become
posing as government agencies (such as the WHO or the US popular in the field of phishing URL detection. Deep learning
Centres for disease Control and Prevention (CDC)) by either algorithms such as Convolutional Neural network (CNN),
creating a fake email/website to lure the users to enter their Auto‐Encoder (Auto encoders (AE)) etc., have been used by
sensitive information. various researchers [13, 14] to automatically extract abstract
Research on this subject has been significantly prompted by higher‐level features from the raw URL. These algorithms have
the rise in phishing attempts over the recent few years, partic- the aptitude to detect representations (features) that match the
ularly during the COVID‐19 pandemic that resulted in the loss classifier's requirement from raw data. This helps in eliminating
of valuable human assets and serious real‐world consequences. the dependency of experts for manual feature engineering and
Since most of the phishing‐based crimes involves malicious overcome the problem of lexical‐based feature extraction. Also,
website URLs to perform various form of cyberattacks, it is DL models are scalable to a huge volume of data, and in fact its
indeed mandatory to deploy an effective strategy to detect ability to classify increases as the number of the input data be-
malicious URLs in a real‐time environment. Hence to mitigate comes higher. In contrast, DL models require the input data to
such attacks, various anti‐phishing tools have been developed by be converted into numerical vectors and hence data needs to be
many organisations in the recent past [6, 7]. Most of these tools preprocessed before being fed into the model. Also, the time
use a technique referred to as Blacklists to analyse the legitimacy taken to train a neural network model is a significant overhead
of the URL website. [8] Blacklists are databases of URLs for since there are lots of training parameters involved.
known phishing websites that are created and maintained by Although, DL models have quite a few concerns, its ability
Internet communities like Phish tank. Blacklists are used by to automatically extract higher‐level features from the URL and
most of the lookup‐based anti‐phishing security toolbars found precisely classify the input data fits it appropriately for malicious
in web browsers. However, blacklists require time to update URL detection. Following its success and also in consideration
their most recent listings, leading to victimise lot of people [9]. with the complexities involved, we have proposed a hybrid DL
While these methods are rapid and are intended to have model named VAE‐DNN that combines Variational Autoen-
low False Positive (FP) rates, one major drawback is that they coders (VAE) (a special form of Autoencoders model) and
cannot be entirely comprehensive, and they particularly fail Deep Neural network (DNN) for effectively classifying mali-
when confronted with newly generated URLs. Since new URLs cious URLs. By integrating these two neural network models,
are generated on a daily basis, this is a serious constraint. the detection accuracy shall be significantly improved.
Hence to cope up with newly generated URLs, few re- Particularly, VAE‐DNN receives raw URL input in the form
searchers have come up with the idea of adopting machine of string and forwards it to a preprocessing stage where the
learning algorithms to detect malicious URLs [10–12]. These input data is converted into a numerical matrix by means of a
ML‐based techniques, in general, make use of statistical/Lex- One Hot Encoding (OHE) mechanism. Once the URL is
ical‐based features of URLs, namely Length of the URL, converted into numerical vectors, they are fed into the VAE
Number of periods in the URL, Top level domain etc. to architecture for the process of feature extraction/dimension-
classify URLs into two broad categories, namely benign and ality reduction. After this stage, the VAE model produces a
phishing. The main advantage of ML‐based phishing detection dimensionally reduced abstract representation of the input URL
approach is that it eliminates the dependency of collecting features from its latent space layer. Those extracted features are
blacklists and its ability to detect new kind of malicious URLs. then fed into the DNN classifier for validation. Finally, the
Although ML‐based approaches results in better detection DNN model is trained and tested with the higher level extracted
accuracy, these techniques suffer from few considerable draw- features produced from the latent space/bottleneck layer of the
backs: (a) Inability to extract semantic patterns that is, not all of VAE model. After thorough experimentation, it has been
the phishing website's qualities are extracted since URL is observed that the proposed VAE‐DNN model is able to
evaluated from a particular perspective (b) The feature effectively classify malicious URLs. The model is being
17518717, 2023, 3, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12106 by Test, Wiley Online Library on [20/08/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
PRABAKARAN ET AL.
- 425
experimented with the data obtained from various resources were used to identify fake URL websites that adopted blacklist‐
such as Kaggle and ISCX‐URL‐2016 dataset. Experimental based techniques that intimate the end user about the legiti-
results suggest that our model reaches a maximum accuracy of macy of a particular website based on a fixed set of rules.
97.45% which is significantly better when compared with all the Although these static approaches may work faster and respond
other experimented DL models. Following are the main con- quickly, still it is vulnerable to new URLs that are not available
tributions of our research. (1) We have incorporated the OHE in the list which makes these techniques not so effective for a
mechanism to preprocess the input URL into a numerical ma- real‐time environment. Followed by list‐based approaches,
trix. This process embeds every URL into a fixed length nu- many of the researchers focussed on incorporating machine
merical matrix with N � M dimension. We have adopted a learning and DL‐based models for malicious URL detection.
strategy in setting the threshold value for N and M, such that the We have done an extensive survey on the various ML and DL‐
model shall be exposed to all kinds of possible characters that based approaches for phishing URL detection, where we have
could probably form a URL for evaluation. (2) Also, we have analysed the performance of the classifier with respect to
adopted the VAE model for automatically extracting higher‐ manually crafted and automatically extracted features.
level features from the input URL that eliminates the de-
pendency of manual feature engineering. Since the VAE model
has the ability to effectively reconstruct the original input, it is 2.1 | Malicious URL detection using
possible to obtain inherent information from the input URL machine learning
and at the same time reduce the dimensionality of the input. (3)
Since the VAE model produces dimensionally reduced inherent In Ref. [19], the authors proposed an effective machine
features of the URL, those features shall be used for effectively learning approach for phishing detection based on lexical/
training the classifier. This significantly reduces the training time statistical features of the URL and the web page content. In
overhead of the classifier since the DNN model is trained with this work, the researchers have considered 30 features based on
lesser number of features extracted from the latent space of the address bar, anomaly, domain and webpage response for
VAE model. (4) Since our model does not employ expert‐ determining whether a website is phishing or not. For exper-
assisted manually crafted features, the model eliminates the imentation, data were collected from the publicly available UCI
impediment of fixed human cognition to identify relevant fea- repository. Various Machine learning models were deployed for
tures. Our architecture possesses the ability to discover a cor- the purpose of classification. Among those models, the
relation between input features, implying that malicious URLs Random Forest (RF) classifier outperformed all the other
have more inherent properties. As a result, our approach can models with the maximum accuracy of 96.87%. Although this
detect previously unknown malicious URL samples efficiently. approach seemed to be effective, this kind of technique suffers
The rest of the paper is organised as follows. Section 2 from the exploitation of 0‐day attacks since the model is
discusses the related work undergone in the field of malicious trained with fixed statistical/lexical‐based features which can
URL detection. Section 3 provides a detailed overview of the be easily decoded.
proposed VAE‐DNN framework. Section 4 demonstrates the Saleem et al. [20] proposed a malicious URL detection
various experiments being conducted and the appropriate re- mechanism based on lexical features by using machine learning
sults obtained. Finally in Section 5, discussions and conclusions algorithms. Similar to the work conducted by the authors in
are presented. Ref. [19], in this work, fixed lexical features of the URL were
considered for classification. Exactly 27 URL features were
crafted based on the URL length, domain length, alphabets in
2 | RELATED WORK URL etc., that represents the lexical behaviour of the URL.
After extracting those features, few of the irrelevant features
In this section, we investigate the recent studies that have been were reduced from the feature set and only 20 features were
conducted in the field of Information security. Numerous selected. For experimentation, URLs were collected from the
smart security solutions have been proposed in the recent past UNB dataset that is composed of around 66,000 URLs. The
to detect the various intrusive attacks in a real‐world results suggested that the RF classifier produced the maximum
networking environment. Probabilistic graphical models [15], accuracy result of 99.5%. In comparison to the previously
Big Data‐based Hierarchical DL System [16], stepping‐stone discussed methodology, this is a lightweight approach that
intrusion detection based on clustering mechanisms [17], offers better performance since it employs a minimum number
Deep Neural Network (DNN), CNN and other neural of features for training the model.
network models [18] have been effectively utilised to tackle the Gupta et al. [21] devised an ML approach for phishing
various network security issues. URL detection based on lexical properties of the URL. In this
However, the number of intrusive attacks based on work, only nine lexical features of the URL were fixed based on
phishing‐based mechanisms has been growing exponentially. the lexical characteristics of the URL. This work used ISCX
To effectively classify the phishing URLs from benign ones, URL‐2016 dataset for collecting both benign and phishing
many researchers have contributed in various ways to identify URLs. Around 11,000 instances of the URL were composed
an optimal mechanism to generate a novel phishing detection and experimented among four different machine learning
model. Earlier studies reveal that rule‐based phishing solutions classifiers. A maximum accuracy of 99.7% was achieved by the
17518717, 2023, 3, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12106 by Test, Wiley Online Library on [20/08/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
426
- PRABAKARAN ET AL.
RF classifier. The uniqueness of this anti‐phishing approach is most influential features and the optimal weights of website
that this model does not rely on third party services for feature features. The research's main contribution is the adoption of
extraction. evolutionary algorithms to select suitable weights and attri-
Ekta and Deepak [22] conducted a study on spoofed butes for URLs. The detection accuracy of the model was
website detection by using a machine learning approach. In this 89.5%. On the other hand, the proposed model has the
study, the authors constructed 20 statistical features relevant to following deficiencies: (i) Limitation of data sources (ii)
websites based on page, HTML and URL data. For experi- Selecting and weighing features by using GAs considerably
mentation, six machine learning classifiers such as NB, Deci- took a longer duration.
sion Table, KNN, SVM, RF and Adaboost were used. The Liqun et al. [25] developed a unique approach based on a
dataset was collected from the phish tank and open phish non‐inverse matrix online sequence extreme learning machine
website that contains both legitimate and phishing websites. (NIOSELM). To reduce the detection model's dependency on
Around 5000 website URLs were collected from the the majority class, an Adaptive Synthetic Sampling (ADASYN)
mentioned resources and used for experimentation. Experi- approach was used. In addition, an enhanced denoising auto‐
mental results suggested that both RF and Adaboost algo- encoder (SDAE) was built to reduce the size of the experi-
rithms produced better classification accuracy of more than mental dataset. To balance the dataset and reduce dimension-
99% and minimal error in detecting appropriate URLs. ality, the suggested model utilises unique preprocessing
Although this seems to be an effective approach for spoofed procedures. The detection accuracy, on the other hand, may
website detection, the model was only being trained with a not be as good as the existing approaches.
minimal number of URL data, which makes it vulnerable to In order to overcome the issues prevailing in supervised
new kinds of URLs that are not available in the trained dataset. classification in terms of robustness against 0‐day phishing
Based on the study conducted, although it can be pre- attacks, the authors at Ref. [26] proposed a hybrid model
sumed that ML‐based approaches for detecting malicious combining the convolution operation to model the character
URLs seem to be an optimal approach, these techniques are level feature of the URL and a deep Convolutional Autoen-
still vulnerable to various issues such as reliability of third‐party coder (CAE) to consider the nature of the 0‐day attacks.
services for feature extraction, manual feature engineering, Around 1.5 lakh URL data were collected from the three real‐
inability to cope up with a huge volume of URLs, dependency world datasets, namely Phish storm, Phish tank and ISCX‐
of fixed number of features for prediction etc. Since the URL‐2016 datasets for experimentation. The results demon-
modern‐day hackers are generating URLs of which quite a few strate that the proposed method showed an accuracy of
are shorter in length and certain other URLs have a unique 96.42%. In comparison with the supervised approaches, the
structuring which are difficult to interpret. To overcome such proposed model exhibited better performance. Incorporating
issues, many researchers [23–26] have adopted the DL‐based autoencoders for phishing detection impacted heavily on the
mechanism to classify the URL which does not require fea- performance of the classification results. However, the pro-
tures to be manually extracted. posed method tends to misclassify several phishing URLs as
benign, since only character‐level features were considered for
individual URLs. Hence there is a need to identify a suitable
2.2 | Malicious URL detection using deep preprocessing and feature extraction mechanism that reduces
learning the FP rate of the model.
The models adopted in Refs. [25, 26] used various forms of
Wang et al. [23] proposed PDRCNN (Precise Phishing Detec- Auto‐encoder (AE) to effectively classify phishing URLs, and
tion with Recurrent Convolutional Neural Networks, a quick the findings reveal that adoption of AE was highly efficient in
phishing website detection method that entirely depends on the inheriting higher level abstract features from URLs. Taking
website's URL. This was the first way of detecting phishing in advantage of the efficacies of the AE model, we have proposed
the context of cyber security issues using a DL model. To extract a hybrid framework combining VAE and deep neural network
global features from the input URL, a bidirectional Long Short (DNN) for effective classification of malicious URLs.
Term Memory network was used, and CNN was used to capture
the significant components of the extracted features. PDRCNN
had a detection accuracy of 97% and an Area under curve (AUC) 3 | METHODOLOGY
value of 99%, which is much better than other approaches that
employed artificial features for classification, although it took a Various machine learning classifiers and hybrid algorithms for
long time to train the model. Furthermore, the model is obliv- phishing detection models have been proposed in the litera-
ious about whether the relevant website to the URL being ture, with most authors claiming to achieve the maximum
evaluated for training is active or not. accuracy in detecting phishing websites. Machine learning al-
To enhance phishing website prediction, the authors at Ref. gorithms combine feature selection approaches with classifi-
[24] presented an integrated intelligent phishing website pre- cation algorithms to provide a quick way to identify various
diction by using deep neural networks integrating evolutionary forms of attacks. Despite the fact that ML‐based algorithms
algorithm‐based feature selection and weighting approaches. have shown promising results in malicious URL detection,
The genetic algorithm was used to intuitively determine the however these algorithms are susceptible to the following
17518717, 2023, 3, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12106 by Test, Wiley Online Library on [20/08/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
PRABAKARAN ET AL.
- 427
issues: (a) Requires features of the URL to be manually Hence, we have proposed a detection model that in-
extracted which depends on third‐party services to obtain corporates optimal preprocessing and feature selection mech-
certain important features. (b) Inability to handle huge volumes anism, and at the same time adapt necessary strategies to
of data. (c) Inability to cope with URLs that do not conform to reduce the training time overhead and improvise the response
the manually constructed features. time of the model. The steps involved in the proposed model
Deep learning algorithms, on the other hand, have grown are depicted in Figure 1.
in prominence as a result of their benefits over traditional
machine learning classifiers. Deep learning algorithms possess
the ability to automatically learn features from an input 3.1 | URL preprocessing
through successive forward and backward propagation of data.
Few researchers in recent years have adopted neural network‐ Since a DL model particularly accepts input in the form of
based models for malicious URL detection. Various neural numerical vectors, initially the data should be converted into
network‐based approaches were used for automatically either a single or multidimensional matrix with numerical
extracting inherent features from raw URLs, facilitating the values. Our model receives input in the form of URL strings in
unsupervised learning methodology. Despite its success, DL‐ which each character has significance in determining the nature
based models have the following constraints: (a) The URL of the URL. Our approach employs OHE [27]‐based pre-
data should be converted into numeric vectors before feeding processing mechanism that converts every URL string into a
it to the model. (b) Training the model with raw inputs numerical vector with N x M dimension. Each character in the
significantly takes a considerably longer duration. (c) The URL is represented by a fixed length vector with M dimensions
choice of the neural network model for feature extraction plays that consists of a sequence of 0's and 1's representing the
a vital role in the successful classification of the URL. (d) Huge position of the character. Depending on the length of the
numbers of hyper‐parameter values are involved in D‐ based URL, a URL vector of N x M dimension is obtained.
approaches, and the process of fine tuning the parameters is In particular, M denotes the length of the maximum
tedious. number of probable characters that might appear in a URL. As
Following the success of DL approaches in malicious suggested in Ref. [28], there are 84 possible numbers of
URL detection and also considering the complexities characters (a–z, A–Z, 0–9, ‐ . !*’();:&= +$,/?#[]) through
involved, our research aims at proposing a novel DL‐based which an URL shall be formed. Hence we set the threshold
malicious URL detection model that adopts the optimal value of M to be 84 such that our model captures all kinds of
techniques for preprocessing feature selection and classifica- characters for evaluating the URL.
tion. For experimentation, we have collected raw URL sam- Here, N denotes the length of the URL, that is, the number
ples from various resources that consist of both malicious of characters in the URL. Since the length of each URL is
and benign URL data. Around half a million malicious URLs different from each other, it is not possible to allocate the value
have been crawled from the Phish tank and ISCX‐URL‐2016 of N in accordance with the length of individual URLs. Hence,
dataset and another half a million benign URLs are obtained a fixed length is determined for all URLs by calculating the
from the Kaggle website. average length of all the URLs. Hence we set the threshold
Our study mainly focuses on alleviating the following is- value of N to be 116, which is the mean length of all the URLS
sues that were prevailing in the existing phishing detection‐ in the dataset. If a particular URL exceeds the fixed length
based approaches. (i) Reliance of manually extracted URL value, then those extra values are trimmed. In contrast, if a
features that requires a third‐party service. (ii) Huge di- URL possesses a shorter length than the fixed URL length
mensions of the automatically extracted features that lead to value, then a zero padding mechanism is adopted in which 0's
training time overhead. (iii) Deprived preprocessing of URL are appended to the vector.
data leading to inconsistency in identifying the important as- Finally, the raw URL string will be transformed into
pects of the URL. (iv) The poor response time of the model 116 � 84 dimensional matrices with 116 rows and 84 columns,
which is the consequence of the lack of optimality in the representing each character in a one hot encoded form after
aforementioned issues. the preprocessing stage.
3.2 | Variational autoencoder (VAE) of the input data. The Helmholtz Machine [30], which was
perhaps the first model to use a recognition model, inspired
After preprocessing the URL inputs, instead of directly feeding VAE. Its wake‐sleep algorithm, on the other hand, was inef-
them to a neural network‐based model for classification, our ficient and did not optimise a single goal. Instead, the VAE
approach adopts a feature reduction/extraction technique to learning rules are based on a single approximation to the
select certain inherent features of a URL to optimise the per- maximum likelihood goal.
formance of the classifier. Reducing the dimensionality of the The basic principle behind the working of a VAE model is
input URL significantly decreases the training time overhead as follows: (I) Input Data is mapped to a latent space using a
associated with the classifier. The process of reducing the neural network. (i) The posterior and prior distribution of the
dimensionality of the input and recognising specific input latent space are modelled as Gaussian distribution. (ii) The
features have certain pitfalls: (a) While reducing the dimen- output of the corresponding neural network is two parameters:
sionality of the input, if suitable reduction mechanisms are not (a) mean and (b) covariance, which are parameters of the
adopted, then there are chances for compromising important posterior distribution. (II) A random sample from the latent
features that may lead to optimism. (b) Recognising valiant space distribution is assumed to generate data that is similar to
features from a URL input involves the consideration of inner the input data. (III) The latent space vector is mapped to the
relationship among the characters which is quite complex. generated data by using another neural network which in turns
Hence the neural network model should possess the aptitude produces a reconstructed output corresponding to the mean of
to properly reduce the dimensions of the input and at the same the Gaussian distribution.
time retain the important aspects of the input features. The overview of VAE is shown in Figure 2. Variational
In order to automatically extract salient features from the Autoencoders is composed of three main components namely,
input URL vector, we developed an AutoEncoder (AE)‐based Encoder, Decoder and Regularised loss function.
feature extraction approach. The Auto encoders is a special The encoder takes input training data, say X, and produces
form of feed forward neural network which is mainly designed a latent representation, say Z, which in turn is generated using
to encode the input into a compressed and meaningful rep- the parameters of the Gaussian distribution namely, mean(μ)
resentation and then decode it back such that the recon- and covariance(Σ). Here, the latent space is stochastic in nature
structed input is similar as possible to the original one [29]. that is, they are the parameters of a probability distribution.
There are two primary components of an Autoencoder The encoding function is given as
model (a) Encoder and (b) Decoder. The Encoder part is
dedicated to compress the original input vector into a reduced Qφ ðZjXÞ → Z ð1Þ
form which will be present in the latent space or sometimes
referred to as the bottleneck layer of the model. From this where Z in equation (1) can be expressed as
intermediate layer, the decoder decodes the compressed data to
reconstruct the original input data. The core principle behind Z → N ðµðXÞ; Σ ðXÞÞ ð2Þ
this model is to minimise the loss incurred by calculating the
difference among the original and the reconstructed data. Here, the latent space follows a standard multivariate
A traditional AE model can be constructed in one of the Gaussian distribution. The probability density function of a
following ways. (a) Fixing the number of units in the hidden multivariate Gaussian is given by
layer to be considerably lower than the number of units in the
input layer. (b) Fixing the number of units in the hidden layer � �
1 1
higher than the input layer units. P ðZ : µ; ΣÞ ¼ exp − ðZ − μÞ Σ ðZ − μÞ
T −1
ð2πÞn jΣj 2
The former case is referred to as an undercomplete
autoencoder in which the dimensions of the hidden layer are ð3Þ
significantly lesser than the dimension of the input layer. The 10
latter case is called an overcomplete autoencoder where the z1
hidden layer has more units than the input layer. Both B : C
With sample vector Z = B C
@ : A , mean vector μ =
the undercomplete and overcomplete autoencoder suffer from
the problem of generalisation since they get confined into the Zn
0 1
process of copying the original inputs directly without learning
meaningful representation from the inputs. This is mainly due 0 1 B σ 2 ρσ σ C
µ1 B 1 1 n C
to the problem of non‐regularisation in latent space of an B C
B : C B C
autoencoder model, that is, data will be distributed in an un- B C and Covariance matrix Σ = B C
@ : A B C
even manner, and certain parts of the latent space do not B C
µn B C
represent the observed pattern leading to severe overfitting. @ ρσ 1 σ n σ 2n A
To avoid the problem of overfitting and non‐regularisation
in the latent space of a traditional AE model, we have adopted Where ρσ 1 σ n are covariance and are considered to be
a Variational AutoEncoder (VAE), a special form of AE model diagonal and in turn expressed as an identity matrix with zero
that is capable of learning smooth latent space representation covariance and unit variance that is, Z ~ N (0, I) for each
17518717, 2023, 3, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12106 by Test, Wiley Online Library on [20/08/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
PRABAKARAN ET AL.
- 429
dimension and hence the variance for each dimension is one. Substituting σ in Equation (4) results in
Hence, to generate samples Z, only mean μ and variance σ
2
Þ=2
vectors shall be considered, and the covariance matrix is not Z ¼ µ þ elogðσ ∗ϵ ð7Þ
needed since they are diagonal, and there will be no interaction
between the features. The sample Z obtained from the pa- Z is sampled from the output of the encoder and given as
rameters μ and σ are non‐differentiable. Hence it is not input to the decoder. The decoding function is represented as
possible to back propagate through the VAE model since the
sampling is done based on non‐differentiable parameters. To b
Pθ ðXjZÞ → X ð8Þ
alleviate this issue, a standard technique called reparameter-
isation is adopted by sampling Z based on an additional
parameter say ϵ, which is obtained from a unit standard normal The decoder takes the input obtained from the latent space
distribution N (0, I) and is then multiplied by the variance Z and outputs from the estimates of the input training data X.
vector and added with the mean vector. Since this sampling In particular, it enhances the dimension of the input data from
process is independent as it does not depend upon the pa- lower to higher and reconstructs the input data. The recon-
rameters of both the encoder and decoder, the process of back structed data X b is compared against the original input sample
propagation is feasible. X, and the core expectation is that the obtained reconstructed
After applying the reparameterisation technique for sam- data is as close as possible to the original input data.
pling the input data samples, the final sample Z will be The working principle of the VAE architecture is depicted
expressed as in Figure 3 In order to train the model to generate the desired
output, a loss function is calculated based on two terms
Z¼µþσ ∗ ϵ ð4Þ namely, (a) Data Reconstruction loss and (b) Regulariser.
In general, the VAE loss function L is expressed as follows:
In order to optimise the learning mechanism of the model, h � i
�
we used a log variance vector approach instead of a normal L ¼ −DKL Qφ ðZjXÞ�Pθ ðZÞ þ EQφ ðZjXÞ½logðPθ ðXjZÞÞ�
variance vector. The log variance vector is represented as
ð9Þ
Log variance → logðσ2Þ ð5Þ
The first part of the loss function ‐ DKL ½QφðZjXÞk
Pθ ðZÞ � in Equation (9) is referred to as the Kullback‐Lieber
By differentiating Equation (5), σ becomes
(KL) divergence loss term which is a regulariser that ensures
that the parameters obtained from the encoder are as close to
log ðσ 2 Þ 2
the unit normal distribution N (0, I).
logðσ2Þ ¼ 2: log ðσÞ→ ¼ logðσÞ→ σ ¼ elogðσ Þ=2
2 In the KL divergence term, Qφ ðZjXÞ denotes the output
ð6Þ of the encoder that provides mean μ(X) and the covariance
17518717, 2023, 3, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12106 by Test, Wiley Online Library on [20/08/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
430
- PRABAKARAN ET AL.
vector Σ(X) (constrained to be diagonal) corresponding to the 1. Initially, the raw URL inputs are preprocessed by con-
Gaussian. And Pθ ðZÞ is another Gaussian with zero mean and verting them into numerical vectors of a fixed dimension using
unit standard deviation. Hence the KL divergence loss term L1 OHE mechanism. 2. Then, the preprocessed URL vectors are
can be denoted as, passed as input to the VAE model for an unsupervised learning
process. The model undergoes training in which it tries to
L1 ¼ −DKL ½N ðµ; σÞ þ N ð0; IÞ� ð10Þ reconstruct the original input through a series of forward and
backward propagation mechanisms. Once an optimal loss is
The second part of the loss function E Qφ ðZ jX Þ½log achieved, we extract the dimensionally reduced URL vector
ðPθ ðX jZ ÞÞ� in Equation (9) is referred to as the reconstruction samples from the latent/bottleneck space of the VAE model.
loss, which is the output of the decoder that consists of the 3. The extracted low dimensional feature vectors are combined
reconstructed input with respect to the sample Z. The differ- together to construct a new dataset which was fed as input to
ence between the original training samples and the recon- the DNN model for malicious URL detection. The dataset
structed samples is calculated based on the mean squared error obtained from the VAE model is divided into training and
function L2 which is the squared difference between the input testing dataset and passed into the DNN model for supervised
and the output and can be expressed as training and classification. The detailed workflow of the pro-
posed VAE‐DNN framework is depicted in Figure 4.
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u d � �2
uX
L2 ¼ t X−X b ð11Þ
i¼1 3.3.1 | Feature extraction using unsupervised
learning
Hence the overall VAE loss function L can be given as
In this section, we describe the design and steps involved in
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi reducing the dimension of the input vectors using VAE ar-
u d � �2
uX chitecture. An unsupervised learning methodology is adopted
L ¼ −DKL ½N ðµ; σÞ þ N ð0; IÞ� þ t X−X b ð12Þ
i¼1
in which the raw URL inputs that were being converted into an
OHE‐based vector after preprocessing were given as input to
the encoder part of the model.
The encoder part of the model is itself a fully connected
3.3 | Proposed VAE‐DNN architecture feed forward neural network with one input layer, two hidden
layers and an output layer comprising two vectors namely,
To automatically extract higher‐level abstract features from the mean and variance. The number of the units in the input layer
input data and effectively identify malicious URLs, we have of the encoder part is fixed to be 84, since it should match with
proposed a hybrid approach combining Variational Autoen- the dimensions of the input. The number of hidden layer units
coder (VAE) and Deep Neural network. The steps involved in is fixed to be 70 and 64 respectively. The numbers of units in
the proposed framework are as follows: the output layer of the encoder consisting of the mean and
17518717, 2023, 3, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12106 by Test, Wiley Online Library on [20/08/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
PRABAKARAN ET AL.
- 431
variance vectors that are fixed to be 48, which is double the produced as output, which are then passed through the
size of the latent space features. Latent space units L are fixed decoder where the lower dimensional vectors are again
to be 24 based on several empirical experiments with respect to reconstructed back to the original input dimension. After the
the loss function. forward propagation, a loss function is calculated based on the
Similarly, the decoder part of the VAE model is also a fully difference among the original and the reconstructed output. In
connected feed forward neural network with one input layer, order to optimise the loss error, backward propagation is
two hidden layers and an output layer. The number of units in performed where a suitable optimisation technique is adapted
the input layer is fixed to be 24, since the latent space inputs in which the parameters of the decoder and encoder are
are fed into the decoder for training. Hidden layers one and adjusted and propagated backwards from the decoder to the
two comprise 64 and 70 units, respectively. And the output encoder, and the process repeats.
layer consists of 84 units, which should be similar to the Hence for training the VAE model, around one lakh URL
number of units in the input layer of the encoder part. input samples of dimension 116 � 84 were fed as input to the
The training process of the VAE model comprises a model. We set the number of epochs to be 25. The learning
sequence of forward and backward propagation mechanisms. rate is fixed to be 0.001 and Adam’s optimiser is used for the
In the forward pass, the URL vectors are propagated through optimisation technique. Mean squared error and KL diver-
the encoder in which lower dimensional URL samples are gence loss term were used for calculating the loss function.
17518717, 2023, 3, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12106 by Test, Wiley Online Library on [20/08/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
432
- PRABAKARAN ET AL.
After the training process, the feature vectors from the and defacement). Kaggle is an open source dataset framework
bottleneck layer of the VAE model of dimension 116 � L (L— that comprises recently updated phishing and benign data
Latent space size) were extracted to construct a new dataset across the Internet. We have collected around 50,000 benign
with dimensionally reduced data. URL samples from this forum (https://www.kaggle.com/
xwolf12/malicious‐and‐benign‐websites) for experimentation.
Once the features were extracted and a new dataset was con- The proposed VAE‐DNN model was evaluated based on the
structed from the VAE architecture, finally those data were fed following metrics, namely the Confusion matrix, Classification
into a neural network model for malicious URL detection. In accuracy, precision, recall, F1 score, True Positive (TP) Rate
our work, the deep neural network architecture was proposed (TPR), True Negative (TN) Rate (TNR), FP Rate (FPR) and
for effectively classifying malicious URLs. Around one lakh False Negative (FN) Rate (FNR).
URL samples of 116 � L (L—Latent space size) were collected A confusion matrix is a two‐dimensional matrix used to
from the VAE model and divided into two halves that is, 80% represent the detection accuracy of the model based on the
of the dataset were allocated for training the DNN model, and following terms, namely TP, TN, FP and FN. This matrix
the remaining 20% of the data were retained for testing outputs the total number of samples being correctly and
purposes. incorrectly identified by the classification model. Table 1 shows
The DNN architecture is composed of an input layer, two the general form of a confusion matrix.
hidden layers and a probabilistic output layer. The number of True Positive refers to the number of benign samples being
units in the input layer was set according to the latent space correctly identified as benign samples. True Negative value
size L. The hidden layers H1 and H2 were fixed to have units denotes the total number of malicious samples being correctly
of length 20 and 10, respectively. The output layer is comprised detected as malicious samples. False Positive is calculated based
two units. The input data were divided into equal batches of on the number of malicious samples being incorrectly identi-
size 100. The number of training epochs was set to be 25, and fied as benign samples. False Negative is obtained by counting
the learning rate was fixed to be 0.001. the number of benign samples that are incorrectly identified as
For minimising the loss function, the negative log‐ malicious samples.
likelihood loss function was used. Batch normalisation was Classification accuracy of the model is calculated based on
performed using a drop‐out mechanism. SGD optimiser was the ratio of the number of correctly identified URL samples to
used for the optimisation of the loss function. The Rectified the number of samples. The accuracy of the model is given by
Linear Unit was used as an activation function at every suc-
cessive layer. For calculating the probabilistic outcome, the log TP þ TN
Accuracy ¼ ð13Þ
softmax function was used as an activation function. TP þ FP þ FN þ TN
Finally, after the completion of training, the DNN model is
tested against the testing data samples for malicious URL Precision is one of the important metrics which can be
detection. calculated based on the ratio of the total number of correctly
identified benign samples to the total number of samples being
identified as benign.
4 | RESULTS
TP
Precision ¼ ð14Þ
4.1 | Dataset description TP þ FP
The term Recall or TP rate (TPR) or sensitivity is referred receiving the input features from the VAE model, which is
to as the total number of correctly identified benign samples to actually the reduced version of the original input features.
the actual number of benign samples in the given dataset. To evaluate the performance of the proposed VAE‐DNN
architecture, various experiments have been conducted based
TP on several metrics to assess the accuracy of the model in
Recall ¼ ð15Þ
TP þ FN effectively detecting the malicious URLs. Following are the
various steps taken to ensure the efficacy of the proposed
False Positive Rate denotes the proportion of malicious model. (a) Initially, to evaluate the efficacy of the proposed
samples incorrectly identified and is calculated based on the model, the model was compared against a traditional deep
following formula, neural network model that does not comprise a feature
extraction mechanism. This traditional model was trained and
FP tested against the original input data without feature reduction.
FPR ¼ ð16Þ
FP þ TN The final classification results obtained from this model were
recorded and set for comparison against the proposed VAE‐
The term TNR or specificity is a proportion of the mali- DNN model. (b) For fixing the latent space size ‘L’ of the
cious samples correctly being identified and can be denoted as VAE model, various experiments were conducted to finalise a
fixed length for the bottleneck space of the VAE architecture.
TN Hence the VAE model was separately trained with the entire
TNR ¼ ð17Þ
TN þ FP input data with different range of L values for a fixed number
of iterations. (c) After finalising an optimal L value for the
False Negative Rate denotes the proportion of benign VAE model based on the experimental results, the dimen-
samples incorrectly identified and can be expressed as sionally reduced feature vectors from the latent space of the
model were given as input to the DNN for malicious URL
FN detection. The performance of the model was assessed based
FNR ¼ ð18Þ
TP þ FN on various metrics, namely confusion matrix, precision, recall,
F1 score and accuracy. (d) Also, to highlight the advantages of
The F1 score is measured based on the harmonic mean of incorporating the VAE model for feature selection, various
precision and recall and is given by, AE‐based neural network models were taken into consider-
� � ation for the feature selection mechanism. A hybrid architec-
precision ∗ recall ture combining different AE models and DNN was
F1 score ¼ 2 ∗ ð19Þ
precision þ recall experimented, and the results were compared against the
proposed VAE‐DNN model. Following are the metrics used
for evaluating the different AE‐based models, namely TP rate
(TPR), FP rate (FPR), TN rate (TNR) and FN rate (FNR). (e)
4.3 | Experimentation design Finally, to evaluate the responsiveness of the model in imme-
diately identifying the malicious URLs, a random set of sample
The entire experiment was carried out by using the Google URLs were taken from the real‐world environment and tested
Colab Pro, a setup‐free, cloud‐based Jupyter notebook envi- against the model for measurement. The response time of the
ronment. For our work, we have used Graphics Processing different AE‐based DNN models were taken for evaluation.
Unit (GPU) of Google Colab Pro (High RAM). The hardware
specifications offered by Google Colab Pro are a Tesla v100
PCIe GPU accelerator with a single precision performance of 4.4 | Results
14TFLOPS offering a memory bandwidth of 900 GB/sec
along with HDD of size 125 GB and a memory size of 25 GB. In this section we have described the process involved in
We have implemented our VAE‐DNN model using analysing the performance of the proposed model. Various
Pytorch framework, an open source machine learning library metrics have been involved in assessing the efficiency of the
for python which performs tensor computations on graphics model with respect to malicious URL detection. Initially, the
processing units for building the different neural network performance of the model is measured based on confusion
models. matrix and classification accuracy. Then, the model's perfor-
The proposed model comprises two distinct neural mance outcome is compared against a traditional neural
network architecture combined together for the purpose of network model. In addition, different autoencoders models
feature extraction and classification of input data. The Varia- were taken into consideration for experimentation, and their
tional Autoencoder (VAE) plays the vital role of reducing the detection accuracy is assessed based on various metrics, namely
dimension of the input data along with extracting higher‐level precision, recall and F1 score. Finally, an ROC curve is plotted
abstract representation of the original data. The Deep Neural to observe the superiority of the proposed model in compar-
network (DNN), on the other hand, acts as a classifier by ison to all the other experimented models.
17518717, 2023, 3, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12106 by Test, Wiley Online Library on [20/08/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
434
- PRABAKARAN ET AL.
4.4.1 | Hyper‐parameters of VAE architecture learning rate was fixed to 0.001. The structure of the DNN
classifier is already described in Section 3.3.2.
Our work focuses on deploying a VAE for extracting dimen- Figures 6 and 7 depicts the loss curve and accuracy curve
sionally reduced higher‐level representation of the raw URL of the VAE‐DNN model. The final loss achieved after 25
inputs. As described in section 3.2, the VAE model is epochs during training and testing was below 0.1. The classi-
composed of an encoder part, a latent space and a decoder fication accuracy of the model during training reached a
part. During the training process, the model tries to recon- maximum of 98.52% and testing accuracy reached the highest
struct the original input through a series of forward and of 97.45%. The total time taken for training the VAE‐DNN
backward propagation. Once the training is completed, the model took around 268 s.
features from the latent space of the model are fed as input to In order to assess the detection accuracy of the model with
the deep neural network for further classification. The main respect to both benign and malicious samples, a confusion
purpose of training a VAE model with the original input fea- matrix was constructed to identify the total number of TP, TN,
tures is to reduce the dimensionality of the input along with FP and FN samples. Among the 19,913 samples, our model
extracting significant features. was able to correctly identify 19,423 URL samples leading to a
Hence, the number of units to be fixed in the latent space detection accuracy of 97.45%. In the testing dataset, the total
heavily determines the performance of the classifier, since numbers of benign samples were 10,002, and the numbers of
those are the number of units to be fixed as an input layer for malicious samples were 9911.
DNN. To identify an optimal latent space size L, experiments Table 2 shows the confusion matrix of the VAE‐DNN
were conducted with various ranges of values, and a suitable L model. As can be inferred from the table, out of the 19,913
is finalised based on the final loss obtained after a fixed samples, exactly 490 URL samples were wrongly identified.
number of iterations during training. Among which 280 benign samples were wrongly identified as
The value of L was set as 5, 10, 24, 48 and 64. Since the phishing samples, and 210 Phishing samples were mis-
original input URL is of dimension 116 � 84 after pre- interpreted as benign samples.
processing, the values taken for L were fixed below 84. The L In order to further assess the performance of the proposed
values were set accordingly, and the model was trained for a model, a comparative analysis of VAE‐DNN is performed by
fixed 25 epochs. The final loss was assessed for different L constructing a model that is composed of only a deep neural
values. Figure 5 shows the loss curves obtained for VAE network, and the autoencoder part that was integrated in our
models for different latent space sizes. For L = 5 and L = 64, work for feature extraction was not included in this architec-
the final loss achieved after 25 epochs does not reach an ture. This newly constructed DNN model is composed of one
optimal value when compared with other L values. input layer with 84 units, and two hidden layers with unit
For latent space L = 10 and L = 48; although the initial lengths 20 and 10, respectively, and an output layer with two
loss value obtained for the first epoch was considerably low units. This DNN model is structured in almost the same way as
when compared to other L values, the final loss value achieved we organised DNN in our proposed work except for the
was not as expected. For latent space size L = 24, the optimal number of units in the input layer which was set to 24 in our
loss value was achieved, reaching a final loss of 0.05. case.
Hence based on the experimental results, the final latent The main intention of this experimentation is to check
space size L was fixed to 24 since it reaches an optimal loss whether the influence of feature extraction based on VAE
value after 25 epochs. plays a vital role in effectively classifying the input URLs.
Apart from latent space size, the number of units in the Hence, the feature extraction part is taken out of the picture,
input and output layer of VAE was fixed as 84. For calculating and the DNN part alone is considered for classification with all
the loss value, MSE loss function and the KL divergence term the preprocessed original input URL of dimension 116 � 84.
were used. The learning rate was fixed as 0.001. For optimising The model is trained and tested under the same circumstance
the loss function, Adam’s optimiser function is used. with a similar hyper‐parameter configuration under which our
proposed model was implemented.
Figure 8 shows the detection accuracy of the DNN model
4.4.2 | Performance analysis of the VAE‐DNN without feature extraction mechanism.
model As can be inferred from Figure 8, the maximum accuracy
achieved by the model during the training phase was 92.5%
After training the VAE model, a dataset is constructed based and while testing the model it acquired the highest accuracy
on the dimensionally reduced features extracted from the value of 89.95%. Hence from the results, it can be clearly seen
bottleneck layer of the VAE and fed as input to the DNN for that our proposed model provides better results in terms of
training and testing. Around one lakh URLs of dimension accuracy when compared with the newly constructed model
116 � 24 were split into training and testing data as 80:20 that does not comprise a feature extraction mechanism. Hence
ratios. A total of 79,745 URL samples were taken for training, it can be summarised that the role of the VAE model in
and the remaining 19,913 samples were allotted for testing the extracting higher‐level abstract features from the inputs leads
model. The number of training epochs was fixed to 25, and the to an effective malicious URL detection.
17518717, 2023, 3, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12106 by Test, Wiley Online Library on [20/08/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
PRABAKARAN ET AL.
- 435
FIGURE 5 Experimentation with different latent space size for optimisation of loss value.
TABLE 4 True Positive Rate (TPR), False Positive Rate (FPR), True Negative Rate (TNR) and False Negative Rate (FNR) values for all the models
Algorithms True positive rate (%) False positive rate (%) True negative rate (%) False negative rate (%)
AE‐DNN 90 7.08 92.92 10.01
the input data resulting in a localised space contraction, TABLE 5 Area under curve (AUC) score for all the experimented
yielding a robust feature on the activation layer. models
Tables 3 and 4 clearly depict the performance of all the Algorithms AUC score
experimented models with respect to various metrics, namely
Vanilla AE‐DNN 0.9351
precision, recall, F1 score, accuracy, TP rate, TN rate, FP rate
and FN rate. Deep AE‐DNN 0.9415
The results from Tables 3 and 4 suggest the following Sparse AE‐DNN 0.9555
inference. In terms of detection accuracy, the VAE‐DNN
Denoising AE‐DNN 0.9607
model outperforms all the other experimented models with
the highest accuracy of 97.45%. The traditional VAE‐DNN Convolutional AE‐DNN 0.9699
model was the least to perform in terms of detection accu- Contractive AE‐DNN 0.9789
racy. In terms of precision and recall both VAE‐DNN and
Variational AE‐DNN 0.9858
contractive AE‐DNN produced close enough results which
were the highest among all the others, reaching above 96%.
Both the Denoising AE‐DNN and Convolutional AE‐ accuracy as well as accurately detecting the input URLs based
DNN have acquired higher precision values and almost reach- on its categorisation.
ing 96%, but in terms of recall measurement, both the models The AUC score was computed for all the experimented
tend to produce average results. Regarding the F1 score value, models in order to further validate the strength of the pro-
apart from traditional and deep AE‐based neural network posed model and is depicted in Table 5.
models, all the other models achieved a maximum 95% with the The Area under Curve is used to measure the capability of
highest value achieved by our proposed model, reaching 97.54%. a model in discriminating classes. Maximum the value of AUC,
Although all the experimented models were producing better the outcome of the prediction. VAE‐DNN achieves the
better results in terms of accuracy and F1 score, it is mandatory maximum AUC score of 0.9858, which is far better than all the
to assess the model's ability in effectively identifying the ma- other models.
licious URL, which can be assessed based on two metrics, Figure 9 represents the AUC‐ROC curve. This curve plots
namely a FP rate and FN rate. As can be observed from Ta- the tradeoff between a TP rate and FP rate at different clas-
ble 3, VAE‐DNN has the lowest false alarm rate of 2.19 and sification thresholds. The Area under Curve score is calculated
FN rate of 2.80 when compared to all the other models in the by measuring the entire 2‐dimensional area beneath the ROC
experimentation. This result has proven that the VAE‐DNN curve. It provides an aggregate measure of a model's perfor-
model is considerably better in terms of classification mance across different threshold values.
17518717, 2023, 3, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12106 by Test, Wiley Online Library on [20/08/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
438
- PRABAKARAN ET AL.
FIGURE 9 AUC‐ROC curve for the experimented models the possible characters required to form the URL were taken
into consideration, and a two dimensional vector is formed
From Figure 9, it can be understood that the traditional based on the average length of the URLs and fixed number of
AE‐based models are not consistent in classifying the data possible characters. As we fixed 84 as the fixed size length of
across different threshold values as we can witness a clear each row in the constructed matrix, each URL will be repre-
fluctuation in the curve for both TVAE and Deep AE models. sented by L X 84 dimensional vector representation which is
Among the other AE‐based approaches, the contractive AE‐ quite big to process for a neural network model. However, this
based model performed better at varying thresholds; howev- technique creates a vector that allows space for all kinds of
er, our VAE‐DNN proved to be much more consistent than all possible characters to be represented in the matrix which will
the other models as it remained closer to value one for the be helpful for the neural network models to effectively extract
maximum number of threshold values. inherent features and classify URLs optimally.
The response time of a model is an important measure to To analyse the impact of the VAE model in the final
consider when using it in a real‐time context. Because the classification of URLs, we have experimented with different
model that will be deployed must be able to reply to URLs in auto encoder models and observed the final results obtained by
real‐time in a very short length of time. The time difference the classifier by adopting each and every AE model as a feature
between the time the URL is fed and the projected results can extractor. The results suggest that among all the AE models
be used to calculate a model's response time. Once the URL is selected for the feature extraction process, the VAE model
fed into a model, it would go through various steps, including produced the best possible results in terms of various metrics,
URL preprocessing, feature selection and prediction of namely precision, recall, F1 score etc. Also, the VAE‐DNN
whether the URL is benign or phishing. approach was compared against the traditional DNN classi-
As a result, a study was carried out to evaluate the tested fier that does not employ feature extraction mechanism, and
models in terms of response time. Table 6 depicts the findings the results suggested that our approach delivers 5% more ac-
of the investigation. A random set of URLs from both the curacy than the standalone classifier. To finalise the length of
training and testing sets were fed into all of the experimented the latent space size of VAE, various experiments were con-
models for testing purposes, and response time was calculated ducted in accordance with the final loss obtained, and the
for each model. optimal latent space size of 24 was fixed to reduce the
Table 6 suggests that the traditional VAE model possesses dimensionality of the input features.
the least response time of 1.3 s due to the simplicity in the Our approach eradicates the complexity involved in
structuring of the model. Our proposed VAE‐DNN model manual feature engineering and reliance on third party services
acquired a response time of 1.9 s which is the second fastest to extract certain features. Our model simply accepts raw URL
model to respond among all the models. Convolutional AE‐ data which can be preprocessed and dimensionally reduced
DNN took the longest time to respond due to the abstract higher‐level features of the URL that were easily
complexity of its structure. The runtime achieved through extracted using VAE. Although our model performs better in
VAE‐DNN is optimal and is considered suitable to be terms of classification accuracy, when exposed to a random set
deployed in a real‐world environment. of URLs after being trained, the response time of the model is
not quite optimal since it roughly took around 2 s to respond
to a URL, which should definitely be improved.
4.5 | Discussion
Based on the experimental analysis, few insights have been 5 | CONCLUSION AND FUTURE WORK
inferred regarding the impact of the proposed VAE‐DNN
model in malicious URL detection. The adoption of OHE To overcome the complexities involved in identifying whether a
mechanism in converting URL data to numerical vectors particular website is legitimate or not, our work explores a DL‐
significantly improve the performance of the model since all based phishing detection mechanism that combines the
17518717, 2023, 3, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12106 by Test, Wiley Online Library on [20/08/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
PRABAKARAN ET AL.
- 439
strength of two different neural network models to effectively 2. Xiao, X., et al.: CNN–MHSA: a Convolutional Neural Network and
detect the malicious nature of a particular URL. Inspired by the multi‐head self‐attention combined approach for detecting phishing
reconstruction ability of the autoencoder model that has the websites. Neural Network. 125, 303–312 (2020). https://doi.org/10.
1016/j.neunet.2020.02.013
aptitude to learn better representation of the input data and at the 3. Wardman, B.: Computer Law Commons, Defense and Security Studies
same time could reduce the dimensionality of the input, we have Commons, Forensic Science and Technology Commons, Information
adopted the variational autoencoder (VAE) model for the pro- Security Commons, National Security Law Commons, OS and Networks
cess of feature extraction. Since DL models only accommodate Commons, Other Computer Sciences Commons, and the Social Control,
numerical vectors, the URLs which are actually in the form of Law, Crime, and Deviance Commons Scholarly Commons Citation
Scholarly Commons Citation Wardman, Brad (2016). [Online]. https://
string have been converted into numerical matrices by using commons.erau.edu/adfsl/2016/thursday/2
OHE mechanism. Finally, to classify the URL data as either 4. Phishing Email Reports and Phishing Site Trends 4 Brand‐Domain Pairs
benign or malicious, a deep neural network (DNN) classifier has Measurement 5 Brands & Legitimate Entities Hijacked by Email
been used. Our work mainly employs the VAE model for Phishing Attacks 6 Use of Domain Names for Phishing 7‐9 Phishing and
Identity Theft in Brazil 10‐11 Most Targeted Industry Sectors 12 APWG
extracting abstract higher‐level features from the raw URL and
Phishing Trends Report Contributors 13 Phishing Activity Trends
DNN model for classifying the URL. Experimental results Report Unifying the Global Response to Cyber Crime. [Online]. http://
demonstrate that the proposed model achieves a maximum ac- www.apwg.org
curacy of 97.85%, which is significantly higher than all the other 5. Al‐Qahtani, A.F., Stefano, C.: The COVID‐19 scamdemic: a survey
experimental DL models. The novelty involved in the proposed of phishing attacks and their countermeasures during COVID‐19.
model is the automatic extraction of inherent features from the IET Inf. Secur. 16(5), 324–345 (2022). https://doi.org/10.1049/ise2.
12073
URL that reveals the character level inner relationship existing 6. Whittaker, C., Google Inc, B. Ryner Google Inc, and M. Nazif Google
within the individual URL, which heavily impacts the perfor- Inc: Large‐Scale Automatic Classification of Phishing Pages
mance of the classifier. Although the detection accuracy of the 7. Liang, B., et al.: Cracking classifiers for evasion: a case study on the
proposed model is quite higher, the model exhibits a quite higher google’s phishing pages filter. In: 25th International World Wide Web
Conference, WWW 2016, pp. 345–356 (2016). https://doi.org/10.1145/
FP rate of 2.19%, which needs to be reduced further. To over-
2872427.2883060
come this issue, in our future work we have planned to incor- 8. IEEE Staff IEEE Staff: IEEE International Conference on Intelligence
porate a generative modelling technique, which will allow us to and Security Informatics (2012)
generate fake URLs that resembles the original URL and train 9. Cui, Q., et al.: Tracking phishing attacks over time. In: 26th International
the model with a dataset comprising of both original and World Wide Web Conference, WWW 2017, pp. 667–676 (2017). https://
generated URLs such that our model will further get exposed to doi.org/10.1145/3038912.3052654
10. Tang, L., Mahmoud, Q.H.: A survey of machine learning‐based solutions
different variants of the URL, which might help in reducing the for phishing website detection. Mach. Learn. Knowl. Extr. 3(3), 672–694
false alarm rate of the model. (2021). https://doi.org/10.3390/make3030034
11. Alkawaz, M.H., et al.: A comprehensive survey on identification and
AUTH OR C ON T R I BU T I ON S analysis of phishing websites based on machine learning methods. In:
Manoj Kumar Prabakaran: Conceptualization; Formal anal- ISCAIE 2021 ‐ IEEE 11th Symposium on Computer Applications and
Industrial Electronics, pp. 82–87 (2021). https://doi.org/10.1109/
ysis; Investigation; Methodology; Software; Writing – original ISCAIE51753.2021.9431794
draft; Writing – review & editing. Parvathy Meenakshi Sun- 12. da Silva, C.M.R., Feitosa, E.L., Garcia, V.C.: Heuristic‐based strategy for
daram: Project administration; Supervision. Abinaya Devi Phishing prediction: a survey of URL‐based approach. Comput. Secur.
Chandrasekar: Formal analysis; Validation; Visualization. 88, 101613 (2020). https://doi.org/10.1016/j.cose.2019.101613
13. Yang, P., Zhao, G., Zeng, P.: Phishing website detection based
on multidimensional features driven by deep learning. IEEE Access
CON FL IC T OF I N TE R ES T 7, 15196–15209 (2019). https://doi.org/10.1109/ACCESS.2019.289
The authors declare no conflict of interest. 2066
14. Bu, S.J., Cho, S.B.: Deep character‐level anomaly detection based on a
DATA AVAI L A BI L I T Y S TA T E ME N T convolutional autoencoder for zero‐day phishing url detection.
Data openly available in a public repository that issues datasets Electronics 10(12), 1492 (2021). https://doi.org/10.3390/electronics-
tnqh_9;10121492
with DOIs. 15. He, Z., Zhou, J.: Inference attacks on genomic data based on probabi-
listic graphical models. Big Data Mining and Analytics 3(3), 225–233
ORCID (2020). https://doi.org/10.26599/BDMA.2020.9020008
Manoj Kumar Prabakaran https://orcid.org/0000-0001- 16. Zhong, W., Yu, N., Ai, C.: Applying big data based deep learning system
8814-0793 to intrusion detection. Big Data Mining and Analytics 3(3), 181–195
(2020). https://doi.org/10.26599/BDMA.2020.9020003
Parvathy Meenakshi Sundaram https://orcid.org/0000- 17. Wang, L., et al.: Effective algorithms to detect stepping‐stone intrusion
0002-1600-9136 by removing outliers of packet RTTs. Tsinghua Sci. Technol. 27(2),
Abinaya Devi Chandrasekar https://orcid.org/0000-0002- 432–442 (2022). https://doi.org/10.26599/TST.2021.9010041
3736-6386 18. Haghighat, M.H., Li, J.: Intrusion detection system using voting‐based
neural network. Tsinghua Sci. Technol. 26(4), 484–495 (2021). https://
doi.org/10.26599/TST.2020.9010022
R EF ER EN CE S 19. Harinahalli Lokesh, G., BoreGowda, G.: Phishing website detection
1. Moore, T., Clayton, R.: Examining the Impact of Website Take‐Down on based on effective machine learning approach. J. Cyber Secur. Tech. 5(1),
Phishing. [Online]. http://www.bankname.freehostsite.com/login 1–14 (2021). https://doi.org/10.1080/23742917.2020.1813396
17518717, 2023, 3, Downloaded from https://ietresearch.onlinelibrary.wiley.com/doi/10.1049/ise2.12106 by Test, Wiley Online Library on [20/08/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License
440
- PRABAKARAN ET AL.
20. Saleem Raja, A., Vinodini, R., Kavitha, A.: Lexical features based malicious 27. Choong, A.C.H., Lee, N.K.: Evaluation of convolutionary neural net-
URL detection using machine learning techniques. Mater. Today Proc. 47, works modeling of DNA sequences using ordinal versus one‐hot
163–166 (2021). https://doi.org/10.1016/j.matpr.2021.04.041 encoding method. In: 2017 International Conference on Computer and
21. Gupta, B.B., et al.: A novel approach for phishing URLs detection using Drone Applications (IConDA), pp. 60–65 (2017). https://doi.org/10.
lexical based machine learning in a real‐time environment. Comput. 1109/ICONDA.2017.8270400
Commun. 175, 47–57 (2021). https://doi.org/10.1016/j.comcom.2021. 28. Berners‐Lee, T., Masinter, L., McCahill, M.: Uniform Resource Locators
04.023 (URL) (1994).20
22. Gandotra, E., Gupta, D.: Improving spoofed website detection using 29. Bank, D., Koenigstein, N., and Giryes, R.: Autoencoders. arXiv preprint
machine learning. Cybern. Syst. 52(2), 169–190 (2021). https://doi.org/ arXiv:2003.05991 (2020)
10.1080/01969722.2020.1826659 30. Dayan, P., et al.: The Helmholtz machine. Neural Comput. 7(5), 889–904
23. Wang, W., et al.: PDRCNN: precise phishing detection with recurrent (1995). https://doi.org/10.1162/neco.1995.7.5.889
convolutional neural networks. Secur. Commun. Network. 2019, 1–15
(2019). https://doi.org/10.1155/2019/2595794
24. Ali, W., Ahmed, A.A.: Hybrid intelligent phishing website prediction
using deep neural networks with genetic algorithm‐based feature selec-
tion and weighting. IET Inf. Secur. 13(6), 659–669 (2019). https://doi.
org/10.1049/iet‐ifs.2019.0006 How to cite this article: Prabakaran, M.K., Meenakshi
25. Yang, L., et al.: An improved ELM‐based and data preprocessing inte- Sundaram, P., Chandrasekar, A.D.: An enhanced deep
grated approach for phishing detection considering comprehensive fea- learning‐based phishing detection mechanism to
tures. Expert Syst. Appl. 165, 113863 (2021). https://doi.org/10.1016/j. effectively identify malicious URLs using variational
eswa.2020.113863
26. Bu, S.J., Cho, S.B.: Deep character‐level anomaly detection based on a
autoencoders. IET Inf. Secur. 17(3), 423–440 (2023).
convolutional autoencoder for zero‐day phishing url detection. Electronics https://doi.org/10.1049/ise2.12106
10(12), 1492 (2021). https://doi.org/10.3390/electronics10121492