base paper
base paper
base paper
6, 2022.
Digital Object Identifier 10.1109/ACCESS.2021.3137636
ABSTRACT Phishing attackers spread phishing links through e-mail, text messages, and social media
platforms. They use social engineering skills to trick users into visiting phishing websites and entering
crucial personal information. In the end, the stolen personal information is used to defraud the trust of
regular websites or financial institutions to obtain illegal benefits. With the development and applications
of machine learning technology, many machine learning-based solutions for detecting phishing have been
proposed. Some solutions are based on the features extracted by rules, and some of the features need to rely on
third-party services, which will cause instability and time-consuming issues in the prediction service. In this
paper, we propose a deep learning-based framework for detecting phishing websites. We have implemented
the framework as a browser plug-in capable of determining whether there is a phishing risk in real-time
when the user visits a web page and gives a warning message. The real-time prediction service combines
multiple strategies to improve accuracy, reduce false alarm rates, and reduce calculation time, including
whitelist filtering, blacklist interception, and machine learning (ML) prediction. In the ML prediction
module, we compared multiple machine learning models using several datasets. From the experimental
results, the RNN-GRU model obtained the highest accuracy of 99.18%, demonstrating the feasibility of
the proposed solution.
INDEX TERMS Phishing detection, machine learning, deep learning, RNN-GRU, web browser extension.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
VOLUME 10, 2022 1509
L. Tang, Q. H. Mahmoud: Deep Learning-Based Framework for Phishing Website Detection
an empirical threshold. It can be seen from the academic constantly evolving. It is the most accurate and fast way to
research report that the number of effective rules is within 100 filter good URLs through the whitelist and block phishing
[5], [6]. Cybercriminals can also develop new attack strate- URLs through the blacklist. However, the list method cannot
gies based on these rules. The rules are interpretable, and the detect new phishing links, and because of the low cost of
logic of the rules is limited, so the detection methods based creating a phishing URL, the attacker does not rely on using
on the rules can be easily cracked and used by attackers. For the same phishing link multiple times.
example, the feature of a URL’s schema is HTTPS, which is Many research reports based on machine learning have
used in many research papers and obtained high importance. been published, and high accuracy results have been obtained
However, the APWG report showed an average of 83 percent in experiments. However, in the actual network environment,
of phishing websites used HTTPS schema in the first quarter there are still many victims of phishing attacks every year,
of 2021 [3]. causing economic losses. There is still a certain gap between
With the rapid development of machine learning, there the experimental data results and the real network secu-
are more and more applications in the field of cybersecu- rity solutions. Therefore, it is very important to study anti-
rity. Some scholars and experts have proposed solutions for phishing solutions in a real-time environment.
detecting phishing links based on machine learning, and We divide the related work into two parts: (1) deep
many academic journal articles show that machine learning- learning-based methods for detecting phishing websites
based solutions have achieved high accuracy [7]–[10]. How- (2) frameworks with prototype implementations.
ever, in the application scenario of a real-time environment,
there are still many challenges. For instance, the real-time
system requires the response time of the predictive service A. DEEP LEARNING-BASED METHODS
to be on the order of milliseconds; a high false-positive rate In this part, we reviewed some state-of-the-art deep learning-
will affect user experience and user trust. based solutions for phishing websites detection.
In this paper, we propose a deep learning-based framework Bu and Cho [11] proposed a deep autoencoder model
to detect phishing links in a real-time web browsing envi- to detect zero-day phishing attacks and obtained 97.34%
ronment. We developed a browser plug-in to receive client accuracy. They extracted character-level features from URL
information, call the background prediction service, and show strings and executed experiments on three different datasets
the prediction results to users. When the URL of the current collected from Phish Storm [2], ISCX-URL-2016 [12], and
tab of the browser is predicted to be a phishing link, the Phish Tank [13]. They used receiver-operating characteris-
current page will receive an obvious warning prompt. The tic curve analysis and N-fold cross-validation to evaluate
prediction result is obtained by the core prediction service the experimental results. Comparing the root mean square
calling a trained machine learning model. We introduced error (RMSE) in the reconstruction phase between legitimate
multiple models with multiple data sets for comparison and URLs and phishing URLs, they found the RMSE increased
backup. It is concluded from the experimental results that the significantly for the phishing URL.
RNN-GRU model obtains the highest accuracy rate of 99.18, Somesha et al. [14] introduced deep learning models for
which is better than SVM, Logistic Regression, Random detecting phishing websites only using ten features extracted
Forest. The contributions of this paper are: from HTML and a third-party service. They compared three
1) A deep learning-based framework for detecting phish- deep learning models and calculated 18 features’ weights.
ing URLs. We trained and tested the models using The experimental results demonstrated that the Long Short-
seven custom datasets generated from four existing Term Memory (LSTM) model achieved the highest accuracy
data sources, and we achieved the highest accuracy of of 99.57%. However, they only used one published dataset
99.18% with the RNN-GRU model. with 3526 instances. The dataset is obviously too small for
2) A prototype implementation of the proposed frame- deep learning training. The high accuracy rate in the experi-
work as a Chrome browser extension. mental results may be due to the uneven distribution and poor
We organized the rest of the paper as follows: Section II diversity of the test data.
summarizes the related work focusing on deep learning mod- Adebowale et al. [15] combined the convolutional neu-
els and real-time frameworks. Section III presents the design ral network (CNN) and long short-term memory (LSTM)
and architecture of the proposed framework. Section IV dis- algorithm to classify phishing websites. The hybrid classifier
cusses the prototype implementation, including some open- obtained an accuracy of 93.28% and an average computa-
source frameworks, services, and tools that we have utilized. tional time of 25s by using image, frame and text features.
Experimental results and analysis are reported in section V. They collected URLs from Phish Tank and Common Crawl
Finally, Section VI concludes the paper and offers ideas for and extracted image features from URLs. The image features
future work. are used to feed the offline CNN model, and the text features
are contributed to the LSTM classifier. The innovative point
II. RELATED WORK of this solution is to combine the characteristics of pictures
Phishing attacks represent a serious problem, and the tech- and text. However, from the experimental results, there is
nology for detecting and intercepting phishing attacks is still room for improvement in the accuracy rate, and the
FIGURE 1. Architecture of the deep learning-based framework for detecting phishing URLs.
calculation time is too long to meet the requirements of real- for phishing, combining multiple methods to protect users
time prediction products. from being attacked effectively. However, there is still room
for improvement in the machine learning model’s perfor-
B. FRAMEWORKS AND SYSTEMS mance, and the number of logos that the logo classifier can
When detecting whether a webpage is at risk of phishing detect is too small.
attacks, the core service is a prediction service based on Maurya et al. [17] introduced an anti-phishing system,
machine learning. The response time of predictive service is which contains a web browser extension. The browser plug-
the most important indicator to measure the feasibility of this in obtains the current URL in real-time and extracts features
real-time system. based on the DOM structure, then detects whether there is a
Atimorathanna et al. [16] introduced an anti-phishing pro- risk of phishing attacks and prompts the user. The detection
tection system, which consists of a web browser extension, service is divided into three stages, namely whitelist match-
an e-mail detection plug-in, filters, and a machine learning- ing, blacklist filtering, and prediction based on a machine
based phishing detecting server. The browser extension is learning model. The prediction phase determines the URL
used to extract the current URL, capture a screenshot, and that meets the criteria as a phishing link based on character-
store the user’s visit history as a profile on the client-side. The level features. For example, there are no hyperlinks to the web
server mainly uses the following processes to detect phish- page, and the number of hyperlinks to external domain names
ing links: (1) using the blacklist and whitelist of third-party exceeds a certain percentage. Such rules are vulnerable to
services to filter new URLs; (2) using a machine learning attackers, and some normal URLs are likely to be misidenti-
model based on 13 features to predict whether the URL is a fied. In addition, the author improves accuracy by combining
phishing link; (3) using computer vision technology to detect three basic classification models.
website logos and comparisons the similarity of screenshots Shah et al. [18] presented a machined learning-based
of web pages. The logo detector in the article is used to browser extension for detecting phishing URLs. They trained
identify 20 well-known online banks and some commonly the Random Forest model using the UCI dataset, which con-
used website logos. tains 11,055 instances with 30 normalized features. There-
The authors collected and established their own database fore, it is required to extract features based on the current
for the training of the logo detection model and obtained an URL string in a real-time environment. In the article, the
accuracy rate of more than 95%. The comparison of the sim- authors extracted 16 features that do not rely on third-party
ilarity of the two screenshots uses Python’s OpenCV library. services. Experimental results show that the accuracy rate is
The experimental results of the URL analyzer showed that 89.6%, which has a lot of room for improvement.
the Random Forest classifier achieved the highest accuracy of Sundaram et al. [19] built a Chrome browser extension for
96.257%. It is a completed online real-time detection system phishing websites detection. They used the UCI data set to
train the model and packaged the trained model into a browser web browser extension. The orange lines with arrows in the
extension. The article did not describe the implementation figure show the data interaction process.
details and results in detail, nor does it give the average The core process of this framework is mainly divided into
calculation time for real-time detection of a URL. However, the following six steps: the first is to collect and integrate
the feature extraction process relies on third-party domain data from various data sources; the second is to combine
name services. different data sets for machine learning model training, and
Abiodun et al. [20] developed a website to verify a link store the trained model in a file system; the third is that the
is a phishing URL or not. The detector was implemented interface for predicting phishing risk calls the trained model
by JAVA programming language and a library named JSoup to make predictions; the fourth is that the browser extension
HTML Parser (JHP). This solution is mainly divided into calls the prediction interface to perform real-time detection
three stages. The first is to use JSoup to parse the DOM and display the detection results; the fifth is that users can
structure of the website to be detected. The second is to submit real-time feedback when they disagree with the detec-
analyze the number of link tag <a> from the DOM structure tion results, such as misjudgment, missed alarm; finally, the
and analyze the attribute ‘‘href’’ value. The attribute value is report submitted by the user is verified through manual review
classified as an empty link, external links and internal links. and automatic review strategy, and the verification result is
Third, the link calculator figured out an indicator, which has a synchronized to the data set
value between 0 and 1. When the value exceeds 0.8, the URL
to be verified is considered a phishing link. Since no machine A. DATA COLLECTION TASKS
learning model is introduced, there is no training process. Data is the core of the field of machine learning. The quality
In the experiment, the authors used 300 URLs to test the and quantity of data significantly impact the performance of
performance of the link calculator. The testing results showed machine learning-based modules [22]. The data collection
they achieved 99.97% accuracy and a 0.03 false-negative rate. module is the foundation of this system. A data collecting
They will need to use a larger test data set to verify this task is divided into two parts, obtaining data from different
solution in the future. From the analysis, it is a misjudgment data sources, then analyzing and storing data.
to judge the phishing risk by analyzing the characteristics of We collected data from different open sources shown
the link tag from the website source code alone, and it is easy in Table 1. The Phish Storm [2] dataset contains 96,018
for attackers to use this rule to circumvent these rules. URLs: 48,009 legitimate URLs and 48,009 phishing URLs.
A web browser architecture with an intelligent engine The ISCX-URL2016 [12] dataset contains 35378 legitimate
for phishing websites detection named EPDB is presented URLs and 9965 phishing URLs. We loaded around 350,000
in [21]. Compared to the traditional web browser architec- benign URLs from an open Kaggle project [23]. In addition,
tures, the EPDB has a brilliant engine-integrated machine we initially collected 400,000 data and regularly grabbed new
learning model for detection in a real-time environment. data from the Phish Tank platform [13] every day.
They used the UCI dataset to train machine learning models.
In the predictive process, the rule of extraction framework TABLE 1. Data sources.
is applied, which could extract 30 features of a website.
The experimental results showed the Random Forest classi-
fier obtained the highest accuracy of 99.36%. Although the
accuracy of the experimental data is very high, this solution
also has some limitations and challenges. First, developing
a browser is a highly complex task. Some functions of the
browser need to be compatible with mature browser functions
before they can be promoted to users. In addition, the data
set for training the model is single, and the robustness of
the model needs to be verified again. Finally, the rule-based We analyzed the basic structure of the URL and parsed out
feature extraction framework relies on third-party services. basic information such as protocol, domain, subdomain, top-
level domain, and path [24]. Table 2 presents the major fields
of a table named URL. We stored data in a relational database,
III. FRAMEWORK DESIGN as it is flexible and efficient for providing data services by
Figure 1 depicts the architecture of the components of our reading based on SQL. These data services can combine
proposed framework. There are four modules in terms of data multiple data sets. For example, select 20,000 phishing links
collection tasks, machine learning (ML), cloud application, from phish tank and 20,000 good links from Kaggle, and
and web browser extension. The data collection module is an combine them into a balanced data set with 40,000 instances.
independent scheduled task application. The ML module is
used for training modules. The web browser extension is a B. MACHINE LEARNING
client-side product. The cloud application is built to deal with The machine learning module is mainly responsible for
false alarms and phishing URLs reported by users from the model training and model testing. In this framework, the data
FIGURE 2. The characters dictionary with 100 ASCII characters which are widely used in URLs.
TABLE 2. The database table URL’s structure. to extract features. The feature extraction process converts a
collection of text documents to a matrix of token counts, and
each token stands for one word. In classical machine learning
models, the tokenization process converts a URL string to a
list of words. Therefore, the number of features equals the
vocabulary size found by analyzing the data.
In deep learning models, the tokenization process parses a
URL string to a list of characters (Character-level tokens).
The characters in the URL come from the ASCII charac-
ter set. We chose the most common 100 characters as the
character set dictionary for this study. Figure 2 shows all the
arranged characters and the corresponding index.
The maximum length of a URL is 2083 characters [24].
Because of the calculation time of the deep learning model
and the analysis of the statistical data of the existing data
set, we set the maximum number of URL characters to 200.
of the training model is updated regularly, and the training Therefore, each URL can be transformed into a 200∗ 100
and testing processes of all models are automatically and matrix. The position of the dictionary corresponding to each
regularly triggered. The system will record each run’s param- character is marked as 1, and the remaining values are
eters and data collection types and dump the model to the file 0. Figure 3 shows the process of forming a matrix using
storage system. It is flexible to add new models to the ML Google’s official website as an example.
module. This research developed six machine learning mod-
els, namely Logistic Regression, Support vector machines
(SVM), Random Forest, RNN, RNN-GRU, and RNN-Long
short-term memory (LSTM).
1) PARAMETER CONFIGURATION
The parameter configuration process initializes the model
parameters according to the configuration file. The config-
uration file includes a parameter grid corresponding to each
model, and each parameter has a discrete number of values.
In the model training process, one of the permutations and
combinations of these parameter values will be selected for
each training. When all the combinations are applied to the
model and the training is completed, the optimal parameter
combination can be obtained by comparing the accuracy of
the model.
2) DATA LOADING
The dataset used for model training is obtained from the FIGURE 3. The process of creating a feature matrix from a URL string and
database through the data service. The data service supports the character dictionary.
the flexible selection of different data source combinations
and datasets of varying data volumes. Each data instance
contains a URL string and a label that signs the URL is a 4) MODELLING
phishing link or legitimate link. The label values are normal- It is a solution to treat a URL as a document and use character
ized as 1 and 0. separators to parse words as features. However, many words
in URLs also lack semantics. Moreover, the analysis of word-
3) FEATURE EXTRACTION level results in an extensive dictionary will slow the calcu-
We treat the URL string as a document containing semantics lation time. Therefore, we choose to analyze with character
and apply the Natural language processing (NLP) technology level and the characters of the entire URL as a sequence. The
recurrent neural network (RNN) is a feedback neural network training is the process of optimizing the weights parame-
that stores temporary states. It’s suitable for training sequence ters by calculating each error. First, randomly initialize the
data [25]. Figure 4 shows a regular RNN architecture that weights matrix, then calculate the difference between the
consists of an input layer, several hidden layers and an output actual value and the predicted value, then use the optimized
layer. Compared to the feedforward artificial neural networks algorithm to find the optimal solution to minimize the differ-
(ANN), RNNs have a unique architecture with a connection ence, and finally adjust each weight by calculating the step
function between neurons in hidden layers. The figure shows each time.
that the current hidden state is related to the previous hidden Depending on the architecture of RNN and activation
state and the current input. The current hidden state’s func- functions used, the basic RNN architecture does not per-
tional form can be represented as Eq. (1) and (2). The tanh is form well for handling inputs for long sequences because
a nonlinear function, W represents the weights between the of the vulnerability to gradient vanishing or exploding prob-
neurons, and b is the bias vector of the setting. The softmax lems [26]. To address these, Hochreiter and Schmidhuber
calculates the output value as an activation function, as shown introduced a gradient-based model named long short-term
in Eq. (3), and the model prediction value is related to the memory (LSTM) in 1997 [27]. They invented a long short-
current hidden state. term memory unit instead of tanh function to compute hidden
states. The LSTM unit consists of three gates and two mem-
ht = fw (ht−1 , xt ) (1) ory cells. Cho et al. proposed a novel model with a hidden
ht = tanh (Whx xt + Whh ht−1 + bh ) (2) unit, which was motivated by LSTM in 2014 [28].
Since the hidden unit contains two gates to control and
Yt = softmax Wyh ht + by (3)
calculate the hidden state, this model is also named gated
The scenario that detects the phishing link is a many-to-one recurrent unit (GRU). Figure 6 demonstrates the architecture
task type, the input is character-level sequence data, and the of the gate units. It can be said that long short-term memory
output is a category. Figure 5 shows the structure of one network (LSTM) and gated recurrent unit (GRU) are two
hidden layer. enhanced versions of RNN. Many studies and experimental
data show that for sequence data training, the LSTM and GRU
architecture can achieve better performance than the basic
RNN architecture [29]–[31].
FIGURE 6. (a). A LSTM unit. G stands for a gate. ht means the current A. WEB FRAMEWORK
hidden state, and Ct means a current memory cell state. The t-1 is a
previous time. (b). A GRU unit. G stands for a gate. ht means the current We used Python as our core language, which is a modern
hidden state, and ht−1 means the previous hidden state. high-level programming language in the field of data mining
and machine learning. There are various frameworks and
libraries for the Python language. In our system, data col-
lection, data storage, model training, websites, and HTTP
will automatically detect whether the newly opened URL services are all supported by mature libraries and frameworks.
is at risk of phishing. If there is a risk, the user will be In addition, the access and use of these packages are very
interactively prompted through a popup box, and the entrance simple and convenient.
feedback error detection will be provided. The extension will Considering the usage scenarios and read and write per-
call the HTTP interface of the prediction service to obtain formance, we chose the MySQL relational database. First,
the detection result and save the detection result in Chrome’s the website has user management, report management, model
storage. version management and other functions, which require a
relational database. In addition, the data set used for model
D. CLOUD APPLICATION training is acquired dynamically. It is very flexible to combine
When a false alarm or missed alarm occurs in the prediction different data sources and data volumes to form a new data
service, the user can take the initiative to report the cur- set for model training. For example, we obtained 200,000
rent falsely detected URL from the browser plug-in portal. phishing URLs from Phish tank and 340,000 legitimate URLs
We have developed a website to receive these reports. Once from Kaggle. A balanced data set with 40,000 URLs can
the report is submitted to the system, the system has a manual be flexibly combined, including 20,000 phishing URLs and
review process to confirm the risks of these URLs. In addi- 20,000 legal URLs.
tion, there are automatic audit strategies to improve audit We imported the Flask as a web framework to provide
efficiency. Once the review is completed, these URLs will be HTTP service and maintain the official website. Flask is a
regularly synchronized to the data collection module, and the lightweight web framework and easy to extend [34]. For
source is reported. In addition, the website provides a detec- example, the flask-user package provides user authorization
tion interface for browser plug-ins, supports multi-strategy services.
B. TRAINING MODEL
1) SCIKIT-LEARN
The scikit-learn is open-source and widely used for predictive
data analysis in the machine learning field [35]. We imported
a scikit-learn library to train three traditional machine learn-
ing models: Logistic Regression, Random Forest, and support
vector machine.
2) PYTORCH
The PyTorch is an open-source deep learning framework
and development platform. We used the dataset module
to build a custom dataset as input for the training model.
In the deep learning models’ construction, we imported FIGURE 8. A screenshot of the chrome browser extension’s alert warning
message window when it detected a phishing URL. The current URL is
the linear layer, RNN layer, GRU layer, and LSTM layer. ‘‘http://srv172932.hoster-test.ru/Notice/webmail/main%20all/
We imported torch.cuda package that utilizes GPUs for par- login.html’’.
allel computation [36].
mathematical calculations of four atomic statistical indica- TABLE 3. The GRU model with different datasets.
tors in terms of the number of correctly identified positive
instances (TP), the number of correctly identified negative
data points (TN), the number of negative data points pre-
dicted by the model are positive (FP), the number of positive
instances labelled as negative (FN). In the article, we use
the F1 score to represent the meaning of the recall and pre-
cision. In addition, in cybersecurity detection applications,
false alarms can affect the user experience and trust, and
leak alarms are likely to directly cause user losses. Therefore,
we use accuracy, F1, false-positive rate, false-negative rate to
measure the efficiency of models. The mathematical formulas
for these metrics are as Equations (4), (7), (8), and (9).
TP + TN
accuracy = (4)
TP + TN + FP + FN
TP
Precision = (5)
TP + FP
TP
Recall = (6)
TP + FN
2 × precision × recall
F1 =
precision + recall
TP
= (7)
TP + 2 (FP + FN )
1
FP
false positive rate = (8)
FP + TN performs best. In KPT datasets, the false rate decreases lin-
FN
false negative rate = (9) early as the number of data increases.
FN + TP
Furthermore, Average precision (AP) is a widely used metric
in evaluating the accuracy of deep learning models by com-
puting the average precision value for recall value over 0 to 1;
higher is better. Mean average precision (mAP) is the average
of AP. Equation (10) shows the calculation logic. In this
scene, the number of classes is two.
1 X TP(c)
mAP = (10)
classes TP (c) + FP(c)
c∈classes
FIGURE 13. The accuracy and F1 score in different models with KPT-12
dataset.
D. COMPARISON
This section compares the RNN-GRU model to existing solu-
tions that train deep learning models to detect phishing web-
sites. Table 6 shows a comparison from different dimensions,
such as data collection, models, performance indicators,
limitations.
As for the limitations of the proposed solution implementa-
tion, since there are no short links in the data set of the training
model, all current prediction services cannot accurately detect
whether short links are at risk of phishing. Furthermore,
we intercepted the first 200 characters of the URL, so for
URLs with more than 200 characters, part of the information
is lost, so it may affect the detection results. In addition, the
process of the automatic review report is currently judged
based on rules such as remote IP address, client information,
and the number of times the URL has been submitted. This
strategy can easily be used maliciously by phishing attackers.
In the future, more data will be needed to support automatic
review results, for example, by obtaining the HTML of the
current URL, identifying the similarity between the logo
image and the whitelisted website, and whether there is an
input box in the HTML.
TABLE 7. (Continued.) Comparison of proposed RNN-GRU model with detection in a real-time browsing environment. The novel
other deep learning-based solutions.
features of the framework are:
1) We utilized closed-loop data to drive better perfor-
mance of machine learning models. A dataset is fun-
damental to model training, and high-quality data can
improve the performance of a model. The feedback data
from users are high-quality data with advancement,
accuracy, and sensitivity.
2) The system is running in a real-time environment with-
out delays. The prediction results are displayed when
the web page is opened.
3) Experimental data can be tracked. The model training
process is an automated task, and each execution result
is stored in a real-time database.
4) We have developed a browser extension as a client
product that every ordinary netizen can use.
5) The implementation of predictive services is
extendable, and individual detection services can be
combined. For example, you can introduce a blacklist
filtering service, computer vision service.
6) The feature extraction process in the deep learning
model is independent of third-party services.
In the future, we will deploy the whole system to a cloud
platform. Configure machines with NVIDIA GPUs for model
training and increase efficiency with GPU’s parallel com-
puting power. Afterward, users can download the extension
through the Chrome Web Store. In addition, we plan to imple-
ment our framework as a plug-in for other browsers.
REFERENCES
[1] L. Tang and Q. H. Mahmoud, ‘‘A survey of machine learning-based solu-
tions for phishing website detection,’’ Mach. Learn. Knowl. Extraction,
vol. 3, no. 3, pp. 672–694, Aug. 2021, doi: 10.3390/make3030034.
[2] S. Marchal, J. Francois, R. State, and T. Engel, ‘‘PhishStorm: Detecting
phishing with streaming analytics,’’ IEEE Trans. Netw. Service Manage.,
vol. 11, no. 4, pp. 458–471, Dec. 2014.
[3] (Jun. 2021). Phishing Activity Trends Report 1st Quarter 2021.
APWG. Accessed: Oct. 20, 2021. [Online]. Available: https://docs.apwg.
org/reports/apwg_trends_report_q1_2021.pdf
[4] (2020). 2020 Internet Crime Report. [Online]. Available: https://www.
ic3.gov/Media/PDF/AnnualReport/2020_IC3Report.pdf
[5] R. M. Mohammad, F. Thabtah, and L. McCluskey, ‘‘Predicting phish-
ing websites based on self-structuring neural network,’’ Neural Com-
put. Appl., vol. 25, no. 2, pp. 443–458, Nov. 2013, doi: 10.1007/
s00521-013-1490-z.
[6] M. A. El-Rashidy, ‘‘A smart model for web phishing detection based
on new proposed feature selection technique,’’ Menoufia J. Electron.
Eng. Res., vol. 30, no. 1, pp. 97–104, Jan. 2021, doi: 10.21608/
mjeer.2021.146286.
[7] B. B. Gupta, K. Yadav, I. Razzak, K. Psannis, A. Castiglione, and
X. Chang, ‘‘A novel approach for phishing URLs detection using lexical
based machine learning in a real-time environment,’’ Comput. Commun.,
vol. 175, pp. 47–57, Jul. 2021, doi: 10.1016/j.comcom.2021.04.023.
[8] E. Gandotra and D. Gupta, ‘‘Improving spoofed website detection using
machine learning,’’ Cybern. Syst., vol. 52, no. 2, pp. 169–190, Oct. 2020,
doi: 10.1080/01969722.2020.1826659.
VI. CONCLUSION AND FUTURE WORK
[9] W. Wang, F. Zhang, X. Luo, and S. Zhang, ‘‘PDRCNN: Precise phish-
Many machine learning-based solutions have been proposed ing detection with recurrent convolutional neural networks,’’ Secur.
in recent years to deal with phishing attacks, but results have Commun. Netw., vol. 2019, pp. 1–15, Oct. 2019, doi: 10.1155/2019/
not been verified in live browsing environments, and there is 2595794.
[10] M. Sabahno and F. Safara, ‘‘ISHO: Improved spotted hyena optimiza-
a lack of analysis and research of products for phishing detec- tion algorithm for phishing website detection,’’ Multimedia Tools Appl.,
tion. In this paper, we proposed a framework for phishing Mar. 2021, doi: 10.1007/s11042-021-10678-6.
[11] S.-J. Bu and S.-B. Cho, ‘‘Deep character-level anomaly detection based [32] D. P. Kingma and J. Ba, ‘‘Adam: A method for stochastic optimization,’’
on a convolutional autoencoder for zero-day phishing URL detection,’’ 2014, arXiv:1412.6980.
Electronics, vol. 10, no. 12, p. 1492, Jun. 2021, doi: 10.3390/electron- [33] Y. Ho and S. Wookey, ‘‘The real-world-weight cross-entropy loss function:
ics10121492. Modeling the costs of mislabeling,’’ IEEE Access, vol. 8, pp. 4806–4813,
[12] URL 2016 | Datasets | Research | Canadian Institute for Cybersecu- 2020, doi: 10.1109/ACCESS.2019.2962617.
rity | UNB. Accessed: Oct. 20, 2021. www.unb.ca. [Online]. Available: [34] Welcome to Flask—Flask Documentation (2.0.x).
https://www.unb.ca/cic/datasets/url-2016.html flask.palletsprojects.com. [Online]. Available: https://flask.palletsprojects.
[13] PhishTank > See All Suspected Phish Submissions. Accessed: com/en/2.0.x/
Oct. 20, 2021. www.phishtank.com.[Online]. Available: https://www. [35] (2019). Scikit-Learn: Machine Learning in Python—Scikit-Learn 0.20.3
phishtank.com/phish_archive.php Documentation. Scikit-learn.org. [Online]. Available: https://scikit-
[14] M. Somesha, A. R. Pais, R. S. Rao, and V. S. Rathour, ‘‘Efficient deep learn.org/stable/index.html
learning techniques for the detection of phishing websites,’’ Sādhanā, [36] Torch.Cuda—Pytorch 1.9.1 Documentation. Pytorch.org.
vol. 45, no. 1, pp. Jun. 2020, doi: 10.1007/s12046-020-01392-4. Accessed: Oct. 14, 2021. [Online]. Available: https://pytorch.org/
[15] M. A. Adebowale, K. T. Lwin, and M. A. Hossain, ‘‘Intelligent docs/stable/cuda.html
phishing detection scheme using deep learning algorithms,’’ J. [37] Welcome. (Nov. 9, 2020). Chrome Developers. Accessed: Oct. 14, 2021.
Enterprise Inf. Manage., to be published. [Online]. Available: [Online]. Available: https://developer.chrome.com/docs/extensions/mv3/
https://www.emerald.com/insight/content/doi/10.1108/JEIM-01-2020- [38] V. Babel, K. Singh, S. K. Jangir, B. Singh, and S. Kumar.
0036/full/html, doi: 10.1108/jeim-01-2020-0036. (2019). Journal of Analysis and Computation (JAC) Evaluation
[16] D. N. Atimorathanna, T. S. Ranaweera, R. A. H. Devdunie Pabasara, Methods for Machine Learning. Accessed: Oct. 14, 2021.
J. R. Perera, and K. Y. Abeywardena, ‘‘NoFish; total anti-phishing pro- [Online]. Available: http://www.ijaconline.com/wp-content/uploads/
tection system,’’ in Proc. 2nd Int. Conf. Advancements Comput. (ICAC), 2019/06/ICITDA_2019_paper_88-.pdf
Dec. 2020, pp. 470–475, doi: 10.1109/ICAC51239.2020.9357145. [39] H. N. A. Pham and E. Triantaphyllou, ‘‘The impact of overfitting and
[17] S. Maurya, H. Singh, and A. Jain, ‘‘Browser extension based hybrid anti- overgeneralization on the classification accuracy in data mining,’’ in Soft
phishing framework using feature selection,’’ Int. J. Adv. Comput. Sci. Computing for Knowledge Discovery and Data Mining, O. Maimon and
Appl., vol. 10, no. 11, pp. 1–10, 2019, doi: 10.14569/ijacsa.2019.0101178. L. Rokach, Eds. 2008, pp. 391–431, doi: 10.1007/978-0-387-69935-6_16.
[18] B. Shah, K. Dharamshi, M. B. Patel, and V. Gaikwad. (2020). [40] IBM Cloud Education. (Mar. 3, 2021). What is Overfitting. www.ibm.com.
Chrome Extension for Detecting Phishing Websites. Semantic Scholar. [Online]. Available: https://www.ibm.com/cloud/learn/overfitting
[Online]. Available: https://www.semanticscholar.org/paper/Chrome- [41] X. Ying, ‘‘An overview of overfitting and its solutions,’’ J. Phys.,
Extension-for-Detecting-Phishing-Websites-Shah- Conf. Ser., vol. 1168, Feb. 2019, Art. no. 022022, doi: 10.1088/1742-
Dharamshi/fa99621bdc27cbd32ed799d7a6c1848ac644e8a8 6596/1168/2/022022.
[19] K. M. Sundaram, R. Sasikumar, A. S. Meghana, A. Anuja, and [42] G. H. Lokesh and G. BoreGowda, ‘‘Phishing website detection based on
C. Praneetha, ‘‘Detecting phishing websites using an efficient feature- effective machine learning approach,’’ J. Cyber Secur. Technol., vol. 5,
based machine learning framework,’’ Revista Gestão Inovação e Tecnolo- no. 1, pp. 1–14, Jan. 2021, doi: 10.1080/23742917.2020.1813396.
gias, vol. 11, no. 2, pp. 2106–2112, Jun. 2021, doi: 10.47059/revistagein- [43] Visualizing Models, Data, and Training With Tensorboard—Pytorch
tec.v11i2.1832. Tutorials 1.9.1+Cu102 Documentation. Pytorch.org. Accessed:
[20] O. Abiodun, A. S. Sodiya, and S. O. Kareem, ‘‘Linkcalculator—An effi- Oct. 15, 2021. [Online]. Available: https://pytorch.org/tutorials/
cient link-based phishing detection tool,’’ Acta Inf. Malaysia, vol. 4, no. 2, intermediate/tensorboard_tutorial.html
pp. 37–44, Oct. 2020, doi: 10.26480/aim.02.2020.37.44. [44] J. Zhang, Y. Ou, D. Li, and Y. Xin, ‘‘A prior-based transfer learning method
[21] M. G. Hr, M. V. Adithya, and S. Vinay, ‘‘Development of anti-phishing for the phishing detection,’’ J. Netw., vol. 7, no. 8, p. 1201, Aug. 2012, doi:
browser based on random forest and rule of extraction framework,’’ Cyber- 10.4304/jnw.7.8.1201-1207.
security, vol. 3, no. 1, pp. 1–14, Oct. 2020, doi: 10.1186/s42400-020- [45] M. Chatterjee and A.-S. Namin, ‘‘Detecting phishing websites through
00059-1. deep reinforcement learning,’’ in Proc. IEEE 43rd Annu. Comput. Softw.
[22] N. Gupta, S. Mujumdar, H. Patel, S. Masuda, N. Panwar, S. Bandyopad- Appl. Conf. (COMPSAC), Jul. 2019, pp. 227–232.
hyay, S. Mehta, S. Guttula, S. Afzal, R. Sharma Mittal, and V. Munigala,
‘‘Data quality for machine learning tasks,’’ in Proc. 27th ACM SIGKDD
Conf. Knowl. Discovery Data Mining, Aug. 2021, pp. 4040–4041, doi:
10.1145/3447548.3470817.
[23] S. Kumar. (2019). Malicious and Benign URLs. kaggle.com.
Accessed: Oct. 20, 2021. [Online]. Available: https://www.kaggle.com/
siddharthkumar25/malicious-and-benign-urls LIZHEN TANG received the bachelor’s degree
[24] (2020). URL Structure [2020 SEO Best Practices]. Moz. [Online]. Avail- in mechanical engineering and automation from
able: https://moz.com/learn/seo/url Zhejiang Sci-Tech University, Zhejiang, China.
[25] S. Smys, J. I. Zong Chen, and S. Shakya, ‘‘Survey on neural network She is currently pursuing the M.A.Sc. degree in
architectures with deep learning,’’ J. Soft Comput. Paradigm, vol. 2, no. 3, electrical and computer engineering with Ontario
pp. 186–194, Jul. 2020, doi: 10.36548/jscp.2020.3.007. Tech University. She worked as a Software Devel-
[26] S. Seo, C. Kim, H. Kim, K. Mo, and P. Kang, ‘‘Comparative study oper at Alibaba Group, Hangzhou, China, for
of deep learning-based sentiment classification,’’ IEEE Access, vol. 8, eight years. Her research interests include machine
pp. 6861–6875, 2020, doi: 10.1109/ACCESS.2019.2963426.
[27] S. Hochreiter and J. Schmidhuber, ‘‘Long short-term memory,’’ Neu-
learning, neural networks, cybersecurity, and data
ral Comput., vol. 9, no. 8, pp. 1735–1780, 1997, doi: 10.1162/ science.
neco.1997.9.8.1735.
[28] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares,
H. Schwenk, and Y. Bengio, ‘‘Learning phrase representations using
RNN encoder-decoder for statistical machine translation,’’ 2014,
arXiv:1406.1078.
[29] J. Chung, C. Gulcehre, and K. Cho. (Dec. 2014). Empirical Evaluation QUSAY H. MAHMOUD (Senior Member, IEEE)
of Gated Recurrent Neural Networks on Sequence Modeling. Accessed: was the Founding Chair at the Department of
Aug. 29, 2021. [Online]. Available: https://ashutoshtripathicom.files.
Electrical, Computer and Software Engineering,
wordpress.com/2021/06/paper-on-rnn-and-lstm-sequential-modelling.pdf
[30] A. Khan and A. Sarfaraz, ‘‘RNN-LSTM-GRU based language transforma- Ontario Tech University, Canada. He has worked
tion,’’ Soft Comput., vol. 23, no. 24, pp. 13007–13024, Aug. 2019, doi: as an Associate Dean with the Faculty of Engineer-
10.1007/s00500-019-04281-z. ing and Applied Science, Ontario Tech University.
[31] A. Shewalkar, D. Nyavanandi, and S. A. Ludwig, ‘‘Performance evaluation He is currently a Professor of software engineer-
of deep neural networks applied to speech recognition: RNN, LSTM and ing. His research interests include intelligent soft-
GRU,’’ J. Artif. Intell. Soft Comput. Res., vol. 9, no. 4, pp. 235–245, ware systems and cybersecurity.
Oct. 2019, doi: 10.2478/jaiscr-2019-0006.