Project

CHAPTER 1
1
INTRODUCTION
Phishing costs Internet users billions of dollars per year. It refers to luring techniques used by
identity thieves to fish for personal information in a pond of unsuspecting Internet users.
Phishers use spoofed e-mail, phishing software to steal personal information and financial
account details such as usernames and passwords. This paper deals with methods for detecting
phishing Web sites by analysing various features of benign and phishing URLs by Machine
learning techniques. We discuss the methods used for detection of phishing Web sites based on
lexical features, host properties and page importance properties. We consider various machine
learning algorithms for evaluation of the features in order to get a better understanding of the
structure of URLs that spread phishing. The fine-tuned parameters are useful in selecting the
apt machine learning algorithm for separating the phishing sites from benign sites.
The criminals, who want to obtain sensitive data, first create unauthorized replicas of a real
website and e-mail, usually from a financial institution or another company that deals with
financial information. The e-mail will be created using logos and slogans of a legitimate
company. The nature of website creation is one of the reasons that the Internet has grown so
rapidly as a communication medium, it also permits the abuse of trademarks, trade names, and
other corporate identifiers upon which consumers have come to rely as mechanisms for
authentication. Phisher then send the "spoofed" e-mails to as many people as possible in an
attempt to lure them in to the scheme. When these e-mails are opened or when a link in the
mail is clicked, the consumers are redirected to a spoofed website, appearing to be from the
legitimate entity.
1.1 PROBLEM DEFINITION
Web Security has become very important in recent years as internet connectivity has
penetrated more and more regions across the world. While this penetration is great for global
connectivity, it also means that more people have access to websites that can potentially attack
them using malwares, viruses, and other malicious agents. Thus, it becomes more important
than ever to identify and deal with such websites before a normal user has access to them
(JangJaccard and Nepal, 2014). Current approaches to deal with this problem have many
limitations in terms of effectiveness and efficiency Eshete, Villafiorita and Weldemariam,
2011. The aim of this study is to detect malicious websites using a group of machine learning
algorithms called classifiers, we will try to detect malware on websites.
2
This will help in safe web surfing and better user experience. By timely reporting malicious
websites, the users will be able to avoid any violation and serious privacy breach. The users
will also be able to avoid any illegal activities that they can get involved in. Labelling malicious
websites will also help to eliminating fraud, as users become victim of attacks that use
blackmailing and false information to get monetary advantage of their victim. For example,
ransomware attacks are getting quite common. Systems get infected by such viruses through
surfing malicious websites.
1.2 PROJECT PURPOSE
The main purpose of the project is to detect the fake or phishing websites who are trying to get
access to the sensitive data or by creating the fake websites and trying to get access of the user
personal credentials. We are using machine learning algorithms to safeguard the sensitive data
and to detect the phishing websites who are trying to gain access on sensitive data.
1.3 BACKGROUND OF THE STUDY
The number of websites on the internet is increasing at a rapid rate. In 2018, there were over
1.6 Billion websites on the world wide web. Total number of Websites - Internet Live Stats,
2020. And as the time passes the number is increasing
3
Figure 1 above shows the number of websites over time. However, as the internet is expanding
so is the risk of malware attacks to web services. Corrupt web developers release malware
through their websites to hack personal computers and servers and breach privacies for
blackmailing, fraud, and theft. These attackers ask for ransom money and can create serious
problems for the victims. Attackers can publish private data of their victims, can steal money
from their accounts.
Figure 2 below shows the top ten categories of websites that have malicious content and can
potentially harm their user. Soft media, 2016. As can be observed the list of categories below
contain some of the most common websites that can have a lot of utility and can make life of a
user easier. When malicious, these can become a nightmare for the user. These websites such
as gambling, shopping and business all prompt users for credit card information. This
information can easily get in the hands of the wrong people and can cause financial harm to the
users.
Figure 2: Top 10 Categories of evil websites Soft media, 2016
4
Following is a summary of potential impacts of malicious websites on computers:
• Disrupts operations and automated programs that maybe handling some important processes
• Steals sensitive information.
• Provides unauthorized access to system resources to other malicious software’s
• Reduces computer or web browser speeds by increasing dummy processes
• Creates network connectivity issues
• Cause frequent freezing or crashing
1.4 PROJECT FEATURES
One of the challenges faced by our research was the unavailability of reliable training datasets.
In fact, this challenge faces any researcher in the field. However, although plenty of articles
about predicting phishing websites using data mining techniques have been disseminated these
days, no reliable training dataset has been published publically, maybe because there is no
agreement in literature on the definitive features that characterize phishing websites, hence it
is difficult to shape a dataset that covers all possible features.
In this article, we shed light on the important features that have proved to be sound and effective
in predicting phishing websites. In addition, we proposed some new features, experimentally
assign new rules to some well-known features and update some other features.
1.5 PROJECT OVERVIEW
In recent years, with the increasing use of mobile devices, there is a growing trend to move
almost all real-world operations to the cyberworld. Although this makes easy our daily lives, it
also brings many security breaches due to the anonymous structure of the Internet. Used
antivirus programs and firewall systems can prevent most of the attacks. However, experienced
attackers target on the weakness of the computer users by trying to phish them with bogus
webpages. These pages imitate some popular banking, social media, e-commerce, etc. sites to
steal some sensitive information such as, user-ids, passwords, bank account, credit card
numbers, etc. Phishing detection is a challenging problem, and many different solutions are
proposed in the market as a blacklist, rule-based detection, anomaly-based detection, etc. In the
literature, it is seen that current works tend on the use of machine learning-based anomaly
detection due to its dynamic structure, especially for catching the “zero-day” attacks. In this
paper, we proposed a machine learning-based phishing detection system by using eight
5
different algorithms to analyse the URLs, and three different datasets to compare the results
with other works. The experimental results depict that the proposed models have an outstanding
performance with a success rate. Phishing is a form type of a cybersecurity attack where an
attacker gains control on sensitive website user accounts by learning sensitive information such
as login credentials, credit card information by sending a malicious URL in email or
masquerading as a reputable person in email or through other communication channels. The
victim receives a message from known contacts, persons, entities or organizations and looks
very much genuine in its appeal. The received message might contain malicious links, software
that might target the user computer or the malicious link might direct the user to some forged
website which is similar in look and feel of a popular website, further victim might be tricked
to divulge his personal information e.g. credit card information, login and password details and
other sensitive information like account id details etc. Phishing is the most popular type of
cybersecurity attack and very common among the attackers. Phishing attacks are generally easy
as most of the victims are not well aware of the intricacies about the web applications and
computer networks and its technologies and are easy prey for getting tricked or spoofed. It is
very easy to phishing unsuspecting users using forged websites and luring them for clicking
the websites for some prize and offers than targeting the computer defense system. The
malicious website is designed in such a way that it has a similar look and feel and it appears
very genuine in its appearance as it contains the organization's logos and other copyrighted
contents. As many users unwittingly clicking the phishing websites URLs and this results in
huge financial and loss of reputation to the person and to the concerned organization In our
daily life, we carry out most of our work on digital platforms. Using a computer and the internet
in many areas facilitates our business and private life. It allows us to complete our transaction
and operations quickly in areas such as trade, health, education, communication, banking,
aviation, research, engineering, entertainment, and public services. The users who need to
access a local network have been able to easily connect to the Internet anywhere and anytime
with the development of mobile and wireless technologies. Although this situation provides
great convenience, it has revealed serious deficits in terms of information security. Thus, the
need for users in cyberspace to take measures against possible cyber-attacks has emerged.
Attacks can be carried out by people such as cybercriminals, pirates, or non-malicious
whitecapped attackers and hacktivists. The aim is to reach the computer or the information it
contains or to capture personal information in different ways. The attacks, as internet worms
(Morris Worm), started in 1988, and they have been carried out until today. These attacks are
6
mainly targeted in the following areas: fraud, forgery, force, shakedown, hacking, service
blocking, malware applications, illegal digital contents and social engineering. Reaching with
a wide range of target users, attackers aim to get a lot of information and/or money. According
to Kaspersky's data, the average cost of an attack in 2019 depending on the size of the attack is
between $ 108K and $ 1.4 billion. In addition, the money spent on global security products and
services is around $ 124 billion . Among these attacks, the most widespread and also critical
one is “phishing attacks”. In this type of attack, cybercriminals especially use an email or other
social networking communication channels. Attackers reach the victim users by giving the
impression that the post was sent from a reliable source, such as a bank, e-commerce site, or
similar. Thus, they try to access sensitive information of them. Attackers then access their
victims’ accounts by using this information. Thus, it causes pecuniary loss and intangible
damages.
1.6 MODULE
• Presence of IP address in URL Presence of @ symbol in URL

• Number of dots in Hostname
• Prefix or Suffix separated by (-) to domain URL redirection
• Information submission to Email
• URL Shortening Services “Tiny URL”
• Length of Host name
• Presence of sensitive words in URL
Presence of IP address in URL: If IP address present in URL then the feature is set to 1 else
set to 0. Most of the benign sites do not use IP address as an URL to download a webpage. Use
of IP address in URL indicates that attacker is trying to steal sensitive information.
Presence of @ symbol in URL: If @ symbol present in URL then the feature is set to 1 else
set to 0. Phishers add special symbol @ in the URL leads the browser to ignore everything
preceding the “@” symbol and the real address often follows the “@” symbol [4].
Number of dots in Hostname: Phishing URLs have many dots in URL. For example
http://shop.fun.amazon.phishing.com, in this URL phishing.com is an actual domain name,
whereas use of “amazon” word is to trick users to click on it. Average number of dots in benign
URLs is 3. If the number of dots in URLs is more than 3 then the feature is set to 1 else to 0.
7
Prefix or Suffix separated by (-) to domain: If domain name separated by dash (-) symbol
then feature is set to 1 else to 0. The dash symbol is rarely used in legitimate URLs. Phishers
add dash symbol (-) to the domain name so that users feel that they are dealing with a legitimate
webpage. For example Actual site is http://www.onlineamazon.com but phisher can create
another fake website like http://www.online-amazon.com to confuse the innocent users.
URL redirection: If “//” present in URL path then feature is set to 1 else to 0. The existence of
“//” within the URL path means that the user will be redirected to another website
Information submission to Email: Phisher might use “mail()” or “mailto:” functions to

redirect the user’s information to his personal email[4]. If such functions are present in the URL
then feature is set to 1 else to 0.
URL Shortening Services “Tiny URL”: Tiny URL service allows phisher to hide long
phishing URL by making it short. The goal is to redirect user to phishing websites. If the URL
is crafted using shortening services like bit.ly then feature is set to 1 else 0
Length of Host name: Average length of the benign URLs is found to be a 25, If URL’s length
is greater than 25 then the feature is set to 1 else to 0
Presence of sensitive words in URL: Phishing sites use sensitive words in its URL so that
users feel that they are dealing with a legitimate webpage. Below are the words that found in
many phishing URLs :- 'confirm', 'account', 'banking', 'secure', ', 'web src', 'sign in', 'mail',
'install', 'toolbar', 'backup', 'pay pal', 'password', 'username', etc;
8
CHAPTER 2
9
LITERATURE REVIEW
The purpose or goal behind phishing is data, money or personal information stealing through
the fake website. The best strategy for avoiding the contact with the phishing web site is to
detect real time malicious URL. Phishing websites can be determined on the basis of their
domains. They usually are related to URL which needs to be registered (low-level domain and
upper-level domain, path, query). Recently acquired status of intra-URL relationship is used to
evaluate it using distinctive properties extracted from words that compose a URL based on
query data from various search engines such as Google and Yahoo. These properties are further
led to the machine-learning based classification for the identification of phishing URLs from a
real dataset. This paper focus on real time URL phishing against phishing content by using
phish-STORM. For this a few relationship between the register domain rest of the URL are
consider also intra URL relentless is consider which help to dusting wish between phishing or
non phishing URL. For detecting a phishing website certain typical blacklisted urls are used,
but this technique is unproductive as the duration of phishing websites is very short. Phishing
is the name of avenue. It can be defined as the manner of deception of an organization's
customer to communicate with their confidential information in an unacceptable behaviour. It
can also be defined as intentionally using harsh weapons such as Spasm to automatically target
the victims and targeting their private information. As many of the failures being occurred in
the SMTP are exploiting vectors for the phishing websites, there is a greater availability of
communication for malicious message deliveries.
2.1 In Intelligent Ensemble Learning Approach for Phishing Website Detection Based on
Weighted Soft Voting
The continuous development of network technologies plays a major role in increasing the
utilization of these technologies in many aspects of our lives, including e-commerce, electronic
banking, social media, e-health, and e-learning. In recent times, phishing websites have
emerged as a major cybersecurity threat. Phishing websites are fake web pages that are created
by hackers to mimic the web pages of real websites to deceive people and steal their private
information, such as account usernames and passwords. Accurate detection of phishing
websites is a challenging problem because it depends on several dynamic factors. Ensemble
methods are considered the state-of-theart solution for many classification tasks. Ensemble
learning combines the predictions of several separate classifiers to obtain a higher performance
than a single classifier. This paper proposes an intelligent ensemble learning approach for
phishing website detection based on weighted soft voting to enhance the detection of phishing
10
websites. First, a base classifier consisting of four heterogeneous machine-learning algorithms
was utilized to classify the websites as phishing or legitimate websites. Second, a novel
weighted soft voting method based on Kappa statistics was employed to assign greater weights
of influence to stronger base learners and lower weights of influence to weaker base learners,
and then integrate the results of each classifier based on the soft weighted voting to differentiate
between phishing websites and legitimate websites. The experiments were conducted using the
publicly available phishing website dataset from the UCI Machine Learning Repository, which
consists of 4898 phishing websites and 6157 legitimate websites. The experimental results
showed that the suggested intelligent approach for phishing website detection outperformed
the base classifiers and soft voting method and achieved the highest accuracy of 95% and an
Area Under the Curve (AUC) of 98.8%. Due to their flexibility, convenience, and simplicity
of use, the number of web users who utilize online services, e-banking, and online shopping
has increased rapidly in recent years. This massive increase in the use of online services and
ecommerce has encouraged phishers and cyber attackers to create misleading and phishing
websites in order to obtain financial and other sensitive information. Online phishing sites
typically utilize similar page layouts, fonts, and blocks to imitate official web pages in order to
persuade web visitors to provide personal information, such as login credentials. Due to the
evolution of online hacking techniques and a lack of public awareness, internet users are
frequently exposed to cyber dangers, such as phishing, spam, trojans, and adware. Phishing has
grown in popularity as a means of collecting users’ private information, such as login details,
credit card information, and social security numbers, via fraudulent websites. Therefore,
phishing attacks represent a serious cybersecurity problem that significantly affects commercial
websites and the users of the web. Personal information collected in this way can be used to
steal money via stolen credit cards, debit cards, bank account fraud, and gaining illegal access
to people’s social media profiles. Phishing attacks have already resulted in significant losses
and may have a negative impact on the victim, not just financially, but also in terms of
reputation and national security. In comparison to 2018 and 2019, in 2020, there was a 15%
increase in the number of phishing attacks. In addition, Kaspersky Lab’s antiphishing security
systems stopped over 482 million phishing threats in 2018, a twofold increase over 2017. Based
on the Anti Phishing Working Group’s (APWG) report (APWG 2020), the number of phishing
attacks is rising continually, with 146,994 phishing websites discovered in the second quarter
of 2020. In 2020, the anticipated average cost of a business breach caused by phishing attacks
was 2.8 million USD. It is important to utilize anti-phishing methods to avoid such significant
losses.
11
2.2 Anti-phishing Based on Automated Individual White-List
In phishing and pharming, users could be easily tricked into submitting their
username/passwords into fraudulent web sites whose appearances look similar as the genuine
ones. The traditional blacklist approach for anti-phishing is partially effective due to its partial
list of global phishing sites. In this paper, we present a novel anti-phishing approach named
Automated Individual White List (AIWL). AIWL automatically tries to maintain a white-list
of user’s all familiar Login User Interfaces (LUIs) of web sites. Once a user tries to submit
his/her confidential information to an LUI that is not in the white-list, AIWL will alert the user
to the possible attack. Next, AIWL can efficiently defend against pharming attacks, because
AIWL will alert the user when the legitimate IP is maliciously changed; the legitimate IP
addresses, as one of the contents of LUI, are recorded in the white-list and our experiment
shows that popular web sites’ IP addresses are basically stable. Furthermore, we use Naïve
Bayesian classifier to automatically maintain the white-list in AIWL. Finally, we conclude
through experiments that AIWL is an efficient automated tool specializing in detecting
phishing and pharming Most of the techniques for phishing detection are based on blacklist
[30]. In the blacklist approaches, once the user visits a web site that is in the blacklist, he/she
will be warned of the potential attack. But maintaining a blacklist requires a great deal of
resources for reporting and verification of the suspicious web sites. In addition, phishing sites
emerge endlessly, so it is difficult to keep a global blacklist up to date. Contrary to blacklist,
white-list approach maintains a list containing all legitimate web sites. But a global white-list
approach is likewise hardly used because it is impossible for a white-list to cover all legitimate
web sites in the entire cyber world. In this paper, we present a novel approach, named
Automated Individual White-List (AIWL). AIWL uses a white list that records all familiar
Login User Interfaces (LUIs) of web sites for a user. A familiar LUI of a web site refers to the
characteristic information of a legitimate login page on which the user wants to input his/her
username/password. Every time a user tries to submit his/her sensitive information into an LUI
that is not included in the white-list, the user will be alerted to the possible attack. Here, LUI
refers to the user interface where user inputs his/her username/passwords. For instance, a
typical LUI is composed of URL address, page feature, DNS-IP mapping. Once the user tries
to submit the confidential information into a web site that is in the white-list, LUI information
of current web site will be collected and compared with the pre-stored one in the white-list.
Any mismatch will also cause warning to the user. To conveniently set up the white-list in
AIWL, we use the Naïve Bayesian classifier to identify a successful login process. After a web
site has been logged in successfully several times, it is believed to be a familiar one of the user
12
and the LUI information of the web site can be added to the white-list automatically after user’s
confirmation. The rest of our paper is organized as follows: in section 2, we introduce
background and motivation of the paper; section 3 introduces the overall approach of AIWL
and discusses some important issues in the approach; section 4 describes the experiments for
evaluation; section discusses the advantages of AIWL on the basis of its comparison with other
solutions and consider the limitations of AIWL; section 6 introduces the related work; and
section 7 summarizes our paper and introduces future work. Phishing attackers use both social
engineering and technical subterfuge to steal user’s identity data as well as financial account
information. By sending “spoofed” e-mails, social-engineering schemes lead users to
counterfeit web sites that are designed to trick recipients into divulging financial data such as
credit card numbers, account usernames, passwords and social security numbers. In order to
persuade the recipients to respond, phishers often hijack brand names of banks, e-retailers and
credit card companies. Furthermore, technical subterfuge schemes often plant crimewares, such
as Trojan, keylogger spyware, into victims’ machines to steal user’s credentials. Phishing attack
not only leads to great loss to users but also influences the expansion of ecommerce. Rampant
phishing attacks would cause the whole e-commerce environment to be dangerous and
aggressive. Furthermore, it is difficult for common users to distinguish fraudulent web site from
the genuine one. Thus, users would feel hesitant to use e-banking and online shopping services
in such an environment.
2.3 Phishing Detection using Machine Learning based URL Analysis:
A Survey” As we have moved most of our financial, work related and other daily activities to
the internet, we are exposed to greater risks in the form of cybercrimes. URL based phishing
attacks are one of the most common threats to the internet users. In this type of attack, the
attacker exploits the human vulnerability rather than software flaws. It targets both individuals
and organizations, induces them to click on URLs that look secure, and steal confidential
information or inject malware on our system. Different machine learning algorithms are being
used for the detection of phishing URLs, that is, to classify a URL as phishing or legitimate.
Researchers are constantly trying to improve the performance of existing models and increase
their accuracy. In this work we aim to review various machine learning methods used for this
purpose, along with datasets and URL features used to train the machine learning models. The
performance of different machine learning algorithms and the methods used to increase their
accuracy measures are discussed and analyzed. The goal is to create a survey resource for
researchers to learn the current developments in the field and contribute in making phishing
detection models that yield more accurate results. The year 2020 saw peoples’ life being
13
completely dependent on technology due to the global pandemic. Since digitalization became
significant in this scenario, cyber criminals went on an internet crime spree. Recent reports and
researches point to an increased number of security breaches that costs the victims a huge sum
of money or disclosure of confidential data. Phishing is a cybercrime that employs both social
engineering and technical subterfuge in order to steal personal identity data or financial account
credentials of victims. In phishing, attackers counterfeit trusted websites and misdirect people
to these websites, where they are tricked into sharing usernames, passwords, banking or credit
card details and other sensitive credentials. These phishing URLs may be sent to the consumers
through email, instant message or text message. According to the FBI crime report 2020,
phishing was the most common type of cyber attack in 2020 and phishing incidents nearly
doubled from 114,702 in 2019 to 241,342 in 2020. The Verizon 2020 Data Breach Investigation
Report states that 22% of data breaches in 2020 involved phishing. The number of phishing
attacks as observed by the Anti Phishing Work Group (APWG) grew through 2020, doubling
over the course of the year. In the 4th quarter of 2020, it was found that phishing attacks against
financial institutions were the most prevalent. Phishing attacks against SaaS and Webmail sites
were down and attacks against E-commerce sites escalated, while attacks against media
companies decreased slightly from 12.6% to 11.8%. In light of the prevailing pandemic
situation, there have been many phishing attacks that exploit the global focus on Covid-19.
According to WHO, many hackers and cyber scammers are sending fraudulent emails and
WhatsApp messages to people, taking advantage of the coronavirus disease. These attacks are
coming in the form of fake job offers, fabricated messages from health organizations, covid
vaccine themed phishing and brand impersonation. A URL based phishing attack is carried out
by sending malicious links, that seems legitimate to the users, and tricking them into clicking
on it. In phishing detection, an incoming URL is identified as phishing or not by analyzing the
different features of the URL and is classified accordingly. Different machine learning
algorithms are trained on various datasets of URL features to classify a given URL as phishing
or legitimate.
2.4 A Phishing Sites Blacklist Generator
Phishing is an increasing web attack both in volume and techniques sophistication. Blacklists
are used to resist this type of attack, but fail to make their lists up to-date. This paper proposes
a new technique and architecture for a blacklist generator that maintains an up-to-date blacklist
of phishing sites. When a page claims that it belongs to a given company, the company’s name
is searched in a powerful search engine like Google. The domain of the page is then compared
with the domain of each of the Google’s top10 searched results. If a matching domain is found,
14
the page is considered as a legitimate page, and otherwise as a phishing site. Preliminary
evaluation of our technique has shown an accuracy of 91% in detecting legitimate pages and
100% in detecting phishing sites. Phishing attack is a type of identity theft that aims to deceit
users into revealing their personal information which could be exploited for illegal financial
purposes. A phishing attack begins with an email that claims it is from a legal company like
eBay. The content of email motivates the user to click on a malicious link in the email. The
link connects the user to an illegitimate page that mimics the outward appearance of original
site. The phishing page then requests user's personal information, like online banking
passwords and credit card information. The number of phishing attacks has grown rapidly.
According to trend reports by Anti-Phishing Working Group APWG the number of unique
phishing sites has been reported 37,444 sites in October 2006, increased from 4,367 sites in
October 2005. Other statistics show the increase in the volume of the Phishing attack and their
techniques are becoming much more advanced. A number of techniques have been studied and
practiced against phishing and a large number of them use phishing blacklists to battle against
phishing. Blacklists of phishing sites are valuable sources that are in use by anti-phishing
toolbars to notify users and deny their access to phishing sites, web and email filters to filter
spam and phishing emails, and phishing termination communities to terminate the phishing
sites. Blacklist indicates whether a URL is good or bad. A bad URL means that it is known to
be used by attackers to steel users' information. The blacklist publisher assigns the “goodness”
the URLs that are not in the list and the “badness” the URLs that are in the list) to all internet
URLs. Many browsers now check blacklist databases to address phishing problem and notify
users when they browse phishing pages. Internet Explorer 7, Netscape Browser 8.1, Google
Safe Browsing a feature of the Google Toolbar for Firefox are important browsers which use
blacklists to protect users when they navigating phishing sites. Due to the wide use of blacklists
of phishing sites against phishing, it is very important to introduce techniques that generate the
updated blacklists of phishing sites. The problem of the blacklist is that it is hard to keep the
list up-to-date since it is easy to register new domains in the Internet. In this paper we propose
a technique to detect deceptive phishing pages, as well as our proposed architecture for a
blacklist of phishing sites generator. The rest of paper is organized as follows. Section 2
discusses related works. Section 3 presents our proposed algorithm and the architecture of our
blacklist generator. The evaluation of the approach is given in section 4. Our proposed
technique tries to generate an updated blacklist of phishing sites. Each web page belongs to a
web site and most of them show this relation using the site’s logo. Phishing pages also use
15
legitimate site’s logo to make their pages credible and claim that they belong to that site. Thus
we can find which site a page claims to belong, using its logo. On the other hand, the domain
of a legal site can be found by searching its name in a search engine like Google. Our technique
is based on these two properties of the web pages and search engines to detect phishing pages.
Figure 1 demonstrates our algorithm which gets a URL as input and returns True if the page is
phishing and False if the page is a legitimate one.
2.5 Detection Of Phishing Websites Using Machine Learning
Phishing is a social manipulation assault aimed at leveraging the vulnerability found in the
program at the end of the user. For example, a program may be technically secure enough for
password theft, but an unrecognized user can leak his / her password when an attacker sends a
request for a false password update via a fake website. To resolve this problem, a layer of
security must be added for use. As of late, there have been a few examinations that attempted
to tackle the phishing issue. A few analysts utilized the URL furthermore, contrasted it and,
existing boycotts that contain arrangements of vindictive sites, which they have been making,
and others have utilized the URL in a contrary way, to be specifically contrasting the URL and
a whitelist of real sites. The latter approach uses heuristics, which is used Database of signatures
for any known attacks that match the Signature of the heuristic template to determine whether
it's a phishing This is the platform. Also besides tracking traffic on Alexa 's website is another
way in which researchers have been applied to detect websites for phishing. Phishing is a social
engineering attack that aims at exploiting the weakness found in system processes as caused
by system users. For example, a system can be technically secure enough against password
theft, however unaware end users may leak their passwords if an attacker asked them to update
their passwords via a given Hypertext Transfer Protocol (HTTP) link, which ultimately
threatens the overall security of the system. Moreover, technical vulnerabilities (e.g. Domain
Name System (DNS) cache poisoning) can be used by attackers to construct far more
persuading socially-engineered messages (i.e. use of legitimate, but spoofed, domain names
can be far more persuading than using different domain names). This makes phishing attacks a
layered problem, and an effective mitigation would require addressing issues at the technical
and human layers. Since phishing attacks aim at exploiting weaknesses found in humans (i.e.
system end-users), it is difficult to mitigate them. For example, as evaluated in [1], end-users
failed to detect 29% of phishing attacks even when trained with the best performing user
awareness program. On the other hand, software phishing detection techniques are evaluated
against bulk phishing attacks, which makes their performance practically unknown with
regards to targeted forms of phishing attacks. These limitations in phishing mitigation
16
techniques have practically resulted in security breaches against several organizations
including leading information security providers. The definition by Colin Whittaker et. al. aims
to be broader than Phish Tank’s definition in a sense that attackers goals are no longer restricted
to stealing personal information from victims. On the other hand, the definition still restricts
phishing attacks to ones that act on behalf of third parties, which is not always true. For example
phishing attacks may communicate socially engineered messages to lure victims into installing
MITB malware by attracting the victims to websites that are supposed to deliver safe content
(e.g. video streaming). Once the malware (or crimeware as often named by Anti-Phishing
Working Group (APWG)2 ) is installed, it may log the victim’s keystrokes to steal their
passwords. Note that the attacker in this scenario did not claim the identity of any third party
in the phishing process, but merely communicated messages with links (or attachments) to lure
victims to view videos or multimedia content. In order to address the limitations of the previous
definitions above, we consider phishing attacks as semantic attacks which use electronic
communication channels such as emails, HTTP, SMS, VoIP, etc. . . to communicate socially
engineered messages to persuade victims to perform certain actions (without restricting the
actions) for an attacker’s benefit without restricting the benefits. See Definition 1. Definition
1: Phishing is a type of computer attack that communicates socially engineered messages to
humans via electronic communication channels in order to persuade them to perform certain
actions for the attacker’s benefit. For example, the performed action which the attacker
persuades the victim to perform it for a PayPal user is submitting his/her login credentials to a
fake website that looks similar to PayPal. As a perquisite, this also implies that the attack should
create a need for the end user to perform such action, such as informing him that his/her account
would be suspended unless he logs in to update certain pieces of information.
2.6 A Literature Survey of Phishing Attack Technique
It is a crime to practice phishing by employing technical tricks and social engineering to exploit
the innocence of unaware users. This methodology usually covers up a trustworthy entity so as
to influence a consumer to execute an action if asked by the imitated entity. Most of the times,
phishing attacks are being noticed by the practiced users but security is a main motive for the
basic users as they are not aware of such circumstances. However, some methodologies are
limited to look after the phishing attacks only and the delay in detection is mandatory. In this
paper we emphasize the various techniques used for the detection of phishing attacks. We have
also discovered various techniques for detection and prevention of phishing. Apart from that,
we have introduced a new model for detection and prevention of phishing attacks. The purpose
or goal behind phishing is data, money or personal information stealing through the fake
17
website. The best strategy for avoiding the contact with the phishing web site is to detect real
time malicious URL. Phishing websites can be determined on the basis of their domains. They
usually are related to URL which needs to be registered (low-level domain and upper-level
Recently acquired status of intra-URL relationship is used to evaluate it using distinctive
properties extracted from words that compose a URL based on query data from various search
engines such as Google and Yahoo. These properties are further led to the machine-learning
based classification for the identification of phishing URLs from a real dataset. This paper
focus on real time URL phishing against phishing content by using phish-STORM. For this a
few relationship between the register domain rest of the URL are consider also intra URL
relentless is consider which help to dusting wish between phishing or non phishing URL. For
detecting a phishing website certain typical blacklisted urls are used, but this technique is
unproductive as the duration of phishing websites is very short. Phishing is the name of avenue.
It can be defined as the manner of deception of an organization's customer to communicate
with their confidential information in an unacceptable behavior. It can also be defined as
intentionally using harsh weapons such as Spasm to automatically target the victims and
targeting their private information. As many of the failures being occurred in the SMTP are
exploiting vectors for the phishing websites, there is a greater availability of communication
for malicious message deliveries. Along with the various criminal enterprises, if there is enough
amount of money generated through the mode of phishing, hunting of various other systems of
message delivery can be done, even though the errors are closed eventually in SMTP. Along
with the ever increasing dishonesty through phishing scams, organizations are getting more
attention from their customers regarding the security of their personal information. Anti Phish
is used to avoid users from using fraudulent web sites which in turn may lead to phishing attack.
Here, Anti Phish traces the sensitive information to be filled by the user and alerts the user
whenever he/she is attempting to share his/her information to a untrusted web site. The much
effective elucidation for this is cultivating the users to approach only for trusted websites.
However, this approach is unrealistic. Anyhow, the user may get tricked. Hence, it becomes
mandatory for the associates to present such explanations to overcome the problem of phishing.
Widely accepted alternatives are based on the creepy websites for the identification of “clones”
and maintenance of records of phishing websites which are in hit list. An alternative for
detecting these attacks is a relevant process of reliability of machine on a trait intended for the
reflection of the besieged deception of user by means of electronic communication. This
approach can be used in the detection of phishing websites, or the text messages sent through
emails that are used for trapping the victims. Approximately, 800 phishing mails and 7,000 non
18
phishing mails are traced till date and are detected accurately over 95% of them along with the
categorization on the basis of 0.09% of the genuine emails. We can just wrap up with the
methods for identifying the deception, along with the progressing nature of attacks. Very
complex and dynamic to be identified and classified. Due to the involvement of various
ambiguities in the detection, certain crucial data mining techniques may prove an effective
means in keeping the e-commerce websites safe since it deals with considering various quality
factors rather than exact values. In this paper, an effective approach to overcome the
“fuzziness” in the e-banking phishing website assessment is used an intelligent resilient and
effective model for detecting e-banking phishing websites is put forth. The applied model is
based on fuzzy logics along with data mining algorithms to consider various effective factors
of the e-banking phishing website. . An alternative for detecting these attacks is a relevant
process of reliability of machine on a trait intended for the reflection of the besieged deception
of user by means of electronic communication. This approach can be used in the detection of
phishing websites, or the text messages sent through emails that are used for trapping the
victims. Phishing websites can be determined on the basis of their domains. They usually are
related to URL which needs to be registered (low-level domain and upper-level Recently
acquired status of intra-URL relationship is used to evaluate it using distinctive properties
extracted from words that compose a URL based on query data from various search engines
such as Google and Yahoo. These properties are further led to the machine-learning based
classification for the identification of phishing URLs from a real dataset
2.7 MACHINE LEARNING
Machine learning is a branch of Artificial Intelligence which deals with teaching computers the
ability to learn and improve from experience Kersting, 2018. The primary aim is for the
machine to be able to access data and use it to learn and discover patterns in it which can then
be used to make predictions, perform categorization and clustering Kersting, 2018. ML is
broadly classified into two types – supervised and unsupervised machine learning:
• Supervised Machine learning uses a labelled set of examples to learn patterns and
relationships between the data and the outcome. It then uses these learnings to make predictions
for new data.
19
• Unsupervised Machine Learning tries to uncover the hidden structure from data that is
un labelled. Here it is not about figuring out the right output but instead the focus is on drawing
inferences from the datasets.
In this project we will use Supervised Machine Learning to learn patterns that will help us in
predicting whether a website is malicious or benign. We will do this using the labelled dataset
available at Kaggle. Below I describe a few supervised learning algorithms that we will use to
train our machine learning model.
2.7.1 LOGISTIC REGRESSION
LG is a machine learning algorithm used to train classifiers. It is basically the logistic or sigmoid
function layered on top a linear regression model. Mathematically it is represented as below:
𝑧=𝑤𝑥+𝑏
𝑦=𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑧)
𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑧)=1/11+𝑒−𝑡
Figure 3: The sigmoid function (Obeid, 2019)
Figure 3 shows that as the value for z gets larger and larger the value of Y tends to 1. On the
other hand, as the value of z gets smaller and smaller the value of Y tends to 0. This means that
20
the output of this model is always in the range of 0 to 1 which basically gives us the probability
of an observation being 1 or 0 Sperandei, 2014.
Logistic regression calculates the probability of a binary outcome and by setting a threshold, it
classifies the data points into either outcome. Our data will have a binary outcome as we must
decide whether a website is malicious or not.
LOGISTIC REGRESSION COEFFICIENT TABLE
Once the logistic regression model is trained, we can see how each individual predictor variable
in the model relates with the target model using the coefficient table Logistic Regression
Essentials in R - Articles - STHDA, 2020; Peng et al., 2002.
Table 1. Coefficient Standard Error Z_ Value P_ Value

Example logistic
regression
coefficient table
Predictor
(Intercept) -3.990 1.140 -3.500 0.000
gre 0.002 0.001 2.070 0.038
gpa 0.804 0.332 2.423 0.015
rank2 -0.675 0.316 -2.134 0.033
rank3 -1.340 0.345 -3.881 0.000
rank4 -1.551 0.418 -3.713 0.000
This table can be interpreted as follows:

• For continuous variables such as ‘gre’ and ‘gpa’ the coefficient column tells us the
change in the log-odds of admission per unit change in the predictor variable.
• Categorical variables like rank have a slightly different interpretation. Here the
categories of the variable are compared to one reference category of the same variable.
• Variables and their coefficients can be deemed statistically significant based on the p-
value column according to pre-defined error rate.
2.7.2 SUPPORT VECTOR MACHINE

The objective of a SVM is to find the decision boundary in a n-dimensional space that
maximizes the distance between datapoints of target classes. New data points that fall on either
side of this decision boundary are then classified into their respective classes. Support vectors
are those data points that are closest to the decision boundary. These influence the position and
21
orientation of the boundary itself and are used to optimize the decision boundaries location.
The output of an SVM lies in the range [-1,1]. SVM is not restricted to binary outcomes, though
we can use it for our purpose. SVM will use kernel tricks to classify websites into malicious or
benign/safe category. Kernel technique es include sigmoid, linear, polynomial and radial. There
are several other techniques, but we can focus on these. Figure below shows a typical plot of
support vector machine.
In Figure 4, the red line represents the decision boundary generated using SVM. The dashed
lines represent the support vectors. Any data point lying beyond the positive support vector is
classified to the blue class while any data point lying beyond the negative support vector is
classified to the green class.
Figure 4: A 2-dimensional decision boundary generated using SVM
2.7.3 RANDOM FOREST
Random forests are group of machine learning models. This means that they use predictions
made by many weak models and combine them to generate the actual prediction. In the case of
random forests, the weak models are trained as decision trees. The random forest method also
uses the technique of bagging. What this means is that for training each decision tree in the
forest, a random sample of size N is sampled from the original training data. Along with
sampling from training observations a random sample of features is also sampled. This means
that no single tree is trained using all training data and features. This randomness ensures that
each decision tree is uncorrelated to other decision trees. All decision trees are trained
independently of each other. Once each individual tree has decided, a collective decision is
taken using a voting method. The process of bagging is further shown in Figure 5.
22
Figure 5: Bagging in Random Forests
2.7.4 K- NEAREST NEIGHBOUR
The k-nearest neighbor algorithm, also known as KNN or k-NN, is a non-parametric,

supervised learning classifier, which uses proximity to make classifications or predictions
about the grouping of an individual data point. While it can be used for either regression or
classification problems, it is typically used as a classification algorithm, working off the
assumption that similar points can be found near one another.
2.7.5 NAVIE BAYES

Naïve Bayes algorithm is a supervised learning algorithm, which is based on Bayes theorem
and used for solving classification problems. It is mainly used in text classification that includes
a high-dimensional training dataset. Naïve Bayes Classifier is one of the simple and most
effective Classification algorithms which helps in building the fast machine learning models
that can make quick predictions. It is a probabilistic classifier, which means it predicts on the
basis of the probability of an object. Some popular examples of Naïve Bayes Algorithm are
spam filtration, Sentimental analysis, and classifying articles. Bayes' theorem is also known as
Bayes' Rule or Bayes' law, which is used to determine the probability of a hypothesis with
prior knowledge. It depends on the conditional probability.
The formula for Bayes' theorem is given as:
23
2.8 SOFTWARE DESCRIPTION
2.8.1 Selection of programming language - Python
Python is an interpreted, object-oriented, high-level programming language with dynamic
semantics. Its high-level built in data structures, combined with dynamic typing and dynamic
binding, make it very attractive for Rapid Application Development, as well as for use as a
scripting or glue language to connect existing components together. Python's simple, easy to
learn syntax emphasizes readability and therefore reduces the cost of program maintenance.
Python supports modules and packages, which encourages program modularity and code reuse.
The Python interpreter and the extensive standard library are available in source or binary form
without charge for all major platforms and can be freely distributed.
Programmers prefer python because of the increased productivity it provides. Since there is no
compilation step, the edit-test-debug cycle is incredibly fast. Debugging Python programs is
easy. A bug or bad input will never cause a segmentation fault. Instead, when the interpreter
discovers an error, it raises an exception. When the program doesn't catch the exception, the
interpreter prints a stack trace. A source level debugger allows inspection of local and global
variables, evaluation of arbitrary expressions, setting breakpoints, stepping through the code a
line at a time, and so on. On the other hand, often the quickest way to debug a program is to
add a few print statements to the source. The fast edit-test debug cycle makes this simple
approach very effective.
24
2.8.2 JUPYTER NOTEBOOK
The Jupyter Notebook App is a server-customer application that permits altering and running
note pad records by means of an internet browser. The Jupyter Notebook App can be executed
on a nearby work area requiring no web access (as portrayed in this report) or can be introduced
on a remote server and got to through the web. Notwithstanding showing/altering/running note
pad archives, the Jupyter Notebook App has a "Dashboard" (Notebook Dashboard), a "control
board" indicating nearby records and permitting to open note pad reports or closing down their
portions.
A scratch pad part is a "computational motor" that executes the code contained in a Notebook
record. The python part, referenced in this guide, executes python code. Portions for some,
different dialects exist (official parts).When you open a Notebook report, the related part is
consequently propelled. At the point when the scratch pad is executed (either cell-by-cell or
with menu Cell - > Run All), the portion plays out the calculation and produces the outcomes.
Contingent upon the sort of calculations, the piece may expend critical CPU and RAM. Note
that the RAM isn't discharged until the part is closed down, he Notebook Dashboard is the part
which is indicated first when you dispatch Jupyter Notebook App. The Notebook Dashboard
is essentially used to open note pad archives, and to deal with the running portions (picture and
shutdown).The Notebook Dashboard has different highlights like a record director, in particular
exploring organizers and renaming/erasing documents.
2.8.3 MATPLOTLIB
People are exceptionally visual animals: we comprehend things better when we see things
envisioned. Not with standing, the progression to showing investigations, results or bits of
knowledge can be a bottleneck: you probably won't realize where to begin or you may have as
of now a correct configuration as a top priority, however then inquiries like "Is this the correct
method to imagine the bits of knowledge that I need to convey to my group of onlookers?" will
have unquestionably gone over your brain.
When you're working with the Python plotting library Matplotlib, the initial step to responding
to the above inquiries is by structure up information on themes like: The life structures of a
Matplotlib plot: what is a subplot? What are the Axes? What precisely is a figure? Plot creation,
which could bring up issues about what module you precisely need to import (pylab or pyplot?),
how you precisely ought to approach instating the figure and the Axes of your plot, how to
utilize matplotlib in Jupyter note pads, and so on. Plotting schedules, from straightforward
25
approaches to plot your information to further developed methods for picturing your
information. Essential plot customizations, with an emphasis on plot legends and content, titles,
tomahawks marks and plot format.
Sparing, appearing, your plots: demonstrate the plot, spare at least one figures to, for instance,
pdf documents, clear the tomahawks, clear the figure or close the plot, and so on. In conclusion,
you'll quickly cover two manners by which you can alter Matplotlib: with templates and the
settings. Since all is set for you to begin plotting your information, it's an ideal opportunity to
investigate some plotting schedules. You'll regularly go over capacities like plot() and
disperse(), which either draw focuses with lines or markers interfacing them, or draw detached
focuses, which are scaled or shaded. In any case, as you have just found in the case of the
primary area, you shouldn't neglect to pass the information that you need these capacities to
utilize!
2.8.4 NUMPHY
NumPy is, much the same as SciPy, Scikit-Learn, Pandas, and so forth one of the bundles that
you can't miss when you're learning information science, principally in light of the fact that this
library gives you a cluster information structure that holds a few advantages over Python
records, for example, being increasingly reduced, quicker access in perusing and composing
things, being progressively advantageous and increasingly productive.
NumPy exhibits are somewhat similar to Python records, yet at the same time particularly
unique in the meantime. For those of you who are new to the subject, how about we clear up
what it precisely is and what it's useful for. As the name gives away, a NumPy cluster is a focal
information structure of the numpy library. The library's name is another way to say "Numeric
Python" or "Numerical Python".
At the end of the day, NumPy is a Python library that is the center library for logical registering
in Python. It contains an accumulation of apparatuses and strategies that can be utilized to settle
on a PC numerical models of issues in Science and Engineering. One of these apparatuses is an
elite multidimensional cluster object that is an incredible information structure for effective
calculation of exhibits and lattices. To work with these clusters, there's a tremendous measure
of abnormal state scientific capacities work on these grids and exhibits. since you have set up
your condition, it's the ideal opportunity for the genuine work. In fact, you have officially gone
for some stuff with exhibits in the above Data Camp Light pieces. Be that as it may, you haven't
generally gotten any genuine hands- on training with them, since you originally expected to
26
introduce NumPy all alone pc. Since you have done this current, it's a great opportunity to
perceive what you have to do so as to run the above code pieces without anyone else.
2.8.5 PANDAS
Pandas is an open-source, BSD-authorized Python library giving elite, simple to-utilize

information structures and information examination instruments for the Python programming
language. Python with Pandas is utilized in a wide scope of fields including scholastic and
business areas including money, financial matters, Statistics, examination, and so on. In this
instructional exercise, we will get familiar with the different highlights of Python Pandas and
how to utilize them practically speaking. This instructional exercise has been set up for the
individuals who try to become familiar with the essentials and different elements of Pandas. It
will be explicitly valuable for individuals working with information purging and examination.
In the wake of finishing this instructional exercise, you will wind up at a moderate dimension
of ability from where you can take yourself to more elevated amounts of skill. You ought to
have a fundamental comprehension of Computer Programming phrasings. A fundamental
comprehension of any of the programming dialects is an or more. Pandas library utilizes the
vast majority of the functionalities of NumPy. It is recommended that you experience our
instructional exercise on NumPy before continuing with this instructional exercise.
27
CHAPTER 3
28
SYSTEM ANALYSIS
3.1 FUNCTIONAL REQUIREMENTS
A function of software system is defined in functional requirement and the behavior of the
system is evaluated when presented with specific inputs or conditions which may include
calculations, data manipulation and processing and other specific functionality.
• Our system should be able to load air quality data and preprocess data.
• It should be able to analyze the air quality data.
• It should be able to group data based on hidden patterns.
• It should be able to assign a label based on its data groups.
• It should be able to split data into trainset and testset.
• It should be able to train model using trainset.
• It must validate trained model using testset.
• It should be able to display the trained model accuracy.
• It should be able to accurately predict the air quality on unseen data.
3.2 NON-FUNCTIONAL REQUIREMENTS
Nonfunctional requirements describe how a system must behave and establish constraints of its
functionality. This type of requirements is also known as the system’s quality attributes.
Attributes such as performance, security, usability, compatibility are not the feature of the
system, they are a required characteristic. They are "developing" properties that emerge from
the whole arrangement and hence we can't compose a particular line of code to execute them.
Any attributes required by the customer are described by the specification. We must include
only those requirements that are appropriate for our project. Some Non-Functional
Requirements are as follows:
• Reliability
• Maintainability
• Performance
• Portability
• Scalability
• Flexibility
Some of the quality attributes are as follows:
29
3.2.1 ACCESIBILITY
Availability is a general term used to depict how much an item, gadget, administration, or
condition is open by however many individuals as would be prudent.
In our venture individuals who have enrolled with the cloud can get to the cloud to store and
recover their information with the assistance of a mystery key sent to their email ids.
UI is straightforward and productive and simple to utilize.
3.2.2 MAINTAINABILITY
In programming designing, viability is the simplicity with which a product item can be altered so as to:
• Correct absconds
• Meet new necessities
New functionalities can be included in the task based the client necessities just by adding
the proper documents to existing venture utilizing ASP.net and C# programming dialects. Since
the writing computer programs is extremely straightforward, it is simpler to discover and
address the imperfections and to roll out the improvements in the undertaking.
3.2.3 SCALABILITY
Framework is fit for taking care of increment all out throughput under an expanded burden
when assets (commonly equipment) are included.
Framework can work ordinarily under circumstances, for example, low data transfer capacity
and substantial number of clients.
3.2.4 PORTABILITY
Convey ability is one of the key ideas of abnormal state programming. Convenient is the
product code base component to have the capacity to reuse the current code as opposed to
making new code while moving programming from a domain to another. Venture can be
executed under various activity conditions gave it meet its base setups. Just framework records
and dependent congregations would need to be designed in such case. The functional
requirements for a system describe what the system should do.
Those requirments depend on the type of software being developed, the expected users of the
software. These are the statement of services the system should provide, how the system should
react to particular inputs and how the system should behave in particular situation.
• Extracting data from CSV files

• Cleaning the data.
• Vector Representation.
30
Non-functional requirements is not about functionality or behavior of system, but rather are
used to specify the capacity of a system. They are more related to properties of system such
as quality, reliability and quick response time. Non- functional requirements come up via
customer needs, because of budget, interoperability need such as software and hardware
requirement, organizational policies or due to some external factors such as:-
• Basic Operational Requirements
• Organizational Requirements
• Product Requirements
• User Requirements
3.2.4.1 BASIC OPERATIONAL REQUIREMENTS

The four primary functions of systems engineering are all performed by the end users, which
is the customers. Operational requirements which are given by:-
• Mission profile or scenario: It is a map which describes the procedures and leads us
to the final goal/ objective. The goal of proposed system is, to predict the crop yield
prediction for future year using previous year dataset.
• Performance: It basically gives system parameters to reach our goal. Parameters for the
proposed system are accurate predicted value which is compared to the existing system.
Utilization environments: It enlists the different permutations and combinations a
system can be reused in many other applications which gives better prediction, as
well as gives a new approach to prediction techniques.
• Life cycle: It discuss about the life span of a system. As number of data increases the
number of iterations increases, which will give more accuracy to the output.
3.2.4.2 ORGANIZATIONAL REQUIREMENTS
The Organizational requirement consists of the following types:
• Process Standards: To make sure the system is a quality product, IEEE standards have
been used during system development.
• Design Methods: Design is an important step, on which all other steps in the
engineering process are based on.
31
3.2.4.3 PRODUCT REQUIREMENTS
• Portability: As the system is Python based, it will run on a platform which is supported
by ANACONDA.
• Correctness: The system has been put through rigorous testing after it has followed
strict guidelines and rules. The testing has validated the data.
• Ease of Use: The user interface allows the user to interact with the system at a very
comfortable level with no hassles.
• Modularity: The many different modules in the system are neatly defined for ease of
use and to make the product as flexible as possible with different permutations and
combinations.
3.2.4.4 USER REQUIREMENTS
• The user should able to have User Interface Window with Visualize Graphics.
• The user should able to configure with neat GUI all the parameters.
3.3 HARDWARE REQUIREMENTS
The following is the hardware requirements of the system for the proposed system:
• Processor : Any Processor above 500 MHz

• RAM : 8 GB
• Hard Disk : 1 TB
• Input device : Standard keyboard and mouse
3.4 SOFTWARE REQUIREMENTS
The following is the software requirements of the system for the proposed system:
• OS : Windows 10
• Platform : Jupyter Notebook
• Language : Python
• IDE/tool : Anaconda 3-5.0.3
32
3.5 DATASETS
The key to success in the field of machine learning or to become a great data scientist is to
practice with different types of datasets. But discovering a suitable dataset for each kind of
machine learning project is a difficult task. So, in this topic, we will provide the detail of the
sources from where you can easily get the dataset according to your project. Before knowing
the sources of the machine learning dataset, let's discuss datasets.
A dataset is a collection of data in which data is arranged in some order. A dataset can contain
any data from a series of an array to a database table. Below table shows an example of the
dataset:
A tabular dataset can be understood as a database table or matrix, where each column
corresponds to a particular variable, and each row corresponds to the fields of the dataset.
The most supported file type for a tabular dataset is "Comma Separated File," or CSV.
But to store a "tree-like data," we can use the JSON file more efficiently.
Types of data in datasets

Numerical data: Such as house price, temperature, etc
Categorical data: Such as Yes/No, True/False, Blue/green, etc.
Ordinal data: These data are similar to categorical data but can be measured on the basis of
comparison.
Note: A real-world dataset is of huge size, which is difficult to manage and process at
the initial level. Therefore, to practice machine learning algorithms, we can use any
dummy dataset.
Need of Dataset
To work with machine learning projects, we need a huge amount of data, because, without the
data, one cannot train ML/AI models. Collecting and preparing the dataset is one of the most
crucial parts while creating an ML/AI project.The technology applied behind any ML projects
cannot work properly if the dataset is not well prepared and pre- processed During the
development of the ML project, the developers completely rely on the datasets. In building ML
applications, datasets are divided into two parts training and testing:
33
Kaggle Datasets
Kaggle is one of the best sources for providing datasets for Data Scientists and Machine
Learners. It allows users to find, download, and publish datasets in an easy way. It also provides
the opportunity to work with other machine learning engineers and solve difficult Data Science
related tasks.
34
CHAPTER 4
35
ARCHITECTURE AND DESIGN
4.1 DATA FLOW DIAGRAM
The database may be defined as an organized collection of related information. The organized
information serves as a base from which further recognizing can be retrieved desired information or
processing the data. The most important aspect of building an application system is the design of tables.
The data flow diagram is used for classifying system requirements to major transformation that will
become programs in system design. This is starting point of the design phase that functionally
decomposes the required specifications down to the lower level of details. It consists of a series of
bubbles joined together by lines.
LEVEL 0 DATA FLOW DIAGRAM
Fig 4.1.1 level 0 dataflow diagram
36
37
38
39
4.2 USECASE DIAGRAM
A use case diagram at its simplest is a representation of a user's interaction with the system that
shows the relationship between the user and the different use cases in which the user is
involved. A use case diagram can identify the different types of users of a system and the
different use cases and will often be accompanied by other types of diagrams as well. The figure
shows the use case diagram for the system.
The following figure shows the use case diagram:
Fig 4.2.1 use case diagram
40
4.3 SYSTEM SEQUENCE DIAGRAM
A system sequence diagram is, as the name suggests, a type of sequence diagram in
UML. These charts show the details of events that are generated by actors from outside the
system. Standard sequence diagrams show the progression of events over a certain amount of
time, while system sequence diagrams go a step further and present sequences for specific use
cases. Use case diagrams are simply another diagram type which represents a user's interaction
with the system. An SSD shows – for one particular scenario of a use case –
Fig 4.3.1 system sequence diagram
41
CHAPTER 5
42
IMPLEMENTATION
Implementation is the process of defining how the system should be built, ensuring that it is
operational and meets quality standards. It is a systematic and structured approach for
effectively integrating a software-based service or component into the requirements of end
users. This chapter of the report illustrates the approach employed to classify the URLs as either
phishing or legitimate. The methodology involves building a training set. The training set is used
for training a machine learning model, i.e., the classifier. Fig shows the diagrammatic
representation of the implementation.
43
5.1 Overview of system implementation
The plan contains an overview of the system, a brief description of the major tasks involved
in the implementation, the overall resources needed to support the implementation effort
and any site-specific implementation requirements.
5.1.1 Selection of programming language - Python

Python is an interpreted, object-oriented, high-level programming language with dynamic
semantics. Its high-level built in data structures, combined with dynamic typing and
dynamic binding, make it very attractive for Rapid Application Development, as well as
for use as a scripting or glue language to connect existing components together. Python's
simple, easy to learn syntax emphasizes readability and therefore reduces the cost of
program maintenance. Python supports modules and packages, which encourages program
modularity and code reuse. The Python interpreter and the extensive standard library are
available in source or binary form without charge for all major platforms and can be freely
distributed.
Programmers prefer python because of the increased productivity it provides. Since there
is no compilation step, the edit-test-debug cycle is incredibly fast. Debugging Python
programs is easy. A bug or bad input will never cause a segmentation fault. Instead, when
the interpreter discovers an error, it raises an exception. When the program doesn't catch
the exception, the interpreter prints a stack trace. A source level debugger allows inspection
of local and global variables, evaluation of arbitrary expressions, setting breakpoints,
stepping through the code a line at a time, and so on. On the other hand, often the quickest
way to debug a program is to add a few print statements to the source. The fast edit- test
debug cycle makes this simple approach very effective.
5.2 SYSTEM MAINTENANCE

System implementation is the important stage of project when the theoretical design is tuned
into practical system.
The main stages in the implementation are as follows
• Planning Training
• System test
• Changeover Planning
Planning is the first task in the system implementation. At the time of implementation of
any system people from different departments and system analysis involve. They are
confirmed to practical problem of controlling various activities of people outside their
44
own data processing departments. The line managers controlled through an
implementation coordinating committee.
The committee considers ideas, problems and complaints of user department, it must also
consider
• The implication of system environment;
• Self selection and allocation for implementation tasks;
• Consultation with union sand resources available;
• Standby facilities and channels of communication

5.2.1 TRAINING
To achieve the objectives and benefits from computer based system, it is essential for the
people who will be involved to be confident of their role in new system. These involve
them in understanding overall system and its effect on the organization and in being able
to carry out effectively their specified task. So training must take place at an early stage.
Training session must give user staff, the skills required in their new jobs.
5.2.2 SYSTEM TESTING
It is the stage of implementation, which ensures that system works accurately and
effectively before the live operation commences. It is a confirmation that all are correct
and opportunity to show the users that the system must be tested with text data and show
that the system will operate successfully and produce expected results under expected
conditions .Before implementation, the proposed system must be tested with raw data to
ensure that the modules of the system work correctly and satisfactorily. The system must
be tested with valid data to achieve its objective. The purpose of system testing is to
identify and correct errors in the candidate system. As important this phase is, it is one that
is frequently compromised. Typically, the project schedule or the user is eager to go
directly to conversion.
This creates two problems
• The time lag between the cause and appearance of the problem
• The effect of system errors on files and records within the system, a small system
error can conceivably exploded into much larger problem. Effectively early in the
process translates directly into long term cost savings from reduce number of
errors.
45
5.2.3 CHANGEOVER
Changeover is the process where the existing system is converted into the new system.
The changeover from old to new system takes place when:
• The objective of this project is to train machine learning models and deep neural nets
on the dataset created to predict phishing websites.
• Both phishing and benign URLs of websites are gathered to form a dataset and from
them required URL and website content-based features are extracted.
• The performance level of each model is measures and compared.
• A phishing website is a common social engineering method that mimics trustful
uniform resource locators (URLs) and webpages.
46
CHAPTER 6
47
TESTING
The purpose of testing is to discover errors. Testing is the process of trying to discover every
conceivable fault or weakness in a work product. It provides a way to check the functionality
of components, sub assemblies, assemblies and/or a finished product it is the process of
exercising software with the intent of ensuring that the Software system meets its requirements
and user expectations and does not fail in an unacceptable manner. There are various types of
test. Each test type addresses a specific testing requirement.
6.1 TYPES OF TESTS
6.1.1 UNIT TESTING

Unit testing involves the design of test cases that validate that the internal program logic is
functioning properly, and that program inputs produce valid outputs. All decision branches and
internal code flow should be validated. It is the testing of individual software units of the
application. It is done after the completion of an individual unit before integration. This is a
structural testing, that relies on knowledge of its construction and is invasive. Unit tests perform
basic tests at component level and test a specific business process, application, and/or system
configuration. Unit tests ensure that each unique path of a business process performs accurately
to the documented specifications and contains clearly defined inputs and expected results.
6.1.2 INTEGRATION TESTING
Integration tests are designed to test integrated software components to determine if they
actually run as one program. Testing is event driven and is more concerned with the basic
outcome of screens or fields. Integration tests demonstrate that although the components were
individually satisfaction, as shown by successfully unit testing, the combination of components
is correct and consistent. Integration testing is specifically aimed at exposing the problems that
arise from the combination of components.
6.1.3 VALIDATION TESTING

An Engineering Validation Test (EVT) is performed on first engineering prototypes, to ensure
that the basic unit performs to design goals and specifications. It is important in identifying
design problems, and solving them as early in the design cycle as possible, is the key to keeping
projects on time and within budget. Too often, product design and performance problems are
not detected until late in the product development cycle — when the product is ready to be
48
shipped. The old adage holds true: It costs a penny to make a change in engineering, a dime in
production and a dollar after a product is in the field.
Verification is a Quality control process that is used to evaluate whether or not a product,
service, or system complies with regulations, specifications, or conditions imposed at the start
of a development phase. Verification can be in development, scale-up, or production. This is
often an internal process.
Validation is a Quality assurance process of establishing evidence that provides a high degree
of assurance that a product, service, or system accomplishes its intended requirements. This
often involves acceptance of fitness for purpose with end users and other product stakeholders.
The testing process overview is as follows:
Figure: The testing process
6.2 SYSTEM TESTING

System testing of software or hardware is testing conducted on a complete, integrated system
to evaluate the system's compliance with its specified requirements. System testing falls within
the scope of black box testing, and as such, should require no knowledge of the inner design of
the code or logic.
As a rule, system testing takes, as its input, all of the "integrated" software components that
have successfully passed integration testing and also the software system itself integrated with
any applicable hardware system(s).
49
System testing is a more limited type of testing; it seeks to detect defects both within the "inter
assemblages" and also within the system as a whole.
System testing is performed on the entire system in the context of a Functional Requirement
Specification FRS and/or a System Requirement Specification SRS.
System testing tests not only the design, but also the behavior and even the believed
expectations of the customer. It is also intended to test up to and beyond the bound defined in
the software/hardware requirements specification.
50
CHAPTER 7
51
RESULT
RESLUT 7.1 Features of dataset
ATTRIBUTES USED IN THE DATASET
IP_ address, URL_ length, shortening_ service, at_ symbol, double_ slash_ red, prefix_ suffix,
sub_ dom, ssl final_ state, domain_ registration and so on…
52
7.2 OUTPUT PREDICTION
Fig7.2.1 output 1
Fig7.2.2 output 2
53
Fig7.2.3 output 3
54
Fig7.2.4 output 4
Fig7.2.5 output 5
55
Fig7.2.6 output 6
56
Fig7.2.7 output 7
Fig7.2.8 output 8
57
58
Fig7.2.9 output 9
59
Fig7.2.10 output 10
60
Fig7.2.11 output 11
Fig7.2.12 output 12
61
Fig7.2.13 output 13
Fig7.2.14 output 14
62
Fig7.2.15 output 15
63
Fig7.2.16 output 16
64
Fig7.2.17 output 17
The logistic regression algorithm and random forest got more accuracy to detect phishing
websites, URL, ip address etc
65
CONCLUSION
In this project, we have explored how well to classify phishing URLs from the given set of
URLs containing benign and phishing URLs. We have also discussed the randomization of
the dataset, feature engineering, feature extraction using lexical analysis host-based features
and statistical analysis. We have also used different classifiers for the comparative study and
found that the findings are almost consistent across the different classifiers. We also
observed dataset randomization yielded a great optimization and the accuracy of the
classifier improved significantly. We have adopted a simple approach to extract the features
from the URLs using simple regular expressions. There could be more features that can be
experimented and that might lead to improving further the accuracy of the system. The
dataset used in this paper contains the URLs list which may be a little old, hence regular
continuous training along with a new dataset would enhance the model accuracy and
performance significantly. In our experiment we have not used the content based features as
the main problem with the content-based strategy for detecting phishing URLs is the
nonavailability of phishing web-sites and the life span of the phishing website is small, and
it is difficult to train an ML classifier based on its content-based features. In the future, we
would like to incorporate a rule-based prediction based on the content analysis of a URL.
Hence, the combination of classification based lexical analyzer along with a rule-based URL
content analyzer for phishing URL detection would provide a comprehensive solution. It is
outstanding that a decent enemy of phishing apparatus ought to anticipate the phishing
assaults in a decent timescale. We accept that the accessibility of a decent enemy of phishing
device at a decent time scale is additionally imperative to build the extent of anticipating
phishing sites. This apparatus ought to be improved continually through consistent
retraining. As a matter of fact, the accessibility of crisp and cutting-edge preparing dataset
which may gained utilizing our very own device 30, 32 will help us to retrain our model
consistently and handle any adjustments in the highlights, which are influential in deciding
the site class. Albeit neural system demonstrates its capacity to tackle a wide assortment of
classification issues, the procedure of finding the ideal structure is very difficult, and much
of the time, this structure is controlled by experimentation. Our model takes care of this issue
via computerizing the way toward organizing a neural system conspire; hence, on the off
chance that we construct an enemy of phishing model and for any reasons we have to refresh
it, at that point our model will encourage this procedure, that is, since our model will
mechanize the organizing procedure and will request scarcely any client defined parameters.
66
CHAPTER 8
67
REFERENCES
[1] Matthew Dunlop, Stephen Groat, David Shelly (2010) " Gold Phish: Using Images for
Content-Based Phishing Analysis”
[2] Rishikesh Mahajan (2018) “Phishing Website Detection using Machine Learning
Algorithms”
[3] Purvi Pujara, M. B. Chaudhari (2018) “Phishing Website Detection using Machine
Learning : A Review”
[4] David G. Dobolyi, Ahmed Abbasi (2016) “Phish Monger: A Free and Open Source
Public Archive of Real-World Phishing Websites”
[5] Satish.S, Suresh Babu. K (2013) “Phishing Websites Detection Based On Web Source
Code And Url In The Webpage”
[6] Purvi Pujara, M. B. Chaudhari (2018) “Phishing Website Detection using Machine
Learning : A Review”
[7] Satish. S, Suresh Babu. K (2013) “Phishing Websites Detection Based On Web Source
Code And Url In The Webpage”
[8] Tenzin Dakpa, Peter Augustine (2017) “Study of Phishing Attacks and Preventions”
[9] Ping Yi (2018) “Web Phishing Detection Using a Deep Learning Framework” [
10] Jalil Nour mohammadi Khiarak (2017) “What is Machine Learning”
[11] Sadia Afroz, Rachel Greenstadt (2018) “Phish Zoo: An Automated Web Phishing
Detection Approach Based on Profiling and Fuzzy Matching”
[12] Arun Kulkarni, Leonard L. Brown (2019) “Phishing Websites Detection using Machine
Learning”
[13] Rohan Saraf , Mayur Khatri , Mona Mulchandani (2014) “Phish Tank-A Phishing
Detection Tool”
[14] Sadia Afroz, Rachel Greenstadt (2017) “Phish Zoo: Detecting Phishing Websites By
Looking at Them”
[15] Matthew Dunlop, Stephen Groat, David Shelly (2010) " Gold Phish: Using Images for
Content-Based Phishing Analysis” 2020 16th IEEE International Colloquium on Signal
Processing & its Applications (CSPA 2020), 28-29 Feb. 2020, Langkawi, Malaysia
68

Project

Uploaded by

Copyright:

Available Formats

Project

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Project

Uploaded by

Copyright:

Available Formats

CHAPTER 1

1.1 PROBLEM DEFINITION

1.2 PROJECT PURPOSE

1.3 BACKGROUND OF THE STUDY

1.4 PROJECT FEATURES

1.5 PROJECT OVERVIEW

• Presence of IP address in URL Presence of @ symbol in URL

Information submission to Email: Phisher might use “mail()” or “mailto:” functions to

2.7 MACHINE LEARNING

2.7.1 LOGISTIC REGRESSION

Figure 3: The sigmoid function (Obeid, 2019)

LOGISTIC REGRESSION COEFFICIENT TABLE

Table 1. Coefficient Standard Error Z_ Value P_ Value

This table can be interpreted as follows:

2.7.2 SUPPORT VECTOR MACHINE

Figure 4: A 2-dimensional decision boundary generated using SVM

2.7.3 RANDOM FOREST

2.7.4 K- NEAREST NEIGHBOUR

The k-nearest neighbor algorithm, also known as KNN or k-NN, is a non-parametric,

2.7.5 NAVIE BAYES

Pandas is an open-source, BSD-authorized Python library giving elite, simple to-utilize

3.2 NON-FUNCTIONAL REQUIREMENTS

• Extracting data from CSV files

• Basic Operational Requirements

3.2.4.1 BASIC OPERATIONAL REQUIREMENTS

• Processor : Any Processor above 500 MHz

3.4 SOFTWARE REQUIREMENTS

Types of data in datasets

4.1 DATA FLOW DIAGRAM

LEVEL 0 DATA FLOW DIAGRAM

Fig 4.1.1 level 0 dataflow diagram

Fig 4.1.2 level 1 dataflow diagram

Fig 4.1.3 level 1 dataflow diagram

Fig 4.1.4 level 2 dataflow diagram

The following figure shows the use case diagram:

Fig 4.2.1 use case diagram

Fig 4.3.1 system sequence diagram

5.1.1 Selection of programming language - Python

5.2 SYSTEM MAINTENANCE

• Self selection and allocation for implementation tasks;

• Consultation with union sand resources available;

• Standby facilities and channels of communication

This creates two problems

6.1 TYPES OF TESTS

6.1.1 UNIT TESTING

6.1.2 INTEGRATION TESTING

6.1.3 VALIDATION TESTING

The testing process overview is as follows:

Figure: The testing process

6.2 SYSTEM TESTING

RESLUT 7.1 Features of dataset

ATTRIBUTES USED IN THE DATASET

10] Jalil Nour mohammadi Khiarak (2017) “What is Machine Learning”

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.