Final Report

Download as pdf or txt
Download as pdf or txt
You are on page 1of 79

DETECTING FAKE NEWS ON

SOCIAL MEDIA
A Project report submitted in partial fulfilment for the award of
requirement of the Degree of

BACHELOR OF COMPUTER SCIENCE

Submitted by

ARULMOZHI.I (21CS1108)
GAYATHRI.E (21CS1114)
PAVITHRA.B (21CS1121)

Under the guidance of

Mr. R.BABU M.C.A., M.E.,


(Assistant Professor, Department of Computer Science)

DEPARTMENT OF COMPUTER SCIENCE


PSV COLLEGE OF ARTS AND SCIENCE
PONDICHERRY UNIVERSITY

PONDICHERRY, INDIA.

May 2024

1
PSV COLLEGE OF ARTS AND SCIENCE
PONDICHERRY UNIVERSITY
PUDUCHERRY – 607 402

DEPARTMENT OF COMPUTER SCIENCE

BONAFIDE CERTIFICATE

This is to certify that this project work entitled “ DETECTING FAKE


NEWS ON SOCIAL MEDIA” is a bonafide work done by ARULMOZHI.I
(21CS1108), GAYATHRI.E(21CS1114), PAVITHRA.B(21CS1121) in partial
fulfillment of the requirements for the award of Degree of Bachelor of science
in Computer Science by Pondicherry University during the academic year 2021-
2024

PROJECT GUIDE HEAD OF THE DEPARTEMENT

Submitted for the university Examination held on_____________________

INTERNAL EXAMINER EXTERNAL EXAMINER

2
ACKNOWLEDGEMENT

We greatly thanks to all those who helped us in making this Project successful.

We find great pleasure in thanking our Chairman and Managing Director

Thiru. S.SELVAMANI and our secretary Dr. S.VIKNESH M.Tech, Ph.D., for all their

support, encouragement by providing us with a better environment for studying as well as to

equip ourselves to the learning environment.

It is great pleasure to thank our Director Dr.N.GOBU M.E., MBA, Ph.D., MISTE,

AMIE, MIIW for his valuable guidance, support and encouragement to do our project work.

It gives us great ecstasy of pleasure to convey our deep and sincere thanks to Principal

Dr. K.KAMALAKKAN, Ph.D., of PSV College of Arts and Science for having given us

permission to take up this project and for his kind patronage.

We wish to express our deep sense of gratitude to our Head of the Department

Mr. R.BABU M.C.A., M.E., for his valuable guidance and encouragement throughout this

project work.

We wish to express our gratitude thanks to our Project Guide Mr. R.BABU M.C.A.,

M.E., Assistant Professor for his able guidance and useful suggestions, which helped me in

completing the project work, in time.

We also thanks to all our Department Faculties and Lab Administrator for their

timely guidance in the conduct of our project work and for all their valuable assistance in the

project work.

Finally, yet importantly, I would like to express my heartfelt thanks to my beloved

parents for their blessings my friends/classmates/seniors for their help and wishes for the

successful completion of this project.

3
4
CHAPTER
TITLE
NO PAGE NO

ABSTRACT 9

I INTRODUCTION 11-12

II PROBLEM DEFINITION 14

III SYSTEM STUDY 16-17

3.1 FEASIBILITY STUDY

3.2 ECONOMICAL FEASIBILITY

3.3 TECHNICAL FEASIBILITY

3.4 SOCIAL FEASIBILITY

IV EXISTING SYSTEM 19

V PROPOSED SYSTEM 21-22

5.1 OVERVIEW

5.2 ADVANTAGES

5.3 RANDOM FOREST ALGORITHM

VI SYSTEM REQUIREMENTS SPECIFICATION 24

6.1 HARDWARE REQUIREMENTS

5
6.2 SOFTWARE REQUIREMENTS

VII SYSTEM ANALYSIS 26-33

7.1 ARCHITECTURE

7.2 DATA FLOW DIAGRAM

VIII SYSTEM IMPLEMENTATION 35-36

8.1 MODULES

8.2 MODULE DESCRIPTION

IX SOFTWARE ENVIRONMENT 38-65

X SYSTEM TESTING 67-69

XI APPENDIX 71-92

11.1 SOURCE CODE

11.2 SCREENSHOTS

XII CONCLUSION AND FUTURE ENHANCEMENT 94-95

XIII BIBLOGRAPHY 97

6
7
ABSTRACT
The rapid development of different social media and content-sharing
platforms has been largely exploited to spread misinformation and fake news that
make people believing in harmful stories, which allow influencing public
opinion, and could cause panic and chaos among population. Thus, fake news
detection has become an important research topic, aiming at flagging a specific
content as fake or legitimate. The fake news detection solutions can be divided
into three main categories: content-based, social context-based and knowledge-
based approaches. In this paper, we propose a novel hybrid fake news detection
system that combines linguistic and knowledge-based approaches and inherits
their advantages, by employing two different sets of features, linguistic features,
and a novel set of knowledge-based features, called fact-verification features that
comprise three types of information namely, reputation of the website where the
news is published, coverage, i.e., number of sources that published the news, and
fact-check, i.e., opinion of well-known fact-checking websites about the news,
i.e., true or false. The proposed system only employs eight features, which is less
than most of the state-of-the-art approaches. Also, the evaluation results on a fake
news dataset show that the proposed system employing both types of features can
reach an accuracy of 94.4%, which is better compared to that obtained from
separately employing linguistic features and fact-verification features.

8
CHAPTER I

INTRODUCTION
Social media are taking an increasing part in our professional and personal
lives. More and more people tend to search and consume news via social media
rather than traditional media outlets. It has become common that important news
are first broadcasted on social networks before being released by traditional
media such as television or radio. Due to the massive propagation of news on
social networks, users rarely check the accuracy of the information they share. It
is therefore common to see false and manipulated information that are circulating
on social media such as hoaxes, rumors , urban legends, and fake news. Moreover,
it is difficult to stop the spreading of fake news when it is already shared many
times and at large-scale. This massive dissemination of false information could
cause a serious negative impact on individuals and society. First, fake news could
negatively influence the public opinion. Second, fake news change the way
people interpret and react to real news. For example, some fake news could make
people suspicious, and affect their ability to discern real news from fake news. In
the literature, many approaches have been proposed for fake news detection.
Early approaches were mainly based on linguistic-based techniques, which rely
on language usage and its analysis to predict deception. The goal of these
approaches is to look for instances of leakage found in the content of a text at
different levels (i.e., words, sentences, characters, and documents levels). These
approaches implement different methods such as: data representation, deep
syntax, sentiment, and semantic analyses. In data representation methods, each
word is considered as a single unit, and individual words are aggregated and
analyzed to reveal linguistic cues of deception. In deep syntax methods, the
sentences are converted into a set of rewritten rules (i.e., parse tree) in order to
describe the syntax structure. The semantic analysis determines the truthfulness
of authors, which describes the degree of compatibility of personal experience

9
compared to the content derived from a collection of analogous data. Finally, the
sentiment analysis focuses on the extraction of opinion, which involves
examining written texts about people’s attitudes, sentiments, and evaluations
using analytical techniques. Recent research has shown that linguistic-based
techniques alone are not sufficient to reach high detection accuracy.

10
CHAPTER II

PROBLEM DEFINITION
The extensive spread of fake news has the potential for extremely negative
impacts on individuals and society. Therefore, fake news detection on social
media has recently become an emerging research that is attracting tremendous
attention. Fake news detection on social media presents unique characteristics
and challenges that make existing detection algorithms from traditional news
media ineffective or not applicable. First, fake news is intentionally written to
mislead readers to believe false information, which makes it difficult and
nontrivial to detect based on news content; therefore, we need to include auxiliary
information, such as user social engagements on social media, to help make a
determination. Second, exploiting this auxiliary information is challenging in and
of itself as users’ social engagements with fake news produce data that is big,
incomplete, unstructured, and noisy.

11
CHAPTER III

SYSTEM STUDY
3.1 FEASIBILITY STUDY
The feasibility of the project is analyzed in this phase and business proposal
is put forth with a very general plan for the project and some cost estimates.
During system analysis the feasibility study of the proposed system is to be
carried out. This is to ensure that the proposed system is not a burden to the
company. For feasibility analysis, some understanding of the major
requirements for the system is essential.

3.2 ECONOMICAL FEASIBILITY

This study is carried out to check the economic impact that the system will
have on the organization. The amount of fund that the company can pour into the
research and development of the system is limited. The expenditures must be
justified. Thus the developed system as well within the budget and this was
achieved because most of the technologies used are freely available. Only the
customized products had to be purchased.

3.3 TECHNICAL FEASIBILITY


This study is carried out to check the technical feasibility, that is, the
technical requirements of the system. Any system developed must not have a high
demand on the available technical resources. This will lead to high demands on
the available technical resources. This will lead to high demands being placed on
the client. The developed system must have a modest requirement, as only
minimal or null changes are required for implementing this system.

12
3.4 SOCIAL FEASIBILITY

The aspect of study is to check the level of acceptance of the system by the
user. This includes the process of training the user to use the system efficiently.
The user must not feel threatened by the system, instead must accept it as a
necessity. The level of acceptance by the users solely depends on the methods
that are employed to educate the user about the system and to make him familiar
with it. His level of confidence must be raised so that he is also able to make some
constructive criticism, which is welcomed, as he is the final user of the system.

13
CHAPTER IV

EXISTING SYSTEM
The major challenges that hinder the efficiency of the existing fake news
detection solutions are related to the highly versatile nature of deceptive
information. Indeed, it is very difficult to obtain a generalized dataset for fake
news detection. Thus, it is very difficult to extract relevant features that can well
represent and allow to detect fake news in various domains. Some existing
solutions rely on ontologies in order to model fake news domain knowledge,
which can be then used to distinguish fake from real news content. As previously
discussed, the existing fake news detection solutions are linguistic-based,
knowledge-based, or social context-based. Considering the limitations of the
aforementioned categories, it would be a good idea to investigate combining two
different categories in order to overcome their respective limitations.

DISADVANTAGES

 Hosseinmoltagh and Paplexakis investigated the problem of identifying the


different types of fake news with high accuracy.
 Logistic regression is a linear algorithm used for binary classification
problems.
 The objective of this algorithm is to build a training model that is used to
predict the class of the target variable, which is used by the decision tree to
solve the classification problem.

14
CHAPTER V

PROPOSED SYSTEM
5.1 overview

In this paper, we propose a hybrid fake news detection system that takes
advantage of both linguistic-based and knowledge-based approaches. To the best
of our knowledge, our work is the first that proposes this hybridization in the
context of fake news detection. Some metaheuristics algorithms have been
proposed to deal with the fake news detection issue. The proposed fake news
detection system consists of two phases, namely training and testing. Both phases
include a preprocessing task, which consists of cleaning and preparing the
training and testing datasets of real and fake news. In the training phase, the
feature extracting task extracts a set of relevant features from the training dataset,
which are then fed to several machine learning algorithms to build a fake news
detection model. In the testing phase, the detection model is applied on test data
to decide whether the provided news articles are real or fake. Presents the overall
architecture of the proposed fake news detection system.

5.2 ADVANTAGES

 Hybrid linguistic and knowledge-based fake news detection system that


combines linguistic features and a novel set of knowledge-based features,
called fact-verification features.
 The proposed system only employs eight features, which is less than most
of the state-of-the-art approaches.
 The evaluation results show that the proposed combination of features
records more than 99% accuracy for fake news detection, and allows an
increase of more than 7% compared to linguistic-based features.

15
5.3 RANDOM FOREST ALGORITHM

XGBoost is a decision-tree-based ensemble Machine Learning algorithm


that uses a gradient boosting framework. In prediction problems involving
unstructured data (images, text, etc.) artificial neural networks tend to
outperform all other algorithms or frameworks. However, when it comes to
small-to-medium structured/tabular data, decision tree based algorithms are
considered best-in-class right now. Please see the chart below for the evolution
of tree-based algorithms over the years..

XGBoost algorithm was developed as a research project at the University of

Washington. Tianqi Chen and Carlos Guestrin presented their paper at SIGKDD

Conference in 2016 and caught the Machine Learning world by fire. Since its

introduction, this algorithm has not only been credited with winning numerous

Kaggle competitions but also for being the driving force under the hood for

several cutting-edge industry applications. As a result, there is a strong

community of data scientists contributing to the XGBoost open source projects

with ~350 contributors and ~3,600 commits on GitHub.

The algorithm differentiates itself in the following ways

 A wide range of applications: Can be used to solve regression,


classification, ranking, and user-defined prediction problems.
 Portability: Runs smoothly on Windows, Linux, and OS X.
 Languages: Supports all major programming languages including C++,
Python, R, Java, Scala, and Julia.
 Cloud Integration: Supports AWS, Azure, and Yarn clusters and works
well with Flink, Spark, and other ecosystems.
16
CHAPTER VI

SYSTEM REQUIREMENTS
SPECIFICATION
6.1 HARDWARE REQUIREMENTS

 Processor - Intel i3,i5,i7, AMD Processor


 RAM - above 5 GB
 Hard Disk - above 500 GB

6.2 SOFTWARE REQUIREMENTS

 Operating System - Windows 7/8/10


 Front End - Html, Css
 Scripts - Python language
 Tool - Python Idle

17
CHAPTER VII

SYSTEM ANALYSIS
7.1 ARCHITECTURE

18
UML DIAGRAM

19
ACTIVITY DIAGRAM

20
7.2 DATA FLOW DIAGRAM

LEVEL 0

21
LEVEL 1

22
LEVEL 2

23
OVERALL DIAGRAM

24
ER DIAGRAM

25
CHAPTER VIII

SYSTEM
IMPLEMENTATION
8.1 MODULES

1. Data collection
2. Data preprocessing
3. Feature extraction
4. Modeling creation using Random Forest
5. Hyperparameter Tuning

8.2 MODULE DESCRIPTION

8.2.1 DATA COLLECTION

Data is the prime ingredient of this project, as these data features are
extracted using Natural Language Processing. By using these features of the data,
Machine Learning Algorithms are trained and models are created. In this
proposal, we have news with equal proportionality of fake and real. Data is saved
in Comma Separated Value format. This data set is divided in the training and
testing of algorithms.

8.2.2 DATA PREPROCESSING

The six-label classification problem was translated into a binary


classification problem with True and False labels for the proposed scheme. In
addition, only the news headline was used as an input for classification. Thus, in
the preprocessing stage, the labels were first mapped using the above-mentioned
mapping, after which only the labels and news statement columns were extracted
from the dataset and saved in .csv format for future use. Following the
preprocessing, we were able to obtain the following three cleaned files: • train.csv

26
8.2.3 FEATURE EXTRACTION

The following feature extraction method is used to help machine learning


models gain insights from news headlines: Count Vectorizer First, the English
stop words were stripped from all of the news headlines using scikitlearn's Count
Vectorizer, and then they were tokenized using spaces and punctuation marks as
the delimiter. After all of the headlines had been tokenized, a sparse matrix of all
of the news headlines as rows and the tokens as columns was restored. In addition
to their morphological use, a number of n-grams were returned to make the tokens
reflect the sense in which they were used.

8.2.4 MODELING CREATION WITH XGBOOST

Logistic Regression, Random Forest Classifier, Nave Bayes, SVM


Classifier, and voting classifier were the models used for training. The features
extracted from the Count Victories are used to train the models. After that, using
Grid Search CV and a 5-hold out cross validation set, all of the models were hyper
parameter tuned for all of the different possible parameters. The aim of this hyper
parameter tuning was to boost the models' f1-score. After the models were fine-
tuned, they were evaluated on a test range, and evaluation metrics for the models
were determined.

8.2.5 HYPERPARAMETER TUNING

Parameters that define the model architecture are referred to as hyper


parameters and thus the process of searching for the ideal parameter is referred to
as hyper parameter tuning. We have used Grid Search CV to tune the parameter
of each algorithm. The grid of values of each parameter is given as input and Grid
Search CV will methodically build and evaluate a model for each combination of
algorithm parameters specified in a grid. The model with the best parameter value
is given as output.

27
CHAPTER IX

SOFTWARE
ENVIRONMENT
INTRODUCTION TO PYTHON
Python is a high-level object-oriented programming language that was created
by Guido van Rossum. It is also called general-purpose programming
language as it is used in almost every domain we can think of as mentioned
below:

 Web Development
 Software Development
 Game Development
 AI & ML
 Data Analytics

This list can go on as we go but why python is so much popular let’s see it in
the next topic.

WHY PYTHON PROGRAMMING?


You guys might have a question in mind that, why python? why not another
programming language?

So let me explain:

Every Programming language serves some purpose or use-case according to a


domain. for eg, Javascript is the most popular language amongst web developers
as it gives the developer the power to handle applications via different
frameworks like react, vue, angular which are used to build beautiful User

28
Interfaces. Similarly, they have pros and cons at the same time. so if we consider
python it is general-purpose which means it is widely used in every domain the
reason is it’s very simple to understand, scalable because of which the speed of
development is so fast. Now you get the idea why besides learning python it
doesn’t require any programming background so that’s why it’s popular amongst
developers as well. Python has simpler syntax similar to the English language and
also the syntax allows developers to write programs with fewer lines of code.
Since it is open-source there are many libraries available that make developers’
jobs easy ultimately results in high productivity. They can easily focus on
business logic and Its demanding skills in the digital era where information is
available in large data sets.

HOW DO WE GET STARTED?

Now in the era of the digital world, there is a lot of information available
on the internet that might confuse us believe me. what we can do is follow the
documentation which is a good start point. Once we are familiar with concepts or
terminology we can dive deeper into this.

Following are references where we can start our journey:

Official Website: https://www.python.org/

Udemy Course: https://www.udemy.com/course/python-the-complete-python-


developer-course/

YouTube: https://www.youtube.com/watch?v=_uQrJ0TkZlc

CodeAcademy: https://www.codecademy.com/catalog/language/python

29
I hope now you guys are excited to get started right so you might be wondering
where we can start coding right so there are a lot of options available in markets.
we can use any IDE we are comfortable with but for those who are new to the
programming world I am listing some of IDE’s below for python:

1) Visual Studio: https://visualstudio.microsoft.com/

2) PyCharm: https://www.jetbrains.com/pycharm/

3) Spyder: https://www.spyder-ide.org/

4) Atom: https://atom.io/

5) Google Colab: https://research.google.com/colaboratory/

Real-World Examples:
1) NASA (National Aeronautics and Space Agency): One of Nasa’s Shuttle
Support Contractors, United Space Alliance developed a Workflow Automation
System (WAS) which is fast. Internal Resources Within critical project stated
that:

“Python allows us to tackle the complexity of programs like the WAS without
getting bogged down in the language”.

Nasa also published a website (https://code.nasa.gov/) where there are 400 open
source projects which use python.

2) Netflix: There are various projects in Netflix which use python as follow:

 Central Alert Gateway


 Chaos Gorilla
 Security Monkey
 Chronos

30
Amongst all projects, Regional failover is the project they have as the system
decreases outage time from 45 minutes to 7 minutes with no additional cost.

3) Instagram: Instagram also uses python extensively. They have built a photo-
sharing social platform using Django which is a web framework for python. Also,
they are able to successfully upgrade their framework without any technical
challenges.

Applications of Python Programming:


1) Web Development: Python offers different frameworks for web development
like Django, Pyramid, Flask. This framework is known for security, flexibility,
scalability.

2) Game Development: PySoy and PyGame are two python libraries that are
used for game development

3) Artificial Intelligence and Machine Learning: There is a large number of


open-source libraries which can be used while developing AI/ML applications.

4) Desktop GUI: Desktop GUI offers many toolkits and frameworks using
which we can build desktop applications.PyQt, PyGtk, PyGUI are some of the
GUI frameworks.

How to Become Better Programmer:


The last but most important thing is how you get better at what programming you
choose is practice. Practical knowledge only acquired by playing with things so
you will get more exposure to real-world scenarios. Consistency is more
important than anything because if you practice it for some days and then you did
nothing then when you start again it will be difficult to practice consistently. So
I request you guys to learn by doing projects so it will help you understand how
things get done and important thing is to have fun at the same time.

31
Approach to be followed to master Python:
“Beginning is the end and end is the beginning”. I know what you are thinking
about. It is basically a famous quote from a web series named “Dark”. Now how
it relates to Python programming?

If you researched on google, youtube, or any development communities out there,


you will find that people explained how you can master programming in let’s say
some “x” number of days and like that.

Well, the reality is like the logo of infinity which we can see above. In the
programming realm, there is no such thing as mastery. It’s simply a trial and error
process. For example. Yesterday I was writing some code where I was trying to
print a value of a variable before declaring it inside a function. There I had seen
a new error named “UnboundLocalErrorException“.

So the important thing to keep in mind is that programming is a surprising realm.


Throughout your entire career, you will be seeing new errors and exceptions. Just
remember the quote – “Practise makes a man perfect”.

Now here is the main part. What approach to follow in order to master Python
Programming?

Well here it is:

Step-1: Start with a “Hello World” Program


If you happened to learn some programming languages, then I am sure you are
aware of what I am talking about. The “Hello World” program is like a tradition
in the developer community. If you want to master any programming language,
this should be the very first line of code we should be seeking for.

Simple Hello World Program in Python:

32
print("Hello World")

Step-2: Start learning about variables


Now once we have mastered the “Hello World” program in Python, the next step
is to master variables in python. Variables are like containers that are used to store
values.

Variables in Python:

my_var = 100

As you can see here, we have created a variable named “my_var” to assign a
value 100 to the same.

Step-3: Start learning about Data Types and Data Structures


The next outpost is to learn about data types. Here I have seen that there is a lot
of confusion between data types and data structures. The important thing to keep
in mind here is that data types represent the type of data. For example. in Python,
we have something like int, string, float, etc. Those are called data types as they
indicate the type of data we are dealing with.

While data structures are responsible for deciding how to store this data in a
computer’s memory.

String data type in Python:

my_str = "ABCD"

As you can see here, we have assigned a value “ABCD” to a variable my_str.
This is basically a string data type in Python.

Data Structure in Python:

33
my_dict={1:100,2:200,3:300}

This is known as a dictionary data structure in Python.

Again this is just the tip of the iceberg. There are lots of data types and data
structures in Python. To give a basic idea about data structures in Python, here is
the complete list:

1.Lists

2.Dictionary

3.Sets

4.Tuples

5.Frozenset

Step-4: Start learning about conditionals and loops


In any programming language, conditionals and loops are considered one of the
backbone.

Python is no exception for that as well. This is one of the most important concepts
that we need to master.

IF-ELIF-ELSE conditionals:

if(x < 10):

print("x is less than 10")

elif(x > 10):

print("x is greater than 10")

else:

34
print("Do nothing")

As you can see in the above example, we have created what is known as the if-
elif-else ladder

For loop:

for i in "Python":

print(i)

The above code is basically an example of for loop in python.

PRO Tip:
Once you start programming with Python, you will be seeing that if we missed
any white spacing in python then python will start giving some errors. This is
known as Indentation in python. Python is very strict with indentation. Python is
created with a mindset to help everyone become a neat programmer. This
indentation scheme in python is introduced in one of python’s early PEP(Python
Enhancement Proposal).

THE PYTHON STANDARD LIBRARY

While The Python Language Reference describes the exact syntax and
semantics of the Python language, this library reference manual describes the
standard library that is distributed with Python. It also describes some of the
optional components that are commonly included in Python distributions.

Python’s standard library is very extensive, offering a wide range of


facilities as indicated by the long table of contents listed below. The library
contains built-in modules (written in C) that provide access to system
functionality such as file I/O that would otherwise be inaccessible to Python
programmers, as well as modules written in Python that provide standardized

35
solutions for many problems that occur in everyday programming. Some of these
modules are explicitly designed to encourage and enhance the portability of
Python programs by abstracting away platform-specifics into platform-neutral
APIs.

The Python installers for the Windows platform usually include the entire
standard library and often also include many additional components. For Unix-
like operating systems Python is normally provided as a collection of packages,
so it may be necessary to use the packaging tools provided with the operating
system to obtain some or all of the optional components.

In addition to the standard library, there is a growing collection of several


thousand components (from individual programs and modules to packages and
entire application development frameworks), available from the Python Package
Index.

What Is a Python Package?


To understand Python packages, we’ll briefly look at scripts and modules. A
“script” is something you execute in the shell to accomplish a defined task. To
write a script, you’d type your code into your favorite text editor and save it
with the .py extension. You can then use the python command in a terminal to
execute your script.
A module on the other hand is a Python program that you import, either
in interactive mode or into your other programs. “Module” is really an umbrella
term for reusable code.
A Python package usually consists of several modules. Physically, a package is
a folder containing modules and maybe other folders that themselves may
contain more folders and modules. Conceptually, it’s a namespace. This simply
means that a package’s modules are bound together by a package name, by
which they may be referenced.

36
Circling back to our earlier definition of a module as reusable, importable code,
we note that every package is a module — but not every module is a package.
A package folder usually contains one file named __init__.py that basically tells
Python: “Hey, this directory is a package!” The init file may be empty, or it may
contain code to be executed upon package initialization.
You’ve probably come across the term “library” as well. For Python, a library
isn’t as clearly defined as a package or a module, but a good rule of thumb is
that whenever a package has been published, it may be referred to as a library.
HOW TO USE A PYTHON PACKAGE
We’ve mentioned namespaces, publishing packages and importing modules. If
any of these terms or concepts aren’t entirely clear to you, we’ve got you! In
this section, we’ll cover everything you’ll need to really grasp the pipeline of
using Python packages in your code.
Importing a Python Package
We’ll import a package using the import statement:

Let’s assume that we haven’t yet installed any packages. Python comes with a
big collection of pre-installed packages known as the Python Standard Library.
It includes tools for a range of use cases, such as text processing and doing math.
Let’s import the latter:

You might think of an import statement as a search trigger for a module.


Searches are strictly organized: At first, Python looks for a module in the cache,
then in the standard library and finally in a list of paths. This list may be accessed
after importing sys (another standard library module).

37
The sys.path command returns all the directories in which Python will try to
find a package. It may happen that you’ve downloaded a package but when you
try importing it, you get an error:

In such cases, check whether your imported package has been placed in one of
Python’s search paths. If it hasn’t, you can always expand your list of search
paths:

At that point, the interpreter will have more than one more location to look for
packages after receiving an import statement.
Namespaces and Aliasing
When we had imported the math module, we initialized the math namespace.
This means that we can now refer to functions and classes from the math module
by way of “dot notation”:

Assume that we were only interested in our math module’s factorial function,
and that we’re also tired of using dot notation. In that case, we can proceed as
follows:

38
If you’d like to import multiple resources from the same source, you can simply
comma-separate them in the import statement:

There is, however, always a small risk that your variables will clash with other
variables in your namespace. What if one of the variables in your code was
named log, too? It would overwrite the log function, causing bugs. To avoid
that, it’s better to import the package as we did before. If you want to save typing
time, you can alias your package to give it a shorter name:

Aliasing is a pretty common technique. Some packages have commonly used


aliases: For instance, the numerical computation library NumPy is almost
always imported as “np.”
Another option is to import all a module’s resources into your namespace:

However, this method poses serious risk since you usually don’t know all the
names contained in a package, increasing the likelihood of your variables being
overwritten. It’s for this reason that most seasoned Python programmers will
discourage use of the wildcard * in imports. Also, as the Zen of Python states,
“namespaces are one honking great idea!”
How to Install a Python Package
How about packages that are not part of the standard library? The official
repository for finding and downloading such third-party packages is the Python
Package Index, usually referred to simply as PyPI. To install packages from
PyPI, use the package installer pip:

39
pip can install Python packages from any source, not just PyPI. If you installed
Python using Anaconda or Miniconda, you can also use the conda command to
install Python packages.

While conda is very easy to use, it’s not as versatile as pip. So if you cannot
install a package using conda, you can always try pip instead.
Reloading a Module
If you’re programming in interactive mode, and you change a module’s script,
these changes won’t be imported, even if you issue another import statement. In
such case, you’ll want to use the reload() function from the importlib library:

How to Create Your Own Python Package


Packaging your code for further use doesn’t necessarily mean you’ll want it
published to PyPI. Maybe you just want to share it with a friend, or reuse it
yourself. Whatever your aim, there are several files that you should include in
your project. We’ve already mentioned the __init__.py file.
Another important file is setup.py. Using the setuptools package, this file
provides detailed information about your project and lists all dependencies —
packages required by your code to run properly.
Publishing to PyPI is beyond the scope of this introductory tutorial. But if you
do have a package for distribution, your project should include two more files:
a README.md written in Markdown, and a license. Check out the official
Python Packaging User Guide (PyPUG) if you want to know more.
INSTALLING PACKAGES

This section covers the basics of how to install Python packages.

40
It’s important to note that the term “package” in this context is being used to
describe a bundle of software to be installed (i.e. as a synonym for a distribution).
It does not to refer to the kind of package that you import in your Python source
code (i.e. a container of modules). It is common in the Python community to refer
to a distribution using the term “package”. Using the term “distribution” is often
not preferred, because it can easily be confused with a Linux distribution, or
another larger software distribution like Python itself.

Requirements for Installing Packages

This section describes the steps to follow before installing other Python packages.

Ensure you can run Python from the command line

Before you go any further, make sure you have Python and that the expected
version is available from your command line. You can check this by running:

Unix/macOS

python3 --version

Windows

You should get some output like Python 3.6.3. If you do not have Python, please
install the latest 3.x version from python.org or refer to the Installing
Python section of the Hitchhiker’s Guide to Python.

Note

If you’re a newcomer and you get an error like this:

41
>>> python --version
Traceback (most recent call last):

File "<stdin>", line 1, in <module>

NameError: name 'python' is not defined

It’s because this command and other suggested commands in this tutorial are
intended to be run in a shell (also called a terminal or console). See the Python
for Beginners getting started tutorial for an introduction to using your operating
system’s shell and interacting with Python.
Note

If you’re using an enhanced shell like IPython or the Jupyter notebook, you can
run system commands like those in this tutorial by prefacing them with
a ! character:

In [1]: import sys


!{sys.executable} --version

Python 3.6.3

It’s recommended to write {sys.executable} rather than plain python in order to


ensure that commands are run in the Python installation matching the currently
running notebook (which may not be the same Python installation that
the python command refers to).
Note

42
Due to the way most Linux distributions are handling the Python 3 migration,
Linux users using the system Python without creating a virtual environment first
should replace the python command in this tutorial with python3 and
the python -m pip command with python3 -m pip --user. Do not run any of the
commands in this tutorial with sudo: if you get a permissions error, come back to
the section on creating virtual environments, set one up, and then continue with
the tutorial as written.
Ensure you can run pip from the command line

Additionally, you’ll need to make sure you have pip available. You can check
this by running:

Unix/macOS

python3 -m pip --version

Windows

If you installed Python from source, with an installer from python.org, or


via Homebrew you should already have pip. If you’re on Linux and installed
using your OS package manager, you may have to install pip separately,
see Installing pip/setuptools/wheel with Linux Package Managers.

If pip isn’t already installed, then first try to bootstrap it from the standard library:

Unix/macOS

python3 -m ensurepip --default-pip

43
Windows

If that still doesn’t allow you to run python -m pip:

 Securely Download get-pip.py 1

 Run python get-pip.py. 2 This will install or upgrade pip. Additionally, it


will install setuptools and wheel if they’re not installed already.

Warning

Be cautious if you’re using a Python install that’s managed by your


operating system or another package manager. get-pip.py does not
coordinate with those tools, and may leave your system in an inconsistent
state. You can use python get-pip.py --prefix=/usr/local/ to install
in /usr/local which is designed for locally-installed software.
Ensure pip, setuptools, and wheel are up to date

While pip alone is sufficient to install from pre-built binary archives, up to date
copies of the setuptools and wheel projects are useful to ensure you can also
install from source archives:

Unix/macOS

python3 -m pip install --upgrade pip setuptools wheel

Windows

44
Optionally, create a virtual environment

See section below for details, but here’s the basic venv 3 command to use on a
typical Linux system:

Unix/macOS

python3 -m venv tutorial_env


source tutorial_env/bin/activate

Windows

This will create a new virtual environment in the tutorial_env subdirectory, and
configure the current shell to use it as the default python environment.

Creating Virtual Environments

Python “Virtual Environments” allow Python packages to be installed in an


isolated location for a particular application, rather than being installed globally.
If you are looking to safely install global command line tools, see Installing stand
alone command line tools.

Imagine you have an application that needs version 1 of LibFoo, but another
application requires version 2. How can you use both these applications? If you
install everything into /usr/lib/python3.6/site-packages (or whatever your
platform’s standard location is), it’s easy to end up in a situation where you
unintentionally upgrade an application that shouldn’t be upgraded.

Or more generally, what if you want to install an application and leave it be? If
an application works, any change in its libraries or the versions of those libraries
can break the application.

45
Also, what if you can’t install packages into the global site-packages directory?
For instance, on a shared host.

In all these cases, virtual environments can help you. They have their own
installation directories and they don’t share libraries with other virtual
environments.

Currently, there are two common tools for creating Python virtual environments:

 venv is available by default in Python 3.3 and later, and


installs pip and setuptools into created virtual environments in Python 3.4
and later.
 virtualenv needs to be installed separately, but supports Python 2.7+ and
Python 3.3+, and pip, setuptools and wheel are always installed into
created virtual environments by default (regardless of Python version).

The basic usage is like so:

Using venv:

Unix/macOS

python3 -m venv <DIR>


source <DIR>/bin/activate

Windows

Using virtualenv:

Unix/macOS

46
python3 -m virtualenv <DIR>
source <DIR>/bin/activate

Windows

For more information, see the venv docs or the virtualenv docs.

The use of source under Unix shells ensures that the virtual environment’s
variables are set within the current shell, and not in a subprocess (which then
disappears, having no useful effect).

In both of the above cases, Windows users should _not_ use


the source command, but should rather run the activate script directly from the
command shell like so:

<DIR>\Scripts\activate

Managing multiple virtual environments directly can become tedious, so


the dependency management tutorial introduces a higher level tool, Pipenv, that
automatically manages a separate virtual environment for each project and
application that you work on.

Use pip for Installing

pip is the recommended installer. Below, we’ll cover the most common usage
scenarios. For more detail, see the pip docs, which includes a complete Reference
Guide.

47
Installing from PyPI

The most common usage of pip is to install from the Python Package Index using
a requirement specifier. Generally speaking, a requirement specifier is composed
of a project name followed by an optional version specifier. PEP 440 contains
a full specification of the currently supported specifiers. Below are some
examples.

To install the latest version of “SomeProject”:

Unix/macOS

python3 -m pip install "SomeProject"

Windows

To install a specific version:

Unix/macOS

python3 -m pip install "SomeProject==1.4"

Windows

To install greater than or equal to one version and less than another:

Unix/macOS

python3 -m pip install "SomeProject>=1,<2"

48
Windows

To install a version that’s “compatible” with a certain version: 4

Unix/macOS

python3 -m pip install "SomeProject~=1.4.2"

Windows

In this case, this means to install any version “==1.4.*” version that’s also
“>=1.4.2”.

Source Distributions vs Wheels

pip can install from either Source Distributions (sdist) or Wheels, but if both are
present on PyPI, pip will prefer a compatible wheel. You can override pip`s
default behavior by e.g. using its –no-binary option.

Wheels are a pre-built distribution format that provides faster installation


compared to Source Distributions (sdist), especially when a project contains
compiled extensions.

If pip does not find a wheel to install, it will locally build a wheel and cache it for
future installs, instead of rebuilding the source distribution in the future.

Upgrading packages

Upgrade an already installed SomeProject to the latest from PyPI.

49
Unix/macOS

python3 -m pip install --upgrade SomeProject

Windows

Installing to the User Site

To install packages that are isolated to the current user, use the --user flag:

Unix/macOS

python3 -m pip install --user SomeProject

Windows

For more information see the User Installs section from the pip docs.

Note that the --user flag has no effect when inside a virtual environment - all
installation commands will affect the virtual environment.

If SomeProject defines any command-line scripts or console entry points, --


user will cause them to be installed inside the user base’s binary directory, which
may or may not already be present in your shell’s PATH. (Starting in version 10,
pip displays a warning when installing any scripts to a directory outside PATH.)
If the scripts are not available in your shell after installation, you’ll need to add
the directory to your PATH:

 On Linux and macOS you can find the user base binary directory by
running python -m site --user-base and adding bin to the end. For

50
example, this will typically print ~/.local (with ~ expanded to the absolute
path to your home directory) so you’ll need to add ~/.local/bin to
your PATH. You can set your PATH permanently by modifying ~/.profile.
 On Windows you can find the user base binary directory by running py -
m site --user-site and replacing site-packages with Scripts. For example,
this could return C:\Users\Username\AppData\Roaming\Python36\site-
packages so you would need to set your PATH to
include C:\Users\Username\AppData\Roaming\Python36\Scripts. You
can set your user PATH permanently in the Control Panel. You may need
to log out for the PATH changes to take effect.
Requirements files

Install a list of requirements specified in a Requirements File.

Unix/macOS

python3 -m pip install -r requirements.txt

Windows

Installing from VCS

Install a project from VCS in “editable” mode. For a full breakdown of the syntax,
see pip’s section on VCS Support.

Unix/macOS

python3 -m pip install -e git+https://git.repo/some_pkg.git#egg=SomeProject


# from git

51
python3 -m pip install -e hg+https://hg.repo/some_pkg#egg=SomeProject
# from mercurial
python3 -m pip install -e svn+svn://svn.repo/some_pkg/trunk/#egg=SomeProject
# from svn
python3 -m pip install -e
git+https://git.repo/some_pkg.git@feature#egg=SomeProject # from a branch

Windows

Installing from other Indexes

Install from an alternate index

Unix/macOS

python3 -m pip install --index-url http://my.package.repo/simple/ SomeProject

Windows

Search an additional index during install, in addition to PyPI

Unix/macOS

python3 -m pip install --extra-index-url http://my.package.repo/simple


SomeProject

Windows

52
Installing from a local src tree

Installing from local src in Development Mode, i.e. in such a way that the project
appears to be installed, but yet is still editable from the src tree.

Unix/macOS

python3 -m pip install -e <path>

Windows

You can also install normally from src

Unix/macOS

python3 -m pip install <path>

Windows

Installing from local archives

Install a particular source archive file.

Unix/macOS

python3 -m pip install ./downloads/SomeProject-1.0.4.tar.gz

Windows

53
Install from a local directory containing archives (and don’t check PyPI)

Unix/macOS

python3 -m pip install --no-index --find-links=file:///local/dir/ SomeProject


python3 -m pip install --no-index --find-links=/local/dir/ SomeProject
python3 -m pip install --no-index --find-links=relative/dir/ SomeProject

Windows

Installing from other sources

To install from other data sources (for example Amazon S3 storage) you can
create a helper application that presents the data in a PEP 503 compliant index
format, and use the --extra-index-url flag to direct pip to use that index.

./s3helper --port=7777
python -m pip install --extra-index-url http://localhost:7777 SomeProject

Installing Prereleases

Find pre-release and development versions, in addition to stable versions. By


default, pip only finds stable versions.

Unix/macOS

python3 -m pip install --pre SomeProject

54
Windows

Installing Setuptools “Extras”

Install setuptools extras.

Unix/macOS

python3 -m pip install SomePackage[PDF]


python3 -m pip install SomePackage[PDF]==3.0
python3 -m pip install -e .[PDF] # editable project in current directory

55
CHAPTER X

SYSTEM TESTING
UNIT TESTING

Unit testing involves the design of test cases that validate that the internal
program logic is functioning properly, and that program inputs produce valid
outputs. All decision branches and internal code flow should be validated. It is
the testing of individual software units of the application .it is done after the
completion of an individual unit before integration. This is a structural testing,
that relies on knowledge of its construction and is invasive. Unit tests perform
basic tests at component level and test a specific business process, application,
and/or system configuration. Unit tests ensure that each unique path of a business
process performs accurately to the documented specifications and contains
clearly defined inputs and expected results.

INTEGRATION TESTING

Integration tests are designed to test integrated software components to


determine if they actually run as one program. Testing is event driven and is
more concerned with the basic outcome of screens or fields. Integration tests
demonstrate that although the components were individually satisfaction, as
shown by successfully unit testing, the combination of components is correct and
consistent. Integration testing is specifically aimed at exposing the problems that
arise from the combination of components.

FUNCTIONAL TEST

Functional tests provide systematic demonstrations that functions tested are


available as specified by the business and technical requirements, system
documentation, and user manuals.

Functional testing is centered on the following items:

56
Valid Input : identified classes of valid input must be accepted.

Invalid Input : identified classes of invalid input must be rejected.

Functions : identified functions must be exercised.

Output : identified classes of application outputs must be exercised.

Systems/Procedures: interfacing systems or procedures must be invoked.

Organization and preparation of functional tests is focused on requirements,


key functions, or special test cases. In addition, systematic coverage pertaining to
identify Business process flows; data fields, predefined processes, and successive
processes must be considered for testing. Before functional testing is complete,
additional tests are identified and the effective value of current tests is
determined.

SYSTEM TEST

System testing ensures that the entire integrated software system meets
requirements. It tests a configuration to ensure known and predictable results. An
example of system testing is the configuration oriented system integration test.
System testing is based on process descriptions and flows, emphasizing pre-
driven process links and integration points.

WHITE BOX TESTING

White Box Testing is a testing in which in which the software tester has
knowledge of the inner workings, structure and language of the software, or at
least its purpose. It is purpose. It is used to test areas that cannot be reached from
a black box level.

BLACK BOX TESTING

Black Box Testing is testing the software without any knowledge of the
inner workings, structure or language of the module being tested. Black box tests,

57
as most other kinds of tests, must be written from a definitive source document,
such as specification or requirements document, such as specification or
requirements document. It is a testing in which the software under test is treated,
as a black box .you cannot “see” into it. The test provides inputs and responds to
outputs without considering how the software works.

58
CHAPTER XI

APPENDIX
9.1CODING
import numpy as np
import pandas as pd
import os
for dirname, _, filenames in os.walk('input'):
for filename in filenames:
print(os.path.join(dirname, filename))
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report, confusion_matrix,
ConfusionMatrixDisplay
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier,
LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
fake=pd.read_csv("D:/Dowload D/Fake_News_Detection-master/jupyter
code/Fake.csv")
true=pd.read_csv("D:/Dowload D/Fake_News_Detection-master/jupyter
code/True.csv")
fake.head()
fake.info(), true.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23481 entries, 0 to 23480

59
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 title 23481 non-null object
1 text 23481 non-null object
2 subject 23481 non-null object
3 date 23481 non-null object
dtypes: object(4)
memory usage: 733.9+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21417 entries, 0 to 21416
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 title 21417 non-null object
1 text 21417 non-null object
2 subject 21417 non-null object
3 date 21417 non-null object
dtypes: object(4)
memory usage: 669.4+ KB
(None, None)
# There is no null value in the datasets
fake_df=fake[['text']]
true_df=true[['text']]
fake_df['label']=0
true_df['label']=1
C:\Users\Admin\AppData\Local\Temp\ipykernel_6284\1271111993.py:3:
SettingWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame.

Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-


docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

fake_df['label']=0

60
C:\Users\Admin\AppData\Local\Temp\ipykernel_6284\1271111993.py:4:
SettingWithCopyWarning:

A value is trying to be set on a copy of a slice from a DataFrame.

Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-


docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

true_df['label']=1
# All we need are just text and label (! I added the label myself )
data=pd.concat([fake_df,true_df], axis=0)
data=data.sample(frac=0.7)
data.tail(7)
# Adding and randomizing process of two datasets
df=data.reset_index()
df.drop(['index'], axis=1, inplace=True)
df.head()
X=df['text']
y=df['label']
x_train, x_test, y_train, y_test= train_test_split(X, y, test_size=0.2,
random_state=7)
tfid_vectorizer=TfidfVectorizer(stop_words='english', max_df=0.7)
train_tfid=tfid_vectorizer.fit_transform(x_train)
test_tfid=tfid_vectorizer.transform(x_test)
PA_classifier=PassiveAggressiveClassifier(max_iter=50)
PA_classifier.fit(train_tfid, y_train)
pa_predict=PA_classifier.predict(test_tfid)
print("Classification Report:", classification_report(y_test, pa_predict))
print("Accuracy Score:", accuracy_score(y_test, pa_predict))
61
con_pa=confusion_matrix(y_test, pa_predict)
con_mat=ConfusionMatrixDisplay(con_pa, display_labels=['Fake','True'])
con_mat.plot()
plt.show()
Classification Report: precision recall f1-score support

0 0.99 0.99 0.99 3290


1 0.99 0.99 0.99 2996

accuracy 0.99 6286


macro avg 0.99 0.99 0.99 6286
weighted avg 0.99 0.99 0.99 6286

Accuracy Score: 0.9925230671333122

Wow! look at this, 99.25 % accuracy


linear_cl=LogisticRegression()
linear_cl.fit(train_tfid,y_train)
linear_predict=linear_cl.predict(test_tfid)
62
print("Classification Report:", classification_report(y_test,
linear_predict))
print("Accuracy Score:", accuracy_score(y_test, linear_predict))

con_pa=confusion_matrix(y_test, linear_predict)
con_mat=ConfusionMatrixDisplay(con_pa, display_labels=['Fake','True'])
con_mat.plot()
plt.show()
Classification Report: precision recall f1-score support

0 0.98 0.98 0.98 3290


1 0.98 0.98 0.98 2996

accuracy 0.98 6286


macro avg 0.98 0.98 0.98 6286
weighted avg 0.98 0.98 0.98 6286

Accuracy Score: 0.9804327076041998

63
Logistic Regression is working with 98.27 % accuracy
tree_model=DecisionTreeClassifier()
tree_model.fit(train_tfid, y_train)
tree_predict=tree_model.predict(test_tfid)
print("Classification Report:", classification_report(y_test, tree_predict))
print("Accuracy Score:", accuracy_score(y_test, tree_predict))

con_pa=confusion_matrix(y_test, tree_predict)
con_mat=ConfusionMatrixDisplay(con_pa, display_labels=['Fake','True'])
con_mat.plot()
plt.show()
Classification Report: precision recall f1-score support

0 0.99 1.00 0.99 3290


1 0.99 0.99 0.99 2996

accuracy 0.99 6286


macro avg 0.99 0.99 0.99 6286
weighted avg 0.99 0.99 0.99 6286

Accuracy Score: 0.9944320712694877

64
Decision Tree model is working with 99.36 % accuracy
xgboost_model=XGBClassifier()
xgboost_model.fit(train_tfid,y_train)
xgb_predict=xgboost_model.predict(test_tfid)
print("Classification Report:", classification_report(y_test, xgb_predict))
print("Accuracy Score:", accuracy_score(y_test, xgb_predict))

con_pa=confusion_matrix(y_test, xgb_predict)
con_mat=ConfusionMatrixDisplay(con_pa, display_labels=['Fake','True'])
con_mat.plot()
plt.show()
Classification Report: precision recall f1-score support

0 1.00 1.00 1.00 3290


1 1.00 1.00 1.00 2996

65
accuracy 1.00 6286
macro avg 1.00 1.00 1.00 6286
weighted avg 1.00 1.00 1.00 6286

Accuracy Score: 0.9968183264397072

import pickle
pickle.dump(xgboost_model,open("modell.pkl","wb"))

# TfidfVectorizer Explanation

Convert a collection of raw documents to a matrix of TF-IDF features

TF-IDF where TF means term frequency, and IDF means Inverse


Document frequency.

from sklearn.feature_extraction.text import TfidfVectorizer

66
text = ['Hello Diwakar here, fake news detection','Welcome to the Machine
learning' ]

vect = TfidfVectorizer()

vect.fit(text)

## TF will count the frequency of word in each document. and IDF

print(vect.idf_)

[1.40546511 1.40546511 1.40546511 1.40546511 1.40546511 1.40546511


1.40546511 1.40546511 1.40546511 1.40546511 1.40546511]

print(vect.vocabulary_)

{'hello': 3, 'diwakar': 1, 'here': 4, 'fake': 2, 'news': 7, 'detection': 0, 'welcome': 10,


'to': 9, 'the': 8, 'machine': 6, 'learning': 5}

### A words which is present in all the data, it will have low IDF value.
With this unique words will be highlighted using the Max IDF values.

example = text[0]

example

'Hello Kushal Bhavsar here, I love machine learning'

example = vect.transform([example])

print(example.toarray())

[[0.4078241 0.4078241 0.4078241 0. 0.4078241 0.29017021


0.4078241 0.29017021 0. 0. 0. ]]

### Here, 0 is present in the which indexed word, which is not available in
given sentence.

## PassiveAggressiveClassifier

## Let's start the work

import os

os.chdir("D:/Fake News Detection")

import pandas as pd

67
dataframe = pd.read_csv('news.csv')

dataframe.head()

x = dataframe['text']

y = dataframe['label']

0 Daniel Greenfield, a Shillman Journalism Fello...


1 Google Pinterest Digg Linkedin Reddit Stumbleu...
2 U.S. Secretary of State John F. Kerry said Mon...
3 — Kaydee King (@KaydeeKing) November 9, 2016 T...
4 It's primary day in New York and front-runners...
...
6330 The State Department told the Republican Natio...
6331 The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...
6332 Anti-Trump Protesters Are Tools of the Oligar...
6333 ADDIS ABABA, Ethiopia —President Obama convene...
6334 Jeb Bush Is Suddenly Attacking Trump. Here's W...
Name: text, Length: 6335, dtype: object

0 FAKE
1 FAKE
2 REAL
3 FAKE
4 REAL
...
6330 REAL
6331 FAKE
6332 FAKE
6333 REAL
6334 REAL
Name: label, Length: 6335, dtype: object

from sklearn.model_selection import train_test_split


from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

68
x_train,x_test,y_train,y_test =
train_test_split(x,y,test_size=0.2,random_state=0)
y_train
2402 REAL
1922 REAL
3475 FAKE
6197 REAL
4748 FAKE
...
4931 REAL
3264 REAL
1653 FAKE
2607 FAKE
2732 REAL
Name: label, Length: 5068, dtype: object
y_train
2402 REAL
1922 REAL
3475 FAKE
6197 REAL
4748 FAKE
...
4931 REAL
3264 REAL
1653 FAKE
2607 FAKE
2732 REAL
Name: label, Length: 5068, dtype: object
tfvect = TfidfVectorizer(stop_words='english',max_df=0.7)
tfid_x_train = tfvect.fit_transform(x_train)
tfid_x_test = tfvect.transform(x_test)
* max_df = 0.50 means "ignore terms that appear in more than 50% of the
documents".
* max_df = 25 means "ignore terms that appear in more than 25 documents".
classifier = PassiveAggressiveClassifier(max_iter=50)
classifier.fit(tfid_x_train,y_train)

69
PassiveAggressiveClassifier(C=1.0, average=False, class_weight=None,
early_stopping=False, fit_intercept=True,
loss='hinge', max_iter=50, n_iter_no_change=5,
n_jobs=None, random_state=None, shuffle=True,
tol=0.001, validation_fraction=0.1, verbose=0,
warm_start=False)
y_pred = classifier.predict(tfid_x_test)
score = accuracy_score(y_test,y_pred)
print(f'Accuracy: {round(score*100,2)}%')
Accuracy: 93.69%
cf = confusion_matrix(y_test,y_pred, labels=['FAKE','REAL'])
print(cf)
[[575 40]
[ 40 612]]
def fake_news_det(news):
input_data = [news]
vectorized_input_data = tfvect.transform(input_data)
prediction = classifier.predict(vectorized_input_data)
print(prediction)
fake_news_det('U.S. Secretary of State John F. Kerry said Monday that he will
stop in Paris later this week, amid criticism that no top American officials
attended Sunday’s unity march against terrorism.')
['REAL']
fake_news_det("""Go to Article
President Barack Obama has been campaigning hard for the woman who is
supposedly going to extend his legacy four more years. The only problem with
stumping for Hillary Clinton, however, is she’s not exactly a candidate easy
to get too enthused about. """)
['FAKE']
import pickle
pickle.dump(classifier,open('model.pkl', 'wb'))

70
# load the model from disk
loaded_model = pickle.load(open('model.pkl', 'rb'))
def fake_news_det1(news):
input_data = [news]
vectorized_input_data = tfvect.transform(input_data)
prediction = loaded_model.predict(vectorized_input_data)
print(prediction)
fake_news_det1("""Go to Article
President Barack Obama has been campaigning hard for the woman who is
supposedly going to extend his legacy four more years. The only problem with
stumping for Hillary Clinton, however, is she’s not exactly a candidate easy
to get too enthused about. """)
['FAKE']
fake_news_det1("""U.S. Secretary of State John F. Kerry said Monday that he
will stop in Paris later this week, amid criticism that no top American officials
attended Sunday’s unity march against terrorism.""")
['REAL']
fake_news_det('''U.S. Secretary of State John F. Kerry said Monday that he will
stop in Paris later this week, amid criticism that no top American officials
attended Sunday’s unity march against terrorism.''')
['REAL']

71
11.2 SCREENSHOTS

72
73
74
75
76
CHAPTER XII

CONCLUSION
In this paper, we have proposed a novel hybrid fake news detection system
that employs two types of features: linguistic and fact-verification features. The
proposed detection system employs only eight features, which less compared to
the stat-of-the-art approaches. It operates in two phases: training and testing. In
the training phase, the detection system runs four machine learning algorithms,
i.e., Logistic Regression (LR), Random Forest (RF), Additional Trees
Discriminant, and XGBoost, in order to select the best classifier for the testing
phase. Evaluation results on the News data set show that the proposed detection
system achieves an accuracy of 99% under XGBoost. As future work, we aim at
improving the accuracy of our detection system by investigating other
discriminating features such as visual-based and style-based features. Moreover,
we plan to further detect other types of false information such as
biased/inaccurate news and misleading/ambiguous news.

77
FUTURE ENHANCEMENT
The future scope of this project is that fake news detectors can help to filter
different websites that contain fake news and the motive is to help users such that
they can’t get attracted by click bait. The project can also be used on many social
media platforms where there is a massive amount of fake data which can cause
damage to the society, with some modifications to remove the same. Fake account
creators are constantly adapting their tactics to evade detection. Future work
could focus on developing machine learning models that can adapt to these
evolving tactics and remain effective in identifying fake accounts.

78
CHAPTER XIII

REFERENCE
[1]. S. Yang, K. Shu, S. Wang, R. Gu, F. Wu and H. Liu, "Unsupervised Fake
News Detection on Social Media: A Generative Approach", Proceedings of the
AAAI Conference on Artificial Intelligence, vol. 33, pp. 5644-5651, 2019.

[2]. M. Hlaing and N. Kham, "Defining News Authenticity on Social Media U


sing Machine Learning Approach", 2020 IEEE Conference on Computer
Applications (ICCA), 2020.

[3]. F. Ozbay and B. Alatas, "Fake news detection within online social media
using supervised artificial intelligence algorithms", Physica A: Statistical
Mechanics and its Applications, vol. 540, pp. 123174, 2020.

[4]. P. Faustini and T. Covões, "Fake news detection in multiple platforms and
languages", Expert Systems with Applications, vol. 158, pp. 113503, 2020.

[5]. K. Yazdi, A. Yazdi, S. Khodayi, J. Hou, W. Zhou and S. Saedy, "Improving


Fake News Detection Using K-means and Support Vector Machine
Approaches", World Academy of Science Engineering and Technology
International Journal of Electronics and Communication Engineering, vol. 14,
no. 2, 2020.

[6]. Y. Lin, "10 Twitter Statistics Every Marketer Should Know in 2021
[Infographic]", My.oberlo.com, 2021.

[7]. B. Collins, D. T. Hoang, N. T. Nguyen and D. Hwang, "Trends in combating


fake news on social media - a survey", Journal of Information and
Telecommunication, vol. 5, no. 2, pp. 247-266, 2020.

[8]. J. W. Waweru Muigai, "Understanding fake news", International Journal


of Scientific and Research Publications (IJSRP), vol. 9, no. 1, 2019.

[9]. L. Gimenez, "6 steps for data cleaning and why it matters", Geotab, 2020.

[10]. P. S Reddy, D. Roy, P. Manoj, M. Keerthana and P. V Tijare, "A Study on


Fake News Detection Using Naive Bayes SVM Neural Networks and
LSTM", Jour of Adv Research in Dynamical & Control Systems, vol. 11, no. 06,
2019.

79

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy