Final Report
Final Report
Final Report
SOCIAL MEDIA
A Project report submitted in partial fulfilment for the award of
requirement of the Degree of
Submitted by
ARULMOZHI.I (21CS1108)
GAYATHRI.E (21CS1114)
PAVITHRA.B (21CS1121)
PONDICHERRY, INDIA.
May 2024
1
PSV COLLEGE OF ARTS AND SCIENCE
PONDICHERRY UNIVERSITY
PUDUCHERRY – 607 402
BONAFIDE CERTIFICATE
2
ACKNOWLEDGEMENT
We greatly thanks to all those who helped us in making this Project successful.
Thiru. S.SELVAMANI and our secretary Dr. S.VIKNESH M.Tech, Ph.D., for all their
It is great pleasure to thank our Director Dr.N.GOBU M.E., MBA, Ph.D., MISTE,
AMIE, MIIW for his valuable guidance, support and encouragement to do our project work.
It gives us great ecstasy of pleasure to convey our deep and sincere thanks to Principal
Dr. K.KAMALAKKAN, Ph.D., of PSV College of Arts and Science for having given us
We wish to express our deep sense of gratitude to our Head of the Department
Mr. R.BABU M.C.A., M.E., for his valuable guidance and encouragement throughout this
project work.
We wish to express our gratitude thanks to our Project Guide Mr. R.BABU M.C.A.,
M.E., Assistant Professor for his able guidance and useful suggestions, which helped me in
We also thanks to all our Department Faculties and Lab Administrator for their
timely guidance in the conduct of our project work and for all their valuable assistance in the
project work.
parents for their blessings my friends/classmates/seniors for their help and wishes for the
3
4
CHAPTER
TITLE
NO PAGE NO
ABSTRACT 9
I INTRODUCTION 11-12
II PROBLEM DEFINITION 14
IV EXISTING SYSTEM 19
5.1 OVERVIEW
5.2 ADVANTAGES
5
6.2 SOFTWARE REQUIREMENTS
7.1 ARCHITECTURE
8.1 MODULES
XI APPENDIX 71-92
11.2 SCREENSHOTS
XIII BIBLOGRAPHY 97
6
7
ABSTRACT
The rapid development of different social media and content-sharing
platforms has been largely exploited to spread misinformation and fake news that
make people believing in harmful stories, which allow influencing public
opinion, and could cause panic and chaos among population. Thus, fake news
detection has become an important research topic, aiming at flagging a specific
content as fake or legitimate. The fake news detection solutions can be divided
into three main categories: content-based, social context-based and knowledge-
based approaches. In this paper, we propose a novel hybrid fake news detection
system that combines linguistic and knowledge-based approaches and inherits
their advantages, by employing two different sets of features, linguistic features,
and a novel set of knowledge-based features, called fact-verification features that
comprise three types of information namely, reputation of the website where the
news is published, coverage, i.e., number of sources that published the news, and
fact-check, i.e., opinion of well-known fact-checking websites about the news,
i.e., true or false. The proposed system only employs eight features, which is less
than most of the state-of-the-art approaches. Also, the evaluation results on a fake
news dataset show that the proposed system employing both types of features can
reach an accuracy of 94.4%, which is better compared to that obtained from
separately employing linguistic features and fact-verification features.
8
CHAPTER I
INTRODUCTION
Social media are taking an increasing part in our professional and personal
lives. More and more people tend to search and consume news via social media
rather than traditional media outlets. It has become common that important news
are first broadcasted on social networks before being released by traditional
media such as television or radio. Due to the massive propagation of news on
social networks, users rarely check the accuracy of the information they share. It
is therefore common to see false and manipulated information that are circulating
on social media such as hoaxes, rumors , urban legends, and fake news. Moreover,
it is difficult to stop the spreading of fake news when it is already shared many
times and at large-scale. This massive dissemination of false information could
cause a serious negative impact on individuals and society. First, fake news could
negatively influence the public opinion. Second, fake news change the way
people interpret and react to real news. For example, some fake news could make
people suspicious, and affect their ability to discern real news from fake news. In
the literature, many approaches have been proposed for fake news detection.
Early approaches were mainly based on linguistic-based techniques, which rely
on language usage and its analysis to predict deception. The goal of these
approaches is to look for instances of leakage found in the content of a text at
different levels (i.e., words, sentences, characters, and documents levels). These
approaches implement different methods such as: data representation, deep
syntax, sentiment, and semantic analyses. In data representation methods, each
word is considered as a single unit, and individual words are aggregated and
analyzed to reveal linguistic cues of deception. In deep syntax methods, the
sentences are converted into a set of rewritten rules (i.e., parse tree) in order to
describe the syntax structure. The semantic analysis determines the truthfulness
of authors, which describes the degree of compatibility of personal experience
9
compared to the content derived from a collection of analogous data. Finally, the
sentiment analysis focuses on the extraction of opinion, which involves
examining written texts about people’s attitudes, sentiments, and evaluations
using analytical techniques. Recent research has shown that linguistic-based
techniques alone are not sufficient to reach high detection accuracy.
10
CHAPTER II
PROBLEM DEFINITION
The extensive spread of fake news has the potential for extremely negative
impacts on individuals and society. Therefore, fake news detection on social
media has recently become an emerging research that is attracting tremendous
attention. Fake news detection on social media presents unique characteristics
and challenges that make existing detection algorithms from traditional news
media ineffective or not applicable. First, fake news is intentionally written to
mislead readers to believe false information, which makes it difficult and
nontrivial to detect based on news content; therefore, we need to include auxiliary
information, such as user social engagements on social media, to help make a
determination. Second, exploiting this auxiliary information is challenging in and
of itself as users’ social engagements with fake news produce data that is big,
incomplete, unstructured, and noisy.
11
CHAPTER III
SYSTEM STUDY
3.1 FEASIBILITY STUDY
The feasibility of the project is analyzed in this phase and business proposal
is put forth with a very general plan for the project and some cost estimates.
During system analysis the feasibility study of the proposed system is to be
carried out. This is to ensure that the proposed system is not a burden to the
company. For feasibility analysis, some understanding of the major
requirements for the system is essential.
This study is carried out to check the economic impact that the system will
have on the organization. The amount of fund that the company can pour into the
research and development of the system is limited. The expenditures must be
justified. Thus the developed system as well within the budget and this was
achieved because most of the technologies used are freely available. Only the
customized products had to be purchased.
12
3.4 SOCIAL FEASIBILITY
The aspect of study is to check the level of acceptance of the system by the
user. This includes the process of training the user to use the system efficiently.
The user must not feel threatened by the system, instead must accept it as a
necessity. The level of acceptance by the users solely depends on the methods
that are employed to educate the user about the system and to make him familiar
with it. His level of confidence must be raised so that he is also able to make some
constructive criticism, which is welcomed, as he is the final user of the system.
13
CHAPTER IV
EXISTING SYSTEM
The major challenges that hinder the efficiency of the existing fake news
detection solutions are related to the highly versatile nature of deceptive
information. Indeed, it is very difficult to obtain a generalized dataset for fake
news detection. Thus, it is very difficult to extract relevant features that can well
represent and allow to detect fake news in various domains. Some existing
solutions rely on ontologies in order to model fake news domain knowledge,
which can be then used to distinguish fake from real news content. As previously
discussed, the existing fake news detection solutions are linguistic-based,
knowledge-based, or social context-based. Considering the limitations of the
aforementioned categories, it would be a good idea to investigate combining two
different categories in order to overcome their respective limitations.
DISADVANTAGES
14
CHAPTER V
PROPOSED SYSTEM
5.1 overview
In this paper, we propose a hybrid fake news detection system that takes
advantage of both linguistic-based and knowledge-based approaches. To the best
of our knowledge, our work is the first that proposes this hybridization in the
context of fake news detection. Some metaheuristics algorithms have been
proposed to deal with the fake news detection issue. The proposed fake news
detection system consists of two phases, namely training and testing. Both phases
include a preprocessing task, which consists of cleaning and preparing the
training and testing datasets of real and fake news. In the training phase, the
feature extracting task extracts a set of relevant features from the training dataset,
which are then fed to several machine learning algorithms to build a fake news
detection model. In the testing phase, the detection model is applied on test data
to decide whether the provided news articles are real or fake. Presents the overall
architecture of the proposed fake news detection system.
5.2 ADVANTAGES
15
5.3 RANDOM FOREST ALGORITHM
Washington. Tianqi Chen and Carlos Guestrin presented their paper at SIGKDD
Conference in 2016 and caught the Machine Learning world by fire. Since its
introduction, this algorithm has not only been credited with winning numerous
Kaggle competitions but also for being the driving force under the hood for
SYSTEM REQUIREMENTS
SPECIFICATION
6.1 HARDWARE REQUIREMENTS
17
CHAPTER VII
SYSTEM ANALYSIS
7.1 ARCHITECTURE
18
UML DIAGRAM
19
ACTIVITY DIAGRAM
20
7.2 DATA FLOW DIAGRAM
LEVEL 0
21
LEVEL 1
22
LEVEL 2
23
OVERALL DIAGRAM
24
ER DIAGRAM
25
CHAPTER VIII
SYSTEM
IMPLEMENTATION
8.1 MODULES
1. Data collection
2. Data preprocessing
3. Feature extraction
4. Modeling creation using Random Forest
5. Hyperparameter Tuning
Data is the prime ingredient of this project, as these data features are
extracted using Natural Language Processing. By using these features of the data,
Machine Learning Algorithms are trained and models are created. In this
proposal, we have news with equal proportionality of fake and real. Data is saved
in Comma Separated Value format. This data set is divided in the training and
testing of algorithms.
26
8.2.3 FEATURE EXTRACTION
27
CHAPTER IX
SOFTWARE
ENVIRONMENT
INTRODUCTION TO PYTHON
Python is a high-level object-oriented programming language that was created
by Guido van Rossum. It is also called general-purpose programming
language as it is used in almost every domain we can think of as mentioned
below:
Web Development
Software Development
Game Development
AI & ML
Data Analytics
This list can go on as we go but why python is so much popular let’s see it in
the next topic.
So let me explain:
28
Interfaces. Similarly, they have pros and cons at the same time. so if we consider
python it is general-purpose which means it is widely used in every domain the
reason is it’s very simple to understand, scalable because of which the speed of
development is so fast. Now you get the idea why besides learning python it
doesn’t require any programming background so that’s why it’s popular amongst
developers as well. Python has simpler syntax similar to the English language and
also the syntax allows developers to write programs with fewer lines of code.
Since it is open-source there are many libraries available that make developers’
jobs easy ultimately results in high productivity. They can easily focus on
business logic and Its demanding skills in the digital era where information is
available in large data sets.
Now in the era of the digital world, there is a lot of information available
on the internet that might confuse us believe me. what we can do is follow the
documentation which is a good start point. Once we are familiar with concepts or
terminology we can dive deeper into this.
YouTube: https://www.youtube.com/watch?v=_uQrJ0TkZlc
CodeAcademy: https://www.codecademy.com/catalog/language/python
29
I hope now you guys are excited to get started right so you might be wondering
where we can start coding right so there are a lot of options available in markets.
we can use any IDE we are comfortable with but for those who are new to the
programming world I am listing some of IDE’s below for python:
2) PyCharm: https://www.jetbrains.com/pycharm/
3) Spyder: https://www.spyder-ide.org/
4) Atom: https://atom.io/
Real-World Examples:
1) NASA (National Aeronautics and Space Agency): One of Nasa’s Shuttle
Support Contractors, United Space Alliance developed a Workflow Automation
System (WAS) which is fast. Internal Resources Within critical project stated
that:
“Python allows us to tackle the complexity of programs like the WAS without
getting bogged down in the language”.
Nasa also published a website (https://code.nasa.gov/) where there are 400 open
source projects which use python.
2) Netflix: There are various projects in Netflix which use python as follow:
30
Amongst all projects, Regional failover is the project they have as the system
decreases outage time from 45 minutes to 7 minutes with no additional cost.
3) Instagram: Instagram also uses python extensively. They have built a photo-
sharing social platform using Django which is a web framework for python. Also,
they are able to successfully upgrade their framework without any technical
challenges.
2) Game Development: PySoy and PyGame are two python libraries that are
used for game development
4) Desktop GUI: Desktop GUI offers many toolkits and frameworks using
which we can build desktop applications.PyQt, PyGtk, PyGUI are some of the
GUI frameworks.
31
Approach to be followed to master Python:
“Beginning is the end and end is the beginning”. I know what you are thinking
about. It is basically a famous quote from a web series named “Dark”. Now how
it relates to Python programming?
Well, the reality is like the logo of infinity which we can see above. In the
programming realm, there is no such thing as mastery. It’s simply a trial and error
process. For example. Yesterday I was writing some code where I was trying to
print a value of a variable before declaring it inside a function. There I had seen
a new error named “UnboundLocalErrorException“.
Now here is the main part. What approach to follow in order to master Python
Programming?
32
print("Hello World")
Variables in Python:
my_var = 100
As you can see here, we have created a variable named “my_var” to assign a
value 100 to the same.
While data structures are responsible for deciding how to store this data in a
computer’s memory.
my_str = "ABCD"
As you can see here, we have assigned a value “ABCD” to a variable my_str.
This is basically a string data type in Python.
33
my_dict={1:100,2:200,3:300}
Again this is just the tip of the iceberg. There are lots of data types and data
structures in Python. To give a basic idea about data structures in Python, here is
the complete list:
1.Lists
2.Dictionary
3.Sets
4.Tuples
5.Frozenset
Python is no exception for that as well. This is one of the most important concepts
that we need to master.
IF-ELIF-ELSE conditionals:
else:
34
print("Do nothing")
As you can see in the above example, we have created what is known as the if-
elif-else ladder
For loop:
for i in "Python":
print(i)
PRO Tip:
Once you start programming with Python, you will be seeing that if we missed
any white spacing in python then python will start giving some errors. This is
known as Indentation in python. Python is very strict with indentation. Python is
created with a mindset to help everyone become a neat programmer. This
indentation scheme in python is introduced in one of python’s early PEP(Python
Enhancement Proposal).
While The Python Language Reference describes the exact syntax and
semantics of the Python language, this library reference manual describes the
standard library that is distributed with Python. It also describes some of the
optional components that are commonly included in Python distributions.
35
solutions for many problems that occur in everyday programming. Some of these
modules are explicitly designed to encourage and enhance the portability of
Python programs by abstracting away platform-specifics into platform-neutral
APIs.
The Python installers for the Windows platform usually include the entire
standard library and often also include many additional components. For Unix-
like operating systems Python is normally provided as a collection of packages,
so it may be necessary to use the packaging tools provided with the operating
system to obtain some or all of the optional components.
36
Circling back to our earlier definition of a module as reusable, importable code,
we note that every package is a module — but not every module is a package.
A package folder usually contains one file named __init__.py that basically tells
Python: “Hey, this directory is a package!” The init file may be empty, or it may
contain code to be executed upon package initialization.
You’ve probably come across the term “library” as well. For Python, a library
isn’t as clearly defined as a package or a module, but a good rule of thumb is
that whenever a package has been published, it may be referred to as a library.
HOW TO USE A PYTHON PACKAGE
We’ve mentioned namespaces, publishing packages and importing modules. If
any of these terms or concepts aren’t entirely clear to you, we’ve got you! In
this section, we’ll cover everything you’ll need to really grasp the pipeline of
using Python packages in your code.
Importing a Python Package
We’ll import a package using the import statement:
Let’s assume that we haven’t yet installed any packages. Python comes with a
big collection of pre-installed packages known as the Python Standard Library.
It includes tools for a range of use cases, such as text processing and doing math.
Let’s import the latter:
37
The sys.path command returns all the directories in which Python will try to
find a package. It may happen that you’ve downloaded a package but when you
try importing it, you get an error:
In such cases, check whether your imported package has been placed in one of
Python’s search paths. If it hasn’t, you can always expand your list of search
paths:
At that point, the interpreter will have more than one more location to look for
packages after receiving an import statement.
Namespaces and Aliasing
When we had imported the math module, we initialized the math namespace.
This means that we can now refer to functions and classes from the math module
by way of “dot notation”:
Assume that we were only interested in our math module’s factorial function,
and that we’re also tired of using dot notation. In that case, we can proceed as
follows:
38
If you’d like to import multiple resources from the same source, you can simply
comma-separate them in the import statement:
There is, however, always a small risk that your variables will clash with other
variables in your namespace. What if one of the variables in your code was
named log, too? It would overwrite the log function, causing bugs. To avoid
that, it’s better to import the package as we did before. If you want to save typing
time, you can alias your package to give it a shorter name:
However, this method poses serious risk since you usually don’t know all the
names contained in a package, increasing the likelihood of your variables being
overwritten. It’s for this reason that most seasoned Python programmers will
discourage use of the wildcard * in imports. Also, as the Zen of Python states,
“namespaces are one honking great idea!”
How to Install a Python Package
How about packages that are not part of the standard library? The official
repository for finding and downloading such third-party packages is the Python
Package Index, usually referred to simply as PyPI. To install packages from
PyPI, use the package installer pip:
39
pip can install Python packages from any source, not just PyPI. If you installed
Python using Anaconda or Miniconda, you can also use the conda command to
install Python packages.
While conda is very easy to use, it’s not as versatile as pip. So if you cannot
install a package using conda, you can always try pip instead.
Reloading a Module
If you’re programming in interactive mode, and you change a module’s script,
these changes won’t be imported, even if you issue another import statement. In
such case, you’ll want to use the reload() function from the importlib library:
40
It’s important to note that the term “package” in this context is being used to
describe a bundle of software to be installed (i.e. as a synonym for a distribution).
It does not to refer to the kind of package that you import in your Python source
code (i.e. a container of modules). It is common in the Python community to refer
to a distribution using the term “package”. Using the term “distribution” is often
not preferred, because it can easily be confused with a Linux distribution, or
another larger software distribution like Python itself.
This section describes the steps to follow before installing other Python packages.
Before you go any further, make sure you have Python and that the expected
version is available from your command line. You can check this by running:
Unix/macOS
python3 --version
Windows
You should get some output like Python 3.6.3. If you do not have Python, please
install the latest 3.x version from python.org or refer to the Installing
Python section of the Hitchhiker’s Guide to Python.
Note
41
>>> python --version
Traceback (most recent call last):
It’s because this command and other suggested commands in this tutorial are
intended to be run in a shell (also called a terminal or console). See the Python
for Beginners getting started tutorial for an introduction to using your operating
system’s shell and interacting with Python.
Note
If you’re using an enhanced shell like IPython or the Jupyter notebook, you can
run system commands like those in this tutorial by prefacing them with
a ! character:
Python 3.6.3
42
Due to the way most Linux distributions are handling the Python 3 migration,
Linux users using the system Python without creating a virtual environment first
should replace the python command in this tutorial with python3 and
the python -m pip command with python3 -m pip --user. Do not run any of the
commands in this tutorial with sudo: if you get a permissions error, come back to
the section on creating virtual environments, set one up, and then continue with
the tutorial as written.
Ensure you can run pip from the command line
Additionally, you’ll need to make sure you have pip available. You can check
this by running:
Unix/macOS
Windows
If pip isn’t already installed, then first try to bootstrap it from the standard library:
Unix/macOS
43
Windows
Warning
While pip alone is sufficient to install from pre-built binary archives, up to date
copies of the setuptools and wheel projects are useful to ensure you can also
install from source archives:
Unix/macOS
Windows
44
Optionally, create a virtual environment
See section below for details, but here’s the basic venv 3 command to use on a
typical Linux system:
Unix/macOS
Windows
This will create a new virtual environment in the tutorial_env subdirectory, and
configure the current shell to use it as the default python environment.
Imagine you have an application that needs version 1 of LibFoo, but another
application requires version 2. How can you use both these applications? If you
install everything into /usr/lib/python3.6/site-packages (or whatever your
platform’s standard location is), it’s easy to end up in a situation where you
unintentionally upgrade an application that shouldn’t be upgraded.
Or more generally, what if you want to install an application and leave it be? If
an application works, any change in its libraries or the versions of those libraries
can break the application.
45
Also, what if you can’t install packages into the global site-packages directory?
For instance, on a shared host.
In all these cases, virtual environments can help you. They have their own
installation directories and they don’t share libraries with other virtual
environments.
Currently, there are two common tools for creating Python virtual environments:
Using venv:
Unix/macOS
Windows
Using virtualenv:
Unix/macOS
46
python3 -m virtualenv <DIR>
source <DIR>/bin/activate
Windows
For more information, see the venv docs or the virtualenv docs.
The use of source under Unix shells ensures that the virtual environment’s
variables are set within the current shell, and not in a subprocess (which then
disappears, having no useful effect).
<DIR>\Scripts\activate
pip is the recommended installer. Below, we’ll cover the most common usage
scenarios. For more detail, see the pip docs, which includes a complete Reference
Guide.
47
Installing from PyPI
The most common usage of pip is to install from the Python Package Index using
a requirement specifier. Generally speaking, a requirement specifier is composed
of a project name followed by an optional version specifier. PEP 440 contains
a full specification of the currently supported specifiers. Below are some
examples.
Unix/macOS
Windows
Unix/macOS
Windows
To install greater than or equal to one version and less than another:
Unix/macOS
48
Windows
Unix/macOS
Windows
In this case, this means to install any version “==1.4.*” version that’s also
“>=1.4.2”.
pip can install from either Source Distributions (sdist) or Wheels, but if both are
present on PyPI, pip will prefer a compatible wheel. You can override pip`s
default behavior by e.g. using its –no-binary option.
If pip does not find a wheel to install, it will locally build a wheel and cache it for
future installs, instead of rebuilding the source distribution in the future.
Upgrading packages
49
Unix/macOS
Windows
To install packages that are isolated to the current user, use the --user flag:
Unix/macOS
Windows
For more information see the User Installs section from the pip docs.
Note that the --user flag has no effect when inside a virtual environment - all
installation commands will affect the virtual environment.
On Linux and macOS you can find the user base binary directory by
running python -m site --user-base and adding bin to the end. For
50
example, this will typically print ~/.local (with ~ expanded to the absolute
path to your home directory) so you’ll need to add ~/.local/bin to
your PATH. You can set your PATH permanently by modifying ~/.profile.
On Windows you can find the user base binary directory by running py -
m site --user-site and replacing site-packages with Scripts. For example,
this could return C:\Users\Username\AppData\Roaming\Python36\site-
packages so you would need to set your PATH to
include C:\Users\Username\AppData\Roaming\Python36\Scripts. You
can set your user PATH permanently in the Control Panel. You may need
to log out for the PATH changes to take effect.
Requirements files
Unix/macOS
Windows
Install a project from VCS in “editable” mode. For a full breakdown of the syntax,
see pip’s section on VCS Support.
Unix/macOS
51
python3 -m pip install -e hg+https://hg.repo/some_pkg#egg=SomeProject
# from mercurial
python3 -m pip install -e svn+svn://svn.repo/some_pkg/trunk/#egg=SomeProject
# from svn
python3 -m pip install -e
git+https://git.repo/some_pkg.git@feature#egg=SomeProject # from a branch
Windows
Unix/macOS
Windows
Unix/macOS
Windows
52
Installing from a local src tree
Installing from local src in Development Mode, i.e. in such a way that the project
appears to be installed, but yet is still editable from the src tree.
Unix/macOS
Windows
Unix/macOS
Windows
Unix/macOS
Windows
53
Install from a local directory containing archives (and don’t check PyPI)
Unix/macOS
Windows
To install from other data sources (for example Amazon S3 storage) you can
create a helper application that presents the data in a PEP 503 compliant index
format, and use the --extra-index-url flag to direct pip to use that index.
./s3helper --port=7777
python -m pip install --extra-index-url http://localhost:7777 SomeProject
Installing Prereleases
Unix/macOS
54
Windows
Unix/macOS
55
CHAPTER X
SYSTEM TESTING
UNIT TESTING
Unit testing involves the design of test cases that validate that the internal
program logic is functioning properly, and that program inputs produce valid
outputs. All decision branches and internal code flow should be validated. It is
the testing of individual software units of the application .it is done after the
completion of an individual unit before integration. This is a structural testing,
that relies on knowledge of its construction and is invasive. Unit tests perform
basic tests at component level and test a specific business process, application,
and/or system configuration. Unit tests ensure that each unique path of a business
process performs accurately to the documented specifications and contains
clearly defined inputs and expected results.
INTEGRATION TESTING
FUNCTIONAL TEST
56
Valid Input : identified classes of valid input must be accepted.
SYSTEM TEST
System testing ensures that the entire integrated software system meets
requirements. It tests a configuration to ensure known and predictable results. An
example of system testing is the configuration oriented system integration test.
System testing is based on process descriptions and flows, emphasizing pre-
driven process links and integration points.
White Box Testing is a testing in which in which the software tester has
knowledge of the inner workings, structure and language of the software, or at
least its purpose. It is purpose. It is used to test areas that cannot be reached from
a black box level.
Black Box Testing is testing the software without any knowledge of the
inner workings, structure or language of the module being tested. Black box tests,
57
as most other kinds of tests, must be written from a definitive source document,
such as specification or requirements document, such as specification or
requirements document. It is a testing in which the software under test is treated,
as a black box .you cannot “see” into it. The test provides inputs and responds to
outputs without considering how the software works.
58
CHAPTER XI
APPENDIX
9.1CODING
import numpy as np
import pandas as pd
import os
for dirname, _, filenames in os.walk('input'):
for filename in filenames:
print(os.path.join(dirname, filename))
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.metrics import classification_report, confusion_matrix,
ConfusionMatrixDisplay
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier,
LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
fake=pd.read_csv("D:/Dowload D/Fake_News_Detection-master/jupyter
code/Fake.csv")
true=pd.read_csv("D:/Dowload D/Fake_News_Detection-master/jupyter
code/True.csv")
fake.head()
fake.info(), true.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23481 entries, 0 to 23480
59
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 title 23481 non-null object
1 text 23481 non-null object
2 subject 23481 non-null object
3 date 23481 non-null object
dtypes: object(4)
memory usage: 733.9+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21417 entries, 0 to 21416
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 title 21417 non-null object
1 text 21417 non-null object
2 subject 21417 non-null object
3 date 21417 non-null object
dtypes: object(4)
memory usage: 669.4+ KB
(None, None)
# There is no null value in the datasets
fake_df=fake[['text']]
true_df=true[['text']]
fake_df['label']=0
true_df['label']=1
C:\Users\Admin\AppData\Local\Temp\ipykernel_6284\1271111993.py:3:
SettingWithCopyWarning:
fake_df['label']=0
60
C:\Users\Admin\AppData\Local\Temp\ipykernel_6284\1271111993.py:4:
SettingWithCopyWarning:
true_df['label']=1
# All we need are just text and label (! I added the label myself )
data=pd.concat([fake_df,true_df], axis=0)
data=data.sample(frac=0.7)
data.tail(7)
# Adding and randomizing process of two datasets
df=data.reset_index()
df.drop(['index'], axis=1, inplace=True)
df.head()
X=df['text']
y=df['label']
x_train, x_test, y_train, y_test= train_test_split(X, y, test_size=0.2,
random_state=7)
tfid_vectorizer=TfidfVectorizer(stop_words='english', max_df=0.7)
train_tfid=tfid_vectorizer.fit_transform(x_train)
test_tfid=tfid_vectorizer.transform(x_test)
PA_classifier=PassiveAggressiveClassifier(max_iter=50)
PA_classifier.fit(train_tfid, y_train)
pa_predict=PA_classifier.predict(test_tfid)
print("Classification Report:", classification_report(y_test, pa_predict))
print("Accuracy Score:", accuracy_score(y_test, pa_predict))
61
con_pa=confusion_matrix(y_test, pa_predict)
con_mat=ConfusionMatrixDisplay(con_pa, display_labels=['Fake','True'])
con_mat.plot()
plt.show()
Classification Report: precision recall f1-score support
con_pa=confusion_matrix(y_test, linear_predict)
con_mat=ConfusionMatrixDisplay(con_pa, display_labels=['Fake','True'])
con_mat.plot()
plt.show()
Classification Report: precision recall f1-score support
63
Logistic Regression is working with 98.27 % accuracy
tree_model=DecisionTreeClassifier()
tree_model.fit(train_tfid, y_train)
tree_predict=tree_model.predict(test_tfid)
print("Classification Report:", classification_report(y_test, tree_predict))
print("Accuracy Score:", accuracy_score(y_test, tree_predict))
con_pa=confusion_matrix(y_test, tree_predict)
con_mat=ConfusionMatrixDisplay(con_pa, display_labels=['Fake','True'])
con_mat.plot()
plt.show()
Classification Report: precision recall f1-score support
64
Decision Tree model is working with 99.36 % accuracy
xgboost_model=XGBClassifier()
xgboost_model.fit(train_tfid,y_train)
xgb_predict=xgboost_model.predict(test_tfid)
print("Classification Report:", classification_report(y_test, xgb_predict))
print("Accuracy Score:", accuracy_score(y_test, xgb_predict))
con_pa=confusion_matrix(y_test, xgb_predict)
con_mat=ConfusionMatrixDisplay(con_pa, display_labels=['Fake','True'])
con_mat.plot()
plt.show()
Classification Report: precision recall f1-score support
65
accuracy 1.00 6286
macro avg 1.00 1.00 1.00 6286
weighted avg 1.00 1.00 1.00 6286
import pickle
pickle.dump(xgboost_model,open("modell.pkl","wb"))
# TfidfVectorizer Explanation
66
text = ['Hello Diwakar here, fake news detection','Welcome to the Machine
learning' ]
vect = TfidfVectorizer()
vect.fit(text)
print(vect.idf_)
print(vect.vocabulary_)
### A words which is present in all the data, it will have low IDF value.
With this unique words will be highlighted using the Max IDF values.
example = text[0]
example
example = vect.transform([example])
print(example.toarray())
### Here, 0 is present in the which indexed word, which is not available in
given sentence.
## PassiveAggressiveClassifier
import os
import pandas as pd
67
dataframe = pd.read_csv('news.csv')
dataframe.head()
x = dataframe['text']
y = dataframe['label']
0 FAKE
1 FAKE
2 REAL
3 FAKE
4 REAL
...
6330 REAL
6331 FAKE
6332 FAKE
6333 REAL
6334 REAL
Name: label, Length: 6335, dtype: object
68
x_train,x_test,y_train,y_test =
train_test_split(x,y,test_size=0.2,random_state=0)
y_train
2402 REAL
1922 REAL
3475 FAKE
6197 REAL
4748 FAKE
...
4931 REAL
3264 REAL
1653 FAKE
2607 FAKE
2732 REAL
Name: label, Length: 5068, dtype: object
y_train
2402 REAL
1922 REAL
3475 FAKE
6197 REAL
4748 FAKE
...
4931 REAL
3264 REAL
1653 FAKE
2607 FAKE
2732 REAL
Name: label, Length: 5068, dtype: object
tfvect = TfidfVectorizer(stop_words='english',max_df=0.7)
tfid_x_train = tfvect.fit_transform(x_train)
tfid_x_test = tfvect.transform(x_test)
* max_df = 0.50 means "ignore terms that appear in more than 50% of the
documents".
* max_df = 25 means "ignore terms that appear in more than 25 documents".
classifier = PassiveAggressiveClassifier(max_iter=50)
classifier.fit(tfid_x_train,y_train)
69
PassiveAggressiveClassifier(C=1.0, average=False, class_weight=None,
early_stopping=False, fit_intercept=True,
loss='hinge', max_iter=50, n_iter_no_change=5,
n_jobs=None, random_state=None, shuffle=True,
tol=0.001, validation_fraction=0.1, verbose=0,
warm_start=False)
y_pred = classifier.predict(tfid_x_test)
score = accuracy_score(y_test,y_pred)
print(f'Accuracy: {round(score*100,2)}%')
Accuracy: 93.69%
cf = confusion_matrix(y_test,y_pred, labels=['FAKE','REAL'])
print(cf)
[[575 40]
[ 40 612]]
def fake_news_det(news):
input_data = [news]
vectorized_input_data = tfvect.transform(input_data)
prediction = classifier.predict(vectorized_input_data)
print(prediction)
fake_news_det('U.S. Secretary of State John F. Kerry said Monday that he will
stop in Paris later this week, amid criticism that no top American officials
attended Sunday’s unity march against terrorism.')
['REAL']
fake_news_det("""Go to Article
President Barack Obama has been campaigning hard for the woman who is
supposedly going to extend his legacy four more years. The only problem with
stumping for Hillary Clinton, however, is she’s not exactly a candidate easy
to get too enthused about. """)
['FAKE']
import pickle
pickle.dump(classifier,open('model.pkl', 'wb'))
70
# load the model from disk
loaded_model = pickle.load(open('model.pkl', 'rb'))
def fake_news_det1(news):
input_data = [news]
vectorized_input_data = tfvect.transform(input_data)
prediction = loaded_model.predict(vectorized_input_data)
print(prediction)
fake_news_det1("""Go to Article
President Barack Obama has been campaigning hard for the woman who is
supposedly going to extend his legacy four more years. The only problem with
stumping for Hillary Clinton, however, is she’s not exactly a candidate easy
to get too enthused about. """)
['FAKE']
fake_news_det1("""U.S. Secretary of State John F. Kerry said Monday that he
will stop in Paris later this week, amid criticism that no top American officials
attended Sunday’s unity march against terrorism.""")
['REAL']
fake_news_det('''U.S. Secretary of State John F. Kerry said Monday that he will
stop in Paris later this week, amid criticism that no top American officials
attended Sunday’s unity march against terrorism.''')
['REAL']
71
11.2 SCREENSHOTS
72
73
74
75
76
CHAPTER XII
CONCLUSION
In this paper, we have proposed a novel hybrid fake news detection system
that employs two types of features: linguistic and fact-verification features. The
proposed detection system employs only eight features, which less compared to
the stat-of-the-art approaches. It operates in two phases: training and testing. In
the training phase, the detection system runs four machine learning algorithms,
i.e., Logistic Regression (LR), Random Forest (RF), Additional Trees
Discriminant, and XGBoost, in order to select the best classifier for the testing
phase. Evaluation results on the News data set show that the proposed detection
system achieves an accuracy of 99% under XGBoost. As future work, we aim at
improving the accuracy of our detection system by investigating other
discriminating features such as visual-based and style-based features. Moreover,
we plan to further detect other types of false information such as
biased/inaccurate news and misleading/ambiguous news.
77
FUTURE ENHANCEMENT
The future scope of this project is that fake news detectors can help to filter
different websites that contain fake news and the motive is to help users such that
they can’t get attracted by click bait. The project can also be used on many social
media platforms where there is a massive amount of fake data which can cause
damage to the society, with some modifications to remove the same. Fake account
creators are constantly adapting their tactics to evade detection. Future work
could focus on developing machine learning models that can adapt to these
evolving tactics and remain effective in identifying fake accounts.
78
CHAPTER XIII
REFERENCE
[1]. S. Yang, K. Shu, S. Wang, R. Gu, F. Wu and H. Liu, "Unsupervised Fake
News Detection on Social Media: A Generative Approach", Proceedings of the
AAAI Conference on Artificial Intelligence, vol. 33, pp. 5644-5651, 2019.
[3]. F. Ozbay and B. Alatas, "Fake news detection within online social media
using supervised artificial intelligence algorithms", Physica A: Statistical
Mechanics and its Applications, vol. 540, pp. 123174, 2020.
[4]. P. Faustini and T. Covões, "Fake news detection in multiple platforms and
languages", Expert Systems with Applications, vol. 158, pp. 113503, 2020.
[6]. Y. Lin, "10 Twitter Statistics Every Marketer Should Know in 2021
[Infographic]", My.oberlo.com, 2021.
[9]. L. Gimenez, "6 steps for data cleaning and why it matters", Geotab, 2020.
79