AI REPORT
AI REPORT
AI REPORT
Declaration i
Abstract ii
Acknowledgement iii
Contents iv
List of Figures vii
i
4.3.3 Proposed System 11
4.3.4 Advantages of the Proposed System 11
4.4 System Architecture 12
4.4.1 Data Flow Diagram 12
4.4.2 UML Diagram 12
4.4.3 USE CASE Diagram 13
4.4.4 Class Diagram 13
4.4.5 Sequence Diagram 14
4.4.6 Activity Diagram 15
4.5 Implementation 16
4.5.1 Modules 17
4.6 Screen Shots 18
5 CONCLUSION 20
BIBLIOGRAPHY 21
APPENDIX 22
Appendix A: Abbreviations 23
ii
LIST OF FIGURES
iii
Fake News Detection
CHAPTER - 1
COMPANY PROFILE
ProgMaster Private Limited was established with a vision to become a leading force in the
software industry. Incorporated as a private, non-government company, ProgMaster Pvt Ltd is
headquartered in Govardhana Giri, Chittoor, Andhra Pradesh. Since its inception, the company
has focused on delivering cutting-edge technology solutions and high-quality education in the IT
domain.
Core Activities:
Corporate Training: ProgMaster has trained over 5,000 students, equipping them with
industry-relevant skills.
Real-Time Projects: Through its unique RealTechWorld system, ProgMaster enables
students to work alongside experienced developers on real-time projects, enhancing their
practical experience.
Software Development: The company specializes in creating custom software solutions,
web applications, and full-stack development services for diverse clients.
Project Consultancy: ProgMaster collaborates with various companies, providing
technical expertise and building innovative software products tailored to their needs.
Mission:
ProgMaster's mission is to bridge the gap between theoretical knowledge and practical
application, empowering students and clients to excel in the rapidly evolving IT industry.
Vision:
To become a globally recognized name in software development and IT education, fostering
innovation and creating future-ready professionals.
ProgMaster continues to invest in research, explore new technologies, and contribute to the
growth of the software ecosystem, setting new benchmarks in quality and excellence.
1.1.1 Objectives
To be a world-class research and development organization committed to enhancing
stakeholder’s value.
To build best products that is socially innovative with high-quality attributes and
provides excellent education to all.
Zeal to excel and zest for change. Respect for dignity and potential of individuals.
They are continuously involved in research about futuristic technologies and finding
ways to simplify them for their clients.
2
Fake News Detection
They are continuously involved in research about futuristic technologies and finding
ways to simplify them for their clients
Through the years, and have been successfully delivering value to our customers. We
truly believe that our customer's success is our success We don’t look at ourselves as a
vendor for their projects instead. You would be excited to hear some of our stories and
know to what extent we have gone in the interest of the success of our customers. and we
work hard to make that happen.
3
Fake News Detection
4
CHAPTER-2
ABOUT THE DEPARTMENT
4
Fake News Detection
2.3 Testing
Testing was done according to the Corporate Standards. As each component was being built,
Unit testing was performed in order to check if the desired functionality is obtained. Each
component in turn is tested with multiple test cases to verify if it is properly working. These
unit tested components are integrated with the existing built components and then integration
testing is performed. Here again, multiple test cases are run to ensure the newly built
component runs in co-ordination with the existing components. Unit and Integration testing
are iteratively performed until the complete product is built.
Once the complete product is built, it is again tested against multiple test cases and all the
functionalities. The product could be working fine in the developer’s environment but might
not necessarily work well in all other environments that the users could be using. Hence, the
product is also tested under multiple environments (Various operating systems and devices).
At every step, if a flaw is observed, the component is rebuilt to fix the bugs. This way, testing
is done hierarchically and iteratively.
5
CHAPTER-3
TASK PERFORMED
Training Program: The internship is a platform where the trainees are assigned with the
specific task. In the initial days of the internship, I was trained on the following:
Python Programming
Machine Learning Algorithms
A. Pre-processing Data:
Social media data is highly unstructured – majority of them are informal communication
with typos, slangs and bad-grammar etc. Quest for increased performance and reliability has
made it imperative to develop techniques for utilization of resources to make informed
decisions. To achieve better insights, it is necessary to clean the data before it can be used for
predictive modeling. For this purpose, basic pre-processing was done on the News training
data. This step was comprised of
Data Cleaning:
While reading data, we get data in the structured or unstructured format. A structured format
has a well-defined pattern whereas unstructured data has no proper structure. In between the
2 structures, we have a semi-structured format which is a comparably better structured than
unstructured format.
Cleaning up the text data is necessary to highlight attributes that we’re going to want our
machine learning system to pick up on. Cleaning (or pre-processing) the data typically
consists of a number of steps:
a) Remove punctuation
Punctuation can provide grammatical context to a sentence which supports our
understanding. But for our vectorizer which counts the number of words and not the
context, it does not add value, so we remove all special characters. eg: How are you?-
>How are you
b) Tokenization
Tokenizing separates text into units such as sentences or words. It gives structure to
previously unstructured text. eg: Plata o Plomo-> ‘Plata’, ’o’, ’Plomo’.
c) Remove stopwords
6
Fake News Detection
Stopwords are common words that will likely appear in any text. They don’t tell us
much about our data so we remove them. eg: silver or lead is fine for me-> silver,
lead, fine.
d) Stemming
Stemming helps reduce a word to its stem form. It often makes sense to treat related
words in the same way. It removes suffices, like “ing”, “ly”, “s”, etc. by a simple rule-
based approach. It reduces the corpus of words but often the actual words get
neglected. eg: Entitling, Entitled -> Entitle. Note: Some search engines treat words
with the same stem as synonyms.
B. Feature Generation:
We can use text data to generate a number of features like word count, frequency of large
words, frequency of unique words, n-grams etc. By creating a representation of words that
capture their meanings, semantic relationships, and numerous types of context they are used
in, we can enable computer to understand text and perform Clustering, Classification etc.
Vectorizing Data: Vectorizing is the process of encoding text as integers i.e. numeric form to
create feature vectors so that machine learning algorithms can understand our data.
1. Vectorizing Data: Bag-Of-Words Bag of Words (BoW) or CountVectorizer describes the
presence of words within the text data. It gives a result of 1 if present in the sentence and 0 if
not present. It, therefore, creates a bag of words with a document-matrix count in each text
document.
2. Vectorizing Data: N-Grams N-grams are simply all combinations of adjacent words or
letters of length n that we can find in our source text. Ngrams with n=1 are called unigrams.
Similarly, bigrams (n=2), trigrams (n=3) and so on can also be used. Unigrams usually don’t
contain much information as compared to bigrams and trigrams. The basic principle behind
n-grams is that they capture the letter or word is likely to follow the given word. The longer
the n-gram (higher n), the more context you have to work with.
3. Vectorizing Data: TF-IDF It computes “relative frequency” that a word appears in a
document compared to its frequency across all documents TF-IDF weight represents the
relative importance of a term in the document and entire corpus. TF stands for Term
Frequency: It calculates how frequently a term appears in a document. Since, every document
size varies, a term may appear more in a long sized document that a short one. Thus, the
length of the document often divides Term frequency.
7
Fake News Detection
8
CHAPTER-4
REFLECTION NOTES
4.1 Experience
As per our experience during the internship, Tequed Labs follows a good work culture and it
has friendly employees, starting from the staff level to the management level. The trainers are
well versed in their fields and they treat everyone equally. There is no distinguishing between
fresher graduates and corporates and everyone is respected equally. There is a lot of
teamwork followed in every task, be it hard or easy and there is a very calm and friendly
atmosphere maintained at all times. There is a lot of scope for self-improvement due to the
great communication and support that can be found. Interns have been treated and taught well
and all our doubts and concerns regarding the training or the companies have been properly
answered. All in all, Tequed Labs was a great place for a fresher to start career and also for a
corporate to boost his/her career. It has been a great experience to be an intern in such a
reputed organization.
SOFTWARE REQUIREMENTS:
Operating System : Windows or Linux
Platform used : Anaconda Navigator (Jupyter notebook)
question of determining ‘fake news’ has also been the subject of particular attention within
the literature.
Conroy, Rubin, and Chen outlines several approaches that seem promising towards the aim of
perfectly classify the misleading articles. They note that simple content-related n-grams and
shallow parts-of-speech (POS) tagging have proven insufficient for the classification task,
often failing to account for important context information. Rather, these methods have been
shown useful only in tandem with more complex methods of analysis. Deep Syntax analysis
using Probabilistic Context Free Grammars (PCFG) have been shown to be particularly
valuable in combination with n-gram methods. Feng, Banerjee, and Choi are able to achieve
85%-91% accuracy in deception related classification tasks using online review corpora.
Feng and Hirst implemented a semantic analysis looking at ‘object:descriptor’ pairs for
contradictions with the text on top of Feng’s initial deep syntax model for additional
improvement. Rubin, Lukoianova and Tatiana analyze rhetorical structure using a vector
space model with similar success. Ciampaglia et al. employ language pattern similarity
networks requiring a pre-existing knowledge base.
10
Fake News Detection
Figure 4.4.1.1: Dataflow diagram to check the truth probability of the URL
4.4.6.1 Activity Di
4.5 Implementation
A. Static Search Implementation-
In static part, we have trained and used 3 out of 4 algorithms for classification. They are
Naïve Bayes, Random Forest and Logistic Regression.
Step 1: In first step, we have extracted features from the already pre-processed dataset.
These features are; Bag-of-words, Tf-Idf Features and N-grams.
Step 2: Here, we have built all the classifiers for predicting the fake news detection. The
extracted features are fed into different classifiers. We have used Naive-bayes, Logistic
Regression, and Random forest classifiers from sklearn. Each of the extracted features was
used in all of the classifiers.
Step 3: Once fitting the model, we compared the f1 score and checked the confusion
matrix. Step 4: After fitting all the classifiers, 2 best performing models were selected as
candidate models for fake news classification.
Step 5: We have performed parameter tuning by implementing GridSearchCV methods
on these candidate models and chosen bestperforming parameters for these classifier.
Step 6: Finally selected model was used for fake news detection with the probability of
truth.
Step 7: Our finally selected and best performing classifier was Logistic Regression which
was then saved on disk. It will be used toclassify the fake news.
It takes a news article as input from user then model is used for final classification output
that is shown to user along with probability oftruth.
In the first search field we have used Natural Language Processing for the first search field
to come up with a proper solution for the problem, and hence we have attempted to create a
model which can classify fake news according to the terms used in the newspaper articles.
Our application uses NLP techniques like CountVectorization and TF-IDF Vectorization
before passing it through a Passive Aggressive Classifier to output the authenticity as a
percentage probability of an article.
The second search field of the site asks for specific keywords to be searched on the net
upon which it provides a suitable output for the percentage probability of that term actually
being present in an article or a similar article with those keyword references in it.
The third search field of the site accepts a specific website domain name upon which the
implementation looks for the site in our true sitesdatabase or the blacklisted sites database.
The true sites database holds the domain names which regularly provide proper and
authentic news and vice versa. If the site isn’t found in either of the databases then the
implementation doesn’t classify the domain it simply states that the news aggregator does
not exist.
Working-
The problem can be broken down into 3 statements-
1) Use NLP to check the authenticity of a news article.
2) If the user has a query about the authenticity of a search query then we he/she can
directly search on our platform and using our custom algorithm we output a confidence
score.
3) Check the authenticity of a news source.
These sections have been produced as search fields to take inputs in 3 different forms in our
implementation of the problem statement.
Figure 4.4 : Analysing fake and real news from the dataset.
CONCLUSION
In the 21st century, the majority of the tasks are done online. Newspapers that were earlier
preferred as hard-copies are now being substituted by applications like Facebook, Twitter,
and news articles to be read online. Whatsapp’s forwards are also a major source. The
growing problem of fake news only makes things more complicated and tries to change or
hamper the opinion and attitude of people towards use of digital technology. When a
person is deceived by the real news two possible things happen- People start believing that
their perceptions about a particular topic are true as assumed. Thus, in order to curb the
phenomenon, we have developed our Fake news Detection system that takes input from
the user and classify it to be true or fake. To implement this, various NLP and Machine
Learning Techniques have to be used.
The model is trained using an appropriate dataset and performance evaluation is also done
using various performance measures. The best model, i.e. the model with highest accuracy
is used to classify the news headlines or articles. As evident above for static search, our
best model came out to be Logistic Regression with an accuracy of 65%. Hence we then
used grid search parameter optimization to increase the performance of logistic regression
which then gave us the accuracy of 75%. Hence we can say that if a user feed a particular
news article or its headline in our model, there are 75% chances that it will be classified to
its true nature.
We intend to build our own dataset which will be kept up to date according to the latest
news. All the live news and latest data will be kept in a database using Web Crawler and
online database
20
Fake News Detection Tequed Labs
BIBLIOGRAPHY
[1] Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang, and Huan Liu, “Fake News
Detection on Social Media: A Data Mining Perspective”arXiv:1708.01967v3 [cs.SI], 3
Sep 2017
[2] Kai Shu, Amy Sliva, Suhang Wang, Jiliang Tang, and Huan Liu, “Fake News
Detection on Social Media: A Data Mining Perspective”arXiv:1708.01967v3 [cs.SI], 3
Sep 2017
[3] M. Granik and V. Mesyura, "Fake news detection using naive Bayes classifier," 2017
IEEE First Ukraine Conference on Electricaland Computer Engineering (UKRCON),
Kiev, 2017, pp. 900-903.
[4] Fake news websites. (n.d.) Wikipedia. [Online]. Available:
https://en.wikipedia.org/wiki/Fake_news_website. Accessed Feb. 6,
2017
[5] Cade Metz. (2016, Dec. 16). The bittersweet sweepstakes to build an AI that
destroys fake news.
Websites referred
www.google.com
www.w3schools.com
www.youtube.com
www.freecode.com
21
Fake News Detection Tequed Labs
APPENDIX
Appendix A: Abbreviation
IDE: An integrated development environment (IDE) is software for building applications that
combines common developer tools into a single graphical user interface (GUI).
CSS: Cascading Style Sheets, fondly referred to as CSS, is a simply designed language
intended to simplify the process of making web pages presentable. CSS allows you to apply
styles to web pages. More importantly, CSS enables you to do this independent of the HTML
that makes up each web page.
HTML: HTML stands for Hyper Text Markup Language. It is used to design the front end
portion of web pages using markup language. HTML is the combination of Hypertext and
Markup language. Hypertext defines the link between the web pages. The markup language
issued to define the text documentation within tag which defines the structure of web pages.
JS: JavaScript is a famous scripting language used to create the magic on the sites to make
the site interactive for the user. It is used to enhancing the functionality of a website to
running cool games and web-based software
TF-IDF : TF-IDF stands for Term Frequency Inverse Document Frequency of records. It
can be defined as the calculation of how relevant a word in a series or corpus is to a text.
The meaning increases proportionally to the number of times in the text a word appears but
is compensated by the word frequency in the corpus (data-set).
NLP : Natural language processing (NLP) is a subfield of Artificial Intelligence (AI). This
is a widely used technology for personal assistants that are used in various business
fields/areas. This technology works on the speech provided by the user, breaks it down for
proper understanding and processes accordingly. This is a very recent and effective
approach due to which it has a really high demand in today’s market. Natural Language
Processing is an upcoming field where already many transitions such as compatibility with
smart devices, interactive talks with a human have been made possible.
22
Fake News Detection Tequed Labs
4.7Services Offered
These are the services offered by the company Tequed labs, in which I have opted for the
Artificial Intelligence and Machine Learning which is most popular domain using nowadays
in every field.
23