Viswajothi Technologies PR Ivate Limited: "Text Summarization Based On NLP"

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 23

VIJAYA VITTALA INSTITUTE OF

TECHNOLOGY
Department of computer science and engineering

“Text Summarization based on NLP”


Submitted in partial fulfilment for the award of degree of 8thSem

BACHELOR OF ENGINEERING
In
COMPUTER SCIENCE
Submitted by

HARSHA SAW
1VJ16CS024
HARSHITHA S
1VJ15CS025

Internship Carried
out At
Dept. of CSE,VVIT 1
CONTENTS

• About the Company


• Work of the Company
• Introduction
• Need for Text Summarization
• Types of Text Summarization
• Literature Survey
• Problem Statement
• Technologies
• Software and Hardware requirements
• Architecture Diagram
• Data Flow Diagram
• Modules
• Advantages and Applications
• References
Dept. of CSE,VVIT 2
About the Company
ViswaJothi Technologies Private Limited (VJT) is a rich mix of advisors,
business analysts, consultants and technology specialists nurturing a
common passion towards technology solutions and business, believing in
revolutionizing the industrial technology business.

ViswaJothi Technologies offer the services in the following areas:

• Enterprise Application Services

• Web designing and Development

• Mobile Application Development

• Internet of Things

• Training Services

Dept. of CSE,VVIT 3
Work of the Company
• IoT (Internet of Things)

• Artificial Intelligence

• Machine learning

• Data Science

• Web Development

• Mobile Application Development

• Python Application Development

• Core Java

Dept. of CSE,VVIT 4
Introduction
• Text summarization refers to the technique of shortening long pieces of text.

• The intention is to create a coherent and fluent summary having only the main
points outlined in the document.
• Automatic text summarization aims to transform lengthy documents into
shortened versions, something which could be difficult and costly to undertake if
done manually.
• Machine learning algorithms can be trained to comprehend documents and
identify the sections that convey important facts and information before
producing the required summarized texts. 

Dept. of CSE,VVIT 5
Need for Text Summarization
• With the present explosion of data circulating the digital space, which is mostly non-
structured textual data, there is a need to develop automatic text summarization tools that
allow people to get insights from them easily.
• Currently, we enjoy quick access to enormous amounts of information. However, most
of this information is redundant, insignificant, and may not convey the intended meaning.
• Therefore, using automatic text summarizers capable of extracting useful information
that leaves out inessential and insignificant data is becoming vital.
• Implementing summarization can enhance the readability of documents, reduce the time
spent in researching for information, and allow for more information to be fitted in a
particular area.

Dept. of CSE,VVIT 6
Types of Text Summarization

•Broadly, there are two approaches to summarizing texts in NLP: extraction


and abstraction.

Extraction-based Summarization

•In extraction-based summarization, a subset of words that represent the most


important points is pulled from a piece of text and combined to make a
summary.
•In machine learning, extractive summarization usually involves weighing the
essential sections of sentences and using the results to generate summaries.
•Different types of algorithms and methods can be used to gauge the weights
of the sentences and then rank them according to their relevance and
similarity with one another—and Dept.
further joining them to generate a summary.7
of CSE,VVIT
•Here's an example:

Abstraction-based Summarization
•In abstraction-based summarization, advanced deep learning techniques
are applied to paraphrase and shorten the original document, just like
humans do.
•Since abstractive machine learning algorithms can generate new phrases
and sentences that represent the most important information from the
source text, they can assist in overcoming the grammatical inaccuracies of
the extraction techniques.

Dept. of CSE,VVIT 8
• The abstraction technique entails paraphrasing and shortening parts of the
source document. 
• Abstraction performs better than extraction. However, the text
summarization algorithms required to do abstraction are more difficult to
develop; that’s why the use of extraction is still popular.
• Abstractive summarization methods aim at producing important material in
a new way.
• Here's an example:

Dept. of CSE,VVIT 9
Literature Survey
Authors Algorithms Advantages Disadvantages
Peter W. Eklund A Survey on Text•Relevance •Genetic algorithm
Summarization •Redundancy consumes more
using Optimization•Length computational time for
Algorithm generating the
•Applicable for both
single and multi summaries.
document •Genetic algorithm
summarization tasks. suffers from local
convergence problem.
Ani Nenkova A Survey of Text Sentence scoring or The text is represented
Summarization summary selection by a diverse set of
Techniques strategies alter the possible indicators of
overall importance which do
performance of the not find the topic
summarizer. identification.

Dept. of CSE,VVIT 10
Problem Statement
• This project describes a system for the summarization of single and multiple
documents.
• The system produces multi as well as single document summaries using
clustering techniques for identifying common terms across the set of
documents.
• The aim is to auto summarize documents.
• This approach utilizes adaptive, incremental learning and that evolves its
structure and functionality. This approach proposes use of Parts of Speech
disambiguation, and capable of dealing with sequential data.
• The project aims that the client will get an application that will execute on
client side and get the summary of the input document as per his/her
requirement. Dept. of CSE,VVIT 11
Technologies
Latent Semantic Analysis (LSA)
• It is a technique in of analyzing relationships between a set of documents and
the terms they contain by producing a set of concepts related to the documents
and terms.
• LSA assumes that words that are close in meaning will occur in similar pieces of
text .
•  A matrix containing word counts per document is constructed from a large
piece of text and a mathematical technique called Singular Value
Decomposition (SVD) is used to reduce the number of rows while preserving
the similarity structure among columns.
• Documents are then compared by taking the cosine of the angle formed by any
two columns.
• Values close to 1 represent very similar documents while values close to 0
represent very dissimilar documents.
Dept. of CSE,VVIT 12
Text Rank Algorithm
•It is a graph-based ranking model for text processing which can be used in order to find the
most relevant sentences in text and also to find keywords.
•In order to find the most relevant sentences in text, a graph is constructed where the
vertices of the graph represent each sentence in a document and the edges between
sentences are based on content overlap, namely by calculating the number of words that 2
sentences have in common.

Dept. of CSE,VVIT 13
• Based on this network of sentences, the sentences are fed into the Pagerank
algorithm which identifies the most important sentences. When we want to extract
a summary of the text, we can now take only the most important sentences.

• The algorithm basically computes weights between sentences by looking which


words are overlapping.

Dept. of CSE,VVIT 14
Software and Hardware Requirements

SOFWARE REQUIREMENTS
1) Python 3.6 or higher

2) Jupyter Notebook/Google Colaboratory

3) Anaconda Navigator

HARDWARE REQUIREMENTS
4) Processor - Pentium 4 or core i3 processor
5) RAM - 4.00 GB

6) System type - 64 – bit operating system

Dept. of CSE,VVIT 15
Architecture Diagram

Dept. of CSE,VVIT 16
Data Flow Diagram

Dept. of CSE,VVIT 17
Modules
 Preparing Data
• First, let’s split the paragraph into its corresponding sentences. The best
way of doing the conversion is to extract a sentence whenever a period
appears.

 Preprocessing Data
• To ensure the scrapped textual data is as noise-free as possible, we’ll
perform some basic text cleaning.  
• To assist us to do the processing, we’ll import a list of stopwords from
the nltk library.

Dept. of CSE,VVIT 18
• We’ll also import PorterStemmer, which is an algorithm for reducing
words into their root forms. For example, cleaning, cleaned,
and cleaner can be reduced to the root clean.
• Furthermore, we’ll create a dictionary table having the frequency of
occurrence of each of the words in the text. 
• Then, we’ll check if the words are present in the frequency_table. If the
word was previously available in the dictionary, its value is updated by 1.
Otherwise, if the word is recognized for the first time, its value is set to 1.

 Tokenization
• To split the article content into a set of sentences, we’ll use the built-in
method from the nltk library.

Dept. of CSE,VVIT 19
 Evaluating the Weighted Frequency of Words
• Next we need to find the weighted frequency of occurrences of all the
words.
• We can find the weighted frequency of each word by dividing its

frequency by the frequency of the most occurring word. 


 Substitute Words with the Weighted Frequencies
• The final step is to plug the weighted frequency in place of the
corresponding words in original sentences and finding their sum.
• It is important to mention that weighted frequency for the words removed
during preprocessing (stop words, punctuation, digits etc.) will be zero and
therefore is not required to be added.

Dept. of CSE,VVIT 20
 Summarization
• The final step is to sort the sentences in the order of their sum.
• With this threshold, we can avoid selecting the sentences with a lower
score than the average score.

• The sentences with highest frequencies summarize the text. 

Dept. of CSE,VVIT 21
Advantages and Applications
Advantages
• Works Instantly Reading the entire article, breaking it and separating the important
ideas from the original text takes time and effort.
• Improves quality, Some software summarizes not only documents but also web
pages.
• This highly improves productiveness as it quicken surfing process.

• Does not Miss Important Facts.


Applications
• Entity timelines
• Storylines of events
• Sentence compression
• Event understanding
• Summarization of user-generated content
Dept. of CSE,VVIT 22
References
[1] Alfonseca, E., Rodriguez, P. Description of the UAM system for generating very
short summaries at DUC-2003. In HLT/NAACL Workshop on Text Summarization /
DUC 2003, 2003.

[2] Angheluta, R., Moens, M, F., De Busser, R. K.U.Leuven Summarization System. In


DUC 2007, Edmonton, alberta, Canada, 2007.

[3] Aonet, C., Okurowskit, M. E., Gorlinskyt, J., et al. A Scalable Summarization
System Using Robust NLP. In Proceedings of the ACL’07/EACL’97 Workshop on
Intelligent Sclalable Text Summarization, pages 66-73, 2013.

[4] Barzilay, R., Elhadad, M. Using Lexixal Chains for Text Summarization. In Inderjeet
Mani and Mark Marbury, editors, Advances in Automatic Text Summarization. MIT
Press, 2011.

Dept. of CSE,VVIT 23

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy