Viswajothi Technologies PR Ivate Limited: "Text Summarization Based On NLP"
Viswajothi Technologies PR Ivate Limited: "Text Summarization Based On NLP"
Viswajothi Technologies PR Ivate Limited: "Text Summarization Based On NLP"
Department of computer science and engineering
Submitted by
Internship Carried
out At
Dept. of CSE,VVIT 1
• Internet of Things
• Training Services
Dept. of CSE,VVIT 3
Work of the Company
• IoT (Internet of Things)
• Artificial Intelligence
• Machine learning
• Data Science
• Web Development
• Core Java
Dept. of CSE,VVIT 4
• Text summarization refers to the technique of shortening long pieces of text.
• The intention is to create a coherent and fluent summary having only the main
points outlined in the document.
• Automatic text summarization aims to transform lengthy documents into
shortened versions, something which could be difficult and costly to undertake if
done manually.
• Machine learning algorithms can be trained to comprehend documents and
identify the sections that convey important facts and information before
producing the required summarized texts.
Dept. of CSE,VVIT 5
Need for Text Summarization
• With the present explosion of data circulating the digital space, which is mostly non-
structured textual data, there is a need to develop automatic text summarization tools that
allow people to get insights from them easily.
• Currently, we enjoy quick access to enormous amounts of information. However, most
of this information is redundant, insignificant, and may not convey the intended meaning.
• Therefore, using automatic text summarizers capable of extracting useful information
that leaves out inessential and insignificant data is becoming vital.
• Implementing summarization can enhance the readability of documents, reduce the time
spent in researching for information, and allow for more information to be fitted in a
particular area.
Dept. of CSE,VVIT 6
Types of Text Summarization
Extraction-based Summarization
Abstraction-based Summarization
•In abstraction-based summarization, advanced deep learning techniques
are applied to paraphrase and shorten the original document, just like
humans do.
•Since abstractive machine learning algorithms can generate new phrases
and sentences that represent the most important information from the
source text, they can assist in overcoming the grammatical inaccuracies of
the extraction techniques.
Dept. of CSE,VVIT 8
• The abstraction technique entails paraphrasing and shortening parts of the
source document.
• Abstraction performs better than extraction. However, the text
summarization algorithms required to do abstraction are more difficult to
develop; that’s why the use of extraction is still popular.
• Abstractive summarization methods aim at producing important material in
a new way.
• Here's an example:
Dept. of CSE,VVIT 9
Literature Survey
Authors Algorithms Advantages Disadvantages
Peter W. Eklund A Survey on Text•Relevance •Genetic algorithm
Summarization •Redundancy consumes more
using Optimization•Length computational time for
Algorithm generating the
•Applicable for both
single and multi summaries.
document •Genetic algorithm
summarization tasks. suffers from local
convergence problem.
Ani Nenkova A Survey of Text Sentence scoring or The text is represented
Summarization summary selection by a diverse set of
Techniques strategies alter the possible indicators of
overall importance which do
performance of the not find the topic
summarizer. identification.
Dept. of CSE,VVIT 10
Problem Statement
• This project describes a system for the summarization of single and multiple
• The system produces multi as well as single document summaries using
clustering techniques for identifying common terms across the set of
• The aim is to auto summarize documents.
• This approach utilizes adaptive, incremental learning and that evolves its
structure and functionality. This approach proposes use of Parts of Speech
disambiguation, and capable of dealing with sequential data.
• The project aims that the client will get an application that will execute on
client side and get the summary of the input document as per his/her
requirement. Dept. of CSE,VVIT 11
Latent Semantic Analysis (LSA)
• It is a technique in of analyzing relationships between a set of documents and
the terms they contain by producing a set of concepts related to the documents
and terms.
• LSA assumes that words that are close in meaning will occur in similar pieces of
text .
• A matrix containing word counts per document is constructed from a large
piece of text and a mathematical technique called Singular Value
Decomposition (SVD) is used to reduce the number of rows while preserving
the similarity structure among columns.
• Documents are then compared by taking the cosine of the angle formed by any
two columns.
• Values close to 1 represent very similar documents while values close to 0
represent very dissimilar documents.
Dept. of CSE,VVIT 12
Text Rank Algorithm
•It is a graph-based ranking model for text processing which can be used in order to find the
most relevant sentences in text and also to find keywords.
•In order to find the most relevant sentences in text, a graph is constructed where the
vertices of the graph represent each sentence in a document and the edges between
sentences are based on content overlap, namely by calculating the number of words that 2
sentences have in common.
Dept. of CSE,VVIT 13
• Based on this network of sentences, the sentences are fed into the Pagerank
algorithm which identifies the most important sentences. When we want to extract
a summary of the text, we can now take only the most important sentences.
Dept. of CSE,VVIT 14
Software and Hardware Requirements
1) Python 3.6 or higher
3) Anaconda Navigator
4) Processor - Pentium 4 or core i3 processor
5) RAM - 4.00 GB
Dept. of CSE,VVIT 15
Architecture Diagram
Dept. of CSE,VVIT 16
Data Flow Diagram
Dept. of CSE,VVIT 17
Preparing Data
• First, let’s split the paragraph into its corresponding sentences. The best
way of doing the conversion is to extract a sentence whenever a period
Preprocessing Data
• To ensure the scrapped textual data is as noise-free as possible, we’ll
perform some basic text cleaning.
• To assist us to do the processing, we’ll import a list of stopwords from
the nltk library.
Dept. of CSE,VVIT 18
• We’ll also import PorterStemmer, which is an algorithm for reducing
words into their root forms. For example, cleaning, cleaned,
and cleaner can be reduced to the root clean.
• Furthermore, we’ll create a dictionary table having the frequency of
occurrence of each of the words in the text.
• Then, we’ll check if the words are present in the frequency_table. If the
word was previously available in the dictionary, its value is updated by 1.
Otherwise, if the word is recognized for the first time, its value is set to 1.
• To split the article content into a set of sentences, we’ll use the built-in
method from the nltk library.
Dept. of CSE,VVIT 19
Evaluating the Weighted Frequency of Words
• Next we need to find the weighted frequency of occurrences of all the
• We can find the weighted frequency of each word by dividing its
Dept. of CSE,VVIT 20
• The final step is to sort the sentences in the order of their sum.
• With this threshold, we can avoid selecting the sentences with a lower
score than the average score.
Dept. of CSE,VVIT 21
Advantages and Applications
• Works Instantly Reading the entire article, breaking it and separating the important
ideas from the original text takes time and effort.
• Improves quality, Some software summarizes not only documents but also web
• This highly improves productiveness as it quicken surfing process.
[3] Aonet, C., Okurowskit, M. E., Gorlinskyt, J., et al. A Scalable Summarization
System Using Robust NLP. In Proceedings of the ACL’07/EACL’97 Workshop on
Intelligent Sclalable Text Summarization, pages 66-73, 2013.
[4] Barzilay, R., Elhadad, M. Using Lexixal Chains for Text Summarization. In Inderjeet
Mani and Mark Marbury, editors, Advances in Automatic Text Summarization. MIT
Press, 2011.
Dept. of CSE,VVIT 23
Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.
Alternative Proxies: