Technical Seminar Report-6607
Technical Seminar Report-6607
Technical Seminar Report-6607
Bachelor of Technology
In
Department of CSE (Artificial Intelligence & Machine Learning)
SUBMITTED BY:
GOLLAPALLY BHAVANA
20S11A6607
MRITS, CSE(AI&ML) 1
TEXT SUMMARIZATION
ABSTRACT
The size of data on the Internet has risen in an exponential manner over the past decade. Thus,
the need for a solution emerges, that transforms this vast raw information into useful
information which a human brain can understand. One such common technique in research that
helps in dealing with enormous data is text summarization. Automatic text summarization
essentially condenses a long document into a shorter format while preserving its information
content and overall meaning. It is a potential solution to the information overload. Several
automatic summarizers exist in the literature capable of producing high-quality summaries, but
they do not focus on preserving the underlying meaning and semantics of the text. We capture
and preserve the semantics of text as the fundamental feature for summarizing a document. We
propose an automatic summarizer using the distributional semantic model to capture semantics
for producing high-quality summaries.
MRITS, CSE(AI&ML) 2
TEXT SUMMARIZATION
CONTENTS
❖ INTRODUCTION
❖ RESULTS
❖ CONCLUSION
❖ REFERENCES
MRITS, CSE(AI&ML) 3
TEXT SUMMARIZATION
INTRODUCTION
Text analytics helps in taking insights from the text. Text processing and automatic text
summarization are the key tasks in this area. Text summarization identifies and extracts key
sentences from the input document to produce the automatic text summaries of the input
documents, and hence is a potential solution to the information overload problem. Moreover,
text summaries are used for reducing the length of the input documents without compromising
with its overall meaning and information content. Hence, text summarization is the data
reduction process for quick consumption by the user.
This project highlights the importance of Machine learning. We introduce an algorithm that
will rank the words based on their occurrence and summarize a large amount of feed text into
meaningful data.
Over the past ten years, there has been an exponential increase in the quantity of data on the
Internet. Therefore, a system that converts this massive raw data into relevant data that the
human brain can comprehend is required.
Automatic text summarization has been a source of fascination since the 1950s. "The
automated production of literary abstracts," a research paper by Hans Peter Luhn published in
the late 1950s, used factors like word frequency to pick keywords from the text for
summarizing
Another notable study, conducted by Harold P Edmundson in the late 1960s, extracted relevant
sentences for text summary using approaches such as the existence of cue words, terms from
the title occurring in the text, and sentence location. Many important and intriguing works on
the topic of automatic text summarization have been published since then.
MRITS, CSE(AI&ML) 4
TEXT SUMMARIZATION
Automatic text summarization has been a source of fascination since the 1950s. "The
automated production of literary abstracts," a research paper by Hans Peter Luhn published in
the late 1950s, used factors like word frequency to pick keywords from the text for
summarizing.
Another notable study, conducted by Harold P Edmundson in the late 1960s, extracted relevant
sentences for text summary using approaches such as the existence of cue words, terms from
the title occurring in the text, and sentence location. Many important and intriguing works on
the topic of automatic text summarization have been published since then.
The two most common types of text summary are extractive and abstractive summarization.
Extractive Summarization: This approach works by extracting many components from a text,
such as phrases and sentences, then stacking them together to generate a summary. As a result,
finding the proper phrases for summary is critical in an extraction approach.
MRITS, CSE(AI&ML) 5
TEXT SUMMARIZATION
The PageRank algorithm provides a probability distribution that is used to predict whether a
user will wind up on a given website after clicking on random links. Any number of documents
can be used to compute PageRank. At the outset of the computational approach, some research
articles assume that the distribution is evenly distributed throughout all documents in the
collection. To update predicted PageRank values to approximate the theoretical real value
more closely, the PageRank calculations require multiple runs over the collection, referred to
as "iterations." Assume a four-page document containing the letters A, B, C, and D on each
page. Links between pages are disregarded, as are outbound links from a single page to another
single page. For all pages, PageRank is set to the same value. Because the entire number of
pages on the web at the time was equal to the sum of PageRank across all pages in the original
version of PageRank, in this case, each page would start with a value of one. On the next cycle,
the PageRank sent from a specific page to the targets of its outbound links is split evenly across
all outbound connections. If the system’s only connections to A were from pages B, C, and D
each link would transmit 0.25 PageRank to A on the following iteration, for a total of 0.75.
Consider what would happen if page B was linked to pages C and A, page C to page A, and
page D to all three sites. As a consequence, on the first repeat, page B would communicate half
of its current value, or 0.125 to page A, and the other half, or 0.125, to page C. Because it had
three outbound links, Page C's whole current value, 0.25, would be transferred to A's sole
existing value, or roughly 0.083. At the end of this cycle, Page A will have a PageRank of
around 0.458, and the page it connects to, A. D, will transmit one third of its PageRank.
MRITS, CSE(AI&ML) 6
TEXT SUMMARIZATION
Instead of pages, we employ phrases in the Text Rank algorithm. The chance of a web page
change is determined by the similarity of the two texts. The similarity scores are kept in square
matrix that resembles PageRank's. Text Rank is an unsupervised extractive text summarizing
approach. The flow of the Text Rank algorithm is
Fig 1
In the following phase, we will look for vector representations for each sentence
1. The similarity between sentence vectors is then calculated and stored in a matrix.
2. The similarity matrix is reworked into a graph with sentences as vertices and
similarity scores as edges to determine the rank of the sentence.
MRITS, CSE(AI&ML) 7
TEXT SUMMARIZATION
Text Rank algorithm is used to create a clean and succinct summary from a collection of
scraped articles. Please bear in mind that this is a single-domain multiple-documents
summarizing project, which means we'll be using a variety of articles as input and creating a
single bullet-point summary.
import numpy as np
import pandas as pd
import nltk
nltk.download('punkt')
import re
df = pd.read_csv("tennis_articles_v4.csv")
df.head()
4. Split Text into Sentences. The material must now be broken down into distinct sentences.
To do so, we'll utilize the nltk library sent tokenize () method.
MRITS, CSE(AI&ML) 8
TEXT SUMMARIZATION
RESULTS
➢ Table 1
An example of Preprocessing. Before Preprocessing / After Preprocessing
Tatkala perjalanan matahari telah melampaui pertengahan dan telah menuju kepada
terbenamnya dinamai ashr asar. / ’tatkala’, ’jalan’, ’matahari’, ’lampau’, ’tengah’,
’benam’, ’nama’, ’ashr’, ’asar’.
➢ Table 2
An example of Cosine Similarity results. Documents Score Cosine similarity
’67-2-10.txt’ 0.7132, ’67-2-7.txt’ 0.6969, ’67-2-5.txt’ 0.4472, ’67-2-6.txt’ 0.4472,’67-
2-9.txt’ 0.3481, ’67-2-4.txt’ 0.3333, ’67-2-3.txt’ 0.3015, ’2-132-2.txt’ 0.2887 ,’67-2-
8.txt’ 0.2673, ’75-34-9.txt’ 0.2425.
MRITS, CSE(AI&ML) 9
TEXT SUMMARIZATION
CONCLUSION
With the advancement of the Internet, a vast quantity of information is now available.
Summarizing large quantities of text is extremely difficult for humans. In this age of
information overload, automated summarization systems are in high demand. There is an
information overload as a result of the fast increase of knowledge and the usage of the Internet.
This difficulty can be handled if there are reliable text summarizers that provide a document
summary for the user's convenience. As a result, a system must be developed that allows a user
to quickly access and obtain a summary document.
MRITS, CSE(AI&ML) 10
TEXT SUMMARIZATION
References
1. Egyptian a informatics Journal, Extracted text summarization using modified page rank
algorithm,
https://www.sciencedirect.com/science/article/pii/S1110866519301355
2. Analytics Vidhya, an introduction to text summarization using text rank algorithm (with
python implementation),
https://www.analyticsvidhya.com/blog/2018/11/introduction-textsummarization-
textrank-python/
MRITS, CSE(AI&ML) 11