Technical Seminar Report-6607

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

TEXT SUMMARIZATION

Technical Seminar Report


On

TEXT SUMMARIZATION USING


TEXTRANK ALGORITHM

In partial fulfilment of requirements for the degree of

Bachelor of Technology
In
Department of CSE (Artificial Intelligence & Machine Learning)

SUBMITTED BY:

GOLLAPALLY BHAVANA

20S11A6607

MRITS, CSE(AI&ML) 1
TEXT SUMMARIZATION

ABSTRACT

The size of data on the Internet has risen in an exponential manner over the past decade. Thus,
the need for a solution emerges, that transforms this vast raw information into useful
information which a human brain can understand. One such common technique in research that
helps in dealing with enormous data is text summarization. Automatic text summarization
essentially condenses a long document into a shorter format while preserving its information
content and overall meaning. It is a potential solution to the information overload. Several
automatic summarizers exist in the literature capable of producing high-quality summaries, but
they do not focus on preserving the underlying meaning and semantics of the text. We capture
and preserve the semantics of text as the fundamental feature for summarizing a document. We
propose an automatic summarizer using the distributional semantic model to capture semantics
for producing high-quality summaries.

Nowadays, the efficient access of enormous amounts of information has


become more difficult due to the rapid growth of the Internet. To manage the vast information,
we need efficient and effective methods and tools. In this paper, a graph based text
summarization method has been described which captures the aboutness of a text document.
The method has been developed using modified Text Rank computed based on the concept of
PageRank defined for each page in the Web pages. The proposed method constructs a graph
with sentences as the nodes and similarity between two sentences as the weight of the edge
between them. Modified inverse sentence frequency-cosine similarity is used to give different
weightage to different words in the sentence, whereas traditional cosine similarity treats the
words equally. The graph is made sparse and partitioned into different clusters with the
assumption that the sentences within a cluster are similar to each other and sentences of
different cluster represent their dissimilarity. The performance evaluation of proposed
summarization technique shows the effectiveness of the method.

The text summarizer reduces unnecessary information by selecting the


important sentences. In multi-document summarization, there is a possibility that two or more
important sentences share similar information. Including those sentences to the summary result
will cause redundant information. This research aims to reduce similar sentences from multi-
document that share similar information to obtain a more concise text summary.

MRITS, CSE(AI&ML) 2
TEXT SUMMARIZATION

CONTENTS
❖ INTRODUCTION

❖ TEXT SUMMARIZATION APPROACHES

❖ UNDERSTANDING THE PAGE RANK ALGORITHM

❖ UNDERSTANDING THE TEXT RANK ALGORITHM

❖ IMPLEMENTATION OF TEXTRANK ALGORITHM

❖ RESULTS

❖ CONCLUSION

❖ REFERENCES

MRITS, CSE(AI&ML) 3
TEXT SUMMARIZATION

INTRODUCTION

Text analytics helps in taking insights from the text. Text processing and automatic text
summarization are the key tasks in this area. Text summarization identifies and extracts key
sentences from the input document to produce the automatic text summaries of the input
documents, and hence is a potential solution to the information overload problem. Moreover,
text summaries are used for reducing the length of the input documents without compromising
with its overall meaning and information content. Hence, text summarization is the data
reduction process for quick consumption by the user.

This project highlights the importance of Machine learning. We introduce an algorithm that
will rank the words based on their occurrence and summarize a large amount of feed text into
meaningful data.

Over the past ten years, there has been an exponential increase in the quantity of data on the
Internet. Therefore, a system that converts this massive raw data into relevant data that the
human brain can comprehend is required.

Automatic text summarization has been a source of fascination since the 1950s. "The
automated production of literary abstracts," a research paper by Hans Peter Luhn published in
the late 1950s, used factors like word frequency to pick keywords from the text for
summarizing

Another notable study, conducted by Harold P Edmundson in the late 1960s, extracted relevant
sentences for text summary using approaches such as the existence of cue words, terms from
the title occurring in the text, and sentence location. Many important and intriguing works on
the topic of automatic text summarization have been published since then.

MRITS, CSE(AI&ML) 4
TEXT SUMMARIZATION

TEXT SUMMARIZATION APPROACHES

Automatic text summarization has been a source of fascination since the 1950s. "The
automated production of literary abstracts," a research paper by Hans Peter Luhn published in
the late 1950s, used factors like word frequency to pick keywords from the text for
summarizing.

Another notable study, conducted by Harold P Edmundson in the late 1960s, extracted relevant
sentences for text summary using approaches such as the existence of cue words, terms from
the title occurring in the text, and sentence location. Many important and intriguing works on
the topic of automatic text summarization have been published since then.

The two most common types of text summary are extractive and abstractive summarization.

Extractive Summarization: This approach works by extracting many components from a text,
such as phrases and sentences, then stacking them together to generate a summary. As a result,
finding the proper phrases for summary is critical in an extraction approach.

To create a completely new summary, abstractive summarization employs advanced natural


language processing algorithms. It's possible that some of the details in this summary aren't
included in the original text. This study will concentrate on the technique of extractive
summarization.

MRITS, CSE(AI&ML) 5
TEXT SUMMARIZATION

UNDERSTANDING THE PAGE RANK ALGORITHM

The PageRank algorithm provides a probability distribution that is used to predict whether a
user will wind up on a given website after clicking on random links. Any number of documents
can be used to compute PageRank. At the outset of the computational approach, some research
articles assume that the distribution is evenly distributed throughout all documents in the
collection. To update predicted PageRank values to approximate the theoretical real value
more closely, the PageRank calculations require multiple runs over the collection, referred to
as "iterations." Assume a four-page document containing the letters A, B, C, and D on each
page. Links between pages are disregarded, as are outbound links from a single page to another
single page. For all pages, PageRank is set to the same value. Because the entire number of
pages on the web at the time was equal to the sum of PageRank across all pages in the original
version of PageRank, in this case, each page would start with a value of one. On the next cycle,
the PageRank sent from a specific page to the targets of its outbound links is split evenly across
all outbound connections. If the system’s only connections to A were from pages B, C, and D
each link would transmit 0.25 PageRank to A on the following iteration, for a total of 0.75.

PR(A) = PR(B) + PR(C) +PR(D).

Consider what would happen if page B was linked to pages C and A, page C to page A, and
page D to all three sites. As a consequence, on the first repeat, page B would communicate half
of its current value, or 0.125 to page A, and the other half, or 0.125, to page C. Because it had
three outbound links, Page C's whole current value, 0.25, would be transferred to A's sole
existing value, or roughly 0.083. At the end of this cycle, Page A will have a PageRank of
around 0.458, and the page it connects to, A. D, will transmit one third of its PageRank.

PR(A) = PR(B)/2 + PR(C)/1 +PR(D)/3

MRITS, CSE(AI&ML) 6
TEXT SUMMARIZATION

UNDERSTANDING THE TEXT RANK ALGORITHM

Instead of pages, we employ phrases in the Text Rank algorithm. The chance of a web page
change is determined by the similarity of the two texts. The similarity scores are kept in square
matrix that resembles PageRank's. Text Rank is an unsupervised extractive text summarizing
approach. The flow of the Text Rank algorithm is

Fig 1

In the following phase, we will look for vector representations for each sentence
1. The similarity between sentence vectors is then calculated and stored in a matrix.

2. The similarity matrix is reworked into a graph with sentences as vertices and
similarity scores as edges to determine the rank of the sentence.

3. Finally, the final summary is made out of a selection of top-ranked sentences.

MRITS, CSE(AI&ML) 7
TEXT SUMMARIZATION

IMPLEMENTATION OF TEXT RANK ALGORITHM

Text Rank algorithm is used to create a clean and succinct summary from a collection of
scraped articles. Please bear in mind that this is a single-domain multiple-documents
summarizing project, which means we'll be using a variety of articles as input and creating a
single bullet-point summary.

Using Jupyter Notebook lets implement the Text Rank algorithm.

1. Import Required Libraries:

First, import the libraries we’ll be leveraging for this challenge.

import numpy as np

import pandas as pd

import nltk

nltk.download('punkt')

import re

2.Read the Data:

Now let’s read our dataset.

df = pd.read_csv("tennis_articles_v4.csv")

3.Inspect the Data

Let’s take a quick glance at the data.

df.head()

4. Split Text into Sentences. The material must now be broken down into distinct sentences.
To do so, we'll utilize the nltk library sent tokenize () method.

5.Applying PageRank Algorithm

MRITS, CSE(AI&ML) 8
TEXT SUMMARIZATION

RESULTS

➢ Table 1
An example of Preprocessing. Before Preprocessing / After Preprocessing
Tatkala perjalanan matahari telah melampaui pertengahan dan telah menuju kepada
terbenamnya dinamai ashr asar. / ’tatkala’, ’jalan’, ’matahari’, ’lampau’, ’tengah’,
’benam’, ’nama’, ’ashr’, ’asar’.

➢ Table 2
An example of Cosine Similarity results. Documents Score Cosine similarity
’67-2-10.txt’ 0.7132, ’67-2-7.txt’ 0.6969, ’67-2-5.txt’ 0.4472, ’67-2-6.txt’ 0.4472,’67-
2-9.txt’ 0.3481, ’67-2-4.txt’ 0.3333, ’67-2-3.txt’ 0.3015, ’2-132-2.txt’ 0.2887 ,’67-2-
8.txt’ 0.2673, ’75-34-9.txt’ 0.2425.

MRITS, CSE(AI&ML) 9
TEXT SUMMARIZATION

CONCLUSION

With the advancement of the Internet, a vast quantity of information is now available.
Summarizing large quantities of text is extremely difficult for humans. In this age of
information overload, automated summarization systems are in high demand. There is an
information overload as a result of the fast increase of knowledge and the usage of the Internet.

This difficulty can be handled if there are reliable text summarizers that provide a document
summary for the user's convenience. As a result, a system must be developed that allows a user
to quickly access and obtain a summary document.

A summary of a document using extractive or abstractive approaches is one such answer.


Extractive text summarization is simpler to construct. In this work, we focused on extractive
techniques for automatic text summarising. We have gone through a handful of the more
common methods. It gives a good overview of recent developments and advancements in
automatic summarization methods, as well as the most up-to-date information in this field.

MRITS, CSE(AI&ML) 10
TEXT SUMMARIZATION

References

1. Egyptian a informatics Journal, Extracted text summarization using modified page rank
algorithm,
https://www.sciencedirect.com/science/article/pii/S1110866519301355

2. Analytics Vidhya, an introduction to text summarization using text rank algorithm (with
python implementation),
https://www.analyticsvidhya.com/blog/2018/11/introduction-textsummarization-
textrank-python/

3. Data Science in your pocket, Text summarization using text rank


https://medium.com/data-science-in-your-pocket/textsummarization-using-textrank-
in-nlp-4bce52c5b390

4. Open genus, Text rank for text summarization,


https://iq.opengenus.org/textrank-for-text-summarization/

5. Research Gate, Graph-based text summarization using modified text rank,


https://www.researchgate.net/publication/327136473_Graph-

MRITS, CSE(AI&ML) 11

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy