NLP-Multi Document Summarization

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Studio Shodwe Home Service About Us Contact

MULTI DOCUMENT
SUMMARIZATION
BY- ARYAN SHARMA
RAJA PANDEY
RANJEEV SINGH
INTRODUCTION
Large amounts of information available online Very challenging to identify
relevantinformation quickly and efficiently
The aim is to develop a multi-document summarizationmodel integrating extractive
and abstractivetechniques.
On obtaining relevant documents, sort them using named entity recognition (NER),
and combine on the basis of similartity.

Three state-of-the-art models, BERT, GPT-2, are studied for abstractive


summarization.
TYPE OF SUMMARIZATION:

https://medium.com/analytics-vidhya/text-summarization-using-nlp-3e85ad0c6349
EXTRACTIVE SUMMARIZATION
Extracts and combines the most relevant sentences or phrases from the original text
without altering their structure or content.

Key Characteristics:
1. Directly quotes text from the source material.
2. Relies on sentence selection algorithms.
3. Maintains grammatical accuracy.

Methods:
1. TextRank Algorithm: A graph-based method ranking sentences
based on their similarity.
2.Cosine similarity used for identifying sentence relevance.

Use Case: Ideal for summarizing factual or structured content


(e.g., news articles).
ABSTRACTIVE SUMMARIZATION
Generates a condensed version by paraphrasing the original text using advanced natural
language generation techniques.

Key Characteristics:
1. Can include new words or phrases not present in the source text.
2. Often results in summaries that are more concise and coherent.

Methods:
1. Transformer models like BERT, T5, and GPT-3.
2. Involves training with sequence-to-sequence learning.

Use Case: Suitable for creative or interpretive summarization (e.g., research papers, blogs).
COMPARISON
Original Text: "Dogs are loyal animals. They are great companions for humans. Cats, on
the other hand, are independent but affectionate."

Extractive: Directly selects key sentences. Abstractive: Paraphrases and generates new
text

Dogs are loyal animals. Cats are Dogs are loyal companions, while cats are
independent but affectionate. independent yet loving pets.
We are focusing on extractive summarization because it directly selects key sentences
from the original text, ensuring accuracy and preserving important details.

The Core Challenge:

In extractive summarization, the goal is to identify and select the most relevant
sentences from a document.
Not every sentence contributes equally to the overall meaning or key ideas of the text.
Some sentences may be redundant or less critical, while others are more central to the
core message.

To rank sentences in extractive summarization, we use algorithms like TextRank to


identify the most important ones.
Studio Shodwe

BEFORE GETTING STARTED


WITH THE TEXTRANK
*THE PAGERANK
ALGORITHM, THERE’S
ALGORITHM
ANOTHER ALGORITHM WHICH
01

WE SHOULD BECOME 02

FAMILIAR WITH........
PAGERANK ALGORITHM
Graph Representation
Pages are nodes, and links between pages
are directed edges.

PageRank Score

Represents the probability of a user visiting a page,


based on incoming links.

Transition Matrix:

Damping Factor:
Suppose we have 4 web pages — w1, w2, w3, and w4.
These pages contain links pointing to one another.

Some pages might have no link – these are called dangling pages

In order to rank these pages, we


would have to compute a score called
the PageRank score.
This score is the probability of a user
visiting that page.
TEXTRANK ALGORITHM
TextRank is an extractive and unsupervised text summarization technique

Steps:
1. Text Preprocessing:
Concatenate all the text from the document.
Split the text into individual sentences

2. Convert each sentence into a vector using word embeddings (e.g., Word2Vec, GloVe).
3. Calculate Similarities:
Compute similarities between sentence vectors (e.g., using cosine similarity).
Store the similarity values in a matrix.

4. Construct Graph

5. Rank Sentences

6. Generate Summary, Select the top-ranked sentences to form the final summary.
1. Sentence Splitting 2. Word Embeddings
Sentence 1: Cats are secretly plotting to rule the world. Sentence 1: [0.5, 0.1, 0.3, 0.0, 0.1]
Sentence 2: Dogs are just happy to get belly rubs. Sentence 2: [0.1, 0.6, 0.2, 0.1, 0.0]
Sentence 3: I think my pet hamster knows more than I do. Sentence 3: [0.3, 0.2, 0.5, 0.0, 0.0]

3. Similarity Matrix 4. Weighted Graph Construction

5. Sentence Rank Calculation


S1: 0.6
S2: 0.5
S3: 0.55

6. Summary Generation
Cats are secretly plotting to rule the world.
I think my pet hamster knows more than I do.
OVERVIEW

https://www.analyticsvidhya.com/blog/2018/11/introduction-text-summarization-textrank-python/#h-pagerank-algorithm
Studio Shodwe Home Service About Us Contact

DEMONSTRATION
BERT BASED MODEL
1. Implementation:

Uses a BERT-based model from Hugging Face, seamlessly integrated with Fast.ai through a
BaseModelWrapper.

Model Used: Utilizes facebook/bert-large-cnn, a pre-trained model from Hugging Face designed for
abstractive summarization.

2. Training Setup:
Fine-tuning BERT: The facebook/bert-large-cnn model is fine-tuned on specific datasets for improved
summarization performance in various contexts (news articles, research papers, etc.).

Optimized for CNN Summarization: Built to handle long-form text summarization, making it particularly
effective for summarizing complex and lengthy documents.

3. Optimization
16-bit Precision: Improves memory efficiency and accelerates training, crucial for large-scale models like
facebook/bart-large-cnn. Staged Training: Initially freezes the base layers of the model to focus on training the
later layers for faster convergence.
BERT BASED MODEL

https://www.mdpi.com/2079-9292/12/16/3456
Studio Shodwe Home Service About Us Contact

DEMONSTRATION
Studio Shodwe Home Service About Us Contact

THANK YOU

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy