NLP-Multi Document Summarization

Studio Shodwe Home Service About Us Contact
MULTI DOCUMENT
SUMMARIZATION
BY- ARYAN SHARMA
RAJA PANDEY
RANJEEV SINGH
INTRODUCTION
Large amounts of information available online Very challenging to identify
relevantinformation quickly and efficiently
The aim is to develop a multi-document summarizationmodel integrating extractive
and abstractivetechniques.
On obtaining relevant documents, sort them using named entity recognition (NER),
and combine on the basis of similartity.
Three state-of-the-art models, BERT, GPT-2, are studied for abstractive

summarization.
TYPE OF SUMMARIZATION:
https://medium.com/analytics-vidhya/text-summarization-using-nlp-3e85ad0c6349
EXTRACTIVE SUMMARIZATION
Extracts and combines the most relevant sentences or phrases from the original text
without altering their structure or content.
Key Characteristics:
1. Directly quotes text from the source material.
2. Relies on sentence selection algorithms.
3. Maintains grammatical accuracy.
Methods:
1. TextRank Algorithm: A graph-based method ranking sentences
based on their similarity.
2.Cosine similarity used for identifying sentence relevance.
Use Case: Ideal for summarizing factual or structured content

(e.g., news articles).
ABSTRACTIVE SUMMARIZATION
Generates a condensed version by paraphrasing the original text using advanced natural
language generation techniques.
Key Characteristics:
1. Can include new words or phrases not present in the source text.
2. Often results in summaries that are more concise and coherent.
Methods:
1. Transformer models like BERT, T5, and GPT-3.
2. Involves training with sequence-to-sequence learning.
Use Case: Suitable for creative or interpretive summarization (e.g., research papers, blogs).
COMPARISON
Original Text: "Dogs are loyal animals. They are great companions for humans. Cats, on
the other hand, are independent but affectionate."
Extractive: Directly selects key sentences. Abstractive: Paraphrases and generates new
text
Dogs are loyal animals. Cats are Dogs are loyal companions, while cats are
independent but affectionate. independent yet loving pets.
We are focusing on extractive summarization because it directly selects key sentences
from the original text, ensuring accuracy and preserving important details.
The Core Challenge:
In extractive summarization, the goal is to identify and select the most relevant
sentences from a document.
Not every sentence contributes equally to the overall meaning or key ideas of the text.
Some sentences may be redundant or less critical, while others are more central to the
core message.
To rank sentences in extractive summarization, we use algorithms like TextRank to

identify the most important ones.
Studio Shodwe
BEFORE GETTING STARTED

WITH THE TEXTRANK
*THE PAGERANK
ALGORITHM, THERE’S
ALGORITHM
ANOTHER ALGORITHM WHICH
01
WE SHOULD BECOME 02
FAMILIAR WITH........
PAGERANK ALGORITHM
Graph Representation
Pages are nodes, and links between pages
are directed edges.
PageRank Score
Represents the probability of a user visiting a page,

based on incoming links.
Transition Matrix:
Damping Factor:
Suppose we have 4 web pages — w1, w2, w3, and w4.
These pages contain links pointing to one another.
Some pages might have no link – these are called dangling pages
In order to rank these pages, we

would have to compute a score called
the PageRank score.
This score is the probability of a user
visiting that page.
TEXTRANK ALGORITHM
TextRank is an extractive and unsupervised text summarization technique
Steps:
1. Text Preprocessing:
Concatenate all the text from the document.
Split the text into individual sentences
2. Convert each sentence into a vector using word embeddings (e.g., Word2Vec, GloVe).
3. Calculate Similarities:
Compute similarities between sentence vectors (e.g., using cosine similarity).
Store the similarity values in a matrix.
4. Construct Graph
5. Rank Sentences
6. Generate Summary, Select the top-ranked sentences to form the final summary.
1. Sentence Splitting 2. Word Embeddings
Sentence 1: Cats are secretly plotting to rule the world. Sentence 1: [0.5, 0.1, 0.3, 0.0, 0.1]
Sentence 2: Dogs are just happy to get belly rubs. Sentence 2: [0.1, 0.6, 0.2, 0.1, 0.0]
Sentence 3: I think my pet hamster knows more than I do. Sentence 3: [0.3, 0.2, 0.5, 0.0, 0.0]
3. Similarity Matrix 4. Weighted Graph Construction
5. Sentence Rank Calculation

S1: 0.6
S2: 0.5
S3: 0.55
6. Summary Generation
Cats are secretly plotting to rule the world.
I think my pet hamster knows more than I do.
OVERVIEW
https://www.analyticsvidhya.com/blog/2018/11/introduction-text-summarization-textrank-python/#h-pagerank-algorithm
DEMONSTRATION
BERT BASED MODEL
1. Implementation:
Uses a BERT-based model from Hugging Face, seamlessly integrated with Fast.ai through a
BaseModelWrapper.
Model Used: Utilizes facebook/bert-large-cnn, a pre-trained model from Hugging Face designed for
abstractive summarization.
2. Training Setup:
Fine-tuning BERT: The facebook/bert-large-cnn model is fine-tuned on specific datasets for improved
summarization performance in various contexts (news articles, research papers, etc.).
Optimized for CNN Summarization: Built to handle long-form text summarization, making it particularly
effective for summarizing complex and lengthy documents.
3. Optimization
16-bit Precision: Improves memory efficiency and accelerates training, crucial for large-scale models like
facebook/bart-large-cnn. Staged Training: Initially freezes the base layers of the model to focus on training the
later layers for faster convergence.
BERT BASED MODEL
https://www.mdpi.com/2079-9292/12/16/3456
DEMONSTRATION
THANK YOU

NLP-Multi Document Summarization

Uploaded by

Copyright:

Available Formats

NLP-Multi Document Summarization

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NLP-Multi Document Summarization

Uploaded by

Copyright:

Available Formats

Studio Shodwe Home Service About Us Contact

Three state-of-the-art models, BERT, GPT-2, are studied for abstractive

Use Case: Ideal for summarizing factual or structured content

The Core Challenge:

To rank sentences in extractive summarization, we use algorithms like TextRank to

BEFORE GETTING STARTED

Represents the probability of a user visiting a page,

In order to rank these pages, we

3. Similarity Matrix 4. Weighted Graph Construction

5. Sentence Rank Calculation

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.