NLP-Multi Document Summarization
NLP-Multi Document Summarization
NLP-Multi Document Summarization
MULTI DOCUMENT
SUMMARIZATION
BY- ARYAN SHARMA
RAJA PANDEY
RANJEEV SINGH
INTRODUCTION
Large amounts of information available online Very challenging to identify
relevantinformation quickly and efficiently
The aim is to develop a multi-document summarizationmodel integrating extractive
and abstractivetechniques.
On obtaining relevant documents, sort them using named entity recognition (NER),
and combine on the basis of similartity.
https://medium.com/analytics-vidhya/text-summarization-using-nlp-3e85ad0c6349
EXTRACTIVE SUMMARIZATION
Extracts and combines the most relevant sentences or phrases from the original text
without altering their structure or content.
Key Characteristics:
1. Directly quotes text from the source material.
2. Relies on sentence selection algorithms.
3. Maintains grammatical accuracy.
Methods:
1. TextRank Algorithm: A graph-based method ranking sentences
based on their similarity.
2.Cosine similarity used for identifying sentence relevance.
Key Characteristics:
1. Can include new words or phrases not present in the source text.
2. Often results in summaries that are more concise and coherent.
Methods:
1. Transformer models like BERT, T5, and GPT-3.
2. Involves training with sequence-to-sequence learning.
Use Case: Suitable for creative or interpretive summarization (e.g., research papers, blogs).
COMPARISON
Original Text: "Dogs are loyal animals. They are great companions for humans. Cats, on
the other hand, are independent but affectionate."
Extractive: Directly selects key sentences. Abstractive: Paraphrases and generates new
text
Dogs are loyal animals. Cats are Dogs are loyal companions, while cats are
independent but affectionate. independent yet loving pets.
We are focusing on extractive summarization because it directly selects key sentences
from the original text, ensuring accuracy and preserving important details.
In extractive summarization, the goal is to identify and select the most relevant
sentences from a document.
Not every sentence contributes equally to the overall meaning or key ideas of the text.
Some sentences may be redundant or less critical, while others are more central to the
core message.
WE SHOULD BECOME 02
FAMILIAR WITH........
PAGERANK ALGORITHM
Graph Representation
Pages are nodes, and links between pages
are directed edges.
PageRank Score
Transition Matrix:
Damping Factor:
Suppose we have 4 web pages — w1, w2, w3, and w4.
These pages contain links pointing to one another.
Some pages might have no link – these are called dangling pages
Steps:
1. Text Preprocessing:
Concatenate all the text from the document.
Split the text into individual sentences
2. Convert each sentence into a vector using word embeddings (e.g., Word2Vec, GloVe).
3. Calculate Similarities:
Compute similarities between sentence vectors (e.g., using cosine similarity).
Store the similarity values in a matrix.
4. Construct Graph
5. Rank Sentences
6. Generate Summary, Select the top-ranked sentences to form the final summary.
1. Sentence Splitting 2. Word Embeddings
Sentence 1: Cats are secretly plotting to rule the world. Sentence 1: [0.5, 0.1, 0.3, 0.0, 0.1]
Sentence 2: Dogs are just happy to get belly rubs. Sentence 2: [0.1, 0.6, 0.2, 0.1, 0.0]
Sentence 3: I think my pet hamster knows more than I do. Sentence 3: [0.3, 0.2, 0.5, 0.0, 0.0]
6. Summary Generation
Cats are secretly plotting to rule the world.
I think my pet hamster knows more than I do.
OVERVIEW
https://www.analyticsvidhya.com/blog/2018/11/introduction-text-summarization-textrank-python/#h-pagerank-algorithm
Studio Shodwe Home Service About Us Contact
DEMONSTRATION
BERT BASED MODEL
1. Implementation:
Uses a BERT-based model from Hugging Face, seamlessly integrated with Fast.ai through a
BaseModelWrapper.
Model Used: Utilizes facebook/bert-large-cnn, a pre-trained model from Hugging Face designed for
abstractive summarization.
2. Training Setup:
Fine-tuning BERT: The facebook/bert-large-cnn model is fine-tuned on specific datasets for improved
summarization performance in various contexts (news articles, research papers, etc.).
Optimized for CNN Summarization: Built to handle long-form text summarization, making it particularly
effective for summarizing complex and lengthy documents.
3. Optimization
16-bit Precision: Improves memory efficiency and accelerates training, crucial for large-scale models like
facebook/bart-large-cnn. Staged Training: Initially freezes the base layers of the model to focus on training the
later layers for faster convergence.
BERT BASED MODEL
https://www.mdpi.com/2079-9292/12/16/3456
Studio Shodwe Home Service About Us Contact
DEMONSTRATION
Studio Shodwe Home Service About Us Contact
THANK YOU