NLP 1

Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

SPEECH & NATURAL LANGUAGE PROCESSING

3 MARKS

1. State the common evaluation metrics used to assess the performance of speech
recognition systems.

1. Word Error Rate (WER): Measures the percentage of words incorrectly


recognized. (3 marks)
2. Character Error Rate (CER): Measures the percentage of characters
incorrectly recognized. (3 marks)
3. F-score: A weighted average of precision and recall, where precision is
the percentage of correctly recognized words among all words recognized,
and recall is the percentage of correctly recognized words among all words
that should have been recognized. (3 marks)

2. Define Natural Language Processing (NLP).


Natural Language Processing (NLP)

NLP is like a bridge between computers and humans. It helps computers understand
and process human language in all its forms, including written, spoken, and signed.
By breaking down language into its individual components (words, phrases, etc.),
NLP gives computers the ability to:

• Extract meaning from text


• Summarize information
• Translate languages
• Generate text
• Recognize speech

NLP is essential for building many of the technologies we use today, such as:

• Search engines
• Chatbots
• Machine translation tools
• Speech recognition systems

3. Define Natural Language Understanding (NLU).


Natural Language Understanding (NLU) involves enabling machines to comprehend and
interpret human language in its natural form. It allows computers to:

• Extract meaning from text and speech: Understand the intent behind words,
identify named entities, and analyze sentiment.
• Engage in natural language dialogue: Interact with humans using a
conversational style, responding to questions and providing information.
• Generate human-like text: Produce grammatically correct and meaningful
text for tasks like summarizing, translation, or content creation.

4. Define Natural Language Generation (NLG).


Natural Language Generation (NLG) is a branch of artificial intelligence that enables
computers to generate human-like text. It involves converting structured data or
knowledge into coherent and grammatically correct sentences or paragraphs. NLG
aims to bridge the gap between computers and humans by allowing computers to
produce natural language that can be easily understood and interpreted by humans.

5. State the key components of an NLP system.


Key Components of an NLP System:

1. Natural Language Understanding (NLU): Processes human language (e.g.,


text, speech) and extracts meaning from it.
2. Natural Language Generation (NLG): Converts machine-understandable data
into human-readable text or speech.
3. Knowledge Base: Stores information about the world, such as facts, rules,
and relationships, to enhance language understanding.

6. Identify the challenges faced in natural language processing.


Challenges in Natural Language Processing

Natural Language Processing (NLP) faces several challenges:


1. Ambiguity: Words and phrases can have multiple meanings, making it
difficult for NLP models to accurately interpret language.
2. Context-dependency: The meaning of words and phrases often depends
on the surrounding context. NLP models must consider this context to
make accurate interpretations.
3. Language complexity: Natural language is complex, with a vast vocabulary,
grammar rules, and idiomatic expressions. NLP models must be able to
handle this complexity to effectively process and understand language.

7. State the various levels of language processing tasks in NLP.


Three Levels of Language Processing Tasks:
1. Shallow Processing: Focuses on basic aspects of language, such as
grammar, spelling, and vocabulary. It includes tasks like:
• Part-of-speech tagging: Identifying the type of each word (noun,
verb, etc.) in a sentence.
• Named entity recognition: Detecting and classifying entities in text,
such as names, locations, and organizations.

©FIADUZ ZAMAN 1
2. Intermediate Processing: Deals with more complex linguistic phenomena,
including semantics and syntax. It involves tasks such as:
• Text Summarization: Condensing a larger body of text into a concise
summary.
• Machine Translation: Translating text from one language to another.
3. Deep Processing: Aims to understand and generate natural language with
a deeper level of comprehension. It includes tasks like:
• Natural Language Understanding: Identifying the meaning and intent
of a text.
• Natural Language Generation: Creating coherent and meaningful text
from machine-generated data.

8. Define tokenization in text processing.


Tokenization in Text Processing

Tokenization is the process of breaking a text into smaller, meaningful units called
tokens. These units can be words, numbers, punctuation marks, or even smaller
units like characters. Tokenization is the first step in many text processing tasks,
such as:

• Natural Language Understanding (NLU): Tokenization helps NLU models


understand the meaning of text by providing them with the individual
components of a sentence.
• Machine Translation (MT): Tokenization is necessary for MT models to
translate text from one language to another, as it allows them to work
with the individual words and phrases in the text.
• Information Retrieval (IR): Tokenization helps IR systems find relevant
documents by indexing the individual tokens in a document and searching
for them in user queries.

3 Marks Answer in Easy Language:

Tokenization is like cutting a sentence into its building blocks, like words and
punctuation marks. It's the first step in making a computer understand what a piece
of text means.

9. Recall the steps involved in named entity recognition (NER).


Steps in Named Entity Recognition (NER):

1. Tokenization:

• Split the text into individual words or tokens.

2. Part-of-Speech (POS) Tagging:

• Label each token with its grammatical category (e.g., noun, verb).

3. Chunking:

• Group related tokens into phrases or chunks (e.g., "New York City").

4. Feature Extraction:

• Identify characteristics of words and chunks (e.g., capitalization, location


in the text).

5. Classification:

• Use machine learning algorithms to classify chunks as named entities


(e.g., person, organization, location).

6. Post-processing:

• Correct any errors, merge overlapping entities, and resolve ambiguities.

10. List the importance of hyperparameter tuning on the performance of an NLP model.
Importance of Hyperparameter Tuning for NLP Model Performance
1. Optimized Model Parameters: Hyperparameter tuning adjusts the internal
settings of the model, such as learning rate or batch size, to find the
optimal combination for a specific task and dataset. This fine-tuning
process ensures the model operates efficiently and avoids overfitting or
underfitting.
2. Improved Accuracy and Generalization: By optimizing hyperparameters, the
model can better capture the underlying relationships in the data and
make more accurate predictions. Hyperparameter tuning also helps prevent
the model from overfitting to specific training examples, resulting in better
generalization to new data.
3. Increased Efficiency: Hyperparameter tuning can identify the most suitable
settings for the model's learning process. This can lead to faster training
times, reduced computational resources, and improved efficiency in
deploying the NLP model for real-world applications.

11. Predict the impact of dataset bias on the fairness of NLP models.
Dataset bias can significantly impact the fairness of NLP models:
• Biased Training Data:
▪ Models trained on biased data inherit and amplify those biases.
▪ This can lead to unfair predictions and discrimination against certain
groups.

©FIADUZ ZAMAN 2
• Underrepresented Groups:
▪ If a dataset lacks data from specific groups, the model may make
unreliable predictions for those groups.
▪ This can result in biased outcomes and perpetuate existing
inequalities.
• Representation in Results:
▪ Biased datasets can skew representation in NLP outputs.
▪ Models may prioritize results that align with the biases present in
the data, rather than providing a balanced view.

12. Discuss the techniques used for feature extraction in speech processing.
Techniques for Feature Extraction in Speech Processing

Feature extraction is crucial in speech processing to represent speech signals in a


way that captures their most important characteristics. Here are three common
techniques:
1. Mel-Frequency Cepstral Coefficients (MFCCs):
• MFCCs imitate the human auditory system by creating a mel-scale
that represents the frequency content of speech.
• They are robust to noise and variations in pitch, making them
highly effective for speech recognition.
2. Linear Predictive Coding (LPC):
• LPC models speech signals as a linear combination of past samples.
• It extracts features that describe the characteristics of the vocal
tract, capturing the formants and other aspects of the speech
waveform.
3. Gammatone Cepstral Coefficients (GTCCs):
• GTCCs are inspired by the gammatone filter bank found in the
human cochlea.
• They provide a more biologically realistic representation of speech,
enhancing the discriminability of speech sounds.

13. Predict the impact of Hidden Markov Model (HMM) work in speech recognition.
Impact of Hidden Markov Models (HMMs) in Speech Recognition:

1. Improved Accuracy: HMMs provide a statistical framework that models the


temporal sequences of speech sounds, allowing for robust recognition even
in noisy conditions.
2. Efficient Representation: HMMs compactly represent the statistical
dependencies between speech units, enabling faster processing and
recognition.
3. Adaptability: HMMs can be trained on large datasets, allowing them to
adapt to different speakers, accents, and environments, improving the
overall recognition performance.

14. Predict the impact of Hidden Markov Model (HMM) work in NLP.
Impact of Hidden Markov Model (HMM) in NLP:
1. Improved Speech Recognition: HMMs have enabled highly accurate speech
recognition systems by modeling the sequential nature of speech and
capturing hidden states in the acoustic signal.
2. Enhanced Text Segmentation: HMMs allow for efficient segmentation of text
into phrases, words, and characters. This aids in tasks like text
summarization and topic labeling.
3. Optimized Language Modeling: HMMs provide a framework for language
modeling, predicting the probability of a word occurring based on previous
words. This benefits tasks like machine translation and language generation.

15. Observe and illustrate the role of text processing in NLP.


Role of Text Processing in NLP:
1. Normalization: Converting text to a consistent format by removing
punctuation, capital letters, and special characters. This helps in better
understanding the text's content.
2. Tokenization: Breaking down text into individual words or phrases called
"tokens." This makes it easier to analyze the text and identify its meaning.
3. Stemming/Lemmatization: Reducing words to their root form. For example,
"running," "ran," and "runs" would all be reduced to "run." This helps
in grouping similar words and understanding the text's overall meaning.
4. Part-of-Speech Tagging: Identifying the grammatical role of each word in
a sentence (e.g., noun, verb, adjective). This helps in understanding
the structure and meaning of the text.
5. Named Entity Recognition: Identifying and labeling entities within the text,
such as persons, organizations, locations, and dates. This helps in
extracting important information from the text for analysis or retrieval.

16. Explain the need for meaning representation.


Meaning Representation is Essential for:

• Understanding Natural Language: It captures the semantic content of


sentences, allowing computers to comprehend what's being said.
• Efficient Communication: It provides a structured way to represent meaning,
making it easier for computers to process and interpret language.
• Knowledge Integration: It enables computers to integrate information from
multiple sources and reason about it in a meaningful way.

17. Explain representation of meaning in Natural Language Processing (NLP).


Representation of Meaning in NLP

©FIADUZ ZAMAN 3
NLP aims to understand and process human language, which conveys complex
meaning. To do this, it requires ways to represent this meaning in a computer-
understandable form. Here's how NLP handles representation of meaning:
1. Bag-of-Words: This simple model represents a text as a list of words,
ignoring word order and grammatical structure. It assumes that each word
contributes equally to the overall meaning.
2. Vector Space Models: These models represent words as vectors in a
multi-dimensional space. Each dimension represents a different semantic
property (e.g., positivity, negativity). Words with similar meanings have
similar vectors, making it easier for NLP algorithms to understand their
relationships.
3. Topic Modeling: This technique identifies groups of words that frequently
occur together, known as "topics." It helps discover the overarching
themes and information within a text, providing a higher-level understanding
of its meaning.

18. Describe the following two approaches of representation with examples: (a)
Distributional Semantics (b) Semantic Networks
Distributional Semantics:

• Represents words based on their context (words appearing before or after


them).
• Example: "Run" and "Walk" are similar because they appear in similar
contexts, such as "I run to the store" and "I walk to the store."

Semantic Networks:

• Represents concepts as nodes and relationships between concepts as


edges.
• Example: A semantic network might have a node for "Dog" and edges
to nodes for "Mammal," "Pet," and "Loyal."

19. Explain Quantifiers in FOPC with example.


Quantifiers in First-Order Predicate Calculus (FOPC)

Quantifiers are symbols that indicate the scope of variables in FOPC statements.
They are used to specify whether a statement applies to all or some members of a
domain of discourse.

Types of Quantifiers:

• Universal quantifier (∀): For all x in the domain, the statement is true.
• Existential quantifier (∃): There exists at least one x in the domain such
that the statement is true.

Example:

"All dogs are mammals" can be represented in FOPC as:


∀x (dog(x) → mammal(x))

• x is the variable representing individual dogs.


• ∀x indicates that the statement applies to all dogs in the domain.
• dog(x) is the predicate indicating that x is a dog.
• mammal(x) is the predicate indicating that x is a mammal.

"There is a dog that is brown" can be represented in FOPC as:


∃x (dog(x) ∧ brown(x))

• ∃x indicates that there exists at least one dog in the domain.


• dog(x) is the predicate indicating that x is a dog.
• brown(x) is the predicate indicating that x is brown.

20. Discuss Traditional CFG Rules and Semantic Augmentations with examples.
Traditional CFG Rules

• Context-free grammar (CFG) rules are a set of rules used to define a


formal language.
• They consist of a set of terminals (e.g., words), a set of nonterminals
(e.g., categories), and a set of production rules.
• Example:
▪ Terminal: noun
▪ Nonterminal: S (for sentence)
▪ Production rule: S -> NP VP

Semantic Augmentations

• Enhance CFG rules with semantic information to capture the meaning of


words and phrases.
• Typically accomplished by adding semantic tags or features to nonterminals
and terminals.
• Example extensions:
▪ NP -> [person:John] (noun phrase referring to John)
▪ VP -> [action:run] (verb phrase expressing running)

Example with Semantic Augmentation

Consider the following CFG rule:

©FIADUZ ZAMAN 4
• S -> NP VP

Traditional CFG Rule:

• This rule states that a sentence (S) consists of a noun phrase (NP)
followed by a verb phrase (VP).

Semantic Augmentation:

• NP -> [person:John]
• VP -> [action:run]

Semantic Interpretation:

• The augmented rule indicates that a sentence consisting of the noun


phrase "John" and the verb phrase "run" expresses the meaning
"[person:John] [action:run]".
• In other words, the sentence "John runs" is interpreted as "The person
named John is performing the action of running".

Benefits of Semantic Augmentations

• Improved language understanding and generation


• Facilitate the development of natural language processing applications (e.g.,
machine translation, chatbots)
• Enhance the ability to handle ambiguity and context

21. Discuss the following regarding Relations between Senses with examples: (a)
Synonymy (b) Polysemy
(a) Synonymy

Synonymy refers to words that have the same meaning. For example:

• "car" and "automobile"


• "happy" and "joyful"

In speech recognition, synonyms can present a challenge as they may sound similar.
For instance, the words "cat" and "hat" have the same first and last sounds, which
can lead to misinterpretation.

(b) Polysemy

Polysemy refers to words that have multiple meanings depending on the context. For
example:

• "bank" (financial institution or riverbank)


• "pupil" (student or part of the eye)

In natural language processing, polysemy requires disambiguation to determine the


intended meaning of a word based on its context. For instance, the sentence "I saw
the bank" could refer to a financial transaction or a scenic river.

22. Explain the concept of Word sense disambiguation with example.


Word Sense Disambiguation (WSD)

WSD is the task of identifying the intended meaning of a word when it has multiple
meanings.

Example:

• "Bank": Can refer to a financial institution or a river's edge.


• "Fair": Can mean "just" or "carnival."

How WSD Works:

WSD involves considering the word's context (surrounding words and sentences) to
determine its most likely meaning. It can be done using machine learning algorithms
or dictionary-based methods.

3 Marks Answer:

Word sense disambiguation helps computers understand the intended meaning of words
that have multiple meanings. By considering the word's context, it can distinguish
between different senses. This is important for accurate speech and natural language
processing because it allows computers to better understand the meaning of sentences
and conversations.

23. Cite three application areas of Word sense disambiguation with brief description.
1. Machine Translation: Word sense disambiguation helps translate a word
accurately in different languages, considering its context. For instance,
"bank" in English can refer to a riverbank or a financial institution.
Disambiguation ensures the correct meaning is conveyed in the target
language.
2. Information Retrieval: Search engines use word sense disambiguation to
retrieve relevant results. The correct sense of a word in a query helps
identify the user's intent. For example, searching for "interest" may return
either articles on hobbies or financial matters, depending on the
disambiguation.
3. Semantic Parsing: Disambiguating word meanings is crucial for
understanding the semantics of a sentence. For instance, "Alice locked

©FIADUZ ZAMAN 5
the door" implies a different meaning depending on whether "locked"
refers to securing or breaking the door. Disambiguation helps natural
language processors interpret the sentence accurately.

24. Explain Syntax- Driven Semantic analysis approach with figure.


Syntax-Driven Semantic Analysis (SDSA)

SDSA is an approach to semantic analysis that uses the syntactic structure of a


sentence to guide the interpretation of its meaning. It starts with a syntactic parse
of the sentence and then uses rules to map the parse tree to a semantic
representation.

Process:

1. Syntactic Parsing: The sentence is parsed into a syntactic tree representing


its grammatical structure.
2. Rule-Based Mapping: Rules are applied to each node in the syntactic
tree to extract its semantic meaning.
3. Semantic Composition: The semantic meanings of the nodes are combined
to form a complete semantic representation of the sentence.

Example:

Consider the sentence "The dog chased the cat."

Syntactic Parsing:
(S
(NP The dog)
(VP chased)
(NP the cat))

Rule-Based Mapping:

• "The dog" -> Denote a specific dog


• "chased" -> Denote an action of pursuing
• "the cat" -> Denote a specific cat

Semantic Composition:

• "The dog chased the cat" -> An action of the dog pursuing the cat

Diagram:

[Image of a syntactic tree with semantic rules mapped to each node]

25. Distinguish between the following Thematic Roles with example: Agent, Experiencer,
Force
Agent: The entity performing the action of the verb. Example: "The boy (Agent)
kicked the ball (Patient)."

Experiencer: The entity that experiences or undergoes the action of the verb.
Example: "Mary (Experiencer) felt happy (Objective Complement)."

Force: An abstract or impersonal force that causes the action of the verb. Example:
"The wind (Force) blew the leaves (Patient)."

26. Analyse the statement, “The cat chased the mouse across the room” and identify the
agent and patient with explanation.
Agent: The cat Patient: The mouse

Explanation:

In this sentence, the cat is the one performing the action of chasing, while the
mouse is the one being chased. Therefore, the cat is the agent, and the mouse is
the patient.

3 marks answer:

The agent is the noun phrase that performs the action of the verb. In this case,
the cat is the agent because it is the one doing the chasing. The patient is the
noun phrase that receives the action of the verb. In this case, the mouse is the
patient because it is the one being chased.

27. Differentiate between the following with example: (a) RESULT (b) INSTRUMENT (c)
BENEFICIARY
RESULT

• Definition: The output or outcome of an action.


• Example: The speech was a success, resulting in a standing ovation.

INSTRUMENT

• Definition: A tool or means used to achieve something.


• Example: The microphone was used as an instrument to amplify the
speaker's voice.

BENEFICIARY

• Definition: A person or institution that receives the benefit of something.


• Example: The speech benefited the audience by providing valuable
information.

©FIADUZ ZAMAN 6
28. Differentiate between Machine Learning and Statistical method for Word Sense
Disambiguation.
Machine Learning (ML) for Word Sense Disambiguation (WSD):

• Learning-based approach: ML algorithms learn the relationships between


words and their different meanings from labeled training data.
• Focus on patterns: ML models identify patterns in the data that help
distinguish between different senses of a word.
• Example: A neural network trained on labeled text data can recognize the
different meanings of "bank" (e.g., financial institution or riverbank).

Statistical Method for WSD:

• Frequency-based approach: Statistical methods rely on the frequency of


words and their contexts in the training data.
• Statistical models: They use statistical models to estimate the probability
of a word having a particular meaning in a given context.
• Example: Lesk's algorithm calculates the overlap in definitions between the
target word and surrounding words to determine its most likely sense.

Key Difference:

The primary difference lies in their approach: ML models learn patterns, while statistical
methods rely on frequency and probability. Both methods have their advantages and
disadvantages, and the choice often depends on the specific WSD task and available
data.

29. Explain how Feature Extraction method helps in disambiguation in semantic meaning.
Feature Extraction for Disambiguation in Semantic Meaning

What is feature extraction? Feature extraction is a technique that extracts relevant


characteristics (features) from speech or text. These features provide valuable
information for understanding the meaning and context of words.

How does feature extraction help disambiguation? When multiple meanings of a word
exist (disambiguation), feature extraction can help distinguish between them. By
analyzing the surrounding context, grammatical structure, and other features, the
disambiguation method can determine the most likely meaning.

Example: Consider the word "bank." It can have multiple meanings, such as a
financial institution or a raised side of a river. Feature extraction would consider the
following:

• Grammatical structure: "bank account" suggests the financial meaning,


while "river bank" suggests the terrain meaning.
• Surrounding context: Mention of money or finance implies the financial
meaning, while references to water or nature indicate the terrain meaning.

30. Analyse the sentence, "She went to the bank to deposit money" and explain why Word
Sense Disambiguation is required.
Word Sense Disambiguation (WSD) is important in this sentence because the word
"bank" can have multiple meanings:

• Financial institution: A place where money is deposited and withdrawn.


• Edge of a river or lake: A raised area of land along a body of water.

Without WSD, a computer might misinterpret the sentence as:

"She went to the edge of the river to deposit money."

This would clearly be incorrect. WSD determines that the intended meaning is the
financial institution, not the geographical feature.

3-Mark Answer:

Word Sense Disambiguation is required in the sentence "She went to the bank to
deposit money" because the word "bank" has multiple meanings. Without WSD, a
computer might interpret the sentence incorrectly.

31. Express the concept of coherence in terms of discourse of texts.


Coherence in Discourse:

Coherence refers to how well the ideas and sentences in a text flow together and
make sense as a whole. It ensures that:

• Related ideas are grouped together: sentences and paragraphs are


organized in a logical way, with related concepts discussed in the same
section.
• Transitions connect ideas: words like "however," "therefore," and "in
conclusion" help guide readers from one idea to the next, indicating
relationships between sentences.
• Information is relevant: all sentences in the text contribute to the main
topic and avoid irrelevant or distracting information.

32. Alice loves to read books. She often visits the library to borrow new ones. The
librarian there is very helpful and always recommends interesting titles. Write the
referring expressions.

1. The librarian who works at the library

©FIADUZ ZAMAN 7
2. The helpful librarian
3. The librarian who recommends interesting books

33. Prepare Anaphora Resolution using Hobbs Algorithm for the text: “The dog chased
the cat. It ran away.”
Anaphora Resolution using Hobbs Algorithm:

Step 1: Identify the anaphor (the word referring to something mentioned earlier):

• It

Step 2: Find the antecedent (the word that the anaphor refers to):

• The cat

Step 3: Check for intervening noun phrases (NPs) that could potentially be antecedents:

• No intervening NPs

Step 4: Assign the closest antecedent to the anaphor:

• The antecedent for "It" is "the cat".

The resolved text: The dog chased the cat. The cat ran away.

34. Prepare Anaphora Resolution using Centering Algorithm for the text: “Alice went to
the store. She bought milk. Then, Bob arrived home.”
Sentence 1: Alice went to the store.

• Center: Alice

Sentence 2: She bought milk.

• Center: She (referring to Alice)

Sentence 3: Then, Bob arrived home.

• Center: Bob (new entity, not included in the previous centers)

35. How does Coreference Resolution structure the following text: "Jessica bought a new
car. She is very happy with it."
Coreference Resolution

Coreference Resolution identifies and connects words or phrases that refer to the
same entity. In this text:

• Anaphor: "She" refers back to "Jessica".


• Coreference Chain: Jessica -> She

Structure:

• Antecedent: Jessica
• Anaphor: She
• Referent: Same person (Jessica)

This coreference structure makes the text coherent by establishing a clear connection
between "Jessica" and "She". It allows the reader to understand that "She" and
"Jessica" are the same person, and that "She" is a reference to the previously
mentioned "Jessica".

36. Explain the benefits of Discourse Analysis.


Benefits of Discourse Analysis for Speech and Natural Language Processing:
1. Improved Contextual Understanding: Discourse analysis helps machines
understand the context of speech by analyzing relationships between
utterances, identifying discourse markers, and considering the speaker's
intent. This enhances the accuracy of speech recognition and natural
language processing tasks.
2. Enhanced Coherence and Cohesion: By examining how texts are organized
and structured, discourse analysis allows machines to generate more
coherent and cohesive responses. This is crucial for conversational systems
and chatbot development, where coherent responses improve user
engagement.
3. Identification of Discourse Structure: Discourse analysis algorithms can
automatically identify discourse segments (e.g., introduction, body,
conclusion) within texts. This aids in summarizing text, organizing document
structures, and improving search engine optimization (SEO).

37. Write the steps of Coreference Resolution.


Coreference Resolution Steps:
1. Identify Noun Phrases: Extract noun phrases (e.g., "the dog", "the car")
from the text.
2. Compute Pairwise Similarity: Calculate the similarity between each pair of
noun phrases using linguistic features (e.g., number, gender) and
semantic relatedness (e.g., using WordNet).
3. Cluster Noun Phrases: Group similar noun phrases into clusters. Each
cluster represents a single entity referred to by multiple noun phrases.

©FIADUZ ZAMAN 8
4. Assign Referents: Assign a unique identifier (e.g., a pronoun) to each
cluster. This identifier becomes the referent for all noun phrases within
the cluster.
5. Resolve Anaphora and Cataphora: Identify and resolve references that occur
before or after the referent noun phrase (e.g., "he", "they", "her").

38. Write a short note on Treebank.


Treebank

A treebank is a collection of parsed sentences, where each word is assigned a


syntactic category and connected to other words using grammatical rules. It serves
as a valuable resource for training and evaluating natural language processing (NLP)
models.

Key Features:

• Annotated Sentences: Sentences are parsed and annotated with syntactic


information, such as noun phrases, verb phrases, and dependencies.
• Hierarchical Structure: The syntactic annotations are organized in a
hierarchical tree structure, representing the grammatical relationships
between words.
• Language-Dependent: Treebanks are typically created for specific
languages, as the grammatical rules and annotations vary across languages.

Uses:

• NLP Model Training: Treebanks provide training data for NLP models that
require an understanding of sentence syntax.
• Syntactic Analysis Evaluation: They are used to evaluate the performance
of NLP models that perform syntactic parsing.
• Language Research: Treebanks facilitate the study of linguistic patterns
and grammatical structures.

39. Explain the various relationships through which Synsets are linked together in
WordNet.
Synsets in WordNet are linked through the following relationships:

1. Hyponymy and Hypernymy (Is-a relationship):

• A hyponym is a more specific concept within a broader category


(hypernym).
• Example: "dog" (hyponym) is "mammal" (hypernym).

2. Meronymy and Holonymy (Part-of relationship):

• A meronym is a part or component of a whole (holonym).


• Example: "finger" (meronym) is part of "hand" (holonym).

3. Troponymy:

• A type of semantic relation that relates a specific term to a more general


term with nearly the same meaning.
• Example: "race car" (troponym) is a type of "car" (superordinate).

4. Entailment:

• A relationship where the meaning of one synset is included in the meaning


of another.
• Example: "kill" (entails) "die".

5. Causation:

• A relationship where the meaning of one synset causes the meaning of


another.
• Example: "explode" (causes) "noise".

40. Develop a Frame titled "Apply_heat" mentioning Lexical Unit and Frame elements
from the sentence: "She cooked a delicious meal in the kitchen". Assume implicit
elements if required.
Frame Title: Apply_heat

Frame Elements:

• Agent: She (implicit)


• Object: meal
• Instrument: kitchen (implicit)
• Action: cooked
• Location: kitchen (implicit)
• Quality: delicious (implicit)

41. Summarize the significances of the British National Corpus (BNC).


Significance of the British National Corpus (BNC):
1. Comprehensive Language Database: The BNC contains over 100 million
words of spoken and written British English, representing a wide range of
genres, styles, and registers. This provides researchers with a rich source
of authentic language data to study and analyze.
2. Benchmark for Linguistic Analysis: The BNC has become a widely accepted
benchmark for linguistic research. Its large size and variety of data enable

©FIADUZ ZAMAN 9
researchers to conduct robust statistical analyses and draw reliable
conclusions about language use.
3. Improving Natural Language Processing (NLP): The BNC is heavily used
in the development and evaluation of NLP systems. Its extensive coverage
of real-world language helps researchers test and refine NLP algorithms,
leading to more accurate and effective language processing tools.

42. Compare Lemmatization and Stemming.


Lemmatization and Stemming are techniques used in natural language processing to
reduce words to their base or root form.

Lemmatization:

• Considers the context of the word and its grammatical information (e.g.,
tense, number, gender) to identify the base form.
• Retains the original meaning and part of speech of the word.
• Example: Convert "running" to "run" (verb) instead of "run" (noun).

Stemming:

• Ignores the context and removes suffixes and prefixes indiscriminately.


• May alter the meaning and part of speech of the word.
• Example: Convert "running" and "runner" to "run".

43. Summarize the challenges in Coreference Resolution in terms of Ambiguity, Nested


Coreferences and Long-Distance Dependencies.
Ambiguity:

• Pronouns (e.g., "he") can refer to multiple antecedents.


• Multiple entities mentioned in a text may be referred to by the same
pronoun.

Nested Coreferences:

• Coreferences can be embedded within other coreferences.


• Example: "John told Mary that he had seen her." ("he" refers to John,
"her" refers to Mary)

Long-Distance Dependencies:

• Coreferences may span large text distances.


• Example: "The book was on the table. It was a thick book." ("It" refers
to "the book" mentioned earlier in the text)

44. Distinguish between Coreference and Anaphora.


Coreference:

• Refers to two different expressions in a text that refer to the same entity.
• The entities can be people, places, things, or concepts.
• Example: "John went to the store. He bought a bag of chips."

Anaphora:

• A specific type of coreference where a later expression (anaphor) refers


back to an earlier expression (antecedent).
• The antecedent is often a pronoun, but it can also be a noun phrase or
other word or phrase.
• Example: "John went to the store. He bought a bag of chips and put it
in his car."

45. Summarize the aspects of Discourse segmentation.


Discourse Segmentation

Discourse segmentation is the process of dividing a text or spoken conversation into


meaningful units. It helps to identify the structure and organization of a discourse.

Aspects of Discourse Segmentation:

1. Cohesion: The use of linguistic devices (e.g., pronouns, conjunctions)


to connect and transition between units.
2. Coherence: The logical and sequential flow of information within and
between units, making sense as a whole.
3. Segmentation boundaries: Points where the discourse transitions from one
topic, perspective, or speaker to another.

Understanding discourse segmentation is crucial for:

• Identifying the key sections of a text or conversation


• Summarizing and extracting information
• Generating coherent text or dialogue in natural language processing systems

46. Explain the meaning of "syntactic parsing" and the function it serves in NLP.
Syntactic Parsing

Syntactic parsing is a process in Natural Language Processing (NLP) that breaks


down a sentence into its individual components and analyzes their relationships.

Function in NLP

©FIADUZ ZAMAN 10
Syntactic parsing helps computers understand the structure and meaning of sentences
by:

1. Identifying Constituency: It groups words into their corresponding units,


such as noun phrases, verb phrases, and prepositional phrases.
2. Revealing Dependencies: It determines how words in a sentence connect
to each other, showing which words depend on which.
3. Establishing Hierarchy: It organizes the units into a hierarchical structure,
resembling a tree-like graph called a "parse tree."

By understanding the syntactic structure of sentences, NLP systems can:

• Improve text comprehension


• Extract information more accurately
• Generate grammatically correct text
• Identify sentence types (e.g., declarative, interrogative)

47. Establish an explanation of "ambiguity" about the sentence construction.


Ambiguity in sentence construction occurs when a sentence can be interpreted in
more than one way. This can be due to:

• Lexical ambiguity: Individual words can have multiple meanings.


• Structural ambiguity: The arrangement of words in a sentence can create
multiple interpretations.
• Scope ambiguity: Certain phrases or clauses can apply to multiple parts
of a sentence.

For example, the sentence "The cat sat on the mat with the mouse" has three
possible interpretations:

• The cat was sitting on the mat with the mouse.


• The cat was sitting on the mat, which had a mouse on it.
• The cat was sitting on the mat and had a mouse with it.

Ambiguity in speech and natural language processing can lead to confusion and errors
in communication. It is important to resolve ambiguity using context, grammar, or
disambiguation techniques to ensure clear and unambiguous communication.

48. Determine the fundamental ideas underlying various parsing strategies, such as
shallow parsing and dynamic programming parsing.
Shallow Parsing:

• Focuses on extracting only surface-level information, such as part-of-


speech tags and chunk boundaries.
• Uses predefined rules or statistical models to assign tags and chunk
constituents.
• Provides a quick and coarse-grained understanding of sentence structure.

Dynamic Programming Parsing:

• Constructs a parse tree by iteratively applying rules or scores to candidate


subtrees.
• Dynamically computes the best parse subtree for each span of the
sentence.
• Uses dynamic programming techniques to optimize for the best overall
parse tree.
• Can handle complex sentences with ambiguous or long-distance
dependencies.

49. Apply the CYK algorithm to parse a simple sentence with probabilities assigned to
CFG rules.
CYK Algorithm (for Probabilistic CFGs):

The CYK algorithm efficiently parses sentences using a Context-Free Grammar


(CFG), where each rule has a probability associated with it.

Steps:
1. Initialization: Create a probability table P[i, j, k] for all positions i, j in
the sentence and all CFG symbols k. Set P[i, j, k] to the probability of
the rule that directly generates the substring from position i to j in the
sentence as symbol k.
2. Recursion: For all positions i, j, and k, iterate over all pairs of positions
i' and j' such that i ≤ i' < j ≤ j' and calculate the probabilities of all
rules that generate a substring from i to j' and a substring from i'+1 to
j as the symbol k. Combine these probabilities using the rule probabilities
to update P[i, j, k].
3. Extraction: Parse tree extraction starts at P[1, length(sentence), S],
where S is the start symbol of the CFG. Select the rule with the highest
probability and recursively extract the substrings it generates from the input
sentence.

Example:

Consider the sentence "The man walks" and the CFG:


S -> NP VP (0.8)
NP -> The man (0.9)
VP -> walks (0.7)

©FIADUZ ZAMAN 11
CYK Table:

T m w

--- --- --- --- ---

1 1 0.9 0.0 0.0

2 0.0 0.9 0.0

3 0.0 0.0 0.7

2 2 0.0 0.81 0.0

3 0.0 0.0 0.7

3 3 0.0 0.0 0.7

Parse Tree:
S (0.8)
|
NP (0.9) VP (0.7)
| |
The man walks

50. Examine and contrast the capabilities of probabilistic CYK parsing versus regular
CYK parsing.
Probabilistic CYK Parsing:

• Generates a parse tree with probabilities associated with each node.


• Allows for ambiguity in the grammar, where multiple parses are possible
for a given input.
• Can handle noisy or incomplete input data by assigning probabilities to
alternative parses.

Regular CYK Parsing:

• Generates a single parse tree.


• Assumes that the grammar is unambiguous and produces a unique parse
for every input.
• Faster and simpler to implement than probabilistic CYK parsing.

Key Differences:

• Ambiguity handling: Probabilistic CYK parsing can handle ambiguities, while


regular CYK parsing cannot.
• Probability assignment: Probabilistic CYK parsing provides probabilities for
different parses, while regular CYK parsing does not.
• Robustness to noise: Probabilistic CYK parsing can be more robust to
noisy input data due to its ability to assign probabilities to alternative
parses.

51. Illustrate how tense agreement between a verb and its subject is one example of the
additional information that feature structures with unification can capture.
Tense Agreement

In sentences, the verb must agree with the subject in tense. For example:

• "The boy runs" (present tense)


• "The boy ran" (past tense)

Feature Structures with Unification

Feature structures with unification can represent this tense agreement. Each word
(noun and verb) has a feature structure that includes features like person, number,
and tense.

Example

Let's represent the sentence "The boy runs":


• Noun (boy):
▪ Person: 3
▪ Number: singular
• Verb (runs):
▪ Person: 3
▪ Number: singular
▪ Tense: present

Unification

Using unification, we can match the features of the noun and verb:
[Person: 3] [Number: singular] = [Person: 3] [Number: singular]

©FIADUZ ZAMAN 12
This shows that the features of the noun and verb match, indicating correct tense
agreement.

Benefits:

Feature structures with unification provide:

• Syntactic Information: Representing grammatical relationships (e.g.,


subject-verb agreement).
• Semantic Information: Capturing meanings of words (e.g., person, number,
tense).
• Compact Representation: Combining different levels of information into a
single structure.

52. Discover the limitations of Probabilistic Lexicalized CFGs and investigate other
approaches to dealing with complex syntactic occurrences.
Limitations of Probabilistic Lexicalized CFGs:

• Limited handling of structural complexity: PLCFIGs struggle with long-


distance dependencies and hierarchical structures, which can occur in
natural language.
• Data sparsity: Training on large datasets can become an issue as the
probability of specific rule sequences decreases with increasing sentence
length.
• Ambiguity resolution: PLCFIGs cannot capture the ambiguity inherent in
natural language, where multiple parses may be possible for a given
sentence.

Other Approaches:

• Tree-Adjoining Grammar (TAG): TAGS allow for the construction of


hierarchical structures, addressing the limitations of PLCFIGs in handling
complex syntactic occurrences.
• Head-Driven Phrase Structure Grammar (HPSG): HPSG combines
traditional CFGs with feature structures, providing a more expressive
framework for capturing linguistic dependencies.
• Dependency Grammar (DG): DGs focus on the relationships between words
rather than phrases, reducing the complexity of syntactic structures and
resolving ambiguities more effectively.

53. Construct a feature structure representation for a specific grammatical category (e.g.,
noun phrase) with relevant features like number and case.
Noun Phrase Feature Structure:
[NP
Num: Singular/Plural
Case: Nominative/Accusative/Dative/Genitive
Head: Noun
]

Features:

• Num: Indicates the number of the noun phrase (singular or plural).


• Case: Specifies the grammatical case of the noun phrase (e.g., nominative
for subjects, accusative for direct objects).
• Head: The main noun in the noun phrase, which determines its semantic
content.

54. Determine the advantages and disadvantages of shallow parsing compared to full
parsing for specific NLP tasks (e.g., information extraction).
Advantages of Shallow Parsing:
• Speed and Efficiency: Shallow parsing analyzes only limited aspects of a
sentence, making it faster and less computationally intensive than full
parsing.
• Focused Analysis: It focuses on identifying specific linguistic structures
(e.g., noun phrases, verbs) relevant to the task, resulting in tailored
information for the task.

Disadvantages of Shallow Parsing:


• Limited Insight: Shallow parsing provides a surface-level analysis of the
sentence and may miss deeper grammatical relationships or semantic
information.
• Inaccuracy: Due to its limited analysis, shallow parsing is prone to errors
in identifying the correct constituents, which can impact downstream tasks.

Comparison for Information Extraction:

For information extraction tasks, shallow parsing is often preferred because:

• It can quickly extract specific information from documents without needing


a complete understanding of the sentence.
• It is tailored to identify relevant linguistic structures for information extraction,
such as named entities and relations.

However, full parsing may be necessary when a deeper understanding of the sentence
structure and semantics is required for complex information extraction tasks.

55. Write about how well shallow parsing works for jobs that need more in-depth syntactic
analysis, such as sentiment analysis.

©FIADUZ ZAMAN 13
Shallow parsing, which focuses on basic sentence structure, is not well-suited for
tasks like sentiment analysis that require more detailed syntactic analysis. To capture
the nuances of sentiment, we need to understand the relationships between words in
a sentence and identify their roles in the sentence. Shallow parsing lacks the depth
to provide this level of insight.

5 MARKS

1. Describe Thematic roles with examples.


Thematic Roles

Thematic roles identify the semantic functions of noun phrases in a sentence. They
describe the relationship between the verb and the participants in the event or action
described by the verb.

Five Essential Thematic Roles:

1. Agent: The one who performs the action (e.g., "John kicked the ball")
2. Patient: The one or thing that undergoes the action (e.g., "The ball was
kicked by John")
3. Instrument: The tool or means by which the action is performed (e.g.,
"John kicked the ball with his foot")
4. Beneficiary: The one who benefits from the action (e.g., "John gave the
gift to Mary")
5. Location: The place where the action occurs (e.g., "John sat on the
bench")

2. Explain how the Word Sense Disambiguation using Dictionary-Based Methods works.
How Word Sense Disambiguation Using Dictionary-Based Methods Works

Step 1: Word Lookup

• Identify the word to be disambiguated in the input text.


• Look up the word in a dictionary or knowledge base.

Step 2: Sense Identification

• The dictionary provides multiple senses (meanings) for the word.


• The algorithm chooses the most appropriate sense based on information
in the dictionary, such as:
▪ Definitions
▪ Part of speech
▪ Semantic relationships

Step 3: Context Analysis

• The algorithm analyzes the surrounding context of the word.


• It looks for clues that indicate the specific meaning intended in the text.
• This may include identifying neighboring words, syntactic structure, or
semantic patterns.

Step 4: Sense Selection

• The algorithm combines the information from the dictionary lookup and
context analysis.
• It selects the sense that best fits the context and creates an interpretation
of the text.

Advantages:

• Fast and efficient


• Relies on existing knowledge bases and dictionaries

3. Describe the advantages and disadvantages of Word Sense Disambiguation using


Dictionary-Based Methods.
Advantages:

• Simple and efficient: Dictionary-based methods are straightforward to


implement and computationally inexpensive.
• High recall: They cover a wide range of word senses, resulting in high
recall (finding as many correct disambiguations as possible).

Disadvantages:

• Low precision: Due to the ambiguity present in natural language,


dictionary-based methods may yield incorrect disambiguations where multiple
senses are plausible.
• Limited coverage: Dictionaries can only disambiguate known words, so
they may not be suitable for handling new or specialized vocabulary.
• Context-insensitivity: These methods do not consider the context of the
word, which can sometimes lead to incorrect disambiguations.

©FIADUZ ZAMAN 14
• Data dependence: The performance of dictionary-based methods is heavily
dependent on the quality and completeness of the dictionary used.
• Scalability: As the size of the dictionary increases, the computational time
required for disambiguation can become prohibitive.

4. Describe the following methods of WSD: Dictionary-based, Machine learning,


Statistical.
Dictionary-based WSD:

• Uses a precompiled dictionary with word senses defined by human experts.


• Looks up the word in the dictionary and returns the most appropriate
sense based on context.

Machine Learning-based WSD:

• Trains a machine learning model on a dataset of labeled text.


• The model learns to associate words with their corresponding senses.
• When given a new sentence, the model predicts the correct sense for
each word.

Statistical WSD:

• Calculates the probability of each possible sense based on statistical


information.
• Considers factors such as word frequency, co-occurrence with other words,
and syntactic context.
• Uses statistical techniques to determine the most likely sense.

5. Describe the following regarding requirements of representation: Neural Network


Architectures, Frame Semantics.
Neural Network Architectures:

• Purpose: Capture complex relationships in data.


• Requirement: Large datasets for training, computational power for
processing.
• Benefit: Can learn patterns that are difficult for humans to identify.

Frame Semantics:

• Purpose: Represent semantic roles and relationships in sentences.


• Requirement: Hand-crafted rules or supervised data to define frames.
• Benefit: Provides a structured representation of sentence meaning,
facilitating comprehension.

6. Represent the following statements in FOPC about animals and their habitats, and infer
about their ability to fly. All birds can fly. Penguins are birds. Penguins live in cold
climates. If an animal can fly, it does not live in a cold climate. Question: Do penguins
fly?
Statements in FOPC:

1. ∀x(Bird(x) → Fly(x))
2. Penguin(p)
3. Penguin(p) → ColdClimate(p)
4. Fly(x) → ¬ColdClimate(x)

Question as a query:

5. Fly(p)

Inference:

From statement 2, we know that penguins are birds (Bird(p)). From statement 1,
we know that all birds can fly (Fly(x) → Bird(x)). Therefore, we can infer that
penguins can fly (Fly(p)).

However, statement 4 contradicts this inference. It states that if an animal can fly,
it does not live in a cold climate (Fly(x) → ¬ColdClimate(x)). From statement
3, we know that penguins live in a cold climate (ColdClimate(p)). Therefore, we
cannot definitively conclude that penguins can fly.

Answer:

Based on the given information, we cannot determine whether or not penguins can
fly.

7. Represent the following statements in FOPC about animals and their habitats, and
deduce the identity of Tom's grandmother. Every person has a mother. Julia is Tom's
mother. A grandmother is the mother of a person's mother. Question: Who is Tom's
grandmother?
FOPC Representation:

• ∀person(∃mother(person))
• mother(julia, tom)
• ∀person(∀mother(mother(person), grandmother(mother, person)))

Deduction:

Let x be Tom's grandmother. By definition of grandmother:

©FIADUZ ZAMAN 15
• grandmother(x, tom)

By definition of mother:

• mother(julia, tom)

Therefore, x is Julia's mother:

• mother(x, julia)

Since every person has a mother, Julia must have a mother:

• ∃mother(julia)

Hence, Tom's grandmother is Julia's mother.

8. Given the following data, evaluate the degree of similarity between the common words:
'happy' and 'joyful'.
1. Semantic Similarity: Happy and joyful have similar meanings, both
expressing a state of positive emotion. They convey a sense of contentment
and well-being. This high semantic similarity indicates a strong connection
between the words.
2. Lexical Overlap: Both words contain the shared root 'joy,' further
emphasizing their semantic overlap. This linguistic feature adds to their
degree of similarity.
3. Synonymy: Happy and joyful are often used interchangeably in everyday
speech and writing. They can substitute for each other in many contexts
without significantly altering the meaning. This interchangeability enhances
their similarity.
4. Collocation: Happy and joyful frequently co-occur in phrases and
expressions, such as "happy and joyful news" or "a happy and joyful
occasion." This collocation pattern further supports their association.
5. Usage Patterns: In corpus linguistics, happy and joyful exhibit similar usage
patterns. They are both commonly used to describe emotions, events, and
experiences. This consistency in usage indicates a high degree of similarity
between the words.

9. Differentiate between the following: WSD using Supervised learning, WSD using
Dictionary, WSD using Thesaurus.
WSD using Supervised Learning:

• Uses labeled data to train a model that predicts the correct sense of a
word based on its context.
• Relies on a vast training dataset and requires expert annotation.

WSD using Dictionary:

• Looks up the word in a predefined dictionary and assigns the sense


provided by the dictionary.
• Limited by the scope and accuracy of the dictionary.

WSD using Thesaurus:

• Uses a thesaurus to find synonyms and related terms for the target word.
• Relies on semantic relationships between words and their contexts.

10. Explain Naive Bayes classifier.


Naive Bayes Classifier

The Naive Bayes classifier is a simple but effective probabilistic model used for
classification tasks in NLP. It assumes that the features of a data point are
independent of each other given the class label. This assumption simplifies the training
and prediction process, as it allows us to calculate the probability of a data point
belonging to a class using Bayes' theorem:
P(Class | Features) = (P(Features | Class) * P(Class)) / P(Features)

where:

• P(Class | Features): Probability of the class given the features


• P(Features | Class): Probability of the features given the class
• P(Class): Prior probability of the class
• P(Features): Probability of the features

Advantages:

• Fast to train and predict


• Requires minimal data preprocessing
• Can handle a large number of features
• Effective for text classification tasks

Disadvantages:

• The independence assumption is not always true, which can lead to errors
• Can be sensitive to outliers and noise in the data

11. Write down the steps of the Hobb’s algorithm.


Hobbs' Algorithm for Text Segmentation

Steps:

©FIADUZ ZAMAN 16
1. Scan the text: Read the text and identify all the punctuation marks.
2. Mark sentence boundaries:
• If there is a period (.), question mark (?), or exclamation mark
(!), mark the end of a sentence.
• If there is a comma (,) followed by a quotation mark (" or '),
mark the start of a sentence.
3. Handle dashes and parentheses:
• If there is a dash (-) or parenthesis (()), treat it as a
sentence boundary only if it has a sentence endpoint inside (e.g.,
"This is a sentence - I'm sure.").
4. Check for abbreviations:
• If a period is followed by a capitalized letter, it may be an
abbreviation. Verify if it is a common abbreviation before marking
it as a sentence boundary.
5. Break up sequences of punctuation:
• If multiple punctuation marks appear in a row (e.g., "..."), break
them up into separate sentence boundaries.
6. Identify paragraph boundaries:
• If there is a double line break (two blank lines in a row), mark
the start of a new paragraph.
• If there is a single line break followed by a capitalized word, mark
the start of a new paragraph (unless it is a quote).

12. Write down the steps of Coreference Resolution.


Steps of Coreference Resolution:
1. Identification: Identify candidate words or phrases (e.g., pronouns, nouns)
that may refer to the same entity.
2. Mention Clustering: Group related candidate mentions together based on
their linguistic features (e.g., gender, number).
3. Entity Identification: Determine the entities that each mention cluster refers
to, either in the text or through world knowledge.
4. Coreference Linking: Connect mentions within each entity into chains of
coreferring expressions, creating a coreference graph.
5. Resolution: Resolve the coreferences by assigning a unique identifier to
each entity and replacing mentions with their corresponding identifier
throughout the text.

13. Construct the stemmed word for the word 'running' using Porter Stemmer algorithm.
Porter Stemming Algorithm Steps for 'running':
1. Remove Plural Suffixes (-s, -es, -ies): Not applicable here as 'running'
does not end with -s, -es, or -ies.
2. Remove Past Tense Suffixes (-d, -ed, -ing): Remove '-ing' to get 'run'.
3. Remove Suffixes (-ly, -ic, -ism, etc.): Not applicable here as 'run' does
not end with these suffixes.
4. First Replacement Rule: Not applicable here as 'run' is not one of the
specified irregular words.
5. Second Replacement Rule: Not applicable here as 'run' is not one of the
specified irregular words.

Stemmed Word: run

14. Write down the Features of Lemmatization.


Features of Lemmatization:
1. Reduces words to their base form: Lemmatization converts words to their
root form, regardless of tense, number, or other grammatical variations.
(e.g., "run," "running," and "ran" are all lemmatized to "run").
2. Improves accuracy: By using the root form of words, lemmatization reduces
the number of unique words in a text, making it easier for natural
language processing models to understand the content.
3. Increases efficiency: Lemmatization can reduce the processing time and
memory requirements of NLP tasks by handling different forms of the
same word as a single entity.
4. Provides better insights: Lemmatized text offers a clearer understanding of
the main concepts in a document, as it removes grammatical variations
that may obscure the meaning.
5. Improves search results: Lemmatized text can improve search engine
results by matching user queries with the most relevant content, regardless
of word form variations.

15. Write down the Key aspects of the Penn Treebank.


Key Aspects of the Penn Treebank:
1. Annotation: The Penn Treebank is a large corpus of English text that
has been annotated with syntactic structure, including part-of-speech tags
and phrase structures. This annotation makes it a valuable resource for
training and evaluating natural language processing (NLP) models.
2. Large Size: The Penn Treebank contains over 1 million words of text,
making it one of the largest annotated corpora available. This ensures
that it provides a comprehensive sample of the English language and
allows for statistical analysis and model development.
3. Wide Range: The texts in the Penn Treebank cover various genres,
including newspaper articles, academic papers, and fiction. This diversity
ensures that NLP models trained on the corpus can generalize to a wide
range of text types.
4. Well-Established Standard: The Penn Treebank annotation scheme is a
widely accepted standard in NLP. It has been used to train and evaluate
numerous NLP systems, making it a reliable and comparable benchmark.

©FIADUZ ZAMAN 17
5. Historical Significance: The Penn Treebank was one of the first large-
scale annotated corpora, and it has played a pivotal role in the
development of NLP research. It continues to be used as a benchmark
and reference dataset, contributing to the advancement of the field.

16. Design a Syntactic Tree according to the Penn Treebank structure for the sentence:
'The quick brown fox jumps over the lazy dog.'
Syntactic Tree (Penn Treebank Structure):
(ROOT
(S
(NP (DT The) (JJ quick) (JJ brown) (NN fox))
(VP (VBZ jumps)
(PP (IN over)
(NP (DT the) (JJ lazy) (NN dog))))))

Explanation:

• ROOT: Represents the root of the sentence, which is the sentence itself.
• S: Indicates the sentence type (in this case, a declarative sentence).
• NP: Noun Phrase, which represents the subject and object of the sentence.
• DT: Determiner (e.g., "the")
• JJ: Adjective
• NN: Noun
• VBZ: Verb (present tense, third person singular)
• PP: Prepositional Phrase
• IN: Preposition (e.g., "over")

Constituent Structure:

• The NP "The quick brown fox" is the subject of the sentence.


• The VP "jumps over the lazy dog" is the predicate of the sentence.
• The PP "over the lazy dog" is an adverbial phrase that modifies the
verb "jumps."

17. Prepare a tagging structure using Brill’s Tagger for the sentence – 'They refuse to
permit us to obtain the refuse permit.'
Brill's Tagger Tagging Structure:

1. Tokenize the Sentence

| Word | |---|---| | They | | refuse | | to | | permit | | us | | to | |


obtain | | the | | refuse | | permit |

2. Assign Initial Tags

Word Tag

They PRP

refuse VB

to TO

permit VB

us PRP

to TO

obtain VB

the DT

refuse NN

permit NN

3. Apply Tagging Rules

• Rule 1: If the current word is a verb and the previous word is a modal
verb, tag the current word as a past participle.
▪ "refuse" is tagged as "VBN" after "to" (modal verb).

Word Tag

They PRP

refuse VBN

©FIADUZ ZAMAN 18
Word Tag

to TO

permit VB

us PRP

to TO

obtain VB

the DT

refuse NN

permit NN

• Rule 2: If the current word is a noun and the previous word is a definite
article, tag the current word as a singular noun.
▪ "refuse" is tagged as "NN" after "the".

Word Tag

They PRP

refuse VBN

to TO

permit VB

us PRP

to TO

obtain VB

the DT

refuse NN

permit NN

Final Tagged Sentence:

They/PRP refuse/VBN to/TO permit/VB us/PRP to/TO obtain/VB the/DT refuse/NN


permit/NN

18. How the word 'car' will be structured in the WordNet based on the following: Synsets,
Hypernyms, Hyponyms, Meronyms, Holonyms.
Synsets:

• A motorized vehicle with wheels for transporting passengers

Hypernyms:

• Vehicle

Hyponyms:

• Sedan
• Coupe
• Truck
• Van

Meronyms:

• Engine
• Transmission
• Wheels

Holonyms:

• Garage

©FIADUZ ZAMAN 19
19. Explain the structure of semantic roles in PropBank using examples.
Structure of Semantic Roles in PropBank

PropBank is a database of annotated sentences that assigns semantic roles to verb


arguments. These roles capture the relationships between the verb and its arguments,
providing insights into the meaning of the sentence.

Core Roles:

• Agent (A0): The primary actor performing the action. (e.g., "John" in
"John opened the door.")
• Patient (A1): The recipient or target of the action. (e.g., "the door" in
"John opened the door.")
• Instrument (A2): The means by which the action is performed. (e.g., "the
key" in "He unlocked the door with the key.")

Peripheral Roles:

• Beneficiary (A3): The recipient of the benefits of the action. (e.g., "the
children" in "He bought toys for the children.")
• Destination (A4): The location towards which the action is directed. (e.g.,
"the table" in "He put the book on the table.")

Examples:

• "The dog ate the bone."


▪ Agent (A0): the dog
▪ Patient (A1): the bone
• "The man opened the door with a key."
▪ Agent (A0): the man
▪ Patient (A1): the door
▪ Instrument (A2): a key
• "She bought flowers for her mother."
▪ Agent (A0): she
▪ Patient (A1): flowers
▪ Beneficiary (A3): her mother

By identifying semantic roles, PropBank helps capture the underlying meaning of


sentences and facilitates tasks such as information extraction and natural language
understanding.

20. Provide the structure of annotation in PropBank for the sentence: 'The company sold
the subsidiary.'
Annotation Structure in PropBank:

In PropBank, each verb phrase in a sentence is assigned a semantic role frame


(SRF). The SRF consists of a predicate (the verb) and a set of arguments
(semantic roles).

For the sentence "The company sold the subsidiary":

Predicate: sell

Arguments:

• Agent (A0): The company


• Theme (A1): the subsidiary

Annotation:
[ARG0 The company] [ARG1 the subsidiary] [V sell]

Breakdown:

• [ARG0 The company] represents the entity that performs the action of
selling.
• [ARG1 the subsidiary] represents the entity that is sold.
• [V sell] represents the verb that describes the action.

This annotation tells us that the semantic roles of the sentence are:

• The company is the seller (agent).


• The subsidiary is the thing being sold (theme).

21. Briefly describe Coherence relation between utterances and Coherence relation
between entities.
Coherence Relation between Utterances:

• Ensures that individual sentences within a text flow smoothly and connect
meaningfully.
• Can be achieved through logical connectors (e.g., and, but, or), anaphora
(e.g., pronouns), or shared themes.

Coherence Relation between Entities:

• Establishes links between entities (e.g., people, places, events) mentioned


in a text.
• Provides context and facilitates understanding by making relationships
explicit.
• Can be represented in knowledge graphs or concept maps.

©FIADUZ ZAMAN 20
22. Briefly describe Unsupervised and Supervised algorithms for Discourse
Segmentation.
Unsupervised Algorithms

• Lexical Cohesion: Identifies segments based on the presence of cohesive


words (e.g., connectors, pronouns).
• Coherence: Analyzes the semantic relatedness of words and sentences
within segments.
• Clustering: Groups similar sentences together based on features like
syntactic structure or semantic content.

Supervised Algorithms

• Machine Learning: Trains a model on labeled data to predict segment


boundaries.
• Hidden Markov Models (HMMs): Statistical models that represent discourse
as sequences of hidden states (segments).
• Conditional Random Fields (CRFs): Graphical models that encode
dependencies between words and segment labels.

23. Describe Hebbs proposed solutions for the following discourse relations with example:
Result, Explanation, Parallel.
Result:

• Hebb proposed that result relations can be expressed using cause-and-


effect phrases, such as "because of" or "due to."
• Example: "The car crashed because the brakes failed."

Explanation:

• Hebb suggested using clarifying or elaborating phrases, such as "in other


words" or "for example."
• Example: "I failed the test. In other words, I did not pass."

Parallel:

• Hebb proposed using linking or comparing phrases, such as "similarly" or


"in addition."
• Example: "I like pizza. Similarly, I also enjoy pasta."

24. Identify the Initial Center, Backward Center and Forward Center in the sentence: 'Alice
went to the store. She bought milk. Then, Bob arrived home.'
Initial Center: Alice (The sentence starts with Alice.)

Backward Center: Store (The previous sentence mentions going to the store.)

Forward Center: She (Refers to Alice in the following sentence.)

25. List the set of pre-defined roles used by PropBank.


PropBank Pre-defined Roles:

• Agent (A0): The primary agent responsible for the action.


• Patient (A1): The entity affected by the action.
• Theme (A2): The object of the action.
• Instrument (A3): The object used to perform the action.
• Location (A4): The place where the action occurs.
• Experiencer (A5): The entity that experiences the action.
• Beneficiary (A6): The entity that benefits from the action.
• Recipient (A7): The entity that receives the object of the action.
• Possessor (A8): The entity that owns or controls the object.
• Source (A9): The entity that originates the object.

26. Discourse Analysis is beneficial for Information retrieval. Justify this statement.
Discourse Analysis benefits Information Retrieval (IR) by:
1. Identifying Document Structure: Discourse Analysis reveals how text is
organized into coherent units (e.g., paragraphs, headings). This structure
can guide IR systems to identify relevant passages and focus their retrieval
efforts.
2. Understanding Cohesion and Coherence: Discourse Analysis investigates
how sentences and paragraphs connect logically (e.g., through referential
chains, connectors). This understanding helps IR systems identify
relationships between keywords and select relevant documents that provide
cohesive and coherent information.
3. Inferring User Intent: Discourse Analysis analyzes the context in which
users' queries are formulated. It can help IR systems understand the
user's underlying goals, which can inform the selection of appropriate
documents and ranking of results.
4. Evaluating Relevance: Discourse Analysis provides a framework for
evaluating the relevance of documents to a user's query. By analyzing
the coherence and referential relationships within a document, IR systems
can determine if it adequately addresses the user's information need.
5. Improving User Experience: Discourse-aware IR systems can enhance the
user experience by providing more targeted and contextually relevant
results. This can reduce the effort required for users to find the information
they seek and improve overall satisfaction with the IR system.

27. Discourse Analysis is beneficial for Text summarization. Justify this statement.

©FIADUZ ZAMAN 21
Discourse Analysis Benefits for Text Summarization:

Discourse analysis, which examines the structure and coherence of a text, provides
valuable insights for effective text summarization. Here are the reasons why:

1. Identification of Key Structures: Discourse analysis reveals the underlying structure


of a text, including its main ideas, supporting arguments, and transitions. This
understanding helps identify the most important parts of the text for inclusion in the
summary.

2. Coherence Analysis: By examining how different parts of the text connect and flow
together, discourse analysis ensures that the summary maintains the original text's
coherence. This results in a summary that is well-organized and easy to follow.

3. Identification of Coreferential Expressions: Coreferential expressions refer to words


or phrases that refer back to previously mentioned concepts. Discourse analysis
identifies these expressions and their referents, allowing the summary to omit redundant
information and maintain focus on the key ideas.

4. Handling Complex Texts: Discourse analysis is particularly useful in summarizing


complex texts that contain multiple subtopics, digressions, or embedded clauses. By
understanding the text's structure and coherence, the summary can effectively condense
and represent the main points.

5. Maintaining the Author's Purpose: Discourse analysis helps ensure that the summary
accurately conveys the author's intended purpose and perspective. By understanding
the text's language use, tone, and implicit meanings, the summary can effectively
capture the author's main message.

28. Discourse Analysis is beneficial for Machine translation. Justify this statement.
Discourse Analysis Enhances Machine Translation

Discourse analysis, the study of how language is used in context, is crucial for
machine translation because it:
1. Captures Contextual Meaning: Discourse analysis helps understand the
pragmatic and semantic relationships within a text, allowing machines to
translate not just individual words but also entire sentences or passages
coherently.
2. Resolves Ambiguities: Context provides clues to resolve ambiguous words
or phrases. By analyzing discourse, machines can identify the intended
meaning and produce more accurate translations.
3. Preserves Text Structure: Discourse analysis aids in understanding the
overall structure and flow of a text, which is essential for maintaining
coherence and cohesion in the translation.
4. Improves Cohesion and Coherence: It helps identify relationships between
text components (e.g., anaphora, ellipsis) and ensures that these
relationships are accurately reflected in the translation.
5. Enhances Cultural Sensitivity: Discourse analysis helps account for cultural
and linguistic differences between languages, reducing the risk of
mistranslations that arise from literal or surface-level interpretations.

29. Discourse Analysis is beneficial for Question answering. Justify this statement.
Discourse Analysis Enhances Question Answering

Discourse analysis, which examines how language is used in context, significantly


benefits question answering tasks by:

• Uncovering Cohesion and Coherence: Discourse analysis identifies


relationships between sentences and paragraphs, revealing the overall
structure and flow of the text. This aids in understanding the context and
extracting relevant information.
• Identifying Topic and Themes: By analyzing discourse, we can pinpoint the
main topics and themes discussed. This enables question answering
systems to focus on the most pertinent content.
• Resolving Ambiguity: Discourse analysis helps resolve ambiguity in
language by understanding the context and relationships between concepts.
This aids in interpreting questions and retrieving accurate answers.
• Inferring Missing Information: Through discourse analysis, we can deduce
information that is not explicitly stated in the text. This enhances question
answering by providing a broader understanding of the subject matter.
• Handling Textual Dynamics: Discourse analysis accounts for the dynamic
nature of text, including changes in perspective, shifts in time, and logical
connections. This allows question answering systems to better navigate
complex texts.

30. Summarize the main aspects of Reference Phenomena.


Reference Phenomena

Reference phenomena refers to how humans use language to refer to objects, events,
and concepts in the real world. There are three main aspects to consider:

1. Referent: The entity (object, event, or concept) that a referring expression


(e.g., a word, phrase) designates.
2. Referring Expression: The linguistic element (e.g., "the book,"
"yesterday") used to establish a connection to the referent.
3. Context: The surrounding linguistic and non-linguistic information that helps
determine the referent.

©FIADUZ ZAMAN 22
Reference phenomena is crucial because it allows us to understand how individuals
convey information about shared objects and experiences. It involves the ability to
identify and interpret referring expressions correctly, taking into account the context
and the intentions of the speaker.

31. Summarize the Reference Phenomena for the text: 'Alice loves to read books. She
often visits the library to borrow new ones. The librarian there is very helpful and always
recommends interesting titles.'
Reference Phenomena

In the text, the following reference phenomena can be observed:

1. Anaphora: "She" refers to Alice, the subject mentioned earlier in the


sentence.
2. Cataphora: "new ones" refers to books, which are mentioned later in the
sentence.
3. Exophora: "library" and "librarian" refer to objects or people outside of
the text.
4. Ellipsis: The second sentence omits the subject ("Alice") and uses the
verb "visits" instead of "visits the library" to imply a reference to the
library mentioned in the first sentence.
5. Definite reference: "the librarian" refers to a specific individual who is
understood to be unique in the context.

32. Summarize the Reference Phenomena for the text: 'John bought a new car. He is very
excited about it. The car is sleek and has advanced features. John plans to take it on a
road trip next week.'
Reference Phenomena:

References occur when a word or phrase refers back to something previously


mentioned in the text. Here's how references are used in the given text:

• Anaphora: "He" refers to "John" mentioned earlier in the sentence.


• Cataphora: "The car" refers to the car that will be mentioned in the next
sentence.
• Exophora: The text makes no external references outside its context.
• Endophora: The references are all within the text itself, such as "He"
and "The car."
• Coreference: "John" and "He" refer to the same individual.

33. Summarize the Reference Phenomena for the text: 'Alice was walking through the
park when she saw a beautiful flower. She bent down to smell it, but then she noticed a
caterpillar crawling on its petals. Alice screamed and jumped back, startled by the
unexpected creature.'
Reference Phenomena Summary:

The text illustrates a classic example of reference phenomena known as anaphoric


reference. Here, the pronoun "it" refers back to the previously mentioned noun,
"flower". This type of reference is used to connect related elements in a discourse
and helps maintain the coherence of the text.

Specifically, the anaphor "it" is a pronominal reference to the antecedent "flower".


The pronoun substitutes the noun, avoiding repetition and maintaining a smooth flow
of information. By doing so, it establishes a connection between the two entities and
clarifies the relationship between them.

34. Coreference Pairing is an important step in Coreference Resolution. Justify the


statement.
Coreference Pairing: A Crucial Step in Coreference Resolution

Coreference Resolution is the task of identifying and linking words or phrases in a


text that refer to the same entity. Coreference Pairing is a key step in this process,
where pairs of expressions are identified as potential coreferences. Here's why it's
important:

1. Identifying Coreferences: Coreference Pairing helps in identifying potential


coreferences that may not be immediately obvious. For example, "the boy" and "he"
can refer to the same person, but without pairing, they may not be recognized as
such.

2. Reduced Search Space: By pairing potential coreferences, the search space for
coreference resolution is narrowed down. This makes the task more efficient and
reduces the chance of incorrect coreference assignment.

3. Disambiguation: Coreference Pairing assists in disambiguating between different


meanings of a word or phrase. For example, if "John" and "Bill" are mentioned in
a text, pairing can help identify whether "he" refers to John or Bill based on the
context.

4. Improved Performance: Accurate Coreference Pairing leads to better performance in


coreference resolution. It provides a foundation for subsequent steps, such as pronoun
resolution and discourse analysis, to accurately link coreferences throughout the text.

5. Information Extraction and Summarization: Coreference Resolution plays a crucial


role in information extraction and text summarization. Proper Coreference Pairing
ensures that information about an entity is correctly attributed, leading to more accurate
and coherent summaries.

©FIADUZ ZAMAN 23
35. Deep learning, particularly neural networks, has revolutionized coreference
resolution. Justify the statement.
Deep learning has revolutionized coreference resolution for several reasons:

1. Enhanced Feature Extraction: Neural networks can learn complex relationships and
patterns in text that traditional natural language processing (NLP) methods may
miss. This allows them to extract more comprehensive and accurate features for
coreference resolution, leading to improved performance.

2. Contextual understanding: Deep learning models consider the wider context of a


text to determine coreference. They analyze the relationships between words and
sentences, which helps in distinguishing between referents that may have similar
surface forms but refer to different entities.

3. Robustness to Variability: Neural networks are more robust to variations in language


usage. They can handle ambiguous pronouns, complex syntactic structures, and
unknown words, making them more effective in resolving coreference in real-world
text.

4. Scalability: Deep learning models can be trained on massive datasets, allowing


them to learn from a wide range of language variations and styles. This scalability
contributes to improved accuracy and generalization ability.

5. Integration with other NLP Tasks: Neural networks for coreference resolution can be
easily integrated with other NLP tasks, such as named entity recognition and part-
of-speech tagging. This integration allows for synergistic performance improvements.

By leveraging these advantages, deep learning approaches have significantly improved


the accuracy and efficiency of coreference resolution, making it an essential component
of modern speech and natural language processing systems.

36. Define the modalities of natural language processing.


Modalities of Natural Language Processing:

NLP involves processing various forms of natural language input, known as modalities.
These include:

1. Text: Written or typed language, such as articles, books, or emails. 2.


Speech: Spoken language, captured through audio recordings or transcriptions. 3.
Dialogue: Interactive conversations between two or more speakers. 4.
Multimodal: Combinations of different modalities, such as text-to-speech or video-
to-text. 5. Non-linguistic Features: Additional information associated with language,
such as sentiment, context, or speaker identity.

37. Recognize the importance of several components that NLP is made up of.
5 Components of NLP and Their Importance
1. Lexicon: A dictionary of words and their meanings. It allows NLP to
understand the words in a text.
2. Syntax: The rules governing how words are combined into sentences. It
helps NLP identify the structure and relationships within text.
3. Semantics: The meaning of words and sentences. It enables NLP to
extract the intent and concepts conveyed by text.
4. Pragmatics: The context and real-world knowledge used to interpret
language. It allows NLP to understand the implied meaning and intentions
behind text.
5. Machine Learning: Algorithms that enable NLP systems to learn from data
and improve their performance over time. It empowers NLP to handle
complex language tasks, such as sentiment analysis and text
summarization.

38. Explain NLP Processing steps.


NLP Processing Steps:

NLP processing involves several steps to extract meaningful information from text
data:

1. Tokenization: Breaking down the text into individual words or phrases


called tokens.
2. Stemming/Lemmatization: Reducing words to their root form to remove
variations (e.g., "running" becomes "run").
3. Part-of-Speech Tagging: Identifying the grammatical role of each token
(e.g., noun, verb, adjective).
4. Parsing: Creating a structural representation of the sentence to understand
its syntax and meaning.
5. Named Entity Recognition: Identifying and classifying specific entities within
the text (e.g., people, places, organizations).

39. Identify and Describe the ambiguity in following sentences - a. The man kept the dog
in the house b. Book the flight c. I saw a bat d. I made her duck e. John and Mary are
married.
a. The man kept the dog in the house

• Ambiguity: The sentence doesn't specify who the owner of the house is.
It could be the man or someone else.

b. Book the flight

• Ambiguity: The sentence doesn't specify whether to book a one-way or


round-trip flight.

©FIADUZ ZAMAN 24
c. I saw a bat

• Ambiguity: The sentence doesn't clarify if the speaker saw a baseball bat
or an animal bat.

d. I made her duck

• Ambiguity: The sentence has two possible interpretations. It could mean


"I made her lower her head" or "I made her act like a duck."

e. John and Mary are married.

• Ambiguity: The sentence doesn't specify whether John and Mary are
currently married or were married in the past.

40. Describe morphology along with an example in details.


Morphology in linguistics refers to the study of the structure and formation of words,
including their parts, their formation, and their inflection.

For example, the word "beautiful" is composed of three parts:

• "beauty" (the root word, which conveys the core meaning of the word)
• "-ful" (the suffix, which adds the meaning of "full of" or "having")
• "-ly" (the adverbial suffix, which indicates that the word is being used
as an adverb)

By analyzing the morphology of "beautiful," we can understand its meaning, its part
of speech, and its relationship to other words in the language. This information is
essential for understanding the syntax and semantics of the sentence in which the
word is used.

41. Summarize the use of FSTs (Finite State Transducers) in morphological analysis.
FSTs (Finite State Transducers) are powerful tools used in morphological analysis,
the study of word structure. They are finite state machines that map input strings to
output strings, allowing for the representation and manipulation of morphological rules
and transformations. FSTs are widely used in NLP (Natural Language Processing)
due to their efficiency, flexibility, and ability to handle complex morphological
phenomena.

42. Summarize the role of Regular Expression in morphological analysis.


Role of Regular Expressions in Morphological Analysis

Regular expressions are powerful tools that play a significant role in morphological
analysis, the study of how words are structured. They allow us to:

• Identify morphological components: Regular expressions can extract


prefixes, suffixes, and stems from words, providing insights into their
morphological structure.
• Match specific patterns: We can create patterns to match particular word
forms or sequences of morphemes, helping us identify different parts of
speech or grammatical functions.
• Automate rule-based morphology: By encoding morphological rules as
regular expressions, we can automate morphological analysis tasks, making
them more efficient and consistent.
• Detect morphological errors: Regular expressions can be used to identify
words that do not conform to expected morphological patterns, indicating
potential errors or misspellings.
• Comparative morphology: Regular expressions facilitate the comparison of
words and morphological patterns across different languages, aiding in
linguistic research and language learning.

43. Observe the problems associated with stemming and estimate a solution for the
problem.
Problems with Stemming:

• Loss of Meaning: Removing suffixes can alter the meaning of words, e.g.,
"computing" vs. "compute."
• Over-Stemming: Removal of prefixes can result in unintended effects,
e.g., "watermelon" becomes "water."
• Under-Stemming: Stemming algorithms may not remove all relevant
suffixes, leading to imprecise results.

Solution:

One solution to these problems is to use lemmatization, a more sophisticated


technique that considers the context and morphology of words to identify their base
form. Lemmatization involves:

• Identifying the part of speech of the word.


• Examining its morphological structure and grammatical rules.
• Selecting the appropriate stem or lemma that retains the intended meaning.

Lemmatization provides more accurate stems while preserving the original meaning of
words. However, it requires a larger vocabulary and can be computationally intensive
compared to stemming.

44. Discuss the problems associated with lemmatization and estimate a solution for the
problem.
Problems with Lemmatization:

©FIADUZ ZAMAN 25
• Homographs: Words with the same spelling but different meanings can
have different lemmas (e.g., "bass" for the fish vs. the musical
instrument).
• Context Dependency: Lemmas can vary depending on the context of the
sentence (e.g., "run" can be a verb or a noun).
• Inflectional Variation: Lemmas represent the root form of a word, but
words can have multiple inflections (e.g., "run," "runs," "running").

Estimated Solution:

Enhanced Lemmatization Techniques:

• Use part-of-speech tagging to identify the grammatical class of a word,


which can help disambiguate homographs.
• Employ syntactic analysis to determine the context and syntactic role of
a word.
• Develop machine learning models trained on large text corpora to predict
lemmas based on word forms and context.

Additional Strategies:

• Consider using stemming algorithms as a less linguistically accurate but


computationally efficient alternative to lemmatization.
• Implement manual rule-based approaches for specific homographic or
context-dependent words.
• Evaluate and iterate: Regularly assess the performance of lemmatization
techniques and refine them as needed.

45. With the help of examples, describe Ngrams, Unigrams, Bigrams, and Trigrams.
Ngrams

Ngrams are sequences of words used in natural language processing. They capture
patterns and relationships within text data.

Types of Ngrams:

1. Unigram (N=1): A single word. Example: "the"

2. Bigram (N=2): A pair of adjacent words. Example: "the dog"

3. Trigram (N=3): A sequence of three adjacent words. Example: "the big dog"

Applications:

• Language Modeling: Predicting the next word in a sequence.


• Machine Translation: Translating text from one language to another.
• Text Classification: Categorizing text into different classes (e.g., spam,
not spam).

Example:

Consider the sentence: "The big brown dog ran down the street."

• Unigrams: the, big, brown, dog, ran, down, street


• Bigrams: the big, big brown, brown dog, dog ran, ran down, down the,
the street
• Trigrams: the big brown, big brown dog, brown dog ran, dog ran down,
ran down the, down the street

By analyzing these ngrams, we can gain insights into the structure and meaning of
the text.

46. Employ an example to examine the Word Level Analysis.


Example: Word Level Analysis

Given sentence: "The quick brown fox jumps over the lazy dog."

Word Level Analysis:


• Tokenization: Divide the sentence into individual words:
▪ ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy",
"dog"]
• Stemming: Remove suffixes and prefixes to obtain word stems:
▪ ["The", "quick", "brown", "fox", "jump", "over", "the", "lazy",
"dog"]
• Lemmatization: Identify the base form of words considering their part of
speech:
▪ ["The", "quick", "brown", "fox", "jump", "over", "the", "lazy",
"dog"]
• Part-of-Speech (POS) Tagging: Identify the grammatical role of each word
in the sentence:
▪ ["DET", "ADJ", "ADJ", "NOUN", "VERB", "ADP", "DET",
"ADJ", "NOUN"]
• Chunking: Group words into phrases based on grammatical relationships:
▪ ["The quick brown fox", "jumps over", "the lazy dog"]

Significance of Word Level Analysis:

Word level analysis helps machines understand the individual units of speech in a
sentence, enabling them to:

©FIADUZ ZAMAN 26
• Determine the meaning of words and sentences
• Identify grammatical structures
• Recognize relationships between words

47. How to apply context-free grammar rules to analyze the structure of simple English
sentences.
How to Apply Context-Free Grammar Rules to Analyze English Sentences:

1. Identify the start symbol: This is the highest-level non-terminal symbol,


usually "S" for "Sentence."
2. Parse the sentence: Apply the grammar rules to break down the sentence
into smaller constituents.
3. Follow the tree structure: The grammar rules define how to expand non-
terminal symbols into terminal symbols and other non-terminal symbols,
creating a tree-like structure.
4. Identify sentence constituents: As you apply the rules, you will identify
components like noun phrases, verb phrases, and phrases within phrases.
5. Label the constituents: Assign labels to the constituents based on their
grammatical function (e.g., NP for noun phrase, VP for verb phrase).

Example:

Analyze the sentence "The quick brown fox jumped over the lazy dog." using the
following grammar rules:

• S -> NP VP
• NP -> Det N
• VP -> V NP
• Det -> the
• N -> fox | dog
• V -> jumped | over

Steps:

1. Start symbol: S
2. Parse the sentence: S -> NP VP -> Det N VP -> Det N VP ->
Det N V NP -> Det N V Det N
3. Identify constituents: NP (the fox), NP (the dog), VP (jumped over),
VP (over the)
4. Label constituents: NP, NP, VP, VP

48. Compare and contrast dependency grammar and context-free grammar for
representing sentence structure.
Dependency Grammar

• Represents sentence structure as a tree with words as nodes and


dependency relationships between them.
• Focuses on hierarchical relationships: each word has a single head word
(except for the root).
• Example: "The boy ate a sandwich." would be represented as:

ate
|
The a sandwich
| |
boy

Context-Free Grammar

• Represents sentence structure as a set of rules that can be used to


generate all valid sentences in a language.
• Uses production rules to define how strings can be transformed into other
strings.
• Example: A rule for generating noun phrases might be:

NounPhrase -> Determiner Noun

Comparison

Feature Dependency Grammar Context-Free Grammar

Structure Tree Rules

Relationships Hierarchical Sequential

Head words Single head Not specified

Flexibility Can handle complex sentences More limited

Generative power Less generative More generative

Contrast

©FIADUZ ZAMAN 27
• Dependency grammar focuses on the relationships between words, while
context-free grammar focuses on the rules that generate valid sentences.
• Dependency grammar represents sentence structure as a tree, while
context-free grammar uses rules to define how strings can be transformed.
• Dependency grammar is more suitable for representing complex sentences,
while context-free grammar is more generative.

49. Determine whether a statement has any possible ambiguities and describe how
various parsing algorithms may handle them.
Ambiguity in Natural Language:

Ambiguity occurs when a sentence can be interpreted in multiple ways due to unclear
grammar or semantics.

How Parsing Algorithms Handle Ambiguity:

1. Top-Down Parsing (Recursive Descent):

• Attempts to match the input sentence against a set of predefined grammar


rules.
• Can get stuck in infinite recursion if the rules are ambiguous.

2. Bottom-Up Parsing (Shift-Reduce):

• Constructs the parse tree by combining smaller constituents into larger


ones.
• Can introduce ambiguity if multiple parse trees can be generated.

3. Chart Parsing:

• Builds a table of all possible constituents and their dependencies.


• Can handle ambiguity by keeping track of multiple interpretations.

4. Augmented Transition Network Parsing:

• Uses a finite-state machine to represent the grammar.


• Can handle ambiguity by allowing multiple paths through the network.

5. Semantic Attachment:

• Associates semantic representations with constituents.


• Can resolve ambiguity by considering the context and meaning of the
sentence.

Example:

Sentence: "John saw the man with a telescope."

Ambiguity: "with a telescope" could modify "John" or "man."

Top-Down Parsing: May fail to determine the intended interpretation.

Bottom-Up Parsing: Can generate multiple parse trees, one for each interpretation.

Chart Parsing: Records both interpretations and allows further processing to resolve
the ambiguity.

Augmented Transition Network Parsing: Assigns probabilities to each interpretation


based on the grammar and context.

Semantic Attachment: Associates "with a telescope" with "man" based on semantic


knowledge (e.g., a telescope is typically used for observation).

50. For particular NLP jobs, compare and contrast the benefits and drawbacks of shallow
parsing versus dynamic programming parsing.
Shallow Parsing

• Benefits:
▪ Fast and efficient
▪ Can handle ungrammatical sentences
▪ Useful for extracting specific information (e.g., named entities)
• Drawbacks:
▪ Does not provide a complete syntactic analysis
▪ May overgenerate or undergenerate parse trees

Dynamic Programming Parsing

• Benefits:
▪ Provides a complete syntactic analysis
▪ Can handle complex and ambiguous sentences
▪ Supports disambiguation techniques
• Drawbacks:
▪ Slower and more computationally expensive than shallow parsing
▪ May be impractical for large text corpora

Comparison

©FIADUZ ZAMAN 28
Dynamic Programming
Feature Shallow Parsing
Parsing

Speed Fast Slow

Completeness Partial Complete

Can handle ungrammatical Requires grammatical


Grammaticality
sentences sentences

Disambiguation Limited Extensive

Computational
Low High
cost

Extracting specific information, Complete syntactic analysis,


Best suited for
Named entity recognition Ambiguity resolution
51. Calculate the levenshtein distance between INTENTION and EXECUTION using
Minimum Edit Distance algorithm.
The Levenshtein distance is a measure of the similarity between two strings. It is
computed by counting the minimum number of insertions, deletions, and substitutions
required to transform one string into another.

To calculate the Levenshtein distance between INTENTION and EXECUTION, we can


use the following matrix:
| | I | N | T | E | N | T | I | O | N
|------+---+---+---+---+---+---+---+---+---|
| E | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
| X | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10
| E | 3 | 4 | 5 | 6 | 5 | 6 | 7 | 8 | 9
| C | 4 | 5 | 6 | 7 | 6 | 5 | 6 | 7 | 8
| U | 5 | 6 | 7 | 8 | 7 | 6 | 5 | 6 | 7
| T | 6 | 7 | 8 | 9 | 8 | 7 | 6 | 5 | 6
| I | 7 | 8 | 9 | 10 | 9 | 8 | 7 | 6 | 5
| O | 8 | 9 | 10 | 11 | 10 | 9 | 8 | 7 | 6
| N | 9 | 10 | 11 | 12 | 11 | 10 | 9 | 8 | 7

The first row and first column of the matrix represent the characters in the first and
second strings, respectively. The cell in the ith row and jth column represents the
minimum number of edits required to transform the first i characters of the first string
into the first j characters of the second string.

To fill in the matrix, we can use the following rules:

• If the ith character of the first string is equal to the jth character of the
second string, then the cost of transforming the first i characters of the
first string into the first j characters of the second string is the same as
the cost of transforming the first i-1 characters of the first string into the
first j-1 characters of the second string.
• If the ith character of the first string is not equal to the jth character of
the second string, then the cost of transforming the first i characters of
the first string into the first j characters of the second string is the
minimum of the following three costs:
▪ The cost of transforming the first i-1 characters of the first string
into the first j characters of the second string, plus the cost of
inserting the ith character of the first string.
▪ The cost of transforming the first i characters of the first string into
the first j-1 characters of the second string, plus the cost of
deleting the jth character of the second string.
▪ The cost of transforming the first i-1 characters of the first string
into the first j-1 characters of the second string, plus the cost of
substituting the ith character of the first string with the jth character
of the second string.

Using these rules, we can fill in the matrix as follows:


| | I | N | T | E | N | T | I | O | N
|------+---+---+---+---+---+---+---+---+---|
| E | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
| X | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10
| E | 3 | 4 | 5 | 6 | 5 | 6 | 7 | 8 | 9
| C | 4 | 5 | 6 | 7 | 6 | 5 | 6 | 7 | 8
| U | 5 | 6 | 7 | 8 | 7 | 6 | 5 | 6 | 7
| T | 6 | 7 | 8 | 9 | 8 | 7 | 6 | 5 | 6
| I | 7 | 8 | 9 | 10 | 9 | 8 | 7 | 6 | 5
| O | 8 | 9 | 10 | 11 | 10 | 9 | 8 | 7 | 6
| N | 9 | 10 | 11 | 12 | 11 | 10 | 9 | 8 | 7

The Levenshtein distance between INTENTION and EXECUTION is the value in the
last cell of the matrix, which is 7. This means that it takes 7 edits to transform
INTENTION into EXECUTION.

©FIADUZ ZAMAN 29

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy