NLP-Questions (1)
NLP-Questions (1)
4. Applications:
● Customer Service: Categorizing customer support tickets to route them to the
appropriate department.
● Content Moderation: Identifying and removing harmful or inappropriate content.
● Social Media Monitoring: Analyzing social media posts to understand public
sentiment towards a brand or product.
● Spam Filtering: Identifying and filtering out unwanted emails.
5. Derive the sentence “John ate an apple” using top down or bottom up parsing.
Compare Top down parsing with Bottom up parsing. Use the following grammar
rules to create the parse tree
✅
→ John ate an apple (because N → apple)
Thus, "John ate an apple" is successfully derived!
Types of Ambiguity:
1. Lexical Ambiguity (Word-Level)
A single word has multiple meanings.
Example: "Bank" (a financial institution) vs. "Bank" (a riverbank).
Conclusion
● Ambiguity makes NLP challenging but can be handled using AI models.
● Natural Language is flexible, context-dependent, and constantly evolving. These
concepts are essential for developing smart AI-based language applications.
Example in User: "Book a flight to Paris for AI Response: "Your flight to Paris
Action next Monday." → NLU extracts on Monday has been booked
intent (book flight), entity (Paris, successfully." (Generated text by
next Monday). NLG).
2. Morphology:
It deals with how words are constructed from more basic meaning units called morphemes. A
morpheme is the primitive unit of meaning in a language. For example, “truth+ful+ness”.
3. Syntax:
It concerns how words can be put together to form correct sentences and determines what
structural role each word plays in the sentence and what phrases are subparts of other phrases.
For example, “the dog ate my homework”
4. Semantics:
It is a study of the meaning of words and how these meanings combine in sentences to form
sentence meaning. It is the study of context-independent meaning. For example, plant:
industrial plant/ living organism
Pragmatics concerns with how sentences are used in different situations and how they affect
the interpretation of the sentence. Discourse context deals with how the immediately
preceding
sentences affect the interpretation of the next sentence. For example, interpreting pronouns
and interpreting the temporal aspects of the information.
5. Reasoning:
To produce an answer to a question that is not explicitly stored in a database, Natural
Language Interface to Database (NLIDB) carries out reasoning based on data stored in the
database. For example, consider the database that holds the student academic information,
and user posed a query such as: ‘Which student is likely to fail in the Science subject?’ To
answer the query, NLIDB needs a domain expert to narrow down the reasoning process.
13.Explain lexicon, lexeme and the different types of relations that hold between
lexemes.
→ A lexicon is the total stock of words and word elements with meaning in a language, while
a lexeme is a single unit of meaning, often a single word, that can take different forms
through inflection.
Lexemes relate to each other in various ways, including synonymy, antonymy, hyponymy,
and meronymy, which are all types of lexical relations.
Lexicon and Lexeme:
Lexicon:
Think of the lexicon as a language's dictionary, containing all the words and their associated
meanings. It's a vast and constantly evolving collection of vocabulary.
Lexeme:
A lexeme is the fundamental unit of meaning within a language's lexicon. It's the underlying
concept that can be expressed in various forms, such as "run", "runs", "ran", and "running".
All these words are different forms of the same lexeme "RUN".
Types of Lexical Relations:
1) Synonymy:
Words that have very similar meanings, often interchangeable in context.
Absolute Synonyms: Words with exactly the same meaning in all contexts (rare).
Partial Synonyms: Words with closely related meanings, allowing for similar but not identical
use.
2) Antonymy:
Words that have opposite meanings.
Gradable Antonyms: Words on a spectrum of meaning (e.g., hot/cold).
Complementary Antonyms: Words that are mutually exclusive (e.g., alive/dead).
3) Hyponymy:
A hierarchical relationship where one word (hyponym) is a more specific type of another
word (hypernym).
Example: A "dog" is a hyponym of "animal" (hypernym).
4) Meronymy:
A part-whole relationship where one word refers to a part of another word.
Example: "Head" is a meronym of "car".
16.Describe N-gram language model. List the problem associated with N-gram
model
→ N-gram Language Model:
An N-gram language model is a type of probabilistic language model used to predict the next
word in a sequence based on the previous 𝑁-1 words. It is based on the Markov assumption
that the probability of a word depends only on the preceding few words.
Definition:
An N-gram is a sequence of N consecutive words from a given sentence or text.
● Unigram (1-gram): considers one word at a time.
● Bigram (2-gram): considers two consecutive words.
● Trigram (3-gram): considers three consecutive words.
Formula:
Advantages:
● Simple to implement.
● Fast and efficient.
● Useful for basic NLP tasks like spelling correction, text generation, etc.
3) Limited Context: N-gram models only consider a fixed-size context (last N-1
words), so they fail to capture long-distance dependencies in language.
4) Poor Generalization: N-gram models rely heavily on exact word sequences and do
not generalize well to new or rare phrases.
Advantages:
● Does not require annotated training data.
● Uses existing resources like dictionaries and thesauri.
● Easy to implement for many languages.
Limitations:
● Limited by the quality and coverage of the dictionary.
● Performs poorly in complex or ambiguous contexts.
● Relies heavily on surface-level word matching.
b) Regular expression
→
c) finite automata
→
1) Term Frequency (TF): Measures how often a word appears in a document. A higher
frequency suggests greater importance. If a term appears frequently in a document, it
is likely relevant to the document’s content.
Formula:
Limitations of TF Alone:
● TF does not account for the global importance of a term across the entire
corpus.
● Common words like "the" or "and" may have high TF scores but are not
meaningful in distinguishing documents.
2) Inverse Document Frequency (IDF): Reduces the weight of common words across
multiple documents while increasing the weight of rare words. If a term appears in
fewer documents, it is more likely to be meaningful and specific.
Formula:
or
● The logarithm is used to dampen the effect of very large or very small values,
ensuring the IDF score scales appropriately.
● It also helps balance the impact of terms that appear in extremely few or
extremely many documents.
c) Parsing:
→
c) Lemmatization:
→ Lemmatization is the process of grouping together the different inflected forms of a word
so they can be analyzed as a single item. Lemmatization is similar to stemming but it brings
context to the words. So, it links words with similar meanings to one word.
Text preprocessing includes both Stemming as well as lemmatization. Many times, people
find these two terms confusing. Some treat these two as the same. Lemmatization is preferred
over Stemming because lemmatization does morphological analysis of the words.
Examples of lemmatization:
→ rocks : rock
→ corpora : corpus
→ better : good
One major difference with stemming is that lemmatize takes a part of speech parameter,
“pos” If not supplied, the default is “noun.”
Lemmatization Techniques
Lemmatization techniques in natural language processing (NLP) involve methods to identify
and transform words into their base or root forms, known as lemmas. These approaches
contribute to text normalization, facilitating more accurate language analysis and processing
in various NLP applications.
Three types of lemmatization techniques are:
1. Rule-Based Lemmatization:
Rule-based lemmatization involves the application of predefined rules to derive the base or
root form of a word. Unlike machine learning-based approaches, which learn from data,
rule-based lemmatization relies on linguistic rules and patterns.
Rule: For regular verbs ending in “-ed,” remove the “-ed” suffix.
Example:
Word: “walked”
Rule Application: Remove “-ed”
Result: “walk
c) Parsing:
→
Semantic Constraints:
● Selectional Restrictions: Verbs impose semantic constraints on their arguments. For
example, a "spokesperson" can "announce" but not "repair".
● Subsumption: One entity can be a subtype of another. For instance, "Microsoft is a
company".
● Verb Semantics: Some verbs emphasize particular arguments, biasing pronoun
resolution. For example, in "John telephoned Bill. He lost the laptop," "He" is more
likely to refer to John (subject) than Bill (object).
● World Knowledge: General knowledge about the world can also influence
coreference. For example, "John and Mary are a married couple. They are happy".
● Inference: Reasoning about the relationships between entities can help resolve
coreference. For example, "John hit the ball. It flew into the crowd".
By combining syntactic and semantic constraints, coreference resolution systems can
effectively determine which expressions refer to the same entity, which is crucial for tasks
like natural language understanding and machine translation.
Key Concepts:
● Conciseness: Summaries are shorter than the original text, highlighting the key points
and reducing redundancy.
● Accuracy: The summary should accurately reflect the original text's main ideas and
information.
● Coherence: The summary should be clear and easy to understand, with a logical flow
of ideas.
● Automatic Text Summarization (ATS): This is a branch of NLP that aims to
automate the process of creating summaries using algorithms and machine learning
techniques.
Types of Summarization:
● Extractive Summarization:
1. Selects important sentences or phrases directly from the original text.
2. Does not rephrase content.
3. Example tools: TextRank, LexRank.
● Abstractive Summarization:
1. Generates new sentences that capture the main ideas.
2. Uses deep learning models (e.g., BERT, GPT).
3. Similar to how humans write summaries.