0% found this document useful (0 votes)
2 views

NLP-Questions (1)

The document outlines key concepts in Natural Language Processing (NLP), including the structure of a generic NLP system, definitions of morphemes, stems, and affixes, and the history of NLP. It also discusses text classification, parsing techniques, part-of-speech tagging, ambiguity in natural language, and various applications of NLP. Additionally, it differentiates between Natural Language Understanding (NLU) and Natural Language Generation (NLG), and explains levels of NLP and morphological distinctions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

NLP-Questions (1)

The document outlines key concepts in Natural Language Processing (NLP), including the structure of a generic NLP system, definitions of morphemes, stems, and affixes, and the history of NLP. It also discusses text classification, parsing techniques, part-of-speech tagging, ambiguity in natural language, and various applications of NLP. Additionally, it differentiates between Natural Language Understanding (NLU) and Natural Language Generation (NLG), and explains levels of NLP and morphological distinctions.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

1.​ Define a generic NLP system with a diagram.

→ A Generic Natural Language Processing (NLP) System is a framework designed to


process, analyze, and understand human language in a structured manner. It typically consists
of multiple components that work together to convert raw text into meaningful insights.

Key Components of a Generic NLP System:


1.​ Text Preprocessing:
●​ Tokenization (splitting text into words or sentences)
●​ Stopword removal (eliminating common words like the, is, and)
●​ Stemming and Lemmatization (reducing words to their root forms)
2.​ Syntactic Analysis (Parsing):
Identifies grammatical structure (e.g., part-of-speech tagging, dependency parsing).
3.​ Semantic Analysis:
Determines the meaning of words and sentences (e.g., Named Entity Recognition,
word sense disambiguation).
4.​ Feature Extraction & Representation:
Converts text into numerical representations like Bag-of-Words (BoW), TF-IDF, or
word embeddings (Word2Vec, GloVe).
5.​ Modeling & Processing:
Uses machine learning (Naïve Bayes, SVM, Decision Trees) or deep learning
(LSTMs, Transformers like BERT & GPT).
6.​ Output Generation:
Performs tasks like sentiment analysis, text summarization, machine translation, or
chatbots.

Applications of Generic NLP Systems:


1.​ Chatbots (e.g., ChatGPT, Alexa, Siri)
2.​ Sentiment analysis (e.g., analyzing customer reviews)
3.​ Machine translation (e.g., Google Translate)
4.​ Speech recognition (e.g., voice assistants)

2.​ Define morpheme, stem and affixes.


→ In NLP, a morpheme is the smallest meaningful unit of language, forming the building
blocks of words. A stem is the core meaning-bearing part of a word, while affixes are bound
morphemes that modify the stem's meaning by adding or changing information like tense,
number, or part of speech.
Morpheme:
●​ A morpheme is the smallest unit of language that carries meaning.
●​ It can be a root word or a part of a word that has meaning.
●​ Examples include "cat," "dog," or the plural marker "-s" in "cats".
Stem:
●​ The stem is the core morpheme or part of a word that defines its main meaning.
●​ It's the part of a word to which affixes can be added.
●​ For example, in "unhappy," the stem is "happy," and "un" is an affix.
Affixes:
●​ Affixes are morphemes that are bound to a stem and can't stand alone.
●​ They modify the stem's meaning, adding information like tense, number, or part of
speech.
●​ Examples include prefixes ("un-", "re-"), suffixes ("-s", "-ing", "-ly"), and infixes
("-bloody-" in "abso-bloody-lutely").
●​ Affixes can be inflectional (changing form but not category, e.g., "walk" to "walking")
or derivational (changing category, e.g., "happy" to "unhappy").
In summary, a morpheme is the smallest unit of meaning, a stem is the core part of a word,
and affixes are bound morphemes that modify the stem's meaning.

3.​ Explain the history of NLP in brief.


→ NLP's history spans from the 1950s, rooted in the Turing Test and early computer
development, to the current era of machine learning and deep learning. Early approaches
were rule-based, but as computing power grew, statistical models and, eventually, machine
learning techniques took over, leading to significant advancements in language understanding
and generation.
Key Milestones:
1950s:
Alan Turing's proposal of the Turing Test (a test for intelligence based on a machine's ability
to communicate naturally) laid the groundwork for NLP. The Georgetown-IBM experiment in
1954 demonstrated early machine translation using hand-coded rules, according to
TechTarget and Wikipedia.
1960s-1980s:
Linguistics theories, like Noam Chomsky's generative grammar, influenced early NLP
models. Symbolic approaches (expert systems) and rule-based systems, like Augmented
Transition Networks (ATNs), were used for language processing, notes Wikipedia and
Medium.
1990s:
Statistical models, enabled by increased computing power, gained prominence, moving away
from handcrafted rules. Machine learning algorithms began to be incorporated into NLP
tasks.
2000s-present:
Machine learning models, like Support Vector Machines and Latent Dirichlet Allocation,
became dominant, allowing systems to learn from data. Deep learning models, such as
Recurrent Neural Networks (RNNs) and transformers, have further revolutionized the field,
enabling more sophisticated language understanding and generation.

4.​ Explain Text Classification in detail.


→ Text classification is a machine learning technique that automatically categorizes text data
into predefined classes or labels. It uses algorithms to analyze the content of text and assign it
the most appropriate category. This process helps organize and structure large amounts of
unstructured text data for various applications like sentiment analysis, spam detection, and
topic labeling.
Here's a more detailed explanation:
1.​ What it is:
Definition:
Text classification is a type of supervised machine learning where a model learns to predict a
class (or category) for a given text input.
Goal:
The goal is to assign the text to the most relevant pre-defined category based on its content.
Examples:
●​ Sentiment analysis: Classifying customer reviews as positive, negative, or neutral.
●​ Spam detection: Identifying emails as spam or not spam.
●​ Topic categorization: Organizing news articles into topics like sports, politics, or
technology.

2.​ How it works:


Data Preparation:
The first step involves preparing the text data by cleaning it, removing unnecessary
characters, and converting it into a format suitable for analysis.
Feature Extraction:
Relevant features are extracted from the text, such as the frequency of words, presence of
certain keywords, or word embeddings.
Model Training:
A machine learning model is trained on labeled data, where each text example is associated
with a specific category.
Prediction:
The trained model is then used to classify new, unseen text data by predicting the most likely
category based on the extracted features.

3.​ Types of Text Classification:


●​ Sentiment Analysis: Determining the sentiment or emotion expressed in a piece of
text.
●​ Toxicity Detection: Identifying offensive or harmful language online.
●​ Topic Labeling: Categorizing text based on its subject matter.
●​ Language Identification: Determining the language of a given text.

4.​ Applications:
●​ Customer Service: Categorizing customer support tickets to route them to the
appropriate department.
●​ Content Moderation: Identifying and removing harmful or inappropriate content.
●​ Social Media Monitoring: Analyzing social media posts to understand public
sentiment towards a brand or product.
●​ Spam Filtering: Identifying and filtering out unwanted emails.

5.​ Derive the sentence “John ate an apple” using top down or bottom up parsing.
Compare Top down parsing with Bottom up parsing. Use the following grammar
rules to create the parse tree

→ Grammar Rules Given


1.​ S → NP VP
2.​ VP → V NP
3.​ NP → NAME
4.​ NP → ART N
5.​ NAME → John
6.​ V → ate
7.​ ART → an | the
8.​ N → apple

Top-Down Parsing for "John ate an apple"


Step: Start from S and expand step-by-step:
●​ S
→ NP VP
→ NAME VP (because NP → NAME)
→ John VP (because NAME → John)
→ John V NP (because VP → V NP)
→ John ate NP (because V → ate)
→ John ate ART N (because NP → ART N)
→ John ate an N (because ART → an)


→ John ate an apple (because N → apple)
Thus, "John ate an apple" is successfully derived!

Parse Tree (Simple View):

Bottom-Up Parsing for "John ate an apple"


Step: Start from words and combine step-by-step:
1.​ John → NAME (by Rule 5)
2.​ NAME → NP (by Rule 3)
3.​ ate → V (by Rule 6)
4.​ an → ART (by Rule 7)
5.​ apple → N (by Rule 8)
6.​ ART + N → NP (by Rule 4)
7.​ V + NP → VP (by Rule 2)

✅ 8.​ NP + VP → S (by Rule 1)


Thus, we have successfully reduced "John ate an apple" to S (the start symbol)!

Quick Parse Tree:


6.​ What is POS tagging. Explain different techniques of POS tagging.
→ POS tagging, or Part-of-Speech tagging, is the process of assigning a grammatical
category (like noun, verb, adjective) to each word in a sentence. It's a fundamental step in
Natural Language Processing (NLP) because it helps understand the structure and meaning of
text. Different techniques are used for POS tagging, including rule-based, statistical, and
hybrid methods.
1.​ Rule-based tagging:
●​ This approach uses pre-defined rules and a lexicon (dictionary) to assign tags to
words.
●​ For example, a rule might state that words ending in "ing" or "ed" are likely verbs.
●​ Rule-based taggers also use contextual information to disambiguate tags.
2.​ Statistical tagging:
●​ This method uses trained models (like Hidden Markov Models or Conditional
Random Fields) to predict tags based on probability.
●​ It learns from large, tagged datasets and assigns tags based on the likelihood of a word
being a specific part of speech in a given context.
3.​ Hybrid tagging:
●​ This combines the strengths of both rule-based and statistical methods.
●​ It might start with a rule-based system and then refine the tagging using a statistical
model.
Example:
In the sentence "The cat sat on the mat," POS tagging might assign these tags:
"The" - Determiner, "cat" - Noun, "sat" - Verb, "on" - Preposition, "the" - Determiner, and
"mat" - Noun.
7.​ Explain in brief Ambiguity and NL (Natural Language)
→ Ambiguity in NLP & Natural Language (NL)
1. Ambiguity in NLP:
Ambiguity occurs when a word, phrase, or sentence has multiple meanings, making it
difficult for computers to interpret language accurately. It is a major challenge in Natural
Language Processing (NLP).

Types of Ambiguity:
1.​ Lexical Ambiguity (Word-Level)
A single word has multiple meanings.
Example: "Bank" (a financial institution) vs. "Bank" (a riverbank).

2.​ Syntactic Ambiguity (Sentence Structure)


A sentence can have multiple grammatical interpretations.
Example: "The chicken is ready to eat." (Is the chicken eating or being eaten?)

3.​ Semantic Ambiguity (Meaning-Based)


A sentence’s meaning is unclear due to word relationships.
Example: "I saw a man with a telescope." (Did I use the telescope, or did the man have it?)

4.​ Pragmatic Ambiguity (Context-Based)


Meaning changes based on the situation or speaker’s intent.
Example: "Can you pass the salt?" (Is it a question about ability or a polite request?)

Solution to Ambiguity in NLP:


●​ Contextual models (like BERT, GPT)
●​ Named Entity Recognition (NER)
●​ Word Sense Disambiguation (WSD)

2. Natural Language (NL)


Natural Language (NL) refers to the way humans communicate through speech and text in
different languages (e.g., English, Hindi, Spanish). Unlike programming languages, NL is
complex, ambiguous, and evolves over time.
Key Features of Natural Language:
●​ Dynamic & Evolving (new words, slang appear over time).
●​ Highly Contextual (same words can have different meanings in different situations).
●​ Fuzzy & Imprecise (humans use idioms, sarcasm, and informal speech).

Examples of Natural Language Processing (NLP) Applications:


●​ Google Translate (Machine Translation)
●​ ChatGPT, Siri, Alexa (Conversational AI)
●​ Spam Filters (Text Classification)

Conclusion
●​ Ambiguity makes NLP challenging but can be handled using AI models.
●​ Natural Language is flexible, context-dependent, and constantly evolving. These
concepts are essential for developing smart AI-based language applications.

8.​ State different applications of NLP.


→ Applications of NLP
1.​ Machine Translation
Automatically translating text from one language to another.
Example: Google Translate.

2.​ Chatbots and Virtual Assistants


Conversational systems that interact with users.
Example: Siri, Alexa, Google Assistant.

3.​ Sentiment Analysis


Detecting emotions or opinions in text (positive, negative, neutral).
Example: Analyzing product reviews or social media posts.

4.​ Speech Recognition


Converting spoken language into written text.
Example: Voice typing, call center transcription.

5.​ Text Summarization


Creating short summaries from long documents or articles.
Example: News article summarizers.

6.​ Information Extraction


Pulling specific data (names, dates, locations) from a large text.
Example: Extracting keywords from resumes.

7.​ Question Answering Systems


Systems that provide direct answers to user queries.
Example: Google’s featured snippets.
8.​ Spam Detection
Identifying unwanted or harmful emails and messages.
Example: Gmail spam filter.

9.​ Text Classification


Categorizing text into predefined groups.
Example: Sorting emails into promotions, updates, or primary inbox.

10.​Autocorrect and Grammar Checking


Correcting spelling and grammar errors in text.
Example: Grammarly, MS Word spelling suggestions.

9.​ Explain the difference between NLU and NLG



Feature Natural Language Natural Language Generation
Understanding (NLU) (NLG)

Definition Focuses on interpreting and Focuses on generating human-like


understanding human language. text based on input data.

Objective Converts unstructured text into Converts structured data into


structured data (extracts meaning). human-readable text.

Processes Parsing, entity recognition, Text planning, sentence structuring,


Involved sentiment analysis, intent grammatical correctness.
detection.

Techniques Named Entity Recognition (NER), Rule-based text generation, deep


Used Dependency Parsing, Word Sense learning models (Transformers,
Disambiguation. GPT, BERT).

Example Chatbot understanding user input, AI-powered content writing,


Applications sentiment analysis in reviews. automated report generation.

Example in User: "Book a flight to Paris for AI Response: "Your flight to Paris
Action next Monday." → NLU extracts on Monday has been booked
intent (book flight), entity (Paris, successfully." (Generated text by
next Monday). NLG).

10.​Write a short note on the Levels of NLP.



1.​ Phonology:
It is concerned with the interpretation of speech sounds within and across words.

2.​ Morphology:
It deals with how words are constructed from more basic meaning units called morphemes. A
morpheme is the primitive unit of meaning in a language. For example, “truth+ful+ness”.

3.​ Syntax:
It concerns how words can be put together to form correct sentences and determines what
structural role each word plays in the sentence and what phrases are subparts of other phrases.
For example, “the dog ate my homework”

4.​ Semantics:
It is a study of the meaning of words and how these meanings combine in sentences to form
sentence meaning. It is the study of context-independent meaning. For example, plant:
industrial plant/ living organism
Pragmatics concerns with how sentences are used in different situations and how they affect
the interpretation of the sentence. Discourse context deals with how the immediately
preceding
sentences affect the interpretation of the next sentence. For example, interpreting pronouns
and interpreting the temporal aspects of the information.

5.​ Reasoning:
To produce an answer to a question that is not explicitly stored in a database, Natural
Language Interface to Database (NLIDB) carries out reasoning based on data stored in the
database. For example, consider the database that holds the student academic information,
and user posed a query such as: ‘Which student is likely to fail in the Science subject?’ To
answer the query, NLIDB needs a domain expert to narrow down the reasoning process.

11.​Differentiate between inflectional and derivational morphology.



12.​Explain Regular Expression.
→ A regular expression (a.k.a. regex or regexp) is a sequence of characters that specifies a
search pattern in the text. Usually, such patterns are used by string-searching algorithms for
“find” or “find and replace” operations on strings, or for input validation.
Key Concepts of Regular Expressions:
1)​ Literal Characters:
These are the actual characters you want to match.
Example: hello matches the string "hello".
2)​ Meta-characters:
Special characters that define the search pattern.
Some commonly used ones:
●​ . (Dot) – Matches any single character except a newline.
●​ ^ (Caret) – Matches the start of a string.
●​ $ (Dollar) – Matches the end of a string.
●​ [] (Square Brackets) – Matches any one of the characters inside the brackets.
●​ | (Pipe) – Logical OR, used to match either pattern on the left or right side.
●​ () (Parentheses) – Groups part of the regex together.
3)​ Quantifiers:
Indicate how many times an element should appear.
●​ * – Zero or more times.
●​ + – One or more times.
●​ ? – Zero or one time.
●​ {n} – Exactly n times.
●​ {n,} – n or more times.
●​ {n,m} – Between n and m times.
4)​ Character Classes:
Predefined sets of characters you can match:
●​ \d – Matches any digit (equivalent to [0-9]).
●​ \w – Matches any word character (alphanumeric + underscore, equivalent to
[A-Za-z0-9_]).
●​ \s – Matches any whitespace character (space, tab, newline, etc.).
●​ \D – Matches any non-digit character.
●​ \W – Matches any non-word character.
●​ \S – Matches any non-whitespace character.
5)​ Anchors:
Define the position of the match in the string:
●​ ^ – Matches the beginning of a string.
●​ $ – Matches the end of a string.
6)​ Escaping Special Characters:
Use a backslash (\) to escape meta-characters when you want to match them literally. For
example, \. matches a period.

Regular Expressions Use Cases


Regular Expressions are used in various tasks such as:
●​ Data pre-processing;
●​ Rule-based information Mining systems;
●​ Pattern Matching;
●​ Text feature Engineering;
●​ Web scraping;
●​ Data validation;
●​ Data Extraction.
An example use case is extracting all hashtags from a tweet, or getting email addresses or
phone numbers from large unstructured text content.

13.​Explain lexicon, lexeme and the different types of relations that hold between
lexemes.
→ A lexicon is the total stock of words and word elements with meaning in a language, while
a lexeme is a single unit of meaning, often a single word, that can take different forms
through inflection.
Lexemes relate to each other in various ways, including synonymy, antonymy, hyponymy,
and meronymy, which are all types of lexical relations.
Lexicon and Lexeme:
Lexicon:
Think of the lexicon as a language's dictionary, containing all the words and their associated
meanings. It's a vast and constantly evolving collection of vocabulary.
Lexeme:
A lexeme is the fundamental unit of meaning within a language's lexicon. It's the underlying
concept that can be expressed in various forms, such as "run", "runs", "ran", and "running".
All these words are different forms of the same lexeme "RUN".
Types of Lexical Relations:
1)​ Synonymy:
Words that have very similar meanings, often interchangeable in context.
Absolute Synonyms: Words with exactly the same meaning in all contexts (rare).
Partial Synonyms: Words with closely related meanings, allowing for similar but not identical
use.
2)​ Antonymy:
Words that have opposite meanings.
Gradable Antonyms: Words on a spectrum of meaning (e.g., hot/cold).
Complementary Antonyms: Words that are mutually exclusive (e.g., alive/dead).
3)​ Hyponymy:
A hierarchical relationship where one word (hyponym) is a more specific type of another
word (hypernym).
Example: A "dog" is a hyponym of "animal" (hypernym).
4)​ Meronymy:
A part-whole relationship where one word refers to a part of another word.
Example: "Head" is a meronym of "car".

14.​What is morphological parsing? Explain FST with example.


→ Morphological Parsing:
Morphological parsing is the process of breaking down a word into its constituent
morphemes, which are the smallest meaningful units of language, such as stems, prefixes,
and suffixes. It determines how a word is formed by combining these morphemes and how
they contribute to the word's overall meaning and grammatical function.
Detailed explanation of morphological parsing:
Morphemes:
Morphemes are the fundamental building blocks of words. They can be stems (the core part
of the word), prefixes (attached at the beginning), suffixes (attached at the end), or infixes
(inserted within the word, less common in English).
Goal of Morphological Parsing:
The goal is to identify the morphemes within a word and how they are combined. For
example, parsing the word "unhappy" would identify "un" (prefix), "happy" (stem), and
possibly "ness" (suffix, though "unhappy" itself functions as an adjective).
Example:
Consider the word "cats". A morphological parser would identify "cat" as the stem (noun)
and "s" as a plural suffix (indicating more than one cat).
Applications:
●​ Morphological parsing is a crucial step in many natural language processing (NLP)
tasks, including:
●​ Word lemmatization: Identifying the base form (lemma) of a word for easier analysis,
like finding the lemma "cat" for "cats".
●​ Text summarization: Breaking down sentences to understand the underlying meaning.
●​ Machine translation: Understanding the structure of words in different languages.
●​ Spelling and grammar correction: Identifying and correcting errors based on
morphological rules.

Finite State Transducer (FST):


A Finite State Transducer (FST) is a model of computation that transforms input strings into
output strings based on a set of rules and states, similar to a finite state automaton but with
output capabilities. It can be visualized as a directed graph where nodes represent states and
labeled edges represent transitions and outputs.
Here's a breakdown with an example:
●​ States: FSTs have a finite number of states that represent the current processing stage.
●​ Transitions: Transitions move the FST from one state to another based on the input.
●​ Input Alphabet: A set of symbols (like letters or numbers) that the FST can process as
input.
●​ Output Alphabet: A set of symbols that the FST can produce as output.
●​ Transition Function: Rules that determine the next state and output given the current
state and input symbol.
●​ Start State: The initial state the FST begins with.
●​ Accepting States: States that indicate the end of a valid input sequence.
Example:
Let's say you want to create an FST that takes a word in English and outputs its plural form.
A simple example:
Input: "cat"
Output: "cats"
Here's how you might represent this in an FST:
States:
●​ State 0 (Start): Initial state
●​ State 1: Processing a word
●​ State 2: End/Accepting state (plural form)
●​ Input Alphabet: {c, a, t}
●​ Output Alphabet: {c, a, t, s}
Transition Function:
●​ From State 0, on input "c", go to State 1, output "c"
●​ From State 1, on input "a", go to State 1, output "a"
●​ From State 1, on input "t", go to State 2, output "t"
●​ From State 2, on input (nothing), go to State 2, output "s" (adding the "s" for plural)
●​ This FST would take "cat" as input and generate "cats" as output. It would process the
input "c", then "a", then "t", and finally, it would add the "s" for the plural form.

15.​Describe HMM with example.


→ A Hidden Markov Model (HMM) is a statistical model used to predict a sequence of
unknown (hidden) variables from a set of observed variables. It's like a secret code where you
observe patterns (observations) and try to infer the hidden system (states) that generated those
patterns.
1.​ Hidden States: These are the underlying, unobserved factors influencing the system.
Example: Your mood (happy, sad, angry) is a hidden state, not directly observable by
others.
2.​ Observations: These are the things you can see or measure, which are affected by the
hidden states. Example: Your facial expressions, body language, and tone of voice are
observations that can help someone infer your mood.
3.​ Transition Probabilities: These are the probabilities of moving from one hidden state
to another over time. Example: The probability of transitioning from a happy state to
a sad state, or from a sad state to a happy state.
4.​ Emission Probabilities: These are the probabilities of observing a specific
observation given a particular hidden state. Example: The probability of seeing a
frown when you are in a sad state, or the probability of hearing a cheerful voice when
you are in a happy state.
How it works:
HMMs use the observed data (observations) and the model's parameters (transition and
emission probabilities) to estimate the most likely sequence of hidden states.
Example: Weather Prediction
●​ Hidden States: The weather (Rainy, Sunny, Cloudy)
●​ Observations: What people wear (Umbrella, Sunglasses, Jacket)
●​ Transition Probabilities: The probability of transitioning from Rainy to Sunny, or
from Sunny to Cloudy, etc.
●​ Emission Probabilities: The probability of someone wearing an umbrella when it's
Rainy, or the probability of someone wearing sunglasses when it's Sunny, etc.
In this example:
●​ You see someone wearing an umbrella for two days and then sunglasses on the third
day.
●​ The HMM analyzes the pattern and calculates the probability of each weather state
(Rainy, Sunny, Cloudy) on each day.
●​ Based on the observations and probabilities, it infers that it was likely Rainy on the
first two days and Sunny on the third day.

16.​Describe N-gram language model. List the problem associated with N-gram
model
→ N-gram Language Model:
An N-gram language model is a type of probabilistic language model used to predict the next
word in a sequence based on the previous 𝑁-1 words. It is based on the Markov assumption
that the probability of a word depends only on the preceding few words.
Definition:
An N-gram is a sequence of N consecutive words from a given sentence or text.
●​ Unigram (1-gram): considers one word at a time.
●​ Bigram (2-gram): considers two consecutive words.
●​ Trigram (3-gram): considers three consecutive words.
Formula:

Advantages:
●​ Simple to implement.
●​ Fast and efficient.
●​ Useful for basic NLP tasks like spelling correction, text generation, etc.

Problems Associated with N-gram Model:


1)​ Data Sparsity: Many possible N-grams may never appear in the training data,
leading to zero probabilities for unseen word combinations.

2)​ High Memory Requirement: As N increases, the number of possible combinations


increases exponentially, requiring more storage.

3)​ Limited Context: N-gram models only consider a fixed-size context (last N-1
words), so they fail to capture long-distance dependencies in language.

4)​ Poor Generalization: N-gram models rely heavily on exact word sequences and do
not generalize well to new or rare phrases.

5 Smoothing Needed: To handle zero probabilities, techniques like Laplace Smoothing or


Kneser-Ney are required, which add complexity.

17.​Explain in brief POS Tagging with types.


→ Part-of-Speech (POS) Tagging:
POS (Part-of-Speech) tagging is the process of assigning grammatical categories (like noun,
verb, adjective, etc.) to each word in a sentence. It's a fundamental task in Natural Language
Processing (NLP) used to understand sentence structure and meaning. POS tagging can be
done using rule-based or statistical methods.
Example:
Sentence: "The quick brown fox jumps over the lazy dog."
POS Tags in the above sentence are:
●​ The DT
●​ quick JJ
●​ brown JJ
●​ fox NN
●​ jumps VBZ
●​ over IN
●​ the DT
●​ lazy JJ
●​ dog NN

Types of POS Tagging:


1)​ Rule-Based Tagging:
This method uses pre-defined rules to assign tags to words, often based on word endings or
patterns. For example, a rule might assign the "noun" tag to words ending in "-tion".
2)​ Statistical Tagging:
This approach uses statistical models (like Hidden Markov Models) to predict the most likely
POS tag for a word based on its context and the tags of surrounding words. It learns from a
corpus of tagged text.
3)​ Machine Learning-Based POS Tagging:
●​ Uses supervised learning (e.g., Decision Trees, CRF, SVM, or Neural Networks).
●​ Requires labeled training data and learns patterns automatically.
●​ Achieves high accuracy with large datasets.
4)​ Hybrid tagging:
●​ This combines the strengths of both rule-based and statistical methods.
●​ It might start with a rule-based system and then refine the tagging using a statistical
model.

Common POS Tags:

18.​Dictionary based approach of WSD.


→ The dictionary-based approach to Word Sense Disambiguation (WSD) relies on leveraging
lexical resources like dictionaries, thesauri, and knowledge bases to determine the correct
meaning of a word in context. This method aims to identify the sense of a word by comparing
its definition or related words with the surrounding words and their meanings within the text,
without relying on large corpora of tagged data.
Key aspects of the dictionary-based approach:
1)​ Lexical Resources:
This approach heavily depends on accessing and utilizing information from lexical resources
like dictionaries and thesauri.
2)​ Definition Overlap:
A core principle is to identify the sense that has the most overlap in its definition with the
words in the context of the target word.
3)​ Selectional Restrictions:
The dictionary-based approach may also consider selectional restrictions, which are the
syntactic and semantic relationships between a word and its arguments.
4)​ Semantic Similarity:
Comparing the semantic similarity between the word's definitions and the context words is
another technique used in this approach.
5)​ Key Technique: Lesk Algorithm:
The Lesk algorithm is a well-known dictionary-based method that measures the overlap
between the definitions of a word and the words in its context to disambiguate the intended
sense.
Example:
For the word "bank" in the sentence:
"He sat on the river bank."
Sense 1: Bank (Financial institution) – Gloss: "an organization that stores and lends money"
Sense 2: Bank (River side) – Gloss: "land alongside a river"
→ The context word "river" overlaps with Sense 2, so it is selected.

Advantages:
●​ Does not require annotated training data.
●​ Uses existing resources like dictionaries and thesauri.
●​ Easy to implement for many languages.

Limitations:
●​ Limited by the quality and coverage of the dictionary.
●​ Performs poorly in complex or ambiguous contexts.
●​ Relies heavily on surface-level word matching.

19.​Describe Morphological parsing with FST.


→ Morphological parsing using Finite State Transducers (FSTs) involves decomposing
words into their constituent morphemes (like stems and affixes) and analyzing their
morphological features. FSTs are used to map surface word forms to their underlying lexical
representations, including stem, part of speech, and other morphological information. This
method is commonly used in natural language processing for tasks like word analysis and
language model building.
How FSTs are used in morphological parsing:
1.​ Lexicon: A lexicon is created containing stems and affixes, along with their
corresponding morphological information.
2.​ Morphotactics: Rules are defined that specify how morphemes can be combined to
form words, often using a finite-state machine.
3.​ Orthographic rules: Rules are included to handle spelling changes that occur during
word formation, such as vowel changes or consonant doubling.
4.​ FST construction: The lexicon, morphotactics, and orthographic rules are combined
to create a cascade of FSTs.
5.​ Parsing: The FSTs are used to analyze the input word and map it to its underlying
lexical representation, including the stem, part of speech, and any morphological
features.
Example:
Consider the word "foxes". A morphological parser using an FST could identify "fox" as the
stem and "es" as the plural suffix, and assign the word the morphological feature of "plural
noun".
Advantages of using FSTs for morphological parsing:
●​ Efficiency: FSTs can be efficiently implemented and are relatively fast, according to
SlideShare.
●​ Flexibility: FSTs can be easily modified to handle different languages and
morphological structures.
●​ Accuracy: FSTs can be designed to handle both regular and irregular morphological
forms.

20.​Write a short note on


a) Stop word removal

b) Regular expression

c) finite automata

21.​Describe Named Entity Recognition (NER) with example


→ Named Entity Recognition (NER) is a Natural Language Processing (NLP) task that
identifies and classifies named entities within text. These entities can be people,
organizations, locations, dates, times, and more. NER helps computers understand the key
subjects of a piece of text and is used in various applications like search engines, chatbots,
and data analysis.
Example:
In the sentence "Apple announced its latest product in California," NER would identify:
●​ Apple: as an organization.
●​ California: as a location.
Explanation:
NER models are trained on labeled datasets to learn how to classify different words or
phrases as belonging to specific entity types. For example, a model might be trained to
recognize that "Apple" is usually an organization and "California" is usually a location.
Use Cases:
●​ Search Engines: NER helps search engines provide more relevant results by
understanding the context of search queries.
●​ Chatbots: NER allows chatbots to understand user requests and provide accurate
responses.
●​ Data Analysis: NER can be used to extract key information from large amounts of
unstructured text.
22.​Write a short notes on
a) Bag of Words (BoW):

b) Term Frequency-Inverse Document Frequency (TF-IDF):


→ TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used in
natural language processing and information retrieval to evaluate the importance of a word in
a document relative to a collection of documents (corpus).
How TF-IDF Works?
TF-IDF combines two components: Term Frequency (TF) and Inverse Document
Frequency (IDF).

1)​ Term Frequency (TF): Measures how often a word appears in a document. A higher
frequency suggests greater importance. If a term appears frequently in a document, it
is likely relevant to the document’s content.
Formula:

Limitations of TF Alone:
●​ TF does not account for the global importance of a term across the entire
corpus.
●​ Common words like "the" or "and" may have high TF scores but are not
meaningful in distinguishing documents.
2)​ Inverse Document Frequency (IDF): Reduces the weight of common words across
multiple documents while increasing the weight of rare words. If a term appears in
fewer documents, it is more likely to be meaningful and specific.
Formula:

or
●​ The logarithm is used to dampen the effect of very large or very small values,
ensuring the IDF score scales appropriately.
●​ It also helps balance the impact of terms that appear in extremely few or
extremely many documents.

Limitations of IDF Alone:


●​ IDF does not consider how often a term appears within a specific document.
●​ A term might be rare across the corpus (high IDF) but irrelevant in a specific
document (low TF).
3)​ TF-IDF Score:
Formula:

●​ Highlights important words in a document.


●​ Reduces the weight of common words that appear in many documents (e.g.,
"the", "is").

c) Parsing:

c) Lemmatization:
→ Lemmatization is the process of grouping together the different inflected forms of a word
so they can be analyzed as a single item. Lemmatization is similar to stemming but it brings
context to the words. So, it links words with similar meanings to one word.
Text preprocessing includes both Stemming as well as lemmatization. Many times, people
find these two terms confusing. Some treat these two as the same. Lemmatization is preferred
over Stemming because lemmatization does morphological analysis of the words.
Examples of lemmatization:
→ rocks : rock
→ corpora : corpus
→ better : good
One major difference with stemming is that lemmatize takes a part of speech parameter,
“pos” If not supplied, the default is “noun.”
Lemmatization Techniques
Lemmatization techniques in natural language processing (NLP) involve methods to identify
and transform words into their base or root forms, known as lemmas. These approaches
contribute to text normalization, facilitating more accurate language analysis and processing
in various NLP applications.
Three types of lemmatization techniques are:
1.​ Rule-Based Lemmatization:
Rule-based lemmatization involves the application of predefined rules to derive the base or
root form of a word. Unlike machine learning-based approaches, which learn from data,
rule-based lemmatization relies on linguistic rules and patterns.
Rule: For regular verbs ending in “-ed,” remove the “-ed” suffix.
Example:
Word: “walked”
Rule Application: Remove “-ed”
Result: “walk

2.​ Dictionary-Based Lemmatization:


Dictionary-based lemmatization relies on predefined dictionaries or lookup tables to map
words to their corresponding base forms or lemmas. Each word is matched against the
dictionary entries to find its lemma. This method is effective for languages with well-defined
rules.
Suppose we have a dictionary with lemmatized forms for some words:
‘running’ -> ‘run’
‘better’ -> ‘good’
‘went’ -> ‘go’
When we apply dictionary-based lemmatization to a text like “I was running to become a
better athlete, and then I went home,” the resulting lemmatized form would be: “I was run to
become a good athlete, and then I go home.”

3.​ Machine Learning-Based Lemmatization:


Machine learning-based lemmatization leverages computational models to automatically
learn the relationships between words and their base forms. Unlike rule-based or
dictionary-based approaches, machine learning models, such as neural networks or statistical
models, are trained on large text datasets to generalize patterns in language.
Example:
Consider a machine learning-based lemmatizer trained on diverse texts. When encountering
the word ‘went,’ the model, having learned patterns, predicts the base form as ‘go.’ Similarly,
for ‘happier,’ the model deduces ‘happy’ as the lemma. The advantage lies in the model’s
ability to adapt to varied linguistic nuances and handle irregularities, making it robust for
lemmatizing diverse vocabularies.

c) Parsing:

23.​Describe Syntactic & Semantic constraint on coreference.


→ Coreference resolution, the process of identifying expressions that refer to the same entity,
relies heavily on both syntactic and semantic constraints.
Syntactic constraints, such as number agreement and grammatical role, help narrow down
the possible antecedents for an anaphor.
Semantic constraints, including selectional restrictions and subsumption, further filter out
potential referents by considering the meaning of words and their relationships.
Syntactic Constraints:
●​ Number Agreement: Pronouns and their antecedents must agree in number
(singular/plural). For example, "John has a new Acura. It is red." but not "They are
red.".
●​ Person Agreement: Pronouns must also agree in person (first, second, third) with
their antecedents.
●​ Case Agreement: While less strict than number and person, case agreement can also
play a role.
●​ Grammatical Role: Entities in subject positions tend to be more salient and preferred
referents than those in object or oblique positions.
●​ Binding Theory: This theory, which governs the interpretation of pronouns and
reflexive pronouns, helps constrain the possible antecedents within a clause.
●​ I-within-i Filter: This constraint limits coreference between a parent noun phrase and
any of its embedded noun phrases.
●​ Appositive Constructions: Appositive constructions, where a noun phrase provides
additional information about another, can be exceptions to the i-within-i filter.

Semantic Constraints:
●​ Selectional Restrictions: Verbs impose semantic constraints on their arguments. For
example, a "spokesperson" can "announce" but not "repair".
●​ Subsumption: One entity can be a subtype of another. For instance, "Microsoft is a
company".
●​ Verb Semantics: Some verbs emphasize particular arguments, biasing pronoun
resolution. For example, in "John telephoned Bill. He lost the laptop," "He" is more
likely to refer to John (subject) than Bill (object).
●​ World Knowledge: General knowledge about the world can also influence
coreference. For example, "John and Mary are a married couple. They are happy".
●​ Inference: Reasoning about the relationships between entities can help resolve
coreference. For example, "John hit the ball. It flew into the crowd".
By combining syntactic and semantic constraints, coreference resolution systems can
effectively determine which expressions refer to the same entity, which is crucial for tasks
like natural language understanding and machine translation.

24.​Explain Text summarization.


→ Text summarization is the process of creating a concise, accurate representation of a
longer text. It involves extracting the most important information and condensing it into a
shorter summary, while retaining the core meaning and context. This process can be done
manually or automatically using natural language processing (NLP) techniques.

Key Concepts:
●​ Conciseness: Summaries are shorter than the original text, highlighting the key points
and reducing redundancy.
●​ Accuracy: The summary should accurately reflect the original text's main ideas and
information.
●​ Coherence: The summary should be clear and easy to understand, with a logical flow
of ideas.
●​ Automatic Text Summarization (ATS): This is a branch of NLP that aims to
automate the process of creating summaries using algorithms and machine learning
techniques.

Types of Summarization:
●​ Extractive Summarization:
1.​ Selects important sentences or phrases directly from the original text.
2.​ Does not rephrase content.
3.​ Example tools: TextRank, LexRank.
●​ Abstractive Summarization:
1.​ Generates new sentences that capture the main ideas.
2.​ Uses deep learning models (e.g., BERT, GPT).
3.​ Similar to how humans write summaries.

Benefits of Text Summarization:


●​ Time Savings: Summaries provide a quick overview of a long text, allowing users to
quickly grasp the main points.
●​ Information Extraction: Summaries help users identify the key information and
topics discussed in the original text.
●​ Improved Learning: Summarization can enhance comprehension and knowledge
retention by helping users focus on the most important details.

Applications of Text Summarization:


●​ News Summaries: Summarizing news articles to provide a quick overview of
important events.
●​ Research Papers: Extracting the key findings and conclusions from scientific papers.
●​ Technical Documents: Simplifying complex information for easier understanding.
●​ Customer Service: Providing concise summaries of customer inquiries.

25.​Describe briefly the Sentiment Analysis techniques.


→ Sentiment analysis techniques, a core part of natural language processing (NLP), involve
using computational linguistics, data mining, and machine learning to determine the
emotional tone (positive, negative, or neutral) expressed in text. These techniques are used to
analyze customer feedback, social media, and other forms of textual data to understand public
opinion and identify trends.
Sentiment analysis techniques:
1.​ Rule-Based Methods: These methods rely on predefined rules and lexicons
(dictionaries of words with associated sentiments) to classify text. For example, a rule
might state that "great" and "amazing" are positive words, while "terrible" and
"awful" are negative. While rule-based methods can be fast and accurate for specific
tasks, they are less adaptable to new and complex contexts.
2.​ Machine Learning Methods: These approaches use algorithms trained on large
datasets of labeled text to predict sentiment. Common machine learning models used
in sentiment analysis include:
●​ Naive Bayes: A probabilistic classifier that predicts sentiment based on the
frequency of words and their associated sentiments.
●​ Support Vector Machines (SVMs): Algorithms that find optimal hyperplanes
to separate different sentiment classes in a feature space.
●​ Neural Networks: Deep learning models that can learn complex patterns in
text data, making them highly adaptable but requiring large training datasets.
3.​ Hybrid Approaches: These methods combine the strengths of rule-based and
machine learning approaches. For example, a hybrid system might use a lexicon to
identify key sentiment words and then employ a machine learning model to make the
final sentiment classification.
4.​ Aspect-Based Sentiment Analysis: This technique focuses on identifying and
analyzing sentiment towards specific aspects or features of a product or service. For
example, it can identify positive sentiment towards a product's design and negative
sentiment towards its battery life, even within the same review.
5.​ Emotion Detection: This type of sentiment analysis aims to identify specific
emotions expressed in text, such as happiness, sadness, anger, or frustration. It goes
beyond simple polarity (positive/negative) to understand the nuances of human
emotion.
6.​ Social Sentiment Analysis: This type of sentiment analysis is specifically tailored for
analyzing sentiment expressed in social media content, such as tweets and status
updates.

26.​Define rationalist and empiricist approaches to NLP?


→ In Natural Language Processing (NLP), the rationalist approach emphasizes using logic
and pre-defined knowledge structures to model language, while the empiricist approach relies
on statistical analysis of large datasets to learn patterns in language. Empirical methods often
use probabilistic models and machine learning, while rationalist approaches might involve
rule-based systems or symbolic representations.
Rationalist Approach:
●​ Focus: Reasoning, logic, and innate knowledge structures.
●​ Method: May use rule-based systems, symbolic representations, or formal grammars
to model language.
●​ Example: Developing a parser that uses a set of predefined grammatical rules to
analyze sentences.
●​ Assumptions: Language knowledge is largely innate or can be derived through
logical deduction.
●​ Goal: To create models that reflect the underlying cognitive processes of language
understanding and generation.
Empiricist Approach:
●​ Focus: Statistical analysis, data-driven learning, and machine learning.
●​ Method: Uses large corpora (collections of text or speech) to train models that learn
patterns and probabilities.
●​ Example: Using machine learning to predict the next word in a sentence based on the
preceding words in a corpus.
●​ Assumptions: Language knowledge is acquired through experience and exposure to
data.
●​ Goal: To create models that can accurately predict and generate human language
based on observed patterns.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy