100% found this document useful (2 votes)
1K views14 pages

NLP QB

Natural Language Processing (NLP) is a field of artificial intelligence that helps computers understand, analyze, and generate human language. There are two main components of NLP: Natural Language Understanding (NLU) which helps machines understand language; and Natural Language Generation (NLG) which helps machines produce language. Some key applications of NLP include question answering, sentiment analysis, machine translation, and chatbots. Understanding natural language can be difficult for computers due to ambiguities in language including lexical, syntactic, and referential ambiguities.

Uploaded by

Ishika Patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
1K views14 pages

NLP QB

Natural Language Processing (NLP) is a field of artificial intelligence that helps computers understand, analyze, and generate human language. There are two main components of NLP: Natural Language Understanding (NLU) which helps machines understand language; and Natural Language Generation (NLG) which helps machines produce language. Some key applications of NLP include question answering, sentiment analysis, machine translation, and chatbots. Understanding natural language can be difficult for computers due to ambiguities in language including lexical, syntactic, and referential ambiguities.

Uploaded by

Ishika Patel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Natural Language Processing Question Bank

 What is NLP? Give Advantages and Disadvantages of ALP in details.


o NLP stands for Natural Language Processing, which is a part of Computer Science,
Human language, and Artificial Intelligence. It is the technology that is used by
machines to understand, analyze, manipulate, and interpret human's languages.
o Advantages:
 NLP helps users to ask questions about any subject and get a direct response
within seconds.
 NLP offers exact answers to the question means it does not offer
unnecessary and unwanted information.
 NLP helps computers to communicate with humans in their languages.
 It is very time efficient.
 Most of the companies use NLP to improve the efficiency of documentation
processes, accuracy of documentation, and identify the information from
large databases.
o Disadvantages:
 NLP may not show context.
 NLP is unpredictable
 NLP may require more keystrokes.
 NLP is unable to adapt to the new domain, and it has a limited function that's
why NLP is built for a single and specific task only.

 Explain the components of NLP.


o The two major components of NLP are:
 Natural Language Understanding (NLU): It helps the machine to understand
and analyze human language by extracting the metadata from content such
as concepts, entities, keywords, emotion, relations, and semantic roles. NLU
mainly used in Business applications to understand the customer's problem
in both spoken and written language. NLU involves the following tasks -
 It is used to map the given input into useful representation.
 It is used to analyze different aspects of the language.

 Natural Language Generation (NLG): It acts as a translator that converts the


computerized data into natural language representation. It mainly involves
Text planning, Sentence planning, and Text Realization.

 Give the difference between NLU and NLG.


 Explain the Applications of NLP.
o Applications of NLP are as follows:
 Question Answering
 Spam Detection
 Sentiment Analysis
 Machine Translation
 Spelling Correction
 Speech Recognition
 Chatbot
 Information extraction
 Natural Language Extraction

 What do you mean by Sentiment Analysis in NLP?


o Sentiment Analysis is also known as opinion mining. It is used on the web to
analyze the attitude, behavior, and emotional state of the sender. This application is
implemented through a combination of NLP (Natural Language Processing) and
statistics by assigning the values to the text (positive, negative, or natural), identify
the mood of the context (happy, sad, angry, etc.)

 Write a short note on How to Build an NLP Pipeline.


o The steps to build to NLP pipeline is as follows:
 Sentence Segmentation: Sentence Segment is the first step for building the
NLP pipeline. It breaks the paragraph into separate sentences.

 Word Tokenization: Word Tokenizer is used to break the sentence into


separate words or tokens.

 Stemming: Stemming is used to normalize words into its base form or root
form. For example, celebrates, celebrated and celebrating, all these words
are originated with a single root word "celebrate." The big problem with
stemming is that sometimes it produces the root word which may not have
any meaning.

 Lemmatization: Lemmatization is quite similar to the Stemming. It is used to


group different inflected forms of the word, called Lemma. The main
difference between Stemming and lemmatization is that it produces the root
word, which has a meaning.

 Identifying the stop words: In English, there are a lot of words that appear
very frequently like "is", "and", "the", and "a". NLP pipelines will flag these
words as stop words. Stop words might be filtered out before doing any
statistical analysis.

 Dependency Parsing: Dependency Parsing is used to find that how all the
words in the sentence are related to each other.

 POS Tags: POS stands for parts of speech, which includes Noun, verb,
adverb, and Adjective. It indicates that how a word functions with its
meaning as well as grammatically within the sentences. A word has one or
more parts of speech based on the context in which it is used.

 Named Entity Recognition: Named Entity Recognition (NER) is the process


of detecting the named entity such as person name, movie name,
organization name, or location.

 Chunking: Chunking is used to collect the individual piece of information


and grouping them into bigger pieces of sentences.

 Explain steps of Natural Language Processing


o There are five steps in Natural Language Processing:
 Lexical and Morphological Analysis: The first phase of NLP is the Lexical
Analysis. This phase scans the source code as a stream of characters and
converts it into meaningful lexemes. It divides the whole text into
paragraphs, sentences, and words.
 Syntactic Analysis: Syntactic Analysis is used to check grammar, word
arrangements, and shows the relationship among the words.
 Semantic Analysis: Semantic analysis is concerned with the meaning
representation. It mainly focuses on the literal meaning of words, phrases,
and sentences.
 Discourse Integration: Discourse Integration depends upon the sentences
that proceeds it and also invokes the meaning of the sentences that follow it.
 Pragmatic Analysis: Pragmatic is the fifth and last phase of NLP. It helps
you to discover the intended effect by applying a set of rules that
characterize cooperative dialogues.

 Explain “Morphological Analysis” and “Syntax Analysis” in Natural Language


Processing. Explain Semantic and Syntactic analysis in NLP.
o Morphological Analysis: The first phase of NLP is the Lexical Analysis. This
phase scans the source code as a stream of characters and converts it into
meaningful lexemes. It divides the whole text into paragraphs, sentences, and
words.
o Syntax Analysis: Syntactic Analysis is used to check grammar, word
arrangements, and shows the relationship among the words.
o Syntactic Analysis: Syntactic Analysis is used to check grammar, word
arrangements, and shows the relationship among the words.
o Semantic Analysis: Semantic analysis is concerned with the meaning
representation. It mainly focuses on the literal meaning of words, phrases, and
sentences.

 State the factors which may make understanding of natural language difficult for a
computer.
o The main reason NLP is difficult is because Ambiguity and Uncertainty exist in the
language. There exists three types of ambiguity.
 Lexical Ambiguity: This happens when a word has two or more meanings.
Example: You looking for a match, the word match has two meanings, first
being a partner and second being the person is looking for a tournament.
 Syntactic Ambiguity: This happens when a sentence has to or more
meanings. Example: I saw the girl with the binocular. This sentence has two
meanings which are did I have the binoculars? Or did the girl have the
binoculars?
 Referential Ambiguity: Referential Ambiguity exists when you are referring
to something using the pronoun. Example: Goku went to Vegeta. He said
“I’m starving”. In this sentence it is not clear who is hungry.

 Write a Short note in NLP API and Libraries.


o NLP APIs: Natural Language Processing APIs allow developers to integrate
human-to-machine communications and complete several useful tasks such as
speech recognition, chatbots, spelling correction, sentiment analysis, etc. Some of
the APIs:
 IBM Watson API: It combines different sophisticated machine learning
techniques to enable developers to classify text into various custom
categories. It supports multiple languages, such as English, French, Spanish,
German, Chinese, etc.
 Chatbot API: It allows you to create intelligent chatbots for any service. It
supports Unicode characters, classifies text, multiple languages, etc.
 Speech to text API: It is used to convert speech to text.
 Sentiment Analysis API: It is also called as 'opinion mining' which is used to
identify the tone of a user (positive, negative, or neutral).
 The Translation API by SYSTRAN: It is used to translate the text from the
source language to the target language.
 Text Analysis API by AYLIEN: It is used to derive meaning and insights
from the textual content.
o NLP Libraries: Some of the popular NLP libraries are as follows:
 Scikit-learn: It provides a wide range of algorithms for building machine
learning models in Python.
 Natural language Toolkit (NLTK): NLTK is a complete toolkit for all NLP
techniques.
 Pattern: It is a web mining module for NLP and machine learning.
 TextBlob: It provides an easy interface to learn basic NLP tasks like
sentiment analysis, noun phrase extraction, or pos-tagging.
 Quepy: Quepy is used to transform natural language questions into queries
in a database query language.
 SpaCy: It is an open-source NLP library which is used for Data Extraction,
Data Analysis, Sentiment Analysis, and Text Summarization.
 Gensim: It works with large datasets and processes data streams.

 What do you mean by Parts of Speech Tagging? What is the need of this Task in NLP?
o Parts of Speech Tagging may be defined as the process of assigning one of the
parts of speech to the given word. It is generally called POS tagging. In simple
words, we can say that POS tagging is a task of labelling each word in a sentence
with its appropriate part of speech. We already know that parts of speech include
nouns, verb, adverbs, adjectives, pronouns, conjunction and their sub-categories.
o Need of POS Tagging
 POS tags make it possible for automatic text processing tools to identify
which part of speech each word is. This facilitates the use of linguistic
criteria in addition to statistics.
 Used For languages where the same word can have different parts of speech,
e.g. work in English, POS tags are used to distinguish between the
occurrences of the word when used as a noun or verb.
 POS tags are also used to search for examples of grammatical or lexical
patterns without specifying a concrete word, e.g. to find examples of any
plural noun not preceded by an article.
 Write a short note on: Word net, Frame net, Stemmer and Perplexity.
o Wordnet
 WordNet is a database of words in the English language. Unlike a dictionary
that's organized alphabetically, WordNet is organized by concept and
meaning. In fact, traditional dictionaries were created for humans but what's
needed is a lexical resource more suited for computers. This is where
WordNet becomes useful.
 WordNet is a network of words linked by lexical and semantic relations.
Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive
synonyms, called synsets, each expressing a distinct concept. Synsets are
interlinked by means of conceptual-semantic and lexical relations. The
resulting network of meaningfully related words and concepts can be
navigated with the WordNet browser.
o Frame set
 The FrameNet corpus is a lexical database of English that is both human-
and machine-readable, based on annotating examples of how words are used
in actual texts. FrameNet is based on a theory of meaning called Frame
Semantics.
 The basic idea is straightforward: that the meanings of most words can best
be understood on the basis of a semantic frame: a description of a type of
event, relation, or entity and the participants in it. For example, the concept
of cooking typically involves a person doing the cooking (Cook), the food
that is to be cooked (Food), something to hold the food while cooking
(Container) and a source of heat (Heating_instrument). In the FrameNet
project, this is represented as a frame called Apply_heat, and the Cook,
Food, Heating_instrument and Container are called frame elements (FEs).
Words that evoke this frame, such as fry, bake, boil, and broil, are called
lexical units (LUs) of the Apply_heat frame. The job of FrameNet is to
define the frames and to annotate sentences to show how the FEs fit
syntactically around the word that evokes the frame.
o Stemmer:
 Stemming is the process of reducing a word to its word stem that affixes to
suffixes and prefixes or to the roots of words known as a lemma. Stemming
is important in natural language understanding (NLU) and natural language
processing (NLP).
 Stemming is a part of linguistic studies in morphology and artificial
intelligence (AI) information retrieval and extraction. Stemming and AI
knowledge extract meaningful information from vast sources like big data or
the Internet since additional forms of a word related to a subject may need to
be searched to get the best results. Stemming is also a part of queries and
Internet search engines.
o Perplexity:
 It is a metric used to judge how good a language model is
 We can define perplexity as the inverse probability of the test set,
normalized by the number of words:

 List and explain Smoothing techniques used in NLP.


o To keep a language model from assigning zero probability to these unseen events,
we’ll have to shave off a bit of probability mass from some more frequent events
and give it to the events we’ve never seen.
o This modification is called smoothing or discounting.
o Different ways to do smoothing:
 Laplace (add-one) smoothing: The simplest way to do smoothing is to add
one to all the bigram counts, before we normalize them into probabilities.
All the counts that used to be zero will now have a count of 1, the counts of
1 will be 2, and so on. This algorithm is called Laplace smoothing.
 P(wi) = ci/N
 PLaplace(wi) = ci+1/N+V
 Where wi = probability of the word, ci = word count, N = number of
token and V = total number which was incremented.
 For bigrams: P(wn|wn-1)= : C(wn|wn-1)/C(wn-1)
 PLaplace (wn|wn-1)= : C(wn|wn-1)+1/C(wn-1)+V.

 Backoff smoothing: In backoff, we use the trigram if the evidence is


sufficient, otherwise we use the bigram, otherwise the unigram. In other
words, we only “back off” to a lower-order n-gram if we have zero evidence
for a higher-order in n-gram.
 Interpolation: Linearly combine estimates of N-gram models of increasing
order.

 Write a short note on Language models


o A language model is the core component of modern Natural Language Processing
(NLP). It’s a statistical tool that analyzes the pattern of human language for the
prediction of words.
o NLP-based applications use language models for a variety of tasks, such as audio
to text conversion, speech recognition, sentiment analysis, summarization, spell
correction, etc.
o They are used for:
 Speech Recognition.
 OCR and Handwriting recognition.
 Machine translation.
 Sentence generation.
 Context sensitive spelling correction.

 List the types of Morphology and explain the approaches to Morphology.


o The study of word formation – how words are built up from smaller pieces is
known as Morphology. Types of Morphology.
 Inflectional morphology: Modification of a word to express different
grammatical categories. Examples: cats, men etc.
 Derivational Morphology: Creation of a new word from existing word by
changing grammatical category. Examples: happiness, brotherhood etc.
o Approaches to Morphology:
 Morpheme based morphology: Word forms are analyzed as arrangements of
morphemes. Morphemes- smallest linguistic unit with a grammatical
function.
 Lexeme based morphology: Lexeme-based morphology usually takes what
is called an "item-and-process" approach. Instead of analyzing a word form
as a set of morphemes arranged in sequence, a word form is said to be the
result of applying rules that alter a word-form or stem in order to produce a
new one.
 Word based morphology: Word-based morphology is (usually) a word and-
paradigm approach. Instead of stating rules to combine morphemes into
word forms, or to generate word forms from stems, word-based morphology
states generalizations that hold between the forms of inflectional paradigms.

 Draw the NLG system architecture and explain its stages.


o Natural Language Generation (NLG) is the subfield of artificial intelligence and
computational linguistics that focuses on computer systems that can produce
understandable texts in English or other human languages
o NLG system architecture:

o Stages of NLG system architecture:


 Example to work on:
 Grass pollen levels for Friday have increased from the moderate to
high levels of yesterday with values of around 6 to 7 across most parts
of the country. However, in Northern areas, pollen levels will be
moderate with values of 4.
 Pollen counts are expected to remain high at level 6 over most of
Scotland, and even level 7 in the south east. The only relief is in the
Northern Isles and far northeast of mainland Scotland with medium
levels of pollen count.
 Content determination: Deciding what information to mention in the text.
For instance, in the pollen example above, deciding whether to explicitly
mention that pollen level is 7 in the south east.
 Document structuring: Overall organisation of the information to convey.
For example, deciding to describe the areas with high pollen levels first,
instead of the areas with low pollen levels.
 Aggregation: Merging of similar sentences to improve readability and
naturalness. For instance, merging the two following sentences:
 Grass pollen levels for Friday have increased from the moderate to high
levels of yesterday and
 Grass pollen levels will be around 6 to 7 across most parts of the country
 into the following single sentence:
 Grass pollen levels for Friday have increased from the moderate to high
levels of yesterday with values of around 6 to 7 across most parts of the
country.
 Lexical choice: Putting words to the concepts. For example, deciding
whether medium or moderate should be used when describing a pollen level
of 4.
 Referring expression generation: Creating referring expressions that identify
objects and regions. For example, deciding to use in the Northern Isles and
far northeast of mainland Scotland to refer to a certain region in Scotland.
This task also includes making decisions about pronouns and other types of
anaphora.
 Realization: Creating the actual text, which should be correct according to
the rules of syntax, morphology.

 Write a short note on Named Entity Recognition.


o NER, short for, Named Entity Recognition is a standard Natural Language
Processing problem which deals with information extraction. The primary
objective is to locate and classify named entities in text into predefined categories
such as the names of persons, organizations, locations, events, expressions of
times, quantities, monetary values, percentages, etc.
o To put it simply, NER deals with extracting the real-world entity from the text such
as a person, an organization, or an event. Named Entity Recognition is also simply
known as entity identification, entity chunking, and entity extraction. They are
quite similar to POS (part-of-speech) tags.
o Application of NER :
 Classifying content for news providers.
 Automatically Summarizing Resumes.
 Optimizing Search Engine Algorithms.
 Powering Recommendation systems.
 Simplifying Customer Support.
 Consider following training data
<s> I am Jack </s>
<s> Jack I am </s>
<s> Jack I like </s>
<s> Jack I do like </s>
<s> do I like Jack</s>
Assume that we use Bi-gram Language Model, based on above data what is the most
probable next word predicated by the model?
a.) <s> Jack ______
b.) <s> Jack I do _____
c.) <s> Jack I am Jack _____
d.) <s> do I like _____
o <s> Jack I
o <s> Jack I do like
o <s> Jack I am Jack I
o <s> do I like Jack | </s> since both have same probability.

 For the given Corpus:


<s> I am a human </s>
<s> I am not a stone </s>
<s> I I live in Mumbai </s>
Check the probability of “I I am not” using biagram model.
o P(S) = P(I I am not) = P(I / ) * P(I / I) * P(am / I) * P(not / am) = 3/3 * ¼ * 2/4 * ½
=0.0625, is the probability of occurrence of these words as combination using the
bigram model.

 Explain Bag of Words (BoW) in details.


o Bag of Words (BoW) model - simply records whether or not a word occurs in a
document. It has an obvious and simple model for a document - namely the
vocabulary sized vector which contains a 1 if the word occurs in the document else
it is 0. These representations are usually very sparse.
o One of the biggest problems with text is that it is messy and unstructured, and
machine learning algorithms prefer structured, well defined fixed-length inputs and
by using the Bag-of-Words technique we can convert variable-length texts into a
fixed-length vector.
 For the given corpus:
D1: The cat sat on the mat. The cat was white.
D2: The brown coloured dog was barking loudly at the cat.
D3: The mat was green in colour.
D4: The dog pulled the mat with his teeth. The cat still sat on the mat.
Calculate:
a.) Term Frequency - tf
b.) Inverse document frequency – idf
c.) Term frequency inverse document frequency – tfidft

o Term Frequency:
 D1: The cat sat on the mat. The cat was white.
 D1 = 5
 D2: The brown coloured dog was barking loudly at the
 cat. D2 = 6
 D3: The mat was green in colour. D3 = 3
 D4: The dog pulled the mat with his teeth. The cat
 still sat on the mat. D4 = 7
 V _ fbark, brown, cat, colour, dog, green, loud, mat,
 pull, sit, teeth, whiteg, V = 12.
 Normalized tf:
 d1 = 1
 5(0; 0; 2; 0; 0; 0; 0; 1; 0; 1; 0; 1)
 d2 = 1
 6(1; 1; 1; 1; 1; 0; 1; 0; 0; 0; 0; 0)
 d3 = 1
 3(0; 0; 0; 1; 0; 1; 0; 1; 0; 0; 0; 0)
 d4 = 1
 7(0; 0; 1; 0; 1; 0; 0; 2; 1; 1; 1; 0)
o Inverse document frequency
 idf values for each word are:
 bark=ln4/1
 brown=ln4/1
 cat=ln4/3
 colour=ln4/2
 dog=ln 4/2
 green=ln 4/1
 loud=ln 4/1
 mat=ln 4/3
 pull=ln 4/1
 sit=ln 4/2
 teeth=ln 4/1
 white=ln 4/1 .

o Term frequency inverse document


 d1 = 1/5(0; 0; 2.ln4/3; 0; 0; 0; 0; 1.ln 4/3 ; 0; 1.ln4/2;0; 1.ln4/1).
 d2 = 1/6(1.ln4/1; 1.ln4/2; 1.ln4/2; 1.ln4/1; 1.ln4/1; 0; 1.ln4/3; 0; 0; 0; 0; 0)
 d3 = 1/3(0; 0; 0; 1.ln4/3; 0; 1.4/1; 0; 1.ln4/2; 0; 0; 0; 0)
 d4 = 1/7(0; 0; 1.ln4/2; 0; 1.l.4/1; 0; 0; 2.ln4/3; 1.ln4/1; 1.ln4/3; 1.ln4/2; 0).

 What do you mean by Word Embedding?


o In natural language processing (NLP), word embedding is a term used for the
representation of words for text analysis, typically in the form of a real-valued
vector that encodes the meaning of the word such that the words that are closer in
the vector space are expected to be similar in meaning. Word embedding can be
obtained using a set of language modeling and feature learning techniques where
words or phrases from the vocabulary are mapped to vectors of real numbers.
o Limitation:
 Inability to handle unknown words.
 No shared representations at sub-word levels.
 Scaling to new languages requires new embedding matrices.

 Write a short note on:


CBoW (Continuous bag of words)
Skip-gram
GLoVE 
o
o CBow:
 In the CBOW model, the distributed representations of context (or
surrounding words) are combined to predict the word in the middle. While in
the Skip-gram model, the distributed representation of the input word is used
to predict the context.
 A prerequisite for any neural network or any supervised training technique is
to have labeled training data. How do you a train a neural network to predict
word embedding when you don’t have any labeled data i.e words and their
corresponding word embedding?
o Skip-gram:
o We’ll do so by creating a “fake” task for the neural network to train. We won’t be
interested in the inputs and outputs of this network, rather the goal is actually just
to learn the weights of the hidden layer that are actually the “word vectors” that
we’re trying to learn.
o The fake task for Skip-gram model would be, given a word, we’ll try to predict its
neighboring words. We’ll define a neighboring word by the window size — a
hyper-parameter
o GloVE:
 HAL (Hyperspace Analogue to Language): Use a word-word
o matrix instead of the term-document matrix. Entries contain counts of target words
occurring with context words. HAL does not handle frequently occurring words
properly. I LSA - does not actually produce word vectors. The termvectors do not
have the nice properties that word vectors have. HAL (Hyperspace Analogue to
Language): Use a word-wormatrix instead of the term-document matrix. Entries
contain counts of target words occurring with context words. HAL does not handle
frequently occurring words properly. I LSA - does not actually produce word
vectors. The term vectors do not have the nice properties that word vectors have.

 Write a short note on Word Sense Disambiguation and also explain Lesk Algorithm.
o Let’s disambiguate the word bank. It can have two meaning where we deposit our
money and other being edge of some river.
o Assumes we have some sense--‐labeled data. Take
o Take all the sentences by relevant words sense.
o Now add this to gloss + example for each sense can call it “Signature” of sense.
o Choose the word with most overlap between contact and signature,

 Explain Supervised Word Sense Disambiguation technique.


o Supervised methods are based on the assumption that the context can provide
enough evidence on its own to disambiguate words. Probably every machine
learning algorithm going has been applied to WSD, including associated
techniques such as feature selection, parameter optimization, and
ensemble learning. Support vector machines and memory-based learning have
been shown to be the most successful approaches, to date, probably because they
can cope with the high-dimensionality of the feature space. However, these
supervised methods are subject to a new knowledge acquisition bottleneck since
they rely on substantial amounts of manually sense-tagged corpora for training,
which are laborious and expensive to create.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy