NLP QB
NLP QB
Stemming: Stemming is used to normalize words into its base form or root
form. For example, celebrates, celebrated and celebrating, all these words
are originated with a single root word "celebrate." The big problem with
stemming is that sometimes it produces the root word which may not have
any meaning.
Identifying the stop words: In English, there are a lot of words that appear
very frequently like "is", "and", "the", and "a". NLP pipelines will flag these
words as stop words. Stop words might be filtered out before doing any
statistical analysis.
Dependency Parsing: Dependency Parsing is used to find that how all the
words in the sentence are related to each other.
POS Tags: POS stands for parts of speech, which includes Noun, verb,
adverb, and Adjective. It indicates that how a word functions with its
meaning as well as grammatically within the sentences. A word has one or
more parts of speech based on the context in which it is used.
State the factors which may make understanding of natural language difficult for a
computer.
o The main reason NLP is difficult is because Ambiguity and Uncertainty exist in the
language. There exists three types of ambiguity.
Lexical Ambiguity: This happens when a word has two or more meanings.
Example: You looking for a match, the word match has two meanings, first
being a partner and second being the person is looking for a tournament.
Syntactic Ambiguity: This happens when a sentence has to or more
meanings. Example: I saw the girl with the binocular. This sentence has two
meanings which are did I have the binoculars? Or did the girl have the
binoculars?
Referential Ambiguity: Referential Ambiguity exists when you are referring
to something using the pronoun. Example: Goku went to Vegeta. He said
“I’m starving”. In this sentence it is not clear who is hungry.
What do you mean by Parts of Speech Tagging? What is the need of this Task in NLP?
o Parts of Speech Tagging may be defined as the process of assigning one of the
parts of speech to the given word. It is generally called POS tagging. In simple
words, we can say that POS tagging is a task of labelling each word in a sentence
with its appropriate part of speech. We already know that parts of speech include
nouns, verb, adverbs, adjectives, pronouns, conjunction and their sub-categories.
o Need of POS Tagging
POS tags make it possible for automatic text processing tools to identify
which part of speech each word is. This facilitates the use of linguistic
criteria in addition to statistics.
Used For languages where the same word can have different parts of speech,
e.g. work in English, POS tags are used to distinguish between the
occurrences of the word when used as a noun or verb.
POS tags are also used to search for examples of grammatical or lexical
patterns without specifying a concrete word, e.g. to find examples of any
plural noun not preceded by an article.
Write a short note on: Word net, Frame net, Stemmer and Perplexity.
o Wordnet
WordNet is a database of words in the English language. Unlike a dictionary
that's organized alphabetically, WordNet is organized by concept and
meaning. In fact, traditional dictionaries were created for humans but what's
needed is a lexical resource more suited for computers. This is where
WordNet becomes useful.
WordNet is a network of words linked by lexical and semantic relations.
Nouns, verbs, adjectives and adverbs are grouped into sets of cognitive
synonyms, called synsets, each expressing a distinct concept. Synsets are
interlinked by means of conceptual-semantic and lexical relations. The
resulting network of meaningfully related words and concepts can be
navigated with the WordNet browser.
o Frame set
The FrameNet corpus is a lexical database of English that is both human-
and machine-readable, based on annotating examples of how words are used
in actual texts. FrameNet is based on a theory of meaning called Frame
Semantics.
The basic idea is straightforward: that the meanings of most words can best
be understood on the basis of a semantic frame: a description of a type of
event, relation, or entity and the participants in it. For example, the concept
of cooking typically involves a person doing the cooking (Cook), the food
that is to be cooked (Food), something to hold the food while cooking
(Container) and a source of heat (Heating_instrument). In the FrameNet
project, this is represented as a frame called Apply_heat, and the Cook,
Food, Heating_instrument and Container are called frame elements (FEs).
Words that evoke this frame, such as fry, bake, boil, and broil, are called
lexical units (LUs) of the Apply_heat frame. The job of FrameNet is to
define the frames and to annotate sentences to show how the FEs fit
syntactically around the word that evokes the frame.
o Stemmer:
Stemming is the process of reducing a word to its word stem that affixes to
suffixes and prefixes or to the roots of words known as a lemma. Stemming
is important in natural language understanding (NLU) and natural language
processing (NLP).
Stemming is a part of linguistic studies in morphology and artificial
intelligence (AI) information retrieval and extraction. Stemming and AI
knowledge extract meaningful information from vast sources like big data or
the Internet since additional forms of a word related to a subject may need to
be searched to get the best results. Stemming is also a part of queries and
Internet search engines.
o Perplexity:
It is a metric used to judge how good a language model is
We can define perplexity as the inverse probability of the test set,
normalized by the number of words:
o Term Frequency:
D1: The cat sat on the mat. The cat was white.
D1 = 5
D2: The brown coloured dog was barking loudly at the
cat. D2 = 6
D3: The mat was green in colour. D3 = 3
D4: The dog pulled the mat with his teeth. The cat
still sat on the mat. D4 = 7
V _ fbark, brown, cat, colour, dog, green, loud, mat,
pull, sit, teeth, whiteg, V = 12.
Normalized tf:
d1 = 1
5(0; 0; 2; 0; 0; 0; 0; 1; 0; 1; 0; 1)
d2 = 1
6(1; 1; 1; 1; 1; 0; 1; 0; 0; 0; 0; 0)
d3 = 1
3(0; 0; 0; 1; 0; 1; 0; 1; 0; 0; 0; 0)
d4 = 1
7(0; 0; 1; 0; 1; 0; 0; 2; 1; 1; 1; 0)
o Inverse document frequency
idf values for each word are:
bark=ln4/1
brown=ln4/1
cat=ln4/3
colour=ln4/2
dog=ln 4/2
green=ln 4/1
loud=ln 4/1
mat=ln 4/3
pull=ln 4/1
sit=ln 4/2
teeth=ln 4/1
white=ln 4/1 .
Write a short note on Word Sense Disambiguation and also explain Lesk Algorithm.
o Let’s disambiguate the word bank. It can have two meaning where we deposit our
money and other being edge of some river.
o Assumes we have some sense--‐labeled data. Take
o Take all the sentences by relevant words sense.
o Now add this to gloss + example for each sense can call it “Signature” of sense.
o Choose the word with most overlap between contact and signature,