Module 3 - NLP
Module 3 - NLP
Module 3
Examples of such data include news websites, blogs, online bookshelves, product reviews,
tweets, etc.;
● Customer Support
Customers often use social media to express their opinions about and experiences of products
or services. Text classification is often used to identify the tweets that brands must respond to
and those that don't require a response
● E-commerce
Customers leave reviews for a range of products on e-commerce websites like Amazon, eBay,
etc - to understand and analyze customers perception of a product or service based on their
comments. This is commonly known as “sentiment analysis.” Its used extensively by brands
across the globe to better understand customers. Rather than categorizing customer feedback
as simply positive, negative, or neutral, over a period of time, sentiment analysis has evolved
into a more sophisticated paradigm: “aspect” based sentiment analysis.
● Other Applications
2. Split the dataset into two (training and test) or three parts: training, validation (i.e.,
development), and test sets, then decide on evaluation metric(s).
4. Train a classifier using the feature vectors and the corresponding labels from the training
set.
5. Using the evaluation metric(s) from Step 2, benchmark the model performance on the test
set.
6. Deploy the model to serve the real-world use case and monitor its performance.
all classes c ∈ C the classifier returns the c class which has the maximum posterior
The intuition of Bayesian classification is to use Bayes’ rule to transform the above equation into
Dropping the denominator P(d) why?⇒we calculating Pld) for each possible class and it does
not change for each class
We call Naive Bayes a generative model because we can read from the above equation as
stating a kind of implicit assumption about how a document is generated: first a class is sampled
from P(c), and then the words are generated by sampling from P(d|c). (In fact we could imagine
generating artificial documents, or at least their word counts,by following this process).
● we assume position doesn’t matter, and that the word “love” has the same effect on
classification whether it occurs as the 1st, 20th, or last word in the document. Thus we
assume that the features f1, f2,..., fn only encode word identity and not position
● The second is commonly called the naive Bayes assumption: this is the conditional
independence assumption that the probabilities P(fi|c) are independent given the class c
and hence can be ‘naively’ multiplied as follows:
Calculations for language modeling are done in log space to avoid underflow and increase
speed.
Classifiers that use a linear combination of the inputs to make a classification decision
—like naive Bayes and also logistic regression are called linear classifiers.
To learn the probability P(fi|c), we’ll assume a feature is just the existence of a word in the
document’s bag of words, and so we’ll want P(wi|c), which we compute as the fraction of times
the word wi appears among all words in all documents of topic c. We first concatenate all
documents with category c into one big “category c” text. Then we use the frequency of wi in
this concatenated document to give a maximum likelihood estimate of the probability:
Here the vocabulary V consists of the union of all the word types in all classes, not just the
words in one class c. There is a problem, however, with maximum likelihood training. Imagine
we are trying to estimate the likelihood of the word “fantastic” given class positive, but suppose
there are no training documents that both contain the word “fantastic” and are classified as
positive. Perhaps the word “fantastic” happens to occur (sarcastically?) in the class negative. In
such a case the probability for this feature will be zero:
Please note :
● vocabulary V consists of the union of all the word types in all classes, not just the
words in one class c
● Remove the unknown words from the test documents
● Sometimes we remove the stop words and sometimes we don't [ Removing the
stop words doesn't bring any change in performance, so usually so words are
included]
A second important addition commonly made when doing text classification for sentiment is to
deal with negation. Consider the difference between I really like this movie (positive) and I didn’t
like this movie (negative). The negation expressed by didn’t completely alters the inferences we
draw from the predicate like. Similarly, negation can modify a negative word to produce a
positive review (don’t dismiss this film, doesn’t let us get bored). A very simple baseline that is
commonly used in sentiment analysis to deal with negation is the following: during text
normalization, prepend the prefix NOT to every word after a token of logical negation (n’t, not,
no, never) until the next punctuation mark. Thus the phrase didn’t like this movie , but I becomes
didn’t NOT_like NOT_this NOT_movie , but Newly formed ‘words’ like NOT like, NOT
recommend will thus occur more often in negative document and act as cues for negative
sentiment, while words like NOT bored, NOT dismiss will acquire positive associations. in some
situations we might have insufficient labeled training data to train accurate naive Bayes
classifiers using all words in the training set to estimate positive and negative sentiment. In such
cases we can instead derive the positive and negative word features from sentiment lexicons, -
Words pre annotated with positive and negative
Unlike Naive Bayes, which estimates probabilities based on feature occurrence in classes,
logistic regression “learns” the weights for individual features based on how important they are
to make a classification decision. The goal of logistic regression is to learn a linear separator
between classes in the training data with the aim of maximizing the probability of the data. This
“learning” of feature weights and probability distribution over all classes is done through a
function called “logistic” function, and (hence the name) logistic regression.
The most important difference between naive Bayes and logistic regression is that logistic
regression is a discriminative classifier while naive Bayes is a generative classifier.
A generative model would have the goal of understanding what dogs look like and what cats
look like. You might literally ask such a model to ‘generate’, i.e., draw, a dog. Given a test
image, the system then asks whether it’s the cat model or the dog model that better fits (is less
surprised by) the image, and chooses that as its label. A discriminative model, by contrast, is
only trying to learn to distinguish the classes (perhaps without learning much about them). So
maybe all the dogs in the training data are wearing collars and the cats aren’t. If that one feature
neatly separates the classes, the model is satisfied. If you ask such a model what it knows
about cats all it can say is that they don’t wear collars.
Consider a single input observation x, which we will represent by a vector of features [x1, x2,...,
xn] (we’ll show sample features in the next subsection). The classifier output y can be 1
(meaning the observation is a member of the class) or 0 (the observation is not a member of the
class). We want to know the probability
P(y = 1|x) that this observation is a member of the class. So perhaps the decision is “positive
sentiment” versus “negative sentiment”, the features represent counts of words in a document,
P(y = 1|x) is the probability that the document has positive sentiment,and P(y = 0|x) is the
probability that the document has negative sentiment.
Logistic regression solves this task by learning, from a training set, a vector of weights and a
bias term. Each weight wi is a real number, and is associated with one of the input features xi.
The weight wi represents how important that input feature is to the classification decision, and
can be positive (providing evidence that the instance being classified belongs in the positive
class) or negative (providing evidence that the instance being classified belongs in the negative
class). Thus we might expect in a sentiment task the word awesome to have a high positive
weight, and abysmal to have a very negative weight. The bias term, also called the intercept, is
another real number that’s added to the weighted inputs.
To create a probability, we’ll pass z through the sigmoid function, σ(z). The sigmoid function
(named because it looks like an s) is also called the logistic function, and gives logistic
regression its name.
Information extraction
Information extraction (IE) is a technique or a task of extracting structured information from
unstructured text. It transforms raw text (e.g., articles, emails, social media posts) into
organized data (e.g., databases, tables, or knowledge graphs) that machines can
understand and use for downstream tasks.
Application of IE
IE tasks
The overarching goal of IE is to extract “knowledge” from text, and each of these tasks provides
different information to do that.
Identifying that the article is about “buyback” or “stock price” relates to the IE task of keyword or
keyphrase extraction (KPE).
Identifying Apple as an organization and Luca Maestri as a person comes under the IE task of
named entity recognition (NER).
Recognizing that Apple is not a fruit, but a company, and that it refers to Apple, Inc. and not
some other company with the word “apple” in its name is the IE task of named entity
disambiguation and linking.
Extracting the information that Luca Maestri is the finance chief of Apple refers to the IE task of
relation extraction.
Advanced IE tasks :
Identifying that this article is about a single event (let’s call it “Apple buys back stocks”) and
being able to link it to other articles talking about the same event over time refers to the IE task
of event extraction.
Temporal information extraction, which aims to extract information about times and dates, which
is also useful for developing calendar applications and interactive personal assistants.
IE pipeline
Supervised learning approaches require corpora with texts and their respective keyphrases and
use engineered features or DL techniques. Creating such labeled datasets for KPE is a time-
and cost-intensive endeavor. Hence, unsupervised approaches that do not require a labeled
dataset and are largely domain agnostic are more popular for KPE. These approaches are also
more commonly used in real-world KPE applications.
All the popular unsupervised KPE algorithms are based on the idea of representing the words
and phrases in a text as nodes in a weighted graph where the weight indicates the importance
of that keyphrase. Keyphrases are then identified based on how connected they are with the
rest of the graph. The top-N important nodes from the graph are then returned as keyphrases.
Important nodes are those words and phrases that are frequent enough and also well
connected to different parts of the text. The different graph-based KPE approaches differ in the
way they select potential words/phrases from the text (from a large set of possible words and
phrases in the entire text) and the way these words/phrases are scored in the graph.
● The process of extracting potential n-grams and building the graph with them is sensitive
to document length - an issue - so use the first M% and the last N% of the text
● Since each keyphrase is independently ranked, we sometimes end up seeing
overlapping keyphrases (e.g., “buy back stock” and “buy back”). One solution for this
could be to use some similarity measure (e.g., cosine similarity) between the top- ranked
keyphrases and choose the ones that are most dissimilar to one another.
● Remove unwanted Word patterns directly-like sentences starting with preposition
● Improper text extraction can affect the rest of the KPE process. - like PDFs or scanned
documents - add some post-processing to the extracted key phrases list to create a final,
meaningful list without noise.
An approach that goes beyond a lookup table is rule-based NER, which can be based on a
compiled list of patterns based on word tokens and POS tags. A more practical approach to
NER is to train an ML model, which can predict the named entities in unseen text. For each
word, a decision has to be made whether or not that word is an entity, and if it is, what type of
the entity it is. - only difference here is that NER is a “sequence labeling” problem
Unlike part-of-speech tagging, where there is no segmentation problem since each word gets
one tag, the task of named entity recognition is to find and label spans of text, and is difficult
partly because of the ambiguity of segmentation. We need to decide what’s an entity and what
isn’t, and where the boundaries are. Indeed, most words in a text will not be named entities.
Another difficulty is caused by type ambiguity. The mention JFK can refer to a person, the
airport in New York, or any number of schools, bridges, and streets around the United States.
The below tables are for your information only: not there in the syllabus but you should
know it for your future - Basic English
It is necessary for you to know the below terms about the English language in order to
understand the working of an NLP model. Please read it carefully.
Ambiguity in NER
Tagging is a disambiguation task; words are ambiguous —have more than one possible
part-of-speech—and the goal is to find the correct tag for the situation. For example, book can
be a verb (book that flight) or a noun (hand me that book).
That can be a determiner (Does that flight serve dinner) or a complementizer (I thought that
your flight was earlier). The goal of POS-tagging is to resolve these ambiguities, choosing the
proper tag for the context.
England (Organization) won the 2019 world cup vs The 2019 world cup happened in England
(Location).
Washington (Location) is the capital of the US vs The first president of the US was Washington
(Person).
There are many types of ambiguity faced by NER systems, some are :
★ Word Sense Disambiguation: Many words in a text can have multiple meanings or
senses. For example, the word "Apple" can refer to a fruit or the technology company.
Determining the correct sense of a word in context is essential for accurate NER.
★ Proper Noun Variations: Proper nouns can have variations in their spellings or forms,
such as abbreviations, misspellings, or alternative names. This variability adds ambiguity
to the NER process. For example, "New York" can be referred to as "NY," "N.Y.," or "Big
Apple."
★ Contextual Ambiguity: The same word can have different named entity categories based
on the context. For instance, the word "Java" can refer to the programming language or
the Indonesian island, and its context determines the correct entity type.
★ Homographs: Homographs are words that are spelled the same but have different
meanings. For example, "lead" can refer to a metal or the act of guiding. Determining the
correct named entity type requires considering the context.
For example, if the previous word was a person name, there’s a higher probability that the
current word is also a person name if it’s a noun (e.g., first and last names).
That is =>
A common use case for sequence labeling is POS tagging, where we need information about
the parts of speech of surrounding words to estimate the part of speech of the current word
Till now we saw that the prediction are independent of the surroundings
To perform sequence classification, we need data in a format that allows us to model the
context.
The labels in the figure follow what’s known as a BIO notation: B indicates the beginning of an
entity; I, inside an entity, indicates when entities comprise more than one word; and O, other,
indicates non-entities.
3. Train the classifier - train using algorithms like CRFs, HMM, etc
Evaluation of NER
You are expected to give the whole pipeline- In simple words . One or two pages are more than enough
but it should be written after knowing the concepts of IE and text classification from the above notes.
PREPARED BY,