NLP
NLP
Introduction
• Let’s predict what word is likely to follow:
• Please turn your homework ….
• Most likely in, Possibly over but sure not the
• It will be very helpful if we assign a probability to each possible next
word
• Models that assign probabilities to upcoming words, or sequences of
words in general, are called language models or LMs.
• LLM that revolutionized modern NLP are trained just by predicting
words
The University of Jordan
Introduction
• Language models can also assign a probability to an entire sentence
• For example, they can predict that the following sequence has a much
higher probability of appearing in a text:
• all of a sudden I notice three guys standing on the sidewalk
• than does this same set of words in a different order:
• on guys all I of notice sidewalk three a sudden standing the
N-gram
• An n-gram is a sequence of n-words:
• A 2-gram or bigram is a two-word sequence of words like: “please turn”. Turn
your”, “your homework”
• A 3-gram or trigram is a three-word sequence of words like “please turn your”
or “turn your homework”
• We use the n-gram term to mean a probabilistic model that can
estimate the probability of a word given the n-1 previous words, and
thereby also to assign probabilities to entire sequences.
The University of Jordan
N-Grams
• Compute the probability of a word w given some history h P(w|h)
• Suppose h is “ its water is so transparent that”
• Suppose the word w is “the”
• We want to know P(the| its water is so transparent that)
• How?
The University of Jordan
Bigram model
• Approximates the probability of a word given all the previous words
P(wn|w1:n−1) by using only the conditional probability of the preceding
word P(wn|wn−1)
• Instead of computing this:
• Aapproximate it with:
The University of Jordan
N-Grams Model
• We can extend to trigrams, 4-grams, 5-grams
• In general, this is an insufficient model of language
• because language has long-distance dependencies:
“The computer which I had just put into the machine room on the fifth floor
crashed.”
Bigram model
• This assumption is called Markov assumption
• Markov models are the class of probabilistic models that assume we
can predict the probability of some future unit without looking too far
into the past.
• generalize bigram to trigram to n-gram:
Example
• Consider a mini corpus of three sentences:
• Calculate:
The University of Jordan
Example
Berkeley Restaurant Project sentences
• can you tell me about any good cantonese restaurants close by
• mid priced thai food is what i’m looking for
• tell me about chez panisse
• can you give me a listing of the kinds of food that are available
• i’m looking for a good place to eat breakfast
• when is caffe venezia open during the day
The University of Jordan
• Result:
The University of Jordan
Perplexity
• The perplexity (sometimes abbreviated as PP or PPL) of a language
model on a test set is the inverse probability of the test set
• one over the probability of the test set, normalized by the number of words.
• For a test set W = w1w2...wN,:
Perplexity
The best language model is one that best predicts an unseen test set
1
• Gives the highest P(sentence) -
PP(W ) = P(w1w2 ...wN )
N
Chain rule:
For bigrams:
Intuition of Perplexity
• The Shannon Game: mushrooms 0.1
• How well can we predict the next word? pepperoni 0.1
anchovies 0.01
Zeros
• Training set: • Test set
… denied the allegations … denied the offer
… denied the reports … denied the loan
… denied the claims
… denied the request
P(“offer” | denied the)
allegations
2 reports
reports
outcome
1 claims
…
attack
request
claims
1 request
man
7 total
allegations
allegations
0.5 claims
outcome
0.5 request
…
reports
attack
2 other
man
request
claims
7 total
The University of Jordan
Add-one estimation
• Also called Laplace smoothing
• Pretend we saw each word one more time than we did
• Just add one to all the counts!
c(wi-1,wi )
PMLE (wi | wi-1 ) =
• MLE estimate: c(wi-1 )
c(wi-1,wi )+1
• Add-1 estimate: PAdd-1 (wi | wi-1 ) =
c(wi-1 )+V
The University of Jordan
Laplace-smoothed bigrams
The University of Jordan
Reconstituted counts
The University of Jordan
Good-Turing Discounting
• re-estimate the amount of probability mass to assign to N-grams
with zero counts by looking at the number of N-grams that
occurred one time.
• A word or N-gram (or any event) that occurs once is called a
singleton, or a hapax legomenon.
• The Good-Turing intuition is to use the frequency of singletons as
a re-estimate of the frequency of zero-count bigrams.
• The Good-Turing intuition is to use the frequency of singletons as
a re-estimate of the frequency of zero-count bigrams.
The University of Jordan
Good-Turing Discounting
• The Good-Turing algorithm is based on computing Nc, the number of
N-grams that occur c times.
• We refer to the number of N-grams that occur c times as the
frequency of frequency c.
• So applying the idea to smoothing the joint probability of bigrams,
N0 is the number of bigrams with count 0, N1 the number of bigrams
with count 1 (singletons), and so on.
The University of Jordan
Good-Turing Discounting
• We can think of each of the Nc as a bin which stores the number of different N-
grams that occur in the training set with that frequency c:
• to re-estimate the smoothed count c for N0, we use the following equation for
the probability P*GT for things that had zero count N0, or what we might call the
missing mass:
• P*GT (things with frequency zero in training) = N1/N
INTERPOLATION
• If we are trying to compute P(wn|wn−1wn−2), but we have no examples of a
particular trigram wn−2wn−1wn, we can instead estimate its probability by using
the bigram probability P(wn|wn−1).
• Similarly, if we don’t have counts to compute P(wn|wn−1), we can look to the
unigram P(wn)
• There are two ways to use this N-gram “hierarchy”, backoff and interpolation.
• In backoff, if we have non-zero trigram counts, we rely solely on the trigram
counts. We only “back off” to a lower order N-gram if we have zero evidence
for a higher-order N-gram.
• In interpolation, we always mix the probability estimates from all the N-gram
estimators, i.e., we do a weighted interpolation of trigram, bigram, and
unigram counts.
The University of Jordan
Linear Interpolation
•Simple interpolation
Held-Out Test
Training Data Data Data
H.W 3
• Write a program to compute unsmoothed unigrams, bigrams and Trigrams.
• Run your N-gram program on two different small corpora (use the links below). Now
compare the statistics of the two corpora. What are the differences in the most common
unigrams between the two? How about interesting differences in bigrams and Trigrams?
• http://ar.wikipedia.org/wiki/%D8%A5%D9%86%D8%AA%D8%B1%D9%86%D8%AA
• http://arz.wikipedia.org/wiki/%D8%A7%D9%86%D8%AA%D8%B1%D9%86%D8%AA
(Bounce)
• Add an option to your program to generate random sentences.
• Add an option to your program to do Good-Turing discounting.