14 Ngramlm
14 Ngramlm
14 Ngramlm
Mausam
(Based on slides of Michael Collins, Dan Jurafsky, Dan Klein,
Chris Manning, Luke Zettlemoyer)
Outline
• Motivation
• Task Definition
• N-Gram Probability Estimation
• Evaluation
• Hints on Smoothing for N-Gram Models
• Simple
• Interpolation and Back-off
• Advanced Algorithms
2
The Language Modeling Problem
Setup: Assume a (finite) vocabulary of words
source channel
w a
P(w) P(a|w)
best observed
decoder
w a
source channel
e f
P(e) P(f|e)
best observed
decoder
e f
• Simplifying assumption:
Andrei Markov
• Or maybe
P(the | its water is so transparent that) » P(the | transparent that)
Markov Assumption
… wn ) P ( wi | wi k …
P( w1w2 wi 1 )
i
• Big problem with unigrams: P(the the the the) >> P(I like ice cream)!
Bigram Models
• Conditioned on previous single word
… wi 1 ) P ( wi | wi 1 )
P( wi | w1w2
• Generative process: pick <s>, pick a word conditioned on previous one,
repeat until to pick </s>
• PCFG LM (later):
• [This, quarter, ‘s, surprisingly, independent, attack, paid, off,
the, risk, involving, IRS, leaders, and, transportation, prices, .]
• [It, could, be, announced, sometime, .]
• [Mr., Toseland, believes, the, average, defense, economy, is,
drafted, from, slightly, more, than, 12, stocks, .]
An example
• Result:
What kinds of knowledge?
• P(english|want) = .0011
World knowledge
• P(chinese|want) = .0065
• P(to|want) = .66
• P(eat | to) = .28
• P(food | to) = 0 Grammatical knowledge
• P(want | spend) = 0
• P (i | <s>) = .25
Practical Issues
• We do everything in log space
• Avoid underflow
• (also adding is faster than multiplying)
…
Google N-Gram Release
• serve as the incoming 92
• serve as the incubator 99
• serve as the independent 794
• serve as the index 223
• serve as the indication 72
• serve as the indicator 120
• serve as the indicators 45
• serve as the indispensable 111
• serve as the indispensible 40
• serve as the individual 234
http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
Outline
• Motivation
• Task Definition
• N-Gram Probability Estimation
• Evaluation
• Hints on Smoothing for N-Gram Models
• Simple
• Interpolation and Back-off
• Advanced Algorithms
42
Evaluation: How good is our model?
• Does our language model prefer good sentences to bad ones?
• Assign higher probability to “real” or “frequently observed” sentences
• Than “ungrammatical” or “rarely observed” sentences?
• We train parameters of our model on a training set.
• We test the model’s performance on data we haven’t seen.
• A test set is an unseen dataset that is different from our training set,
totally unused.
• An evaluation metric tells us how well our model does on the test set.
Extrinsic evaluation of N-gram models
• Best evaluation for comparing models A and B
• Put each model in a task
• spelling corrector, speech recognizer, MT system
• Run the task, get an accuracy for A and for B
• How many misspelled words corrected properly
• How many words translated correctly
• Compare accuracy for A and B
Difficulty of extrinsic (in-vivo) evaluation of N-
gram models
• Extrinsic evaluation
• Time-consuming; requires building applications, new data
• So
• Sometimes use intrinsic evaluation: perplexity
• Bad approximation
• unless the test data looks just like the training data
• So generally only useful in pilot experiments
• But is helpful to think about.
Intuition of Perplexity
mushrooms 0.1
• The Shannon Game:
• How well can we predict the next word? pepperoni 0.1
anchovies 0.01
I always order pizza with cheese and ____
….
The 33rd President of the US was ____
fried rice 0.0001
I saw a ____ ….
• Unigrams are terrible at this game. (Why?) and 1e-100
• A better model of a text
• is one which assigns a higher probability to the word that actually occurs
Perplexity
The best language model is one that best predicts an unseen test set
• Gives the highest P(sentence) -
1
PP(W ) = P(w1w2 ...wN ) N
Perplexity is the inverse probability of
the test set, normalized by the number
1
of words: = N
P(w1w2 ...wN )
Chain rule:
For bigrams:
• Lower is better!
• Example:
• uniform model perplexity is N
• Interpretation: effective vocabulary size (accounting for statistical regularities)
• Typical values for newspaper text:
• Uniform: 20,000; Unigram: 1000s, Bigram: 700-1000, Trigram: 100-200
• Important note:
• Its easy to get bogus perplexities by having bogus probabilities that sum to
more than one over their event spaces. Be careful!
Lower perplexity = better model
allegations
3 allegations
outcome
reports
2 reports
attack
…
request
claims
1 claims
man
1 request
7 total
• Steal probability mass to generalize better
P(w | denied the)
2.5 allegations
allegations
allegations
1.5 reports
outcome
0.5 claims
reports
attack
0.5 request
…
man
claims
request
2 other
7 total
Add-one estimation
c(wi-1, wi ) + k
PAdd-k (wi | wi-1 ) =
c(wi-1 ) + kV
1
c(wi-1, wi ) + m( )
PAdd-k (wi | wi-1 ) = V
c(wi-1 ) + m
What counts do we want?
Count c New count c*
0 .0000270
1 0.446
2 1.26
3 2.24
4 3.24
5 4.22
6 5.19
7 6.21
8 7.24
9 8.25
Absolute Discounting
• Just subtract 0.75 (or some d)!
discounted bigram
c( wi 1 , wi ) d
PAbsoluteDiscounting ( wi | wi 1 )
c( wi 1 )
(Maybe keeping a couple extra values of d for counts 1 and 2)
• Backoff
c( wi 1 , wi )
wi Α( wi 1 )
PBO ( wi | wi 1 ) c( wi 1 )
P( wi ) wi B( wi 1 )
• Problem?
• Not a probability distribution
99
Katz Backoff
c( wi 1 , wi )
PML ( wi | wi 1 ) P * ( wi | wi 1 ) PML ( wi | wi 1 )
c( wi 1 )
• Define the words into seen and unseen
A(v) {w : c(v, w) k} B(v) {w : c(v, w) k}
• Backoff
P * ( wi | wi 1 ) wi Α( wi 1 )
PBO ( wi | wi 1 )
( wi 1 ) P( wi ) wi B( wi 1 )
1 P * (w | w
wA ( wi 1 )
i 1 )
( wi 1 )
P(w)
wB ( wi 1 )
Linear Interpolation
• Simple interpolation
• A frequent word (Francisco) occurring in only one context (San) will have a
low continuation probability
Kneser-Ney Smoothing IV
max(c(wi-1, wi ) - d, 0)
PKN (wi | wi-1 ) = + l (wi-1 )PCONTINUATION (wi )
c(wi-1 )
λ is a normalizing constant; the probability mass we’ve discounted
d
l (wi-1 ) = {w : c(wi-1, w) > 0}
c(wi-1 )
The number of word types that can follow wi-1
the normalized discount = # of word types we discounted
121 = # of times we applied normalized discount
Kneser-Ney Smoothing: Recursive
formulation
i
max(c (w i-n+1 ) - d, 0)
i-1
PKN (wi | wi-n+1 ) = KN
i-1
+ l (w i-1
)P
i-n+1 KN (wi | w i-1
i-n+2 )
cKN (wi-n+1 )
where
max(c(wi-1, wi ) - d, 0)
PKN (wi | wi-1 ) = + l (wi-1 )PCONTINUATION (wi )
c(wi-1 )
122
What Actually Works?
• Trigrams and beyond:
• Unigrams, bigrams generally
useless
• Trigrams much better (when
there’s enough data)
• 4-, 5-grams really useful in MT,
but not so much for speech
• Discounting
• Absolute discounting, Good-
Turing, held-out estimation,
Witten-Bell, etc…
10
9.5 100,000 Katz
• Having more data is 9 100,000 KN
better…
8.5 1,000,000 Katz
• … but so is using a
Entropy
better estimator 8 1,000,000 KN
7.5 10,000,000 Katz