NLP Cat 2
NLP Cat 2
NLP Cat 2
Daniel Jurafsky
Stanford University
James H. Martin
University of Colorado at Boulder
CHAPTER
“You are uniformly charming!” cried he, with a smile of associating and now
and then I bowed and they perceived a chaise and four to wish for.
Random sentence generated from a Jane Austen trigram model
Predicting is difficult—especially about the future, as the old quip goes. But how
about predicting something that seems much easier, like the next word someone is
going to say? What word, for example, is likely to follow
The water of Walden Pond is so beautifully ...
You might conclude that a likely word is blue, or green, or clear, but probably not
refrigerator nor this. In this chapter we formalize this intuition by introducing
language model language models or LMs, models that assign a probability to each possible next
LM word. Language models can also assign a probability to an entire sentence, telling
us that the following sequence has a much higher probability of appearing in a text:
all of a sudden I notice three guys standing on the sidewalk
Why would we want to predict upcoming words, or know the probability of a sen-
tence? One reason is for generation: choosing contextually better words. For ex-
ample we can correct grammar or spelling errors like Their are two midterms,
in which There was mistyped as Their, or Everything has improve, in which
improve should have been improved. The phrase There are is more probable
than Their are, and has improved than has improve, so a language model can
help users select the more grammatical variant. Or for a speech system to recognize
that you said I will be back soonish and not I will be bassoon dish, it
helps to know that back soonish is a more probable sequence. Language models
can also help in augmentative and alternative communication (Trnka et al. 2007,
AAC Kane et al. 2017). People can use AAC systems if they are physically unable to
speak or sign but can instead use eye gaze or other movements to select words from
a menu. Word prediction can be used to suggest likely words for the menu.
Word prediction is also central to NLP for another reason: large language mod-
els are built just by training them to predict words!! As we’ll see in chapters 7-9,
large language models learn an enormous amount about language solely from being
trained to predict upcoming words from neighboring words.
n-gram In this chapter we introduce the simplest kind of language model: the n-gram
language model. An n-gram is a sequence of n words: a 2-gram (which we’ll call
bigram) is a two-word sequence of words like The water, or water of, and a 3-
gram (a trigram) is a three-word sequence of words like The water of, or water
3.1 • N-G RAMS 33
of Walden. But we also (in a bit of terminological ambiguity) use the word ‘n-
gram’ to mean a probabilistic model that can estimate the probability of a word given
the n-1 previous words, and thereby also to assign probabilities to entire sequences.
In later chapters we will introduce the much more powerful neural large lan-
guage models, based on the transformer architecture of Chapter 9. But because
n-grams have a remarkably simple and clear formalization, we use them to intro-
duce some major concepts of large language modeling, including training and test
sets, perplexity, sampling, and interpolation.
3.1 N-Grams
Let’s begin with the task of computing P(w|h), the probability of a word w given
some history h. Suppose the history h is “The water of Walden Pond is so
beautifully ” and we want to know the probability that the next word is blue:
P(blue|The water of Walden Pond is so beautifully) (3.1)
One way to estimate this probability is directly from relative frequency counts: take a
very large corpus, count the number of times we see The water of Walden Pond
is so beautifully, and count the number of times this is followed by blue. This
would be answering the question “Out of the times we saw the history h, how many
times was it followed by the word w”, as follows:
P(blue|The water of Walden Pond is so beautifully) =
C(The water of Walden Pond is so beautifully blue)
(3.2)
C(The water of Walden Pond is so beautifully)
If we had a large enough corpus, we could compute these two counts and estimate
the probability from Eq. 3.2. But even the entire web isn’t big enough to give us
good estimates for counts of entire sentences. This is because language is creative;
new sentences are invented all the time, and we can’t expect to get accurate counts
for such large objects as entire sentences. For this reason, we’ll need more clever
ways to estimate the probability of a word w given a history h, or the probability of
an entire word sequence W .
Let’s start with some notation. First, throughout this chapter we’ll continue to
refer to words, although in practice we usually compute language models over to-
kens like the BPE tokens of page 21. To represent the probability of a particular
random variable Xi taking on the value “the”, or P(Xi = “the”), we will use the
simplification P(the). We’ll represent a sequence of n words either as w1 . . . wn or
w1:n . Thus the expression w1:n−1 means the string w1 , w2 , ..., wn−1 , but we’ll also
be using the equivalent notation w<n , which can be read as “all the elements of w
from w1 up to and including wn−1 ”. For the joint probability of each word in a se-
quence having a particular value P(X1 = w1 , X2 = w2 , X3 = w3 , ..., Xn = wn ) we’ll
use P(w1 , w2 , ..., wn ).
Now, how can we compute probabilities of entire sequences like P(w1 , w2 , ..., wn )?
One thing we can do is decompose this probability using the chain rule of proba-
bility:
P(X1 ...Xn ) = P(X1 )P(X2 |X1 )P(X3 |X1:2 ) . . . P(Xn |X1:n−1 )
Yn
= P(Xk |X1:k−1 ) (3.3)
k=1
34 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS
The chain rule shows the link between computing the joint probability of a sequence
and computing the conditional probability of a word given previous words. Equa-
tion 3.4 suggests that we could estimate the joint probability of an entire sequence of
words by multiplying together a number of conditional probabilities. But using the
chain rule doesn’t really seem to help us! We don’t know any way to compute the
exact probability of a word given a long sequence of preceding words, P(wn |w1:n−1 ).
As we said above, we can’t just estimate by counting the number of times every word
occurs following every long string in some corpus, because language is creative and
any particular context might have never occurred before!
Given the bigram assumption for the probability of an individual word, we can com-
pute the probability of a complete word sequence by substituting Eq. 3.7 into Eq. 3.4:
n
Y
P(w1:n ) ≈ P(wk |wk−1 ) (3.9)
k=1
3.1 • N-G RAMS 35
C(wn−1 wn )
P(wn |wn−1 ) = P (3.10)
w C(wn−1 w)
We can simplify this equation, since the sum of all bigram counts that start with
a given word wn−1 must be equal to the unigram count for that word wn−1 (the reader
should take a moment to be convinced of this):
C(wn−1 wn )
P(wn |wn−1 ) = (3.11)
C(wn−1 )
Let’s work through an example using a mini-corpus of three sentences. We’ll
first need to augment each sentence with a special symbol <s> at the beginning
of the sentence, to give us the bigram context of the first word. We’ll also need a
special end-symbol </s>.1
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>
Here are the calculations for some of the bigram probabilities from this corpus
2 1 2
P(I|<s>) = 3 = 0.67 P(Sam|<s>) = 3 = 0.33 P(am|I) = 3 = 0.67
1 1 1
P(</s>|Sam) = 2 = 0.5 P(Sam|am) = 2 = 0.5 P(do|I) = 3 = 0.33
For the general case of MLE n-gram parameter estimation:
C(wn−N+1:n−1 wn )
P(wn |wn−N+1:n−1 ) = (3.12)
C(wn−N+1:n−1 )
Equation 3.12 (like Eq. 3.11) estimates the n-gram probability by dividing the
observed frequency of a particular sequence by the observed frequency of a prefix.
relative
frequency This ratio is called a relative frequency. We said above that this use of relative
frequencies as a way to estimate probabilities is an example of maximum likelihood
estimation or MLE. In MLE, the resulting parameter set maximizes the likelihood of
the training set T given the model M (i.e., P(T |M)). For example, suppose the word
Chinese occurs 400 times in a corpus of a million words. What is the probability
that a random word selected from some other text of, say, a million words will be the
400
word Chinese? The MLE of its probability is 1000000 or 0.0004. Now 0.0004 is not
the best possible estimate of the probability of Chinese occurring in all situations; it
1 We need the end-symbol to make the bigram grammar a true probability distribution. Without an end-
symbol, instead of the sentence probabilities of all sentences summing to one, the sentence probabilities
for all sentences of a given length would sum to one. This model would define an infinite set of probability
distributions, with one distribution per sentence length. See Exercise 3.5.
36 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS
might turn out that in some other corpus or context Chinese is a very unlikely word.
But it is the probability that makes it most likely that Chinese will occur 400 times
in a million-word corpus. We present ways to modify the MLE estimates slightly to
get better probability estimates in Section 3.6.
Let’s move on to some examples from a real but tiny corpus, drawn from the
now-defunct Berkeley Restaurant Project, a dialogue system from the last century
that answered questions about a database of restaurants in Berkeley, California (Ju-
rafsky et al., 1994). Here are some sample user queries (text-normalized, by lower
casing and with punctuation striped) (a sample of 9332 sentences is on the website):
can you tell me about any good cantonese restaurants close by
tell me about chez panisse
i’m looking for a good place to eat breakfast
when is caffe venezia open during the day
Figure 3.1 shows the bigram counts from part of a bigram grammar from text-
normalized Berkeley Restaurant Project sentences. Note that the majority of the
values are zero. In fact, we have chosen the sample words to cohere with each other;
a matrix selected from a random set of eight words would be even more sparse.
Figure 3.2 shows the bigram probabilities after normalization (dividing each cell
in Fig. 3.1 by the appropriate unigram for its row, taken from the following set of
unigram counts):
i want to eat chinese food lunch spend
2533 927 2417 746 158 1093 341 278
Here are a few other useful probabilities:
P(i|<s>) = 0.25 P(english|want) = 0.0011
P(food|english) = 0.5 P(</s>|food) = 0.68
Now we can compute the probability of sentences like I want English food or
I want Chinese food by simply multiplying the appropriate bigram probabilities to-
gether, as follows:
P(<s> i want english food </s>)
= P(i|<s>)P(want|i)P(english|want)
P(food|english)P(</s>|food)
= 0.25 × 0.33 × 0.0011 × 0.5 × 0.68
= 0.000031
3.1 • N-G RAMS 37
In practice throughout this book, we’ll use log to mean natural log (ln) when the
base is not specified.
Longer context Although for pedagogical purposes we have only described bi-
trigram gram models, when there is sufficient training data we use trigram models, which
4-gram condition on the previous two words, or 4-gram or 5-gram models. For these larger
5-gram n-grams, we’ll need to assume extra contexts to the left and right of the sentence end.
For example, to compute trigram probabilities at the very beginning of the sentence,
we use two pseudo-words for the first trigram (i.e., P(I|<s><s>).
Some large n-gram datasets have been created, like the million most frequent
n-grams drawn from the Corpus of Contemporary American English (COCA), a
curated 1 billion word corpus of American English (Davies, 2020), Google’s Web
5-gram corpus from 1 trillion words of English web text (Franz and Brants, 2006),
or the Google Books Ngrams corpora (800 billion tokens from Chinese, English,
French, German, Hebrew, Italian, Russian, and Spanish) (Lin et al., 2012a)).
38 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS
It’s even possible to use extremely long-range n-gram context. The infini-gram
(∞-gram) project (Liu et al., 2024) allows n-grams of any length. Their idea is to
avoid the expensive (in space and time) pre-computation of huge n-gram count ta-
bles. Instead, n-gram probabilities with arbitrary n are computed quickly at inference
time by using an efficient representation called suffix arrays. This allows computing
of n-grams of every length for enormous corpora of 5 trillion tokens.
Efficiency considerations are important when building large n-gram language
models. It is standard to quantize the probabilities using only 4-8 bits (instead of
8-byte floats), store the word strings on disk and represent them in memory only as
a 64-bit hash, and represent n-grams in special data structures like ‘reverse tries’.
It is also common to prune n-gram language models, for example by only keeping
n-grams with counts greater than some threshold or using entropy to prune less-
important n-grams (Stolcke, 1998). Efficient language model toolkits like KenLM
(Heafield 2011, Heafield et al. 2013) use sorted arrays and use merge sorts to effi-
ciently build the probability tables in a minimal number of passes through a large
corpus.
for speech recognition of chemistry lectures, the test set should be text of chemistry
lectures. If we’re going to use it as part of a system for translating hotel booking re-
quests from Chinese to English, the test set should be text of hotel booking requests.
If we want our language model to be general purpose, then the test set should be
drawn from a wide variety of texts. In such cases we might collect a lot of texts
from different sources, and then divide it up into a training set and a test set. It’s
important to do the dividing carefully; if we’re building a general purpose model,
we don’t want the test set to consist of only text from one document, or one author,
since that wouldn’t be a good measure of general performance.
Thus if we are given a corpus of text and want to compare the performance of
two different n-gram models, we divide the data into training and test sets, and train
the parameters of both models on the training set. We can then compare how well
the two trained models fit the test set.
But what does it mean to “fit the test set”? The standard answer is simple:
whichever language model assigns a higher probability to the test set—which
means it more accurately predicts the test set—is a better model. Given two proba-
bilistic models, the better model is the one that better predicts the details of the test
data, and hence will assign a higher probability to the test data.
Since our evaluation metric is based on test set probability, it’s important not to
let the test sentences into the training set. Suppose we are trying to compute the
probability of a particular “test” sentence. If our test sentence is part of the training
corpus, we will mistakenly assign it an artificially high probability when it occurs
in the test set. We call this situation training on the test set. Training on the test
set introduces a bias that makes the probabilities all look too high, and causes huge
inaccuracies in perplexity, the probability-based metric we introduce below.
Even if we don’t train on the test set, if we test our language model on the
test set many times after making different changes, we might implicitly tune to its
characteristics, by noticing which changes seem to make the model better. For this
reason, we only want to run our model on the test set once, or a very few number of
times, once we are sure our model is ready.
development For this reason we normally instead have a third dataset called a development
test
test set or, devset. We do all our testing on this dataset until the very end, and then
we test on the test once to see how good our model is.
How do we divide our data into training, development, and test sets? We want
our test set to be as large as possible, since a small test set may be accidentally un-
representative, but we also want as much training data as possible. At the minimum,
we would want to pick the smallest test set that gives us enough statistical power
to measure a statistically significant difference between two potential models. It’s
important that the devset be drawn from the same kind of text as the test set, since
its goal is to measure how we would do on the test set.
Note that because of the inverse in Eq. 3.15, the higher the probability of the word
sequence, the lower the perplexity. Thus the the lower the perplexity of a model on
the data, the better the model. Minimizing perplexity is equivalent to maximizing
the test set probability according to the language model. Why does perplexity use
the inverse probability? It turns out the inverse arises from the original definition
of perplexity from cross-entropy rate in information theory; for those interested, the
explanation is in the advanced Section 3.7. Meanwhile, we just have to remember
that perplexity has an inverse relationship with probability.
The details of computing the perplexity of a test set W depends on which lan-
guage model we use. Here’s the perplexity of W with a unigram language model
(just the geometric mean of the inverse of the unigram probabilities):
v
uN
uY 1
perplexity(W ) = t N
(3.16)
P(wi )
i=1
What we generally use for word sequence in Eq. 3.15 or Eq. 3.17 is the entire
sequence of words in some test set. Since this sequence will cross many sentence
boundaries, if our vocabulary includes a between-sentence token <EOS> or separate
begin- and end-sentence markers <s> and </s> then we can include them in the
3.3 • E VALUATING L ANGUAGE M ODELS : P ERPLEXITY 41
probability computation. If we do, then we also include one token per sentence in
the total count of word tokens N.2
We mentioned above that perplexity is a function of both the text and the lan-
guage model: given a text W , different language models will have different perplex-
ities. Because of this, perplexity can be used to compare different language models.
For example, here we trained unigram, bigram, and trigram grammars on 38 million
words from the Wall Street Journal newspaper. We then computed the perplexity of
each of these models on a WSJ test set using Eq. 3.16 for unigrams, Eq. 3.17 for
bigrams, and the corresponding equation for trigrams. The table below shows the
perplexity of the 1.5 million word test set according to each of the language models.
Unigram Bigram Trigram
Perplexity 962 170 109
As we see above, the more information the n-gram gives us about the word
sequence, the higher the probability the n-gram will assign to the string. A trigram
model is less surprised than a unigram model because it has a better idea of what
words might come next, and so it assigns them a higher probability. And the higher
the probability, the lower the perplexity (since as Eq. 3.15 showed, perplexity is
related inversely to the probability of the test sequence according to the model). So
a lower perplexity tells us that a language model is a better predictor of the test set.
Note that in computing perplexities, the language model must be constructed
without any knowledge of the test set, or else the perplexity will be artificially low.
And the perplexity of two language models is only comparable if they use identical
vocabularies.
An (intrinsic) improvement in perplexity does not guarantee an (extrinsic) im-
provement in the performance of a language processing task like speech recognition
or machine translation. Nonetheless, because perplexity usually correlates with task
improvements, it is commonly used as a convenient evaluation metric. Still, when
possible a model’s improvement in perplexity should be confirmed by an end-to-end
evaluation on a real task.
perplexity of A on T is:
1
perplexityA (T ) = PA (red red red red blue)− 5
5 !− 15
1
=
3
−1
1
= =3 (3.19)
3
But now suppose red was very likely in the training set a different LM B, and so B
has the following probabilities:
P(red) = 0.8 P(green) = 0.1 P(blue) = 0.1 (3.20)
We should expect the perplexity of the same test set red red red red blue for
language model B to be lower since most of the time the next color will be red, which
is very predictable, i.e. has a high probability. So the probability of the test set will
be higher, and since perplexity is inversely related to probability, the perplexity will
be lower. Thus, although the branching factor is still 3, the perplexity or weighted
branching factor is smaller:
perplexityB (T ) = PB (red red red red blue)−1/5
1
= 0.04096− 5
= 0.527−1 = 1.89 (3.21)
polyphonic
p=0.0000018
however
the of a to in (p=0.0003)
Figure 3.3 A visualization of the sampling distribution for sampling sentences by repeat-
edly sampling unigrams. The blue bar represents the relative frequency of each word (we’ve
ordered them from most frequent to least frequent, but the choice of order is arbitrary). The
number line shows the cumulative probabilities. If we choose a random number between 0
and 1, it will fall in an interval corresponding to some word. The expectation for the random
number to fall in the larger intervals of one of the frequent words (the, of, a) is much higher
than in the smaller interval of one of the rare words (polyphonic).
One important way to visualize what kind of knowledge a language model em-
sampling bodies is to sample from it. Sampling from a distribution means to choose random
points according to their likelihood. Thus sampling from a language model—which
represents a distribution over sentences—means to generate some sentences, choos-
ing each sentence according to its likelihood as defined by the model. Thus we are
more likely to generate sentences that the model thinks have a high probability and
less likely to generate sentences that the model thinks have a low probability.
3.5 • G ENERALIZING VS . OVERFITTING THE TRAINING SET 43
–To him swallowed confess hear both. Which. Of save on trail for are ay device and
1
gram
rote life have
–Hill he late speaks; or! a more to leg less first you enter
–Why dost stand forth thy canopy, forsooth; he is this palpable hit the King Henry. Live
2
gram
king. Follow.
–What means, sir. I confess she? then all sorts, he is trim, captain.
–Fly, and will rid me these news of price. Therefore the sadness of parting, as they say,
3
gram
’tis done.
–This shall forbid it should be branded, if renown made it empty.
–King Henry. What! I will go seek the traitor Gloucester. Exeunt some of the watch. A
4
gram
great banquet serv’d in;
–It cannot be but so.
Figure 3.4 Eight sentences randomly generated from four n-grams computed from Shakespeare’s works. All
characters were mapped to lower-case and punctuation marks were treated as words. Output is hand-corrected
for capitalization to improve readability.
The longer the context, the more coherent the sentences. The unigram sen-
tences show no coherent relation between words nor any sentence-final punctua-
tion. The bigram sentences have some local word-to-word coherence (especially
considering punctuation as words). The trigram sentences are beginning to look a
lot like Shakespeare. Indeed, the 4-gram sentences look a little too much like Shake-
speare. The words It cannot be but so are directly from King John. This is because,
not to put the knock on Shakespeare, his oeuvre is not very large as corpora go
44 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS
(N = 884, 647,V = 29, 066), and our n-gram probability matrices are ridiculously
sparse. There are V 2 = 844, 000, 000 possible bigrams alone, and the number of
possible 4-grams is V 4 = 7 × 1017 . Thus, once the generator has chosen the first
3-gram (It cannot be), there are only seven possible next words for the 4th element
(but, I, that, thus, this, and the period).
To get an idea of the dependence on the training set, let’s look at LMs trained on a
completely different corpus: the Wall Street Journal (WSJ) newspaper. Shakespeare
and the WSJ are both English, so we might have expected some overlap between our
n-grams for the two genres. Fig. 3.5 shows sentences generated by unigram, bigram,
and trigram grammars trained on 40 million words from WSJ.
1
gram
Months the my and issue of year foreign new exchange’s september
were recession exchange new endorsed a acquire to six executives
Last December through the way to preserve the Hudson corporation N.
2
gram
B. E. C. Taylor would seem to complete the major central planners one
point five percent of U. S. E. has already old M. X. corporation of living
on information such as more frequently fishing to keep her
They also point to ninety nine point six billion dollars from two hundred
3
gram
four oh six three percent of the rates of interest stores as Mexico and
Brazil on market conditions
Figure 3.5 Three sentences randomly generated from three n-gram models computed from
40 million words of the Wall Street Journal, lower-casing all characters and treating punctua-
tion as words. Output was then hand-corrected for capitalization to improve readability.
Compare these examples to the pseudo-Shakespeare in Fig. 3.4. While they both
model “English-like sentences”, there is no overlap in the generated sentences, and
little overlap even in small phrases. Statistical models are pretty useless as predictors
if the training sets and the test sets are as different as Shakespeare and the WSJ.
How should we deal with this problem when we build n-gram models? One step
is to be sure to use a training corpus that has a similar genre to whatever task we are
trying to accomplish. To build a language model for translating legal documents,
we need a training corpus of legal documents. To build a language model for a
question-answering system, we need a training corpus of questions.
It is equally important to get training data in the appropriate dialect or variety,
especially when processing social media posts or spoken transcripts. For exam-
ple some tweets will use features of African American English (AAE)— the name
for the many variations of language used in African American communities (King,
2020). Such features can include words like finna—an auxiliary verb that marks
immediate future tense —that don’t occur in other varieties, or spellings like den for
then, in tweets like this one (Blodgett and O’Connor, 2017):
(3.22) Bored af den my phone finna die!!!
while tweets from English-based languages like Nigerian Pidgin have markedly dif-
ferent vocabulary and n-gram patterns from American English (Jurgens et al., 2017):
(3.23) @username R u a wizard or wat gan sef: in d mornin - u tweet, afternoon - u
tweet, nyt gan u dey tweet. beta get ur IT placement wiv twitter
Is it possible for the testset nonetheless to have a word we have never seen be-
fore? What happens if the word Jurafsky never occurs in our training set, but pops
up in the test set? The answer is that although words might be unseen, we actu-
ally run our NLP algorithms not on words but on subword tokens. With subword
3.6 • S MOOTHING , I NTERPOLATION , AND BACKOFF 45
tokenization (like the BPE algorithm of Chapter 2) any word can be modeled as a
sequence of known smaller subwords, if necessary by a sequence of individual let-
ters. So although for convenience we’ve been referring to words in this chapter, the
language model vocabulary is actually the set of tokens rather than words, and the
test set can never contain unseen tokens.
ci + 1
PLaplace (wi ) = (3.24)
N +V
46 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS
N
c∗i = (ci + 1) (3.25)
N +V
We can now turn c∗i into a probability Pi∗ by normalizing by N.
discounting A related way to view smoothing is as discounting (lowering) some non-zero
counts in order to get the probability mass that will be assigned to the zero counts.
Thus, instead of referring to the discounted counts c∗ , we might describe a smooth-
discount ing algorithm in terms of a relative discount di , the ratio of the discounted counts to
the original counts:
c∗i
di =
ci
Now that we have the intuition for the unigram case, let’s smooth our Berkeley
Restaurant Project bigrams. Figure 3.6 shows the add-one smoothed counts for the
bigrams in Fig. 3.1.
Figure 3.7 shows the add-one smoothed probabilities for the bigrams in Fig. 3.2.
Recall that normal bigram probabilities are computed by normalizing each row of
counts by the unigram count:
C(wn−1 wn )
PMLE (wn |wn−1 ) = (3.26)
C(wn−1 )
For add-one smoothed bigram counts, we need to augment the unigram count by the
number of total word types in the vocabulary V :
C(wn−1 wn ) + 1 C(wn−1 wn ) + 1
PLaplace (wn |wn−1 ) = P = (3.27)
w (C(wn−1 w) + 1) C(wn−1 ) +V
Thus, each of the unigram counts given in the previous section will need to be aug-
mented by V = 1446. The result is the smoothed bigram probabilities in Fig. 3.7.
It is often convenient to reconstruct the count matrix so we can see how much a
smoothing algorithm has changed the original counts. These adjusted counts can be
3.6 • S MOOTHING , I NTERPOLATION , AND BACKOFF 47
[C(wn−1 wn ) + 1] ×C(wn−1 )
c∗ (wn−1 wn ) = (3.28)
C(wn−1 ) +V
Note that add-one smoothing has made a very big change to the counts. Com-
paring Fig. 3.8 to the original counts in Fig. 3.1, we can see that C(want to) changed
from 608 to 238! We can see this in probability space as well: P(to|want) decreases
from 0.66 in the unsmoothed case to 0.26 in the smoothed case. Looking at the dis-
count d (the ratio between new and old counts) shows us how strikingly the counts
for each prefix word have been reduced; the discount for the bigram want to is 0.39,
while the discount for Chinese food is 0.10, a factor of 10! The sharp change occurs
because too much probability mass is moved to all the zeros.
∗ C(wn−1 wn ) + k
PAdd-k (wn |wn−1 ) = (3.29)
C(wn−1 ) + kV
Add-k smoothing requires that we have a method for choosing k; this can be
done, for example, by optimizing on a devset. Although add-k is useful for some
tasks (including text classification), it turns out that it still doesn’t work well for
language modeling, generating counts with poor variances and often inappropriate
discounts (Gale and Church, 1994).
48 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS
How are these λ values set? Both the simple interpolation and conditional interpo-
held-out lation λ s are learned from a held-out corpus. A held-out corpus is an additional
training corpus, so-called because we hold it out from the training data, that we use
to set these λ values.4 We do so by choosing the λ values that maximize the likeli-
hood of the held-out corpus. That is, we fix the n-gram probabilities and then search
for the λ values that—when plugged into Eq. 3.30—give us the highest probability
of the held-out set. There are various ways to find this optimal set of λ s. One way
is to use the EM algorithm, an iterative learning algorithm that converges on locally
optimal λ s (Jelinek and Mercer, 1980).
count(w)
The backoff terminates in the unigram, which has score S(w) = N . Brants et al.
(2007) find that a value of 0.4 worked well for λ .
The log can, in principle, be computed in any base. If we use log base 2, the
resulting value of entropy will be measured in bits.
One intuitive way to think about entropy is as a lower bound on the number of
bits it would take to encode a certain decision or piece of information in the optimal
coding scheme. Consider an example from the standard information theory textbook
Cover and Thomas (1991). Imagine that we want to place a bet on a horse race but
it is too far to go all the way to Yonkers Racetrack, so we’d like to send a short
message to the bookie to tell him which of the eight horses to bet on. One way to
encode this message is just to use the binary representation of the horse’s number
as the code; thus, horse 1 would be 001, horse 2 010, horse 3 011, and so on, with
horse 8 coded as 000. If we spend the whole day betting and each horse is coded
with 3 bits, on average we would be sending 3 bits per race.
Can we do better? Suppose that the spread is the actual distribution of the bets
placed and that we represent it as the prior probability of each horse as follows:
50 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS
1 1
Horse 1 2 Horse 5 64
1 1
Horse 2 4 Horse 6 64
1 1
Horse 3 8 Horse 7 64
1 1
Horse 4 16 Horse 8 64
The entropy of the random variable X that ranges over horses gives us a lower
bound on the number of bits and is
i=8
X
H(X) = − p(i) log2 p(i)
i=1
= 1 log 1 −4( 1 log 1 )
− 21 log2 12 − 14 log2 41 − 18 log2 18 − 16 2 16 64 2 64
= 2 bits (3.34)
A code that averages 2 bits per race can be built with short encodings for more
probable horses, and longer encodings for less probable horses. For example, we
could encode the most likely horse with the code 0, and the remaining horses as 10,
then 110, 1110, 111100, 111101, 111110, and 111111.
What if the horses are equally likely? We saw above that if we used an equal-
length binary code for the horse numbers, each horse took 3 bits to code, so the
average was 3. Is the entropy the same? In this case each horse would have a
probability of 18 . The entropy of the choice of horses is then
i=8
X 1 1 1
H(X) = − log2 = − log2 = 3 bits (3.35)
8 8 8
i=1
Until now we have been computing the entropy of a single variable. But most of
what we will use entropy for involves sequences. For a grammar, for example, we
will be computing the entropy of some sequence of words W = {w1 , w2 , . . . , wn }.
One way to do this is to have a variable that ranges over sequences of words. For
example we can compute the entropy of a random variable that ranges over all se-
quences of words of length n in some language L as follows:
X
H(w1 , w2 , . . . , wn ) = − p(w1 : n ) log p(w1 : n ) (3.36)
w1 : n ∈L
entropy rate We could define the entropy rate (we could also think of this as the per-word
entropy) as the entropy of this sequence divided by the number of words:
1 1 X
H(w1 : n ) = − p(w1 : n ) log p(w1 : n ) (3.37)
n n
w1 : n ∈L
1
H(L) = lim H(w1 : n )
n
n→∞
1X
= − lim p(w1 : n ) log p(w1 : n ) (3.38)
n→∞ n
W ∈L
3.7 • A DVANCED : P ERPLEXITY ’ S R ELATION TO E NTROPY 51
The Shannon-McMillan-Breiman theorem (Algoet and Cover 1988, Cover and Thomas
1991) states that if the language is regular in certain ways (to be exact, if it is both
stationary and ergodic),
1
H(L) = lim − log p(w1 : n ) (3.39)
n→∞ n
That is, we can take a single sequence that is long enough instead of summing over
all possible sequences. The intuition of the Shannon-McMillan-Breiman theorem
is that a long-enough sequence of words will contain in it many other shorter se-
quences and that each of these shorter sequences will reoccur in the longer sequence
according to their probabilities.
Stationary A stochastic process is said to be stationary if the probabilities it assigns to a
sequence are invariant with respect to shifts in the time index. In other words, the
probability distribution for words at time t is the same as the probability distribution
at time t + 1. Markov models, and hence n-grams, are stationary. For example, in
a bigram, Pi is dependent only on Pi−1 . So if we shift our time index by x, Pi+x is
still dependent on Pi+x−1 . But natural language is not stationary, since as we show
in Appendix D, the probability of upcoming words can be dependent on events that
were arbitrarily distant and time dependent. Thus, our statistical models only give
an approximation to the correct distributions and entropies of natural language.
To summarize, by making some incorrect but convenient simplifying assump-
tions, we can compute the entropy of some stochastic process by taking a very long
sample of the output and computing its average log probability.
cross-entropy Now we are ready to introduce cross-entropy. The cross-entropy is useful when
we don’t know the actual probability distribution p that generated some data. It
allows us to use some m, which is a model of p (i.e., an approximation to p). The
cross-entropy of m on p is defined by
1X
H(p, m) = lim − p(w1 , . . . , wn ) log m(w1 , . . . , wn ) (3.40)
n→∞ n
W ∈L
That is, we draw sequences according to the probability distribution p, but sum the
log of their probabilities according to m.
Again, following the Shannon-McMillan-Breiman theorem, for a stationary er-
godic process:
1
H(p, m) = lim − log m(w1 w2 . . . wn ) (3.41)
n→∞ n
This means that, as for entropy, we can estimate the cross-entropy of a model m
on some distribution p by taking a single sequence that is long enough instead of
summing over all possible sequences.
What makes the cross-entropy useful is that the cross-entropy H(p, m) is an up-
per bound on the entropy H(p). For any model m:
This means that we can use some simplified model m to help estimate the true en-
tropy of a sequence of symbols drawn according to probability p. The more accurate
m is, the closer the cross-entropy H(p, m) will be to the true entropy H(p). Thus,
the difference between H(p, m) and H(p) is a measure of how accurate a model is.
Between two models m1 and m2 , the more accurate model will be the one with the
52 C HAPTER 3 • N- GRAM L ANGUAGE M ODELS
lower cross-entropy. (The cross-entropy can never be lower than the true entropy, so
a model cannot err by underestimating the true entropy.)
We are finally ready to see the relation between perplexity and cross-entropy
as we saw it in Eq. 3.41. Cross-entropy is defined in the limit as the length of the
observed word sequence goes to infinity. We approximate this cross-entropy by
relying on a (sufficiently long) sequence of fixed length. This approximation to the
cross-entropy of a model M = P(wi |wi−N+1 : i−1 ) on a sequence of words W is
1
H(W ) = − log P(w1 w2 . . . wN ) (3.43)
N
perplexity The perplexity of a model P on a sequence of words W is now formally defined as
2 raised to the power of this cross-entropy:
Perplexity(W ) = 2H(W )
1
= P(w1 w2 . . . wN )− N
s
1
= N
P(w1 w2 . . . wN )
3.8 Summary
This chapter introduced language modeling via the n-gram model, a classic model
that allows us to introduce many of the basic concepts in language modeling.
• Language models offer a way to assign a probability to a sentence or other
sequence of words or tokens, and to predict a word or token from preceding
words or tokens.
• N-grams are perhaps the simplest kind of language model. They are Markov
models that estimate words from a fixed window of previous words. N-gram
models can be trained by counting in a training corpus and normalizing the
counts (the maximum likelihood estimate).
• N-gram language models can be evaluated on a test set using perplexity.
• The perplexity of a test set according to a language model is a function of
the probability of the test set: the inverse test set probability according to the
model, normalized by the length.
• Sampling from a language model means to generate some sentences, choos-
ing each sentence according to its likelihood as defined by the model.
• Smoothing algorithms provide a way to estimate probabilities for events that
were unseen in training. Commonly used smoothing algorithms for n-grams
include add-1 smoothing, or rely on lower-order n-gram counts through inter-
polation.
Tag
Description Example
ADJ
Adjective: noun modifiers describing properties red, young, awesome
ADV
Adverb: verb modifiers of time, place, manner very, slowly, home, yesterday
Open Class
NOUN
words for persons, places, things, etc. algorithm, cat, mango, beauty
VERB
words for actions and processes draw, provide, go
PROPN
Proper noun: name of a person, organization, place, etc.. Regina, IBM, Colorado
INTJ
Interjection: exclamation, greeting, yes/no response, etc. oh, um, yes, hello
ADP
Adposition (Preposition/Postposition): marks a noun’s in, on, by, under
spacial, temporal, or other relation
Closed Class Words
AUX Auxiliary: helping verb marking tense, aspect, mood, etc., can, may, should, are
CCONJ Coordinating Conjunction: joins two phrases/clauses and, or, but
DET Determiner: marks noun phrase properties a, an, the, this
NUM Numeral one, two, 2026, 11:00, hundred
PART Particle: a function word that must be associated with an- ’s, not, (infinitive) to
other word
PRON Pronoun: a shorthand for referring to an entity or event she, who, I, others
SCONJ Subordinating Conjunction: joins a main clause with a whether, because
subordinate clause such as a sentential complement
PUNCT Punctuation ,̇ , ()
Other
closed class Parts of speech fall into two broad categories: closed class and open class.
open class Closed classes are those with relatively fixed membership, such as prepositions—
new prepositions are rarely coined. By contrast, nouns and verbs are open classes—
new nouns and verbs like iPhone or to fax are continually being created or borrowed.
function word Closed class words are generally function words like of, it, and, or you, which tend
to be very short, occur frequently, and often have structuring uses in grammar.
Four major open classes occur in the languages of the world: nouns (including
proper nouns), verbs, adjectives, and adverbs, as well as the smaller open class of
interjections. English has all five, although not every language does.
noun Nouns are words for people, places, or things, but include others as well. Com-
common noun mon nouns include concrete terms like cat and mango, abstractions like algorithm
and beauty, and verb-like terms like pacing as in His pacing to and fro became quite
annoying. Nouns in English can occur with determiners (a goat, this bandwidth)
take possessives (IBM’s annual revenue), and may occur in the plural (goats, abaci).
count noun Many languages, including English, divide common nouns into count nouns and
mass noun mass nouns. Count nouns can occur in the singular and plural (goat/goats, rela-
tionship/relationships) and can be counted (one goat, two goats). Mass nouns are
used when something is conceptualized as a homogeneous group. So snow, salt, and
proper noun communism are not counted (i.e., *two snows or *two communisms). Proper nouns,
like Regina, Colorado, and IBM, are names of specific persons or entities.
364 C HAPTER 17 • S EQUENCE L ABELING FOR PARTS OF S PEECH AND NAMED E NTITIES
verb Verbs refer to actions and processes, including main verbs like draw, provide,
and go. English verbs have inflections (non-third-person-singular (eat), third-person-
singular (eats), progressive (eating), past participle (eaten)). While many scholars
believe that all human languages have the categories of noun and verb, others have
argued that some languages, such as Riau Indonesian and Tongan, don’t even make
this distinction (Broschart 1997; Evans 2000; Gil 2000) .
adjective Adjectives often describe properties or qualities of nouns, like color (white,
black), age (old, young), and value (good, bad), but there are languages without
adjectives. In Korean, for example, the words corresponding to English adjectives
act as a subclass of verbs, so what is in English an adjective “beautiful” acts in
Korean like a verb meaning “to be beautiful”.
adverb Adverbs are a hodge-podge. All the italicized words in this example are adverbs:
Actually, I ran home extremely quickly yesterday
Adverbs generally modify something (often verbs, hence the name “adverb”, but
locative also other adverbs and entire verb phrases). Directional adverbs or locative ad-
degree verbs (home, here, downhill) specify the direction or location of some action; degree
adverbs (extremely, very, somewhat) specify the extent of some action, process, or
manner property; manner adverbs (slowly, slinkily, delicately) describe the manner of some
temporal action or process; and temporal adverbs describe the time that some action or event
took place (yesterday, Monday).
interjection Interjections (oh, hey, alas, uh, um) are a smaller open class that also includes
greetings (hello, goodbye) and question responses (yes, no, uh-huh).
preposition English adpositions occur before nouns, hence are called prepositions. They can
indicate spatial or temporal relations, whether literal (on it, before then, by the house)
or metaphorical (on time, with gusto, beside herself), and relations like marking the
agent in Hamlet was written by Shakespeare.
particle A particle resembles a preposition or an adverb and is used in combination with
a verb. Particles often have extended meanings that aren’t quite the same as the
prepositions they resemble, as in the particle over in she turned the paper over. A
phrasal verb verb and a particle acting as a single unit is called a phrasal verb. The meaning
of phrasal verbs is often non-compositional—not predictable from the individual
meanings of the verb and the particle. Thus, turn down means ‘reject’, rule out
‘eliminate’, and go on ‘continue’.
determiner Determiners like this and that (this chapter, that page) can mark the start of an
article English noun phrase. Articles like a, an, and the, are a type of determiner that mark
discourse properties of the noun and are quite frequent; the is the most common
word in written English, with a and an right behind.
conjunction Conjunctions join two phrases, clauses, or sentences. Coordinating conjunc-
tions like and, or, and but join two elements of equal status. Subordinating conjunc-
tions are used when one of the elements has some embedded status. For example,
the subordinating conjunction that in “I thought that you might like some milk” links
the main clause I thought with the subordinate clause you might like some milk. This
clause is called subordinate because this entire clause is the “content” of the main
verb thought. Subordinating conjunctions like that which link a verb to its argument
complementizer in this way are also called complementizers.
pronoun Pronouns act as a shorthand for referring to an entity or event. Personal pro-
nouns refer to persons or entities (you, she, I, it, me, etc.). Possessive pronouns are
forms of personal pronouns that indicate either actual possession or more often just
an abstract relation between the person and some object (my, your, his, her, its, one’s,
wh our, their). Wh-pronouns (what, who, whom, whoever) are used in certain question
17.2 • PART- OF -S PEECH TAGGING 365
Below we show some examples with each word tagged according to both the UD
(in blue) and Penn (in red) tagsets. Notice that the Penn tagset distinguishes tense
and participles on verbs, and has a special tag for the existential there construction in
English. Note that since London Journal of Medicine is a proper noun, both tagsets
mark its component nouns as PROPN/NNP, including journal and medicine, which
might otherwise be labeled as common nouns (NOUN/NN).
(17.1) There/PRON/EX are/VERB/VBP 70/NUM/CD children/NOUN/NNS
there/ADV/RB ./PUNC/.
(17.2) Preliminary/ADJ/JJ findings/NOUN/NNS were/AUX/VBD
reported/VERB/VBN in/ADP/IN today/NOUN/NN ’s/PART/POS
London/PROPN/NNP Journal/PROPN/NNP of/ADP/IN Medicine/PROPN/NNP
y1 y2 y3 y4 y5
Figure 17.3 The task of part-of-speech tagging: mapping from input words x1 , x2 , ..., xn to
output POS tags y1 , y2 , ..., yn .
ambiguity thought that your flight was earlier). The goal of POS-tagging is to resolve these
resolution
ambiguities, choosing the proper tag for the context.
accuracy The accuracy of part-of-speech tagging algorithms (the percentage of test set
tags that match human gold labels) is extremely high. One study found accuracies
over 97% across 15 languages from the Universal Dependency (UD) treebank (Wu
and Dredze, 2019). Accuracies on various English treebanks are also 97% (no matter
the algorithm; HMMs, CRFs, BERT perform similarly). This 97% number is also
about the human performance on this task, at least for English (Manning, 2011).
We’ll introduce algorithms for the task in the next few sections, but first let’s
explore the task. Exactly how hard is it? Fig. 17.4 shows that most word types
(85-86%) are unambiguous (Janet is always NNP, hesitantly is always RB). But the
ambiguous words, though accounting for only 14-15% of the vocabulary, are very
common, and 55-67% of word tokens in running text are ambiguous. Particularly
ambiguous common words include that, back, down, put and set; here are some
examples of the 6 different parts of speech for the word back:
earnings growth took a back/JJ seat
a small building in the back/NN
a clear majority of senators back/VBP the bill
Dave began to back/VB toward the door
enable the country to buy back/RP debt
I was twenty-one back/RB then
Nonetheless, many words are easy to disambiguate, because their different tags
aren’t equally likely. For example, a can be a determiner or the letter a, but the
determiner sense is much more likely.
This idea suggests a useful baseline: given an ambiguous word, choose the tag
which is most frequent in the training corpus. This is a key concept:
Most Frequent Class Baseline: Always compare a classifier against a baseline at
least as good as the most frequent class baseline (assigning each token to the class
it occurred in most often in the training set).
17.3 • NAMED E NTITIES AND NAMED E NTITY TAGGING 367
Named entity tagging is a useful first step in lots of natural language processing
tasks. In sentiment analysis we might want to know a consumer’s sentiment toward a
particular entity. Entities are a useful first stage in question answering, or for linking
text to information in structured knowledge sources like Wikipedia. And named
entity tagging is also central to tasks involving building semantic representations,
like extracting events and the relationship between participants.
1 In English, on the WSJ corpus, tested on sections 22-24.
368 C HAPTER 17 • S EQUENCE L ABELING FOR PARTS OF S PEECH AND NAMED E NTITIES
[PER Washington] was born into slavery on the farm of James Burroughs.
[ORG Washington] went up 2 games to 1 in the four-game series.
Blair arrived in [LOC Washington] for what may well be his last state visit.
In June, [GPE Washington] passed a primary seatbelt law.
Figure 17.6 Examples of type ambiguities in the use of the name Washington.
We’ve also shown two variant tagging schemes: IO tagging, which loses some
information by eliminating the B tag, and BIOES tagging, which adds an end tag
E for the end of a span, and a span tag S for a span consisting of only one word.
A sequence labeler (HMM, CRF, RNN, Transformer, etc.) is trained to label each
token in a text with tags that indicate the presence (or absence) of particular kinds
of named entities.
17.4 • HMM PART- OF -S PEECH TAGGING 369
.8
are .2
.1 COLD2 .1 .4 .5
.1 .5
.1
.3 uniformly charming
HOT1 WARM3 .5
.6 .3 .6 .1 .2
.6
(a) (b)
Figure 17.8 A Markov chain for weather (a) and one for words (b), showing states and
transitions. A start distribution π is required; setting π = [0.1, 0.7, 0.2] for (a) would mean a
probability 0.7 of starting in state 2 (cold), probability 0.1 of starting in state 1 (hot), etc.
C(ti−1 ,ti )
P(ti |ti−1 ) = (17.8)
C(ti−1 )
In the WSJ corpus, for example, MD occurs 13124 times of which it is followed
by VB 10471, for an MLE estimate of
C(MD,V B) 10471
P(V B|MD) = = = .80 (17.9)
C(MD) 13124
The B emission probabilities, P(wi |ti ), represent the probability, given a tag (say
MD), that it will be associated with a given word (say will). The MLE of the emis-
sion probability is
C(ti , wi )
P(wi |ti ) = (17.10)
C(ti )
Of the 13124 occurrences of MD in the WSJ corpus, it is associated with will 4046
times:
C(MD, will) 4046
P(will|MD) = = = .31 (17.11)
C(MD) 13124
We saw this kind of Bayesian modeling in Chapter 4; recall that this likelihood
term is not asking “which is the most likely tag for the word will?” That would be
the posterior P(MD|will). Instead, P(will|MD) answers the slightly counterintuitive
question “If we were going to generate a MD, how likely is it that this modal would
be will?”
B2 a22
P("aardvark" | MD)
...
P(“will” | MD)
...
P("the" | MD)
...
MD2 B3
P(“back” | MD)
... a12 a32 P("aardvark" | NN)
P("zebra" | MD) ...
a11 a21 a33 P(“will” | NN)
a23 ...
P("the" | NN)
B1 a13 ...
P(“back” | NN)
P("aardvark" | VB)
...
VB1 a31
NN3 ...
P("zebra" | NN)
P(“will” | VB)
...
P("the" | VB)
...
P(“back” | VB)
...
P("zebra" | VB)
Figure 17.9 An illustration of the two parts of an HMM representation: the A transition
probabilities used to compute the prior probability, and the B observation likelihoods that are
associated with each state, one likelihood for each possible observation word.
372 C HAPTER 17 • S EQUENCE L ABELING FOR PARTS OF S PEECH AND NAMED E NTITIES
For part-of-speech tagging, the goal of HMM decoding is to choose the tag
sequence t1 . . .tn that is most probable given the observation sequence of n words
w1 . . . wn :
tˆ1:n = argmax P(t1 . . .tn |w1 . . . wn ) (17.12)
t1 ... tn
The way we’ll do this in the HMM is to use Bayes’ rule to instead compute:
HMM taggers make two further simplifying assumptions. The first (output in-
dependence, from Eq. 17.7) is that the probability of a word appearing depends only
on its own tag and is independent of neighboring words and tags:
n
Y
P(w1 . . . wn |t1 . . .tn ) ≈ P(wi |ti ) (17.15)
i=1
The second assumption (the Markov assumption, Eq. 17.6) is that the probability of
a tag is dependent only on the previous tag, rather than the entire tag sequence;
n
Y
P(t1 . . .tn ) ≈ P(ti |ti−1 ) (17.16)
i=1
Plugging the simplifying assumptions from Eq. 17.15 and Eq. 17.16 into Eq. 17.14
results in the following equation for the most probable tag sequence from a bigram
tagger:
emission transition
n z }| { z }| {
Y
tˆ1:n = argmax P(t1 . . .tn |w1 . . . wn ) ≈ argmax P(wi |ti ) P(ti |ti−1 ) (17.17)
t1 ... tn t1 ... tn
i=1
The two parts of Eq. 17.17 correspond neatly to the B emission probability and A
transition probability that we just defined above!
17.4 • HMM PART- OF -S PEECH TAGGING 373
Figure 17.10 Viterbi algorithm for finding the optimal sequence of tags. Given an observation sequence and
an HMM λ = (A, B), the algorithm returns the state path through the HMM that assigns maximum likelihood
to the observation sequence.
We represent the most probable path by taking the maximum over all possible
previous state sequences max . Like other dynamic programming algorithms,
q1 ,...,qt−1
Viterbi fills each cell recursively. Given that we had already computed the probabil-
ity of being in every state at time t − 1, we compute the Viterbi probability by taking
the most probable of the extensions of the paths that lead to the current cell. For a
given state q j at time t, the value vt ( j) is computed as
N
vt ( j) = max vt−1 (i) ai j b j (ot ) (17.19)
i=1
The three factors that are multiplied in Eq. 17.19 for extending the previous paths to
compute the Viterbi probability at time t are
374 C HAPTER 17 • S EQUENCE L ABELING FOR PARTS OF S PEECH AND NAMED E NTITIES
DT DT DT DT DT
RB RB RB RB RB
NN NN NN NN NN
JJ JJ JJ JJ JJ
VB VB VB VB VB
MD MD MD MD MD
vt−1 (i) the previous Viterbi path probability from the previous time step
ai j the transition probability from previous state qi to current state q j
b j (ot ) the state observation likelihood of the observation symbol ot given
the current state j
NNP MD VB JJ NN RB DT
<s > 0.2767 0.0006 0.0031 0.0453 0.0449 0.0510 0.2026
NNP 0.3777 0.0110 0.0009 0.0084 0.0584 0.0090 0.0025
MD 0.0008 0.0002 0.7968 0.0005 0.0008 0.1698 0.0041
VB 0.0322 0.0005 0.0050 0.0837 0.0615 0.0514 0.2231
JJ 0.0366 0.0004 0.0001 0.0733 0.4509 0.0036 0.0036
NN 0.0096 0.0176 0.0014 0.0086 0.1216 0.0177 0.0068
RB 0.0068 0.0102 0.1011 0.1012 0.0120 0.0728 0.0479
DT 0.1147 0.0021 0.0002 0.2157 0.4744 0.0102 0.0017
Figure 17.12 The A transition probabilities P(ti |ti−1 ) computed from the WSJ corpus with-
out smoothing. Rows are labeled with the conditioning event; thus P(V B|MD) is 0.7968.
<s > is the start token.
Let the HMM be defined by the two tables in Fig. 17.12 and Fig. 17.13. Fig-
ure 17.12 lists the ai j probabilities for transitioning between the hidden states (part-
of-speech tags). Figure 17.13 expresses the bi (ot ) probabilities, the observation
likelihoods of words given tags. This table is (slightly simplified) from counts in the
WSJ corpus. So the word Janet only appears as an NNP, back has 4 possible parts
17.4 • HMM PART- OF -S PEECH TAGGING 375
of speech, and the word the can appear as a determiner or as an NNP (in titles like
“Somewhere Over the Rainbow” all words are tagged as NNP).
v1(7) v2(7)
q7 DT
art)
D
q3 VB B|st
|J
=. =0 = 2.5e-13
* P
(MD
= 0 |VB)
v2(2) =
tart) v1(2)=
q2 MD D|s
P(M 0006 .0006 x 0 = * P(MD|M max * .308 =
= . D) 2.772e-8
0 =0
8 1 =)
.9 9*.0 NP
v1(1) =
00 D|N
v2(1)
.0 P(M
= .000009
*
= .28
backtrace
start start start start
start
π backtrace
Janet will
t back the bill
o1 o2 o3 o4 o5
Figure 17.14 The first few entries in the individual state columns for the Viterbi algorithm. Each cell keeps
the probability of the best path so far and a pointer to the previous cell along that path. We have only filled out
columns 1 and 2; to avoid clutter most cells with value 0 are left empty. The rest is left as an exercise for the
reader. After the cells are filled in, backtracing from the end state, we should be able to reconstruct the correct
state sequence NNP MD VB DT NN.
Figure 17.14 shows a fleshed-out version of the sketch we saw in Fig. 17.11,
the Viterbi lattice for computing the best hidden state sequence for the observation
sequence Janet will back the bill.
There are N = 5 state columns. We begin in column 1 (for the word Janet) by
setting the Viterbi value in each cell to the product of the π transition probability (the
start probability for that state i, which we get from the <s> entry of Fig. 17.12), and
376 C HAPTER 17 • S EQUENCE L ABELING FOR PARTS OF S PEECH AND NAMED E NTITIES
the observation likelihood of the word Janet given the tag for that cell. Most of the
cells in the column are zero since the word Janet cannot be any of those tags. The
reader should find this in Fig. 17.14.
Next, each cell in the will column gets updated. For each state, we compute the
value viterbi[s,t] by taking the maximum over the extensions of all the paths from
the previous column that lead to the current cell according to Eq. 17.19. We have
shown the values for the MD, VB, and NN cells. Each cell gets the max of the 7 val-
ues from the previous column, multiplied by the appropriate transition probability;
as it happens in this case, most of them are zero from the previous column. The re-
maining value is multiplied by the relevant observation probability, and the (trivial)
max is taken. In this case the final value, 2.772e-8, comes from the NNP state at the
previous column. The reader should fill in the rest of the lattice in Fig. 17.14 and
backtrace to see whether or not the Viterbi algorithm returns the gold state sequence
NNP MD VB DT NN.
In a CRF, by contrast, we compute the posterior p(Y |X) directly, training the CRF
17.5 • C ONDITIONAL R ANDOM F IELDS (CRF S ) 377
However, the CRF does not compute a probability for each tag at each time step. In-
stead, at each time step the CRF computes log-linear functions over a set of relevant
features, and these local features are aggregated and normalized to produce a global
probability for the whole sequence.
Let’s introduce the CRF more formally, again using X and Y as the input and
output sequences. A CRF is a log-linear model that assigns a probability to an
entire output (tag) sequence Y , out of all possible sequences Y, given the entire input
(word) sequence X. We can think of a CRF as like a giant sequential version of
the multinomial logistic regression algorithm we saw for text categorization. Recall
that we introduced the feature function f in regular multinomial logistic regression
for text categorization as a function of a tuple: the input text x and a single class y
(page 86). In a CRF, we’re dealing with a sequence, so the function F maps an entire
input sequence X and an entire output sequence Y to a feature vector. Let’s assume
we have K features, with a weight wk for each feature Fk :
K
!
X
exp wk Fk (X,Y )
k=1
p(Y |X) = K
! (17.23)
X X
0
exp wk Fk (X,Y )
Y 0 ∈Y k=1
It’s common to also describe the same equation by pulling out the denominator into
a function Z(X):
K
!
1 X
p(Y |X) = exp wk Fk (X,Y ) (17.24)
Z(X)
k=1
K
!
X X
0
Z(X) = exp wk Fk (X,Y ) (17.25)
Y 0 ∈Y k=1
We’ll call these K functions Fk (X,Y ) global features, since each one is a property
of the entire input sequence X and output sequence Y . We compute them by decom-
posing into a sum of local features for each position i in Y :
n
X
Fk (X,Y ) = fk (yi−1 , yi , X, i) (17.26)
i=1
Each of these local features fk in a linear-chain CRF is allowed to make use of the
current output token yi , the previous output token yi−1 , the entire input string X (or
any subpart of it), and the current position i. This constraint to only depend on
the current and previous output tokens yi and yi−1 are what characterizes a linear
linear chain chain CRF. As we will see, this limitation makes it possible to use versions of the
CRF
efficient Viterbi and Forward-Backwards algorithms from the HMM. A general CRF,
by contrast, allows a feature to make use of any output token, and are thus necessary
for tasks in which the decision depend on distant output tokens, like yi−4 . General
CRFs require more complex inference, and are less commonly used for language
processing.
378 C HAPTER 17 • S EQUENCE L ABELING FOR PARTS OF S PEECH AND NAMED E NTITIES
These templates automatically populate the set of features from every instance in
the training and test set. Thus for our example Janet/NNP will/MD back/VB the/DT
bill/NN, when xi is the word back, the following features would be generated and
have the value 1 (we’ve assigned them arbitrary feature numbers):
f3743 : yi = VB and xi = back
f156 : yi = VB and yi−1 = MD
f99732 : yi = VB and xi−1 = will and xi+2 = bill
It’s also important to have features that help with unknown words. One of the
word shape most important is word shape features, which represent the abstract letter pattern
of the word by mapping lower-case letters to ‘x’, upper-case to ‘X’, numbers to
’d’, and retaining punctuation. Thus for example I.M.F. would map to X.X.X. and
DC10-30 would map to XXdd-dd. A second class of shorter word shape features is
also used. In these features consecutive character types are removed, so words in all
caps map to X, words with initial-caps map to Xx, DC10-30 would be mapped to
Xd-d but I.M.F would still map to X.X.X. Prefix and suffix features are also useful.
In summary, here are some sample feature templates that help with unknown words:
For example the word well-dressed might generate the following non-zero val-
ued feature values:
2 Because in HMMs all computation is based on the two probabilities P(tag|tag) and P(word|tag), if
we want to include some source of knowledge into the tagging process, we must find a way to encode
the knowledge into one of these two probabilities. Each time we add a feature we have to do a lot of
complicated conditioning which gets harder and harder as we have more and more such features.
17.5 • C ONDITIONAL R ANDOM F IELDS (CRF S ) 379
prefix(xi ) = w
prefix(xi ) = we
suffix(xi ) = ed
suffix(xi ) = d
word-shape(xi ) = xxxx-xxxxxxx
short-word-shape(xi ) = x-x
The known-word templates are computed for every word seen in the training
set; the unknown word features can also be computed for all words in training, or
only on training words whose frequency is below some threshold. The result of the
known-word templates and word-signature features is a very large set of features.
Generally a feature cutoff is used in which features are thrown out if they have count
< 5 in the training set.
Remember that in a CRF we don’t learn weights for each of these local features
fk . Instead, we first sum the values of each local feature (for example feature f3743 )
over the entire sentence, to create each global feature (for example F3743 ). It is those
global features that will then be multiplied by weight w3743 . Thus for training and
inference there is always a fixed set of K features with K weights, even though the
length of each sentence is different.
gazetteer One feature that is especially useful for locations is a gazetteer, a list of place
names, often providing millions of entries for locations with detailed geographical
and political information.3 This can be implemented as a binary feature indicating a
phrase appears in the list. Other related resources like name-lists, for example from
the United States Census Bureau4 , can be used, as can other entity dictionaries like
lists of corporations or products, although they may not be as helpful as a gazetteer
(Mikheev et al., 1999).
The sample named entity token L’Occitane would generate the following non-
zero valued feature values (assuming that L’Occitane is neither in the gazetteer nor
the census).
3 www.geonames.org
4 www.census.gov
380 C HAPTER 17 • S EQUENCE L ABELING FOR PARTS OF S PEECH AND NAMED E NTITIES
We can ignore the exp function and the denominator Z(X), as we do above, because
exp doesn’t change the argmax, and the denominator Z(X) is constant for a given
observation sequence X.
How should we decode to find this optimal tag sequence ŷ? Just as with HMMs,
we’ll turn to the Viterbi algorithm, which works because, like the HMM, the linear-
chain CRF depends at each timestep on only one previous output token yi−1 .
Concretely, this involves filling an N ×T array with the appropriate values, main-
taining backpointers as we proceed. As with HMM Viterbi, when the table is filled,
we simply follow pointers back from the maximum value in the final column to
retrieve the desired set of labels.
17.6 • E VALUATION OF NAMED E NTITY R ECOGNITION 381
The requisite changes from HMM Viterbi have to do only with how we fill each
cell. Recall from Eq. 17.19 that the recursive step of the Viterbi equation computes
the Viterbi value of time t for state j as
N
vt ( j) = max vt−1 (i) ai j b j (ot ); 1 ≤ j ≤ N, 1 < t ≤ T (17.31)
i=1
The CRF requires only a slight change to this latter formula, replacing the a and b
prior and likelihood probabilities with the CRF features:
XK
N
vt ( j) = max vt−1 (i) wk fk (yt−1 , yt , X,t) 1 ≤ j ≤ N, 1 < t ≤ T (17.33)
i=1
k=1
presented are supervised, having labeled data is essential for training and testing. A
wide variety of datasets exist for part-of-speech tagging and/or NER. The Universal
Dependencies (UD) dataset (de Marneffe et al., 2021) has POS tagged corpora in
over a hundred languages, as do the Penn Treebanks in English, Chinese, and Arabic.
OntoNotes has corpora labeled for named entities in English, Chinese, and Arabic
(Hovy et al., 2006). Named entity tagged corpora are also available in particular
domains, such as for biomedical (Bada et al., 2012) and literary text (Bamman et al.,
2019).
guages need to label words with case and gender information. Tagsets for morpho-
logically rich languages are therefore sequences of morphological tags rather than a
single primitive tag. Here’s a Turkish example, in which the word izin has three pos-
sible morphological/part-of-speech tags and meanings (Hakkani-Tür et al., 2002):
1. Yerdeki izin temizlenmesi gerek. iz + Noun+A3sg+Pnon+Gen
The trace on the floor should be cleaned.
17.8 Summary
This chapter introduced parts of speech and named entities, and the tasks of part-
of-speech tagging and named entity recognition:
• Languages generally have a small set of closed class words that are highly
frequent, ambiguous, and act as function words, and open-class words like
nouns, verbs, adjectives. Various part-of-speech tagsets exist, of between 40
and 200 tags.
• Part-of-speech tagging is the process of assigning a part-of-speech label to
each of a sequence of words.
• Named entities are words for proper nouns referring mainly to people, places,
and organizations, but extended to many other types that aren’t strictly entities
or even proper nouns.
• Two common approaches to sequence modeling are a generative approach,
HMM tagging, and a discriminative approach, CRF tagging. We will see a
neural approach in following chapters.
• The probabilities in HMM taggers are estimated by maximum likelihood es-
timation on tag-labeled training corpora. The Viterbi algorithm is used for
decoding, finding the most likely tag sequence
• Conditional Random Fields or CRF taggers train a log-linear model that can
choose the best tag sequence given an observation sequence, based on features
that condition on the output tag, the prior output tag, the entire input sequence,
and the current timestep. They use the Viterbi algorithm for inference, to
choose the best sequence of tags, and a version of the Forward-Backward
algorithm (see Appendix A) for training,
CHAPTER
The study of grammar has an ancient pedigree. The grammar of Sanskrit was
described by the Indian grammarian Pān.ini sometime between the 7th and 4th cen-
syntax turies BCE, in his famous treatise the As.t.ādhyāyı̄ (‘8 books’). And our word syntax
comes from the Greek sýntaxis, meaning “setting out together or arrangement”, and
refers to the way words are arranged together. We have seen syntactic notions in pre-
vious chapters like the use of part-of-speech categories (Chapter 17). In this chapter
and the next one we introduce formal models for capturing more sophisticated no-
tions of grammatical structure and algorithms for parsing these structures.
Our focus in this chapter is context-free grammars and the CKY algorithm
for parsing them. Context-free grammars are the backbone of many formal mod-
els of the syntax of natural language (and, for that matter, of computer languages).
Syntactic parsing is the task of assigning a syntactic structure to a sentence. Parse
trees (whether for context-free grammars or for the dependency or CCG formalisms
we introduce in following chapters) can be used in applications such as grammar
checking: sentence that cannot be parsed may have grammatical errors (or at least
be hard to read). Parse trees can be an intermediate stage of representation for for-
mal semantic analysis. And parsers and the grammatical structure they assign a
sentence are a useful text analysis tool for text data science applications that require
modeling the relationship of elements in sentences.
In this chapter we introduce context-free grammars, give a small sample gram-
mar of English, introduce more formal definitions of context-free grammars and
grammar normal form, and talk about treebanks: corpora that have been anno-
tated with syntactic structure. We then discuss parse ambiguity and the problems
it presents, and turn to parsing itself, giving the famous Cocke-Kasami-Younger
(CKY) algorithm (Kasami 1965, Younger 1967), the standard dynamic program-
ming approach to syntactic parsing. The CKY algorithm returns an efficient repre-
sentation of the set of parse trees for a sentence, but doesn’t tell us which parse tree
is the right one. For that, we need to augment CKY with scores for each possible
constituent. We’ll see how to do this with neural span-based parsers. Finally, we’ll
introduce the standard set of metrics for evaluating parser accuracy.
388 C HAPTER 18 • C ONTEXT-F REE G RAMMARS AND C ONSTITUENCY PARSING
18.1 Constituency
Syntactic constituency is the idea that groups of words can behave as single units,
or constituents. Part of developing a grammar involves building an inventory of the
constituents in the language. How do words group together in English? Consider
noun phrase the noun phrase, a sequence of words surrounding at least one noun. Here are some
examples of noun phrases (thanks to Damon Runyon):
What evidence do we have that these words group together (or “form constituents”)?
One piece of evidence is that they can all appear in similar syntactic environments,
for example, before a verb.
But while the whole noun phrase can occur before a verb, this is not true of each
of the individual words that make up a noun phrase. The following are not grammat-
ical sentences of English (recall that we use an asterisk (*) to mark fragments that
are not grammatical English sentences):
Thus, to correctly describe facts about the ordering of these words in English, we
must be able to say things like “Noun Phrases can occur before verbs”. Let’s now
see how to do this in a more formal way!
more Nouns.1
NP → Det Nominal
NP → ProperNoun
Nominal → Noun | Nominal Noun
Context-free rules can be hierarchically embedded, so we can combine the previous
rules with others, like the following, that express facts about the lexicon:
Det → a
Det → the
Noun → flight
The symbols that are used in a CFG are divided into two classes. The symbols
terminal that correspond to words in the language (“the”, “nightclub”) are called terminal
symbols; the lexicon is the set of rules that introduce these terminal symbols. The
non-terminal symbols that express abstractions over these terminals are called non-terminals. In
each context-free rule, the item to the right of the arrow (→) is an ordered list of one
or more terminals and non-terminals; to the left of the arrow is a single non-terminal
symbol expressing some cluster or generalization. The non-terminal associated with
each word in the lexicon is its lexical category, or part of speech.
A CFG can be thought of in two ways: as a device for generating sentences
and as a device for assigning a structure to a given sentence. Viewing a CFG as a
generator, we can read the → arrow as “rewrite the symbol on the left with the string
of symbols on the right”.
So starting from the symbol: NP
we can use our first rule to rewrite NP as: Det Nominal
and then rewrite Nominal as: Noun
and finally rewrite these parts-of-speech as: a flight
We say the string a flight can be derived from the non-terminal NP. Thus, a CFG
can be used to generate a set of strings. This sequence of rule expansions is called a
derivation derivation of the string of words. It is common to represent a derivation by a parse
parse tree tree (commonly shown inverted with the root at the top). Figure 18.1 shows the tree
representation of this derivation.
NP
Det Nom
a Noun
flight
dominates In the parse tree shown in Fig. 18.1, we can say that the node NP dominates
all the nodes in the tree (Det, Nom, Noun, a, flight). We can say further that it
immediately dominates the nodes Det and Nom.
The formal language defined by a CFG is the set of strings that are derivable
start symbol from the designated start symbol. Each grammar must have one designated start
1 When talking about these rules we can pronounce the rightarrow → as “goes to”, and so we might
read the first rule above as “NP goes to Det Nominal”.
390 C HAPTER 18 • C ONTEXT-F REE G RAMMARS AND C ONSTITUENCY PARSING
symbol, which is often called S. Since context-free grammars are often used to define
sentences, S is usually interpreted as the “sentence” node, and the set of strings that
are derivable from S is the set of sentences in some simplified version of English.
Let’s add a few additional rules to our inventory. The following rule expresses
verb phrase the fact that a sentence can consist of a noun phrase followed by a verb phrase:
Or the verb phrase may have a verb followed by a prepositional phrase alone:
The NP inside a PP need not be a location; PPs are often used with times and
dates, and with other nouns as well; they can be arbitrarily complex. Here are ten
examples from the ATIS corpus:
to Seattle on these flights
in Minneapolis about the ground transportation in Chicago
on Wednesday of the round trip flight on United Airlines
in the evening of the AP fifty seven flight
on the ninth of July with a stopover in Nashville
Figure 18.2 gives a sample lexicon, and Fig. 18.3 summarizes the grammar rules
we’ve seen so far, which we’ll call L0 . Note that we can use the or-symbol | to
indicate that a non-terminal has alternate possible expansions.
NP → Pronoun I
| Proper-Noun Los Angeles
| Det Nominal a + flight
Nominal → Nominal Noun morning + flight
| Noun flights
VP → Verb do
| Verb NP want + a flight
| Verb NP PP leave + Boston + in the morning
| Verb PP leaving + on Thursday
NP VP
Pro Verb NP
a Nom Noun
Noun flight
morning
Figure 18.4 The parse tree for “I prefer a morning flight” according to grammar L0 .
I), and a random expansion of VP (let’s say, to Verb NP), and so on until we generate
the string I prefer a morning flight. Figure 18.4 shows a parse tree that represents a
complete derivation of I prefer a morning flight.
We can also represent a parse tree in a more compact format called bracketed
bracketed notation; here is the bracketed representation of the parse tree of Fig. 18.4:
notation
(18.1) [S [NP [Pro I]] [VP [V prefer] [NP [Det a] [Nom [N morning] [Nom [N flight]]]]]]
A CFG like that of L0 defines a formal language. Sentences (strings of words)
that can be derived by a grammar are in the formal language defined by that gram-
grammatical mar, and are called grammatical sentences. Sentences that cannot be derived by
a given formal grammar are not in the language defined by that grammar and are
ungrammatical referred to as ungrammatical. This hard line between “in” and “out” characterizes
all formal languages but is only a very simplified model of how natural languages
really work. This is because determining whether a given sentence is part of a given
natural language (say, English) often depends on the context. In linguistics, the use
generative
grammar of formal languages to model natural languages is called generative grammar since
the language is defined by the set of possible sentences “generated” by the grammar.
(Note that this is a different sense of the word ‘generate’ than when we talk about
392 C HAPTER 18 • C ONTEXT-F REE G RAMMARS AND C ONSTITUENCY PARSING
For the remainder of the book we adhere to the following conventions when dis-
cussing the formal properties of context-free grammars (as opposed to explaining
particular facts about English or other languages).
Capital letters like A, B, and S Non-terminals
S The start symbol
Lower-case Greek letters like α, β , and γ Strings drawn from (Σ ∪ N)∗
Lower-case Roman letters like u, v, and w Strings of terminals
A language is defined through the concept of derivation. One string derives an-
other one if it can be rewritten as the second one by some series of rule applications.
More formally, following Hopcroft and Ullman (1979),
if A → β is a production of R and α and γ are any strings in the set
directly derives (Σ ∪ N)∗ , then we say that αAγ directly derives αβ γ, or αAγ ⇒ αβ γ.
Derivation is then a generalization of direct derivation:
Let α1 , α2 , . . . , αm be strings in (Σ ∪ N)∗ , m ≥ 1, such that
α1 ⇒ α2 , α2 ⇒ α3 , . . . , αm−1 ⇒ αm
∗
derives We say that α1 derives αm , or α1 ⇒ αm .
We can then formally define the language LG generated by a grammar G as the
set of strings composed of terminal symbols that can be derived from the designated
start symbol S.
∗
LG = {w|w is in Σ∗ and S ⇒ w}
The problem of mapping from a string of words to its parse tree is called syn-
syntactic
parsing tactic parsing, as we’ll see in Section 18.6.
18.3 Treebanks
treebank A corpus in which every sentence is annotated with a parse tree is called a treebank.
18.3 • T REEBANKS 393
((S
(NP-SBJ (DT That) ((S
(JJ cold) (, ,) (NP-SBJ The/DT flight/NN )
(JJ empty) (NN sky) ) (VP should/MD
(VP (VBD was) (VP arrive/VB
(ADJP-PRD (JJ full) (PP-TMP at/IN
(PP (IN of) (NP eleven/CD a.m/RB ))
(NP (NN fire) (NP-TMP tomorrow/NN )))))
(CC and)
(NN light) ))))
(. .) ))
(a) (b)
Figure 18.5 Parses from the LDC Treebank3 for (a) Brown and (b) ATIS sentences.
NP-SBJ VP .
DT JJ , JJ NN VBD ADJP-PRD .
full IN NP
of NN CC NN
Grammar Lexicon
S → NP VP . DT → the | that
S → NP VP JJ → cold | empty | full
NP → DT NN NN → sky | fire | light | flight | tomorrow
NP → NN CC NN CC → and
NP → DT JJ , JJ NN IN → of | at
NP → NN CD → eleven
VP → MD VP RB → a.m.
VP → VBD ADJP VB → arrive
VP → MD VP VBD → was | said
VP → VB PP NP MD → should | would
ADJP → JJ PP
PP → IN NP
PP → IN NP RB
Figure 18.7 CFG grammar rules and lexicon from the treebank sentences in Fig. 18.5.
among the approximately 4,500 different rules for expanding VPs are separate rules
for PP sequences of any length and every possible arrangement of verb arguments:
VP → VBD PP
VP → VBD PP PP
VP → VBD PP PP PP
VP → VBD PP PP PP PP
VP → VB ADVP PP
VP → VB PP ADVP
VP → ADVP VB PP
A → B C D
can be converted into the following two CNF rules (Exercise 18.1 asks the reader to
18.5 • A MBIGUITY 395
Grammar Lexicon
S → NP VP Det → that | this | the | a
S → Aux NP VP Noun → book | flight | meal | money
S → VP Verb → book | include | prefer
NP → Pronoun Pronoun → I | she | me
NP → Proper-Noun Proper-Noun → Houston | NWA
NP → Det Nominal Aux → does
Nominal → Noun Preposition → from | to | on | near | through
Nominal → Nominal Noun
Nominal → Nominal PP
VP → Verb
VP → Verb NP
VP → Verb NP PP
VP → Verb PP
VP → VP PP
PP → Preposition NP
Figure 18.8 The L1 miniature English grammar and lexicon.
A → B X
X → C D
Sometimes using binary branching can actually produce smaller grammars. For
example, the sentences that might be characterized as
VP -> VBD NP PP*
are represented in the Penn Treebank by this series of rules:
VP → VBD NP PP
VP → VBD NP PP PP
VP → VBD NP PP PP PP
VP → VBD NP PP PP PP PP
...
but could also be generated by the following two-rule grammar:
VP → VBD NP PP
VP → VP PP
The generation of a symbol A with a potentially infinite sequence of symbols B with
Chomsky-
adjunction a rule of the form A → A B is known as Chomsky-adjunction.
18.5 Ambiguity
Ambiguity is the most serious problem faced by syntactic parsers. Chapter 17 intro-
duced the notions of part-of-speech ambiguity and part-of-speech disambigua-
structural
ambiguity tion. Here, we introduce a new kind of ambiguity, called structural ambiguity,
illustrated with a new toy grammar L1 , shown in Figure 18.8, which adds a few
rules to the L0 grammar.
Structural ambiguity occurs when the grammar can assign more than one parse
to a sentence. Groucho Marx’s well-known line as Captain Spaulding in Animal
396 C HAPTER 18 • C ONTEXT-F REE G RAMMARS AND C ONSTITUENCY PARSING
S S
NP VP NP VP
elephant elephant
Figure 18.9 Two parse trees for an ambiguous sentence. The parse on the left corresponds to the humorous
reading in which the elephant is in the pajamas, the parse on the right corresponds to the reading in which
Captain Spaulding did the shooting in his pajamas.
will deliver tomorrow night to the American people could be an adjunct modifying
the verb pushed. A PP like over nationwide television and radio could be attached
to any of the higher VPs or NPs (e.g., it could modify people or night).
The fact that there are many grammatically correct but semantically unreason-
able parses for naturally occurring sentences is an irksome problem that affects all
parsers. Fortunately, the CKY algorithm below is designed to efficiently handle
structural ambiguities. And as we’ll see in the following section, we can augment
CKY with neural methods to choose a single correct parse by syntactic disambigua-
syntactic
disambiguation tion.
is a non-unit production in our grammar, then we add A → γ for each such rule in
the grammar and discard all the intervening unit productions. As we demonstrate
with our toy grammar, this can lead to a substantial flattening of the grammar and a
consequent promotion of terminals to fairly high levels in the resulting trees.
Rules with right-hand sides longer than 2 are normalized through the introduc-
tion of new non-terminals that spread the longer sequences over several new rules.
Formally, if we have a rule like
A → BCγ
we replace the leftmost pair of non-terminals with a new non-terminal and introduce
a new production, resulting in the following new rules:
A → X1 γ
X1 → B C
In the case of longer right-hand sides, we simply iterate this process until the of-
fending rule has been replaced by rules of length 2. The choice of replacing the
leftmost pair of non-terminals is purely arbitrary; any systematic scheme that results
in binary rules would suffice.
In our current grammar, the rule S → Aux NP VP would be replaced by the two
rules S → X1 VP and X1 → Aux NP.
The entire conversion process can be summarized as follows:
1. Copy all conforming rules to the new grammar unchanged.
2. Convert terminals within rules to dummy non-terminals.
3. Convert unit productions.
4. Make all rules binary and add them to new grammar.
Figure 18.10 shows the results of applying this entire conversion procedure to
the L1 grammar introduced earlier on page 395. Note that this figure doesn’t show
the original lexical rules; since these original lexical rules are already in CNF, they
all carry over unchanged to the new grammar. Figure 18.10 does, however, show
the various places where the process of eliminating unit productions has, in effect,
created new lexical rules. For example, all the original verbs have been promoted to
both VPs and to Ss in the converted grammar.
L1 Grammar L1 in CNF
S → NP VP S → NP VP
S → Aux NP VP S → X1 VP
X1 → Aux NP
S → VP S → book | include | prefer
S → Verb NP
S → X2 PP
S → Verb PP
S → VP PP
NP → Pronoun NP → I | she | me
NP → Proper-Noun NP → TWA | Houston
NP → Det Nominal NP → Det Nominal
Nominal → Noun Nominal → book | flight | meal | money
Nominal → Nominal Noun Nominal → Nominal Noun
Nominal → Nominal PP Nominal → Nominal PP
VP → Verb VP → book | include | prefer
VP → Verb NP VP → Verb NP
VP → Verb NP PP VP → X2 PP
X2 → Verb NP
VP → Verb PP VP → Verb PP
VP → VP PP VP → VP PP
PP → Preposition NP PP → Preposition NP
Figure 18.10 L1 Grammar and its conversion to CNF. Note that although they aren’t shown
here, all the original lexical entries from L1 carry over unchanged as well.
a position k, the first constituent [i, k] must lie to the left of entry [i, j] somewhere
along row i, and the second entry [k, j] must lie beneath it, along column j.
To make this more concrete, consider the following example with its completed
parse matrix, shown in Fig. 18.11.
(18.4) Book the flight through Houston.
The superdiagonal row in the matrix contains the parts of speech for each word in
the input. The subsequent diagonals above that superdiagonal contain constituents
that cover all the spans of increasing length in the input.
Given this setup, CKY recognition consists of filling the parse table in the right
way. To do this, we’ll proceed in a bottom-up fashion so that at the point where we
are filling any cell [i, j], the cells containing the parts that could contribute to this
entry (i.e., the cells to the left and the cells below) have already been filled. The
algorithm given in Fig. 18.12 fills the upper-triangular matrix a column at a time
working from left to right, with each column filled from bottom to top, as the right
side of Fig. 18.11 illustrates. This scheme guarantees that at each point in time we
have all the information we need (to the left, since all the columns to the left have
already been filled, and below since we’re filling bottom to top). It also mirrors on-
line processing, since filling the columns from left to right corresponds to processing
each word one at a time.
The outermost loop of the algorithm given in Fig. 18.12 iterates over the columns,
and the second loop iterates over the rows, from the bottom up. The purpose of the
innermost loop is to range over all the places where a substring spanning i to j in
the input might be split in two. As k ranges over the places where the string can be
split, the pairs of cells we consider move, in lockstep, to the right along row i and
down along column j. Figure 18.13 illustrates the general case of filling cell [i, j].
CHAPTER
19 Dependency Parsing
Tout mot qui fait partie d’une phrase... Entre lui et ses voisins, l’esprit aperçoit
des connexions, dont l’ensemble forme la charpente de la phrase.
[Between each word in a sentence and its neighbors, the mind perceives con-
nections. These connections together form the scaffolding of the sentence.]
Lucien Tesnière. 1959. Éléments de syntaxe structurale, A.1.§4
The focus of the last chapter was on context-free grammars and constituent-
based representations. Here we present another important family of grammar for-
dependency
grammars malisms called dependency grammars. In dependency formalisms, phrasal con-
stituents and phrase-structure rules do not play a direct role. Instead, the syntactic
structure of a sentence is described solely in terms of directed binary grammatical
relations between the words, as in the following dependency parse:
root
obj
det nmod
(19.1)
nsubj compound case
Relations among the words are illustrated above the sentence with directed, labeled
typed
dependency arcs from heads to dependents. We call this a typed dependency structure because
the labels are drawn from a fixed inventory of grammatical relations. A root node
explicitly marks the root of the tree, the head of the entire structure.
Figure 19.1 on the next page shows the dependency analysis from (19.1) but vi-
sualized as a tree, alongside its corresponding phrase-structure analysis of the kind
given in the prior chapter. Note the absence of nodes corresponding to phrasal con-
stituents or lexical categories in the dependency parse; the internal structure of the
dependency parse consists solely of directed relations between words. These head-
dependent relationships directly encode important information that is often buried in
the more complex phrase-structure parses. For example, the arguments to the verb
prefer are directly linked to it in the dependency structure, while their connection
to the main verb is more distant in the phrase-structure tree. Similarly, morning
and Denver, modifiers of flight, are linked to it directly in the dependency structure.
This fact that the head-dependent relations are a good proxy for the semantic rela-
tionship between predicates and their arguments is an important reason why depen-
dency grammars are currently more common than constituency grammars in natural
language processing.
Another major advantage of dependency grammars is their ability to deal with
free word order languages that have a relatively free word order. For example, word order in Czech
can be much more flexible than in English; a grammatical object might occur before
or after a location adverbial. A phrase-structure grammar would need a separate rule
412 C HAPTER 19 • D EPENDENCY PARSING
prefer S
I flight NP VP
Nom Noun P NP
morning Denver
Figure 19.1 Dependency and constituent analyses for I prefer the morning flight through Denver.
for each possible place in the parse tree where such an adverbial phrase could occur.
A dependency-based approach can have just one link type representing this particu-
lar adverbial relation; dependency grammar approaches can thus abstract away a bit
more from word order information.
In the following sections, we’ll give an inventory of relations used in dependency
parsing, discuss two families of parsing algorithms (transition-based, and graph-
based), and discuss evaluation.
Here the clausal relations NSUBJ and OBJ identify the subject and direct object of
the predicate cancel, while the NMOD, DET, and CASE relations denote modifiers of
the nouns flights and Houston.
19.1.2 Projectivity
The notion of projectivity imposes an additional constraint that is derived from the
order of the words in the input. An arc from a head to a dependent is said to be
projective projective if there is a path from the head to every word that lies between the head
and the dependent in the sentence. A dependency tree is then said to be projective if
all the arcs that make it up are projective. All the dependency trees we’ve seen thus
far have been projective. There are, however, many valid constructions which lead
to non-projective trees, particularly in languages with relatively flexible word order.
Consider the following example.
acl:relcl
root obl
obj cop
(19.3)
nsubj det det nsubj adv
JetBlue canceled our flight this morning which was already late
In this example, the arc from flight to its modifier late is non-projective since there
is no path from flight to the intervening words this and morning. As we can see from
this diagram, projectivity (and non-projectivity) can be detected in the way we’ve
been drawing our trees. A dependency tree is projective if it can be drawn with
no crossing edges. Here there is no way to link flight to its dependent late without
crossing the arc that links morning to its head.
19.1 • D EPENDENCY R ELATIONS 415
Our concern with projectivity arises from two related issues. First, the most
widely used English dependency treebanks were automatically derived from phrase-
structure treebanks through the use of head-finding rules. The trees generated in such
a fashion will always be projective, and hence will be incorrect when non-projective
examples like this one are encountered.
Second, there are computational limitations to the most widely used families of
parsing algorithms. The transition-based approaches discussed in Section 19.2 can
only produce projective trees, hence any sentences with non-projective structures
will necessarily contain some errors. This limitation is one of the motivations for
the more flexible graph-based parsing approach described in Section 19.3.
punct
obl:tmod
obl
case case
det det
nsubj punct
obj aux
adv
nsubj
obj:tmod obj
advmod compound:vv
transition-based Our first approach to dependency parsing is called transition-based parsing. This
architecture draws on shift-reduce parsing, a paradigm originally developed for
analyzing programming languages (Aho and Ullman, 1972). In transition-based
parsing we’ll have a stack on which we build the parse, a buffer of tokens to be
parsed, and a parser which takes actions on the parse via a predictor called an oracle,
as illustrated in Fig. 19.4.
Input buffer
w1 w2 wn
s1 Dependency
s2
Parser LEFTARC Relations
Action
Stack ... Oracle RIGHTARC
w3 w2
SHIFT
sn
Figure 19.4 Basic transition-based parser. The parser examines the top two elements of the
stack and selects an action by consulting an oracle that examines the current configuration.
The parser walks through the sentence left-to-right, successively shifting items
from the buffer onto the stack. At each time point we examine the top two elements
on the stack, and the oracle makes a decision about what transition to apply to build
the parse. The possible transitions correspond to the intuitive actions one might take
in creating a dependency tree by examining the words in a single pass over the input
from left to right (Covington, 2001):
• Assign the current word as the head of some previously seen word,
• Assign some previously seen word as the head of the current word,
• Postpone dealing with the current word, storing it for later processing.
We’ll formalize this intuition with the following three transition operators that
will operate on the top two elements of the stack:
• LEFTA RC: Assert a head-dependent relation between the word at the top of
the stack and the second word; remove the second word from the stack.
• RIGHTA RC: Assert a head-dependent relation between the second word on
the stack and the word at the top; remove the top word from the stack;
19.2 • T RANSITION -BASED D EPENDENCY PARSING 417
• SHIFT: Remove the word from the front of the input buffer and push it onto
the stack.
We’ll sometimes call operations like LEFTA RC and RIGHTA RC reduce operations,
based on a metaphor from shift-reduce parsing, in which reducing means combin-
ing elements on the stack. There are some preconditions for using operators. The
LEFTA RC operator cannot be applied when ROOT is the second element of the stack
(since by definition the ROOT node cannot have any incoming arcs). And both the
LEFTA RC and RIGHTA RC operators require two elements to be on the stack to be
applied.
arc standard This particular set of operators implements what is known as the arc standard
approach to transition-based parsing (Covington 2001, Nivre 2003). In arc standard
parsing the transition operators only assert relations between elements at the top of
the stack, and once an element has been assigned its head it is removed from the
stack and is not available for further processing. As we’ll see, there are alterna-
tive transition systems which demonstrate different parsing behaviors, but the arc
standard approach is quite effective and is simple to implement.
The specification of a transition-based parser is quite simple, based on repre-
configuration senting the current state of the parse as a configuration: the stack, an input buffer
of words or tokens, and a set of relations representing a dependency tree. Parsing
means making a sequence of transitions through the space of possible configura-
tions. We start with an initial configuration in which the stack contains the ROOT
node, the buffer has the tokens in the sentence, and an empty set of relations repre-
sents the parse. In the final goal state, the stack and the word list should be empty,
and the set of relations will represent the final parse. Fig. 19.5 gives the algorithm.
At each step, the parser consults an oracle (we’ll come back to this shortly) that
provides the correct transition operator to use given the current configuration. It then
applies that operator to the current configuration, producing a new configuration.
The process ends when all the words in the sentence have been consumed and the
ROOT node is the only element remaining on the stack.
The efficiency of transition-based parsers should be apparent from the algorithm.
The complexity is linear in the length of the sentence since it is based on a single
left to right pass through the words in the sentence. (Each word must first be shifted
onto the stack and then later reduced.)
Note that unlike the dynamic programming and search-based approaches dis-
cussed in Chapter 18, this approach is a straightforward greedy algorithm—the or-
acle provides a single choice at each step and the parser proceeds with that choice,
no other options are explored, no backtracking is employed, and a single parse is
returned in the end.
Figure 19.6 illustrates the operation of the parser with the sequence of transitions
418 C HAPTER 19 • D EPENDENCY PARSING
root
obj
det
(19.7)
iobj compound
Let’s consider the state of the configuration at Step 2, after the word me has been
pushed onto the stack.
The correct operator to apply here is RIGHTA RC which assigns book as the head of
me and pops me from the stack resulting in the following configuration.
Here, all the remaining words have been passed onto the stack and all that is left
to do is to apply the appropriate reduce operators. In the current configuration, we
employ the LEFTA RC operator resulting in the following state.
At this point, the parse for this sentence consists of the following structure.
iobj compound
(19.8)
Book me the morning flight
There are several important things to note when examining sequences such as
the one in Figure 19.6. First, the sequence given is not the only one that might lead
to a reasonable parse. In general, there may be more than one path that leads to the
same result, and due to ambiguity, there may be other transition sequences that lead
to different equally valid parses.
Second, we are assuming that the oracle always provides the correct operator
at each point in the parse—an assumption that is unlikely to be true in practice.
As a result, given the greedy nature of this algorithm, incorrect choices will lead to
incorrect parses since the parser has no opportunity to go back and pursue alternative
choices. Section 19.2.4 will introduce several techniques that allow transition-based
approaches to explore the search space more fully.
19.2 • T RANSITION -BASED D EPENDENCY PARSING 419
Finally, for simplicity, we have illustrated this example without the labels on
the dependency relations. To produce labeled trees, we can parameterize the LEFT-
A RC and RIGHTA RC operators with dependency labels, as in LEFTA RC ( NSUBJ ) or
RIGHTA RC ( OBJ ). This is equivalent to expanding the set of transition operators from
our original set of three to a set that includes LEFTA RC and RIGHTA RC operators for
each relation in the set of dependency relations being used, plus an additional one
for the SHIFT operator. This, of course, makes the job of the oracle more difficult
since it now has a much larger set of operators from which to choose.
Let’s walk through the processing of the following example as shown in Fig. 19.7.
root
obj nmod
(19.9)
det case
possible action. The same conditions hold in the next two steps. In step 3, LEFTA RC
is selected to link the to its head.
Now consider the situation in Step 4.
Here, we might be tempted to add a dependency relation between book and flight,
which is present in the reference parse. But doing so now would prevent the later
attachment of Houston since flight would have been removed from the stack. For-
tunately, the precondition on choosing RIGHTA RC prevents this choice and we’re
again left with SHIFT as the only viable option. The remaining choices complete the
set of operators needed for this example.
To recap, we derive appropriate training instances consisting of configuration-
transition pairs from a treebank by simulating the operation of a parser in the con-
text of a reference dependency tree. We can deterministically record correct parser
actions at each step as we progress through each training example, thereby creating
the training set we require.
hs1 .w, opi, hs2 .w, opihs1 .t, opi, hs2 .t, opi
hb1 .w, opi, hb1 .t, opihs1 .wt, opi (19.10)
The correct transition here is SHIFT (you should convince yourself of this before
proceeding). The application of our set of feature templates to this configuration
would result in the following set of instantiated features.
Given that the left and right arc transitions operate on the top two elements of the
stack, features that combine properties from these positions are even more useful.
For example, a feature like s1 .t ◦ s2 .t concatenates the part of speech tag of the word
at the top of the stack with the tag of the word beneath it.
Given the training data and features, any classifier, like multinomial logistic re-
gression or support vector machines, can be used.
Input buffer
Parser Oracle
w …
w e(w) Dependency
Action
Relations
Softmax
...
ENCODER
w1 w2 w3 w4 w5 w6
Figure 19.8 Neural classifier for the oracle for the transition-based parser. The parser takes
the top 2 words on the stack and the first word of the buffer, represents them by their encodings
(from running the whole sentence through the encoder), concatenates the embeddings and
passes through a softmax to choose a parser action (transition).
19.2 • T RANSITION -BASED D EPENDENCY PARSING 423
obj nmod
det case
underlying parsing algorithm. This flexibility has led to the development of a di-
verse set of transition systems that address different aspects of syntax and semantics
including: assigning part of speech tags (Choi and Palmer, 2011a), allowing the
generation of non-projective dependency structures (Nivre, 2009), assigning seman-
tic roles (Choi and Palmer, 2011b), and parsing texts containing multiple languages
(Bhat et al., 2017).
Beam Search
The computational efficiency of the transition-based approach discussed earlier de-
rives from the fact that it makes a single pass through the sentence, greedily making
decisions without considering alternatives. Of course, this is also a weakness – once
a decision has been made it can not be undone, even in the face of overwhelming
beam search evidence arriving later in a sentence. We can use beam search to explore alterna-
tive decision sequences. Recall from Chapter 9 that beam search uses a breadth-first
search strategy with a heuristic filter that prunes the search frontier to stay within a
beam width fixed-size beam width.
In applying beam search to transition-based parsing, we’ll elaborate on the al-
gorithm given in Fig. 19.5. Instead of choosing the single best transition operator
at each iteration, we’ll apply all applicable operators to each state on an agenda and
then score the resulting configurations. We then add each of these new configura-
tions to the frontier, subject to the constraint that there has to be room within the
beam. As long as the size of the agenda is within the specified beam width, we can
add new configurations to the agenda. Once the agenda reaches the limit, we only
add new configurations that are better than the worst configuration on the agenda
(removing the worst element so that we stay within the limit). Finally, to insure that
we retrieve the best possible state on the agenda, the while loop continues as long as
there are non-final states on the agenda.
The beam search approach requires a more elaborate notion of scoring than we
used with the greedy algorithm. There, we assumed that the oracle would be a
supervised classifier that chose the best transition operator based on features of the
current configuration. This choice can be viewed as assigning a score to all the
possible transitions and picking the best one.
With beam search we are now searching through the space of decision sequences,
so it makes sense to base the score for a configuration on its entire history. So we
can define the score for a new configuration as the score of its predecessor plus the
19.3 • G RAPH -BASED D EPENDENCY PARSING 425
edge-factored We’ll make the simplifying assumption that this score can be edge-factored,
meaning that the overall score for a tree is the sum of the scores of each of the scores
of the edges that comprise the tree.
X
Score(t, S) = Score(e)
e∈t
4
4
12
5 8
root Book that flight
6 7
Figure 19.11 Initial rooted, directed graph for Book that flight.
Before describing the algorithm it’s useful to consider two intuitions about di-
rected graphs and their spanning trees. The first intuition begins with the fact that
every vertex in a spanning tree has exactly one incoming edge. It follows from this
that every connected component of a spanning tree (i.e., every set of vertices that
are linked to each other by paths over edges) will also have one incoming edge.
The second intuition is that the absolute values of the edge scores are not critical
to determining its maximum spanning tree. Instead, it is the relative weights of the
edges entering each vertex that matters. If we were to subtract a constant amount
from each edge entering a given vertex it would have no impact on the choice of
19.3 • G RAPH -BASED D EPENDENCY PARSING 427
the maximum spanning tree since every possible spanning tree would decrease by
exactly the same amount.
The first step of the algorithm itself is quite straightforward. For each vertex
in the graph, an incoming edge (representing a possible head assignment) with the
highest score is chosen. If the resulting set of edges produces a spanning tree then
we’re done. More formally, given the original fully-connected graph G = (V, E), a
subgraph T = (V, F) is a spanning tree if it has no cycles and each vertex (other than
the root) has exactly one edge entering it. If the greedy selection process produces
such a tree then it is the best possible one.
Unfortunately, this approach doesn’t always lead to a tree since the set of edges
selected may contain cycles. Fortunately, in yet another case of multiple discovery,
there is a straightforward way to eliminate cycles generated during the greedy se-
lection phase. Chu and Liu (1965) and Edmonds (1967) independently developed
an approach that begins with greedy selection and follows with an elegant recursive
cleanup phase that eliminates cycles.
The cleanup phase begins by adjusting all the weights in the graph by subtracting
the score of the maximum edge entering each vertex from the score of all the edges
entering that vertex. This is where the intuitions mentioned earlier come into play.
We have scaled the values of the edges so that the weights of the edges in the cycle
have no bearing on the weight of any of the possible spanning trees. Subtracting the
value of the edge with maximum weight from each edge entering a vertex results
in a weight of zero for all of the edges selected during the greedy selection phase,
including all of the edges involved in the cycle.
Having adjusted the weights, the algorithm creates a new graph by selecting a
cycle and collapsing it into a single new node. Edges that enter or leave the cycle
are altered so that they now enter or leave the newly collapsed node. Edges that do
not touch the cycle are included and edges within the cycle are dropped.
Now, if we knew the maximum spanning tree of this new graph, we would have
what we need to eliminate the cycle. The edge of the maximum spanning tree di-
rected towards the vertex representing the collapsed cycle tells us which edge to
delete in order to eliminate the cycle. How do we find the maximum spanning tree
of this new graph? We recursively apply the algorithm to the new graph. This will
either result in a spanning tree or a graph with a cycle. The recursions can continue
as long as cycles are encountered. When each recursion completes we expand the
collapsed vertex, restoring all the vertices and edges from the cycle with the excep-
tion of the single edge to be deleted.
Putting all this together, the maximum spanning tree algorithm consists of greedy
edge selection, re-scoring of edge costs and a recursive cleanup phase when needed.
The full algorithm is shown in Fig. 19.12.
Fig. 19.13 steps through the algorithm with our Book that flight example. The
first row of the figure illustrates greedy edge selection with the edges chosen shown
in blue (corresponding to the set F in the algorithm). This results in a cycle between
that and flight. The scaled weights using the maximum value entering each node are
shown in the graph to the right.
Collapsing the cycle between that and flight to a single node (labelled tf) and
recursing with the newly scaled costs is shown in the second row. The greedy selec-
tion step in this recursion yields a spanning tree that links root to book, as well as an
edge that links book to the contracted node. Expanding the contracted node, we can
see that this edge corresponds to the edge from book to flight in the original graph.
This in turn tells us which edge to drop to eliminate the cycle.
428 C HAPTER 19 • D EPENDENCY PARSING
F ← []
T’ ← []
score’ ← []
for each v ∈ V do
bestInEdge ← argmaxe=(u,v)∈ E score[e]
F ← F ∪ bestInEdge
for each e=(u,v) ∈ E do
score’[e] ← score[e] − score[bestInEdge]
Figure 19.12 The Chu-Liu Edmonds algorithm for finding a maximum spanning tree in a
weighted directed graph.
On arbitrary directed graphs, this version of the CLE algorithm runs in O(mn)
time, where m is the number of edges and n is the number of nodes. Since this par-
ticular application of the algorithm begins by constructing a fully connected graph
m = n2 yielding a running time of O(n3 ). Gabow et al. (1986) present a more effi-
cient implementation with a running time of O(m + nlogn).
Or more succinctly.
score(S, e) = w · f
Given this formulation, we need to identify relevant features and train the weights.
The features (and feature combinations) used to train edge-factored models mir-
ror those used in training transition-based parsers, such as
19.3 • G RAPH -BASED D EPENDENCY PARSING 429
-4
4
4 -3
12 0
5 8 -2 0
Book that flight Book that flight
root root
12 7 8 12 -6 7 8
6 7 0
7 -1
5 -7
-4 -4
-3 -3
0 0
-2 -2
Book tf
root Book -6 tf root -6
0 -1
-1 -1
-7 -7
• Wordforms, lemmas, and parts of speech of the headword and its dependent.
• Corresponding features from the contexts before, after and between the words.
• Word embeddings.
• The dependency relation itself.
• The direction of the relation (to the right or left).
• The distance from the head to the dependent.
Given a set of features, our next problem is to learn a set of weights correspond-
ing to each. Unlike many of the learning problems discussed in earlier chapters,
here we are not training a model to associate training items with class labels, or
parser actions. Instead, we seek to train a model that assigns higher scores to cor-
rect trees than to incorrect ones. An effective framework for problems like this is to
inference-based
learning use inference-based learning combined with the perceptron learning rule. In this
framework, we parse a sentence (i.e, perform inference) from the training set using
some initially random set of initial weights. If the resulting parse matches the cor-
responding tree in the training data, we do nothing to the weights. Otherwise, we
find those features in the incorrect parse that are not present in the reference parse
and we lower their weights by a small amount based on the learning rate. We do this
incrementally for each sentence in our training data until the weights converge.
430 C HAPTER 19 • D EPENDENCY PARSING
score(h1head, h3dep)
∑
Biaffine
U W b
+
h1 head h1 dep h2 head h2 dep h3 head h3 dep
r1 r2 r3
ENCODER
Here we’ll sketch the biaffine algorithm of Dozat and Manning (2017) and Dozat
et al. (2017) shown in Fig. 19.14, drawing on the work of Grünewald et al. (2021)
who tested many versions of the algorithm via their STEPS system. The algorithm
first runs the sentence X = x1 , ..., xn through an encoder to produce a contextual
embedding representation for each token R = r1 , ..., rn . The embedding for each
token is now passed through two separate feedforward networks, one to produce a
representation of this token as a head, and one to produce a representation of this
token as a dependent:
hhead
i = FFNhead (ri ) (19.13)
hdep
i = FFN dep
(ri ) (19.14)
Now to assign a score to the directed edge i → j, (wi is the head and w j is the depen-
dent), we feed the head representation of i, hhead
i , and the dependent representation
of j, hdep
j , into a biaffine scoring function:
Score(i → j) = Biaff(hhead
i , hdep
j ) (19.15)
|
Biaff(x, y) = x Uy + W(x ⊕ y) + b (19.16)
19.4 • E VALUATION 431
where U, W, and b are weights learned by the model. The idea of using a biaffine
function is to allow the system to learn multiplicative interactions between the vec-
tors x and y.
If we pass Score(i → j) through a softmax, we end up with a probability distri-
bution, for each token j, over potential heads i (all other tokens in the sentence):
This probability can then be passed to the maximum spanning tree algorithm of
Section 19.3.1 to find the best tree.
This p(i → j) classifier is trained by optimizing the cross-entropy loss.
Note that the algorithm as we’ve described it is unlabeled. To make this into
a labeled algorithm, the Dozat and Manning (2017) algorithm actually trains two
classifiers. The first classifier, the edge-scorer, the one we described above, assigns
a probability p(i → j) to each word wi and w j . Then the Maximum Spanning Tree
algorithm is run to get a single best dependency parse tree for the second. We then
apply a second classifier, the label-scorer, whose job is to find the maximum prob-
ability label for each edge in this parse. This second classifier has the same form
as (19.15-19.17), but instead of being trained to predict with binary softmax the
probability of an edge existing between two words, it is trained with a softmax over
dependency labels to predict the dependency label between the words.
19.4 Evaluation
As with phrase structure-based parsing, the evaluation of dependency parsers pro-
ceeds by measuring how well they work on a test set. An obvious metric would be
exact match (EM)—how many sentences are parsed correctly. This metric is quite
pessimistic, with most sentences being marked wrong. Such measures are not fine-
grained enough to guide the development process. Our metrics need to be sensitive
enough to tell if actual improvements are being made.
For these reasons, the most common method for evaluating dependency parsers
are labeled and unlabeled attachment accuracy. Labeled attachment refers to the
proper assignment of a word to its head along with the correct dependency relation.
Unlabeled attachment simply looks at the correctness of the assigned head, ignor-
ing the dependency relation. Given a system output and a corresponding reference
parse, accuracy is simply the percentage of words in an input that are assigned the
correct head with the correct relation. These metrics are usually referred to as the
labeled attachment score (LAS) and unlabeled attachment score (UAS). Finally, we
can make use of a label accuracy score (LS), the percentage of tokens with correct
labels, ignoring where the relations are coming from.
As an example, consider the reference parse and system parse for the following
example shown in Fig. 19.15.
(19.18) Book me the flight through Houston.
The system correctly finds 4 of the 6 dependency relations present in the reference
parse and receives an LAS of 2/3. However, one of the 2 incorrect relations found
by the system holds between book and flight, which are in a head-dependent relation
in the reference parse; the system therefore achieves a UAS of 5/6.
Beyond attachment scores, we may also be interested in how well a system is
performing on a particular kind of dependency relation, for example NSUBJ, across
432 C HAPTER 19 • D EPENDENCY PARSING
root root
obj xcomp
nmod nsubj nmod
iobj det case det case
Book me the flight through Houston Book me the flight through Houston
(a) Reference (b) System
Figure 19.15 Reference and system parses for Book me the flight through Houston, resulting in an LAS of
2/3 and an UAS of 5/6.
a development corpus. Here we can make use of the notions of precision and recall
introduced in Chapter 17, measuring the percentage of relations labeled NSUBJ by
the system that were correct (precision), and the percentage of the NSUBJ relations
present in the development set that were in fact discovered by the system (recall).
We can employ a confusion matrix to keep track of how often each dependency type
was confused for another.
19.5 Summary
This chapter has introduced the concept of dependency grammars and dependency
parsing. Here’s a summary of the main points that we covered: