Online Large-Margin Training of Dependency Parsers: Ryan Mcdonald Koby Crammer Fernando Pereira
Online Large-Margin Training of Dependency Parsers: Ryan Mcdonald Koby Crammer Fernando Pereira
Online Large-Margin Training of Dependency Parsers: Ryan Mcdonald Koby Crammer Fernando Pereira
91
Proceedings of the 43rd Annual Meeting of the ACL, pages 91–98,
Ann Arbor, June 2005.
2005
c Association for Computational Linguistics
and Czech treebank data.
root John hit the ball with the bat 2 System Description
92
h1 h1
h1 h1 h2 h2 h1 h2 h2 h1
⇒ ⇒
s h1 h1 r r+1 h2 h2 t s h1 h1 h2 h2 t s h1 h1 t
Figure 2: O(n3 ) algorithm of Eisner (1996), needs to keep 3 indices at any given stage.
trees that maximize the score function, s(x, y). The Training data: T = {(xt , yt )}Tt=1
primary difficulty is that for a given sentence of 1. w0 = 0; v = 0; i = 0
length n there are exponentially many possible de- 2. for n : 1..N
pendency trees. Using a slightly modified version of 3. for t : 1..T
a lexicalized CKY chart parsing algorithm, it is pos- 4. w(i+1) = update w(i) according to instance (xt , yt )
sible to generate and represent these sentences in a 5. v = v + w(i+1)
forest that is O(n5 ) in size and takes O(n5 ) time to 6. i= i+1
create. 7. w = v/(N ∗ T )
Eisner (1996) made the observation that if the
Figure 3: Generic online learning algorithm.
head of each chart item is on the left or right periph-
ery, then it is possible to parse in O(n3 ). The idea is
to parse the left and right dependents of a word inde- the values of w after each iteration, and the returned
pendently and combine them at a later stage. This re- weight vector is the average of all the weight vec-
moves the need for the additional head indices of the tors throughout training. Averaging has been shown
O(n5 ) algorithm and requires only two additional to help reduce overfitting (Collins, 2002).
binary variables that specify the direction of the item
2.3.1 MIRA
(either gathering left dependents or gathering right
dependents) and whether an item is complete (avail- Crammer and Singer (2001) developed a natural
able to gather more dependents). Figure 2 shows method for large-margin multi-class classification,
the algorithm schematically. As with normal CKY which was later extended by Taskar et al. (2003) to
parsing, larger elements are created bottom-up from structured classification:
pairs of smaller elements. min kwk
Eisner showed that his algorithm is sufficient for s.t. s(x, y) − s(x, y 0 ) ≥ L(y, y 0 )
both searching the space of dependency parses and, ∀(x, y) ∈ T , y 0 ∈ dt(x)
with slight modification, finding the highest scoring
tree y for a given sentence x under the edge fac- where L(y, y 0 ) is a real-valued loss for the tree y 0
torization assumption. Eisner and Satta (1999) give relative to the correct tree y. We define the loss of
a cubic algorithm for lexicalized phrase structures. a dependency tree as the number of words that have
However, it only works for a limited class of lan- the incorrect parent. Thus, the largest loss a depen-
guages in which tree spines are regular. Further- dency tree can have is the length of the sentence.
more, there is a large grammar constant, which is Informally, this update looks to create a margin
typically in the thousands for treebank parsers. between the correct dependency tree and each incor-
rect dependency tree at least as large as the loss of
2.3 Online Learning the incorrect tree. The more errors a tree has, the
Figure 3 gives pseudo-code for the generic online farther away its score will be from the score of the
learning setting. A single training instance is con- correct tree. In order to avoid a blow-up in the norm
sidered on each iteration, and parameters updated of the weight vector we minimize it subject to con-
by applying an algorithm-specific update rule to the straints that enforce the desired margin between the
instance under consideration. The algorithm in Fig- correct and incorrect trees1 .
ure 3 returns an averaged weight vector: an auxil- 1
The constraints may be unsatisfiable, in which case we can
iary weight vector v is maintained that accumulates relax them with slack variables as in SVM training.
93
The Margin Infused Relaxed Algorithm overfitting to the training data. All the experiments
(MIRA) (Crammer and Singer, 2003; Cram- presented here use k = 5. The Eisner (1996) algo-
mer et al., 2003) employs this optimization directly rithm can be modified to find the k-best trees while
within the online framework. On each update, only adding an additional O(k log k) factor to the
MIRA attempts to keep the norm of the change to runtime (Huang and Chiang, 2005).
the parameter vector as small as possible, subject to A more common approach is to factor the struc-
correctly classifying the instance under considera- ture of the output space to yield a polynomial set of
tion with a margin at least as large as the loss of the local constraints (Taskar et al., 2003; Taskar et al.,
incorrect classifications. This can be formalized by 2004). One such factorization for dependency trees
substituting the following update into line 4 of the is
min
w(i+1) − w(i)
generic online algorithm,
s.t. s(l, j) − s(k, j) ≥ 1
min
w(i+1) − w(i)
∀(l, j) ∈ yt , (k, j) ∈
/ yt
s.t. s(xt , yt ) − s(xt , y 0 ) ≥ L(yt , y 0 ) (1)
0
∀y ∈ dt(xt ) It is trivial to show that if these O(n2 ) constraints
are satisfied, then so are those in (1). We imple-
This is a standard quadratic programming prob- mented this model, but found that the required train-
lem that can be easily solved using Hildreth’s al- ing time was much larger than the k-best formu-
gorithm (Censor and Zenios, 1997). Crammer and lation and typically did not improve performance.
Singer (2003) and Crammer et al. (2003) provide Furthermore, the k-best formulation is more flexi-
an analysis of both the online generalization error ble with respect to the loss function since it does not
and convergence properties of MIRA. In equation assume the loss function can be factored into a sum
(1), s(x, y) is calculated with respect to the weight of terms for each dependency.
vector after optimization, w(i+1) .
To apply MIRA to dependency parsing, we can 2.4 Feature Set
simply see parsing as a multi-class classification Finally, we need a suitable feature representation
problem in which each dependency tree is one of f(i, j) for each dependency. The basic features in
many possible classes for a sentence. However, that our model are outlined in Table 1a and b. All fea-
interpretation fails computationally because a gen- tures are conjoined with the direction of attachment
eral sentence has exponentially many possible de- as well as the distance between the two words being
pendency trees and thus exponentially many margin attached. These features represent a system of back-
constraints. off from very specific features over words and part-
To circumvent this problem we make the assump- of-speech tags to less sparse features over just part-
tion that the constraints that matter for large margin of-speech tags. These features are added for both the
optimization are those involving the incorrect trees entire words as well as the 5-gram prefix if the word
y 0 with the highest scores s(x, y 0 ). The resulting is longer than 5 characters.
optimization made by MIRA (see Figure 3, line 4)
Using just features over the parent-child node
would then be:
pairs in the tree was not enough for high accuracy,
min
w(i+1) − w(i)
because all attachment decisions were made outside
s.t. s(xt , yt ) − s(xt , y 0 ) ≥ L(yt , y 0 ) of the context in which the words occurred. To solve
∀y 0 ∈ bestk (xt ; w(i) ) this problem, we added two other types of features,
which can be seen in Table 1c. Features of the first
reducing the number of constraints to the constant k. type look at words that occur between a child and
We tested various values of k on a development data its parent. These features take the form of a POS
set and found that small values of k are sufficient to trigram: the POS of the parent, of the child, and of
achieve close to best performance, justifying our as- a word in between, for all words linearly between
sumption. In fact, as k grew we began to observe a the parent and the child. This feature was particu-
slight degradation of performance, indicating some larly helpful for nouns identifying their parent, since
94
b)
a) c)
Basic Uni-gram Features Basic Big-ram Features
In Between POS Features
p-word, p-pos p-word, p-pos, c-word, c-pos
p-pos, b-pos, c-pos
p-word p-pos, c-word, c-pos
Surrounding Word POS Features
p-pos p-word, c-word, c-pos
p-pos, p-pos+1, c-pos-1, c-pos
c-word, c-pos p-word, p-pos, c-pos
p-pos-1, p-pos, c-pos-1, c-pos
c-word p-word, p-pos, c-word
p-pos, p-pos+1, c-pos, c-pos+1
c-pos p-word, c-word
p-pos-1, p-pos, c-pos, c-pos+1
p-pos, c-pos
Table 1: Features used by system. p-word: word of parent node in dependency tree. c-word: word of child
node. p-pos: POS of parent node. c-pos: POS of child node. p-pos+1: POS to the right of parent in sentence.
p-pos-1: POS to the left of parent. c-pos+1: POS to the right of child. c-pos-1: POS to the left of child.
b-pos: POS of a word in between parent and child nodes.
it would typically rule out situations when a noun dent for a word, it would be useful to know previ-
attached to another noun with a verb in between, ous attachment decisions and incorporate these into
which is a very uncommon phenomenon. the features. It is fairly straightforward to modify
The second type of feature provides the local con- the parsing algorithm to store previous attachments.
text of the attachment, that is, the words before and However, any modification would result in an as-
after the parent-child pair. This feature took the form ymptotic increase in parsing complexity.
of a POS 4-gram: The POS of the parent, child,
word before/after parent and word before/after child. 3 Experiments
The system also used back-off features to various tri-
We tested our methods experimentally on the Eng-
grams where one of the local context POS tags was
lish Penn Treebank (Marcus et al., 1993) and on the
removed. Adding these two features resulted in a
Czech Prague Dependency Treebank (Hajič, 1998).
large improvement in performance and brought the
All experiments were run on a dual 64-bit AMD
system to state-of-the-art accuracy.
Opteron 2.4GHz processor.
2.5 System Summary To create dependency structures from the Penn
Treebank, we used the extraction rules of Yamada
Besides performance (see Section 3), the approach and Matsumoto (2003), which are an approximation
to dependency parsing we described has several to the lexicalization rules of Collins (1999). We split
other advantages. The system is very general and the data into three parts: sections 02-21 for train-
contains no language specific enhancements. In fact, ing, section 22 for development and section 23 for
the results we report for English and Czech use iden- evaluation. Currently the system has 6, 998, 447 fea-
tical features, though are obviously trained on differ- tures. Each instance only uses a tiny fraction of these
ent data. The online learning algorithms themselves features making sparse vector calculations possible.
are intuitive and easy to implement. Our system assumes POS tags as input and uses the
The efficient O(n3 ) parsing algorithm of Eisner tagger of Ratnaparkhi (1996) to provide tags for the
allows the system to search the entire space of de- development and evaluation sets.
pendency trees while parsing thousands of sentences Table 2 shows the performance of the systems
in a few minutes, which is crucial for discriminative that were compared. Y&M2003 is the SVM-shift-
training. We compare the speed of our model to a reduce parsing model of Yamada and Matsumoto
standard lexicalized phrase structure parser in Sec- (2003), N&S2004 is the memory-based learner of
tion 3.1 and show a significant improvement in pars- Nivre and Scholz (2004) and MIRA is the the sys-
ing times on the testing data. tem we have described. We also implemented an av-
The major limiting factor of the system is its re- eraged perceptron system (Collins, 2002) (another
striction to features over single dependency attach- online learning algorithm) for comparison. This ta-
ments. Often, when determining the next depen- ble compares only pure dependency parsers that do
95
English Czech
Accuracy Root Complete Accuracy Root Complete
Y&M2003 90.3 91.6 38.4 - - -
N&S2004 87.3 84.3 30.4 - - -
Avg. Perceptron 90.6 94.0 36.5 82.9 88.0 30.3
MIRA 90.9 94.2 37.5 83.3 88.6 31.3
Table 2: Dependency parsing results for English and Czech. Accuracy is the number of words that correctly
identified their parent in the tree. Root is the number of trees in which the root word was correctly identified.
For Czech this is f-measure since a sentence may have multiple roots. Complete is the number of sentences
for which the entire dependency tree was correct.
not exploit phrase structure. We ensured that the did need to make some data specific changes. In par-
gold standard dependencies of all systems compared ticular, we used the method of Collins et al. (1999) to
were identical. simplify part-of-speech tags since the rich tags used
Table 2 shows that the model described here per- by Czech would have led to a large but rarely seen
forms as well or better than previous comparable set of POS features.
systems, including that of Yamada and Matsumoto The model based on MIRA also performs well on
(2003). Their method has the potential advantage Czech, again slightly outperforming averaged per-
that SVM batch training takes into account all of ceptron. Unfortunately, we do not know of any other
the constraints from all training instances in the op- parsing systems tested on the same data set. The
timization, whereas online training only considers Czech parser of Collins et al. (1999) was run on a
constraints from one instance at a time. However, different data set and most other dependency parsers
they are fundamentally limited by their approximate are evaluated using English. Learning a model from
search algorithm. In contrast, our system searches the Czech training data is somewhat problematic
the entire space of dependency trees and most likely since it contains some crossing dependencies which
benefits greatly from this. This difference is am- cannot be parsed by the Eisner algorithm. One trick
plified when looking at the percentage of trees that is to rearrange the words in the training set so that
correctly identify the root word. The models that all trees are nested. This at least allows the train-
search the entire space will not suffer from bad ap- ing algorithm to obtain reasonably low error on the
proximations made early in the search and thus are training set. We found that this did improve perfor-
more likely to identify the correct root, whereas the mance slightly to 83.6% accuracy.
approximate algorithms are prone to error propaga-
tion, which culminates with attachment decisions at 3.1 Lexicalized Phrase Structure Parsers
the top of the tree. When comparing the two online It is well known that dependency trees extracted
learning models, it can be seen that MIRA outper- from lexicalized phrase structure parsers (Collins,
forms the averaged perceptron method. This differ- 1999; Charniak, 2000) typically are more accurate
ence is statistically significant, p < 0.005 (McNe- than those produced by pure dependency parsers
mar test on head selection accuracy). (Yamada and Matsumoto, 2003). We compared
In our Czech experiments, we used the depen- our system to the Bikel re-implementation of the
dency trees annotated in the Prague Treebank, and Collins parser (Bikel, 2004; Collins, 1999) trained
the predefined training, development and evaluation with the same head rules of our system. There are
sections of this data. The number of sentences in two ways to extract dependencies from lexicalized
this data set is nearly twice that of the English tree- phrase structure. The first is to use the automatically
bank, leading to a very large number of features — generated dependencies that are explicit in the lex-
13, 450, 672. But again, each instance uses just a icalization of the trees, we call this system Collins-
handful of these features. For POS tags we used the auto. The second is to take just the phrase structure
automatically generated tags in the data set. Though output of the parser and run the automatic head rules
we made no language specific model changes, we over it to extract the dependencies, we call this sys-
96
English
Accuracy Root Complete Complexity Time
Collins-auto 88.2 92.3 36.1 O(n5 ) 98m 21s
Collins-rules 91.4 95.1 42.6 O(n5 ) 98m 21s
MIRA-Normal 90.9 94.2 37.5 O(n3 ) 5m 52s
MIRA-Collins 92.2 95.8 42.9 O(n5 ) 105m 08s
Table 3: Results comparing our system to those based on the Collins parser. Complexity represents the
computational complexity of each parser and Time the CPU time to parse sec. 23 of the Penn Treebank.
tem Collins-rules. Table 3 shows the results compar- k=1 k=2 k=5 k=10 k=20
Accuracy 90.73 90.82 90.88 90.92 90.91
ing our system, MIRA-Normal, to the Collins parser Train Time 183m 235m 627m 1372m 2491m
for English. All systems are implemented in Java
and run on the same machine. Table 4: Evaluation of k-best MIRA approximation.
Interestingly, the dependencies that are automati-
cally produced by the Collins parser are worse than
those extracted statically using the head rules. Ar- 3.2 k-best MIRA Approximation
guably, this displays the artificialness of English de- One question that can be asked is how justifiable is
pendency parsing using dependencies automatically the k-best MIRA approximation. Table 4 indicates
extracted from treebank phrase-structure trees. Our the accuracy on testing and the time it took to train
system falls in-between, better than the automati- models with k = 1, 2, 5, 10, 20 for the English data
cally generated dependency trees and worse than the set. Even though the parsing algorithm is propor-
head-rule extracted trees. tional to O(k log k), empirically, the training times
Since the dependencies returned from our system scale linearly with k. Peak performance is achieved
are better than those actually learnt by the Collins very early with a slight degradation around k=20.
parser, one could argue that our model is actu- The most likely reason for this phenomenon is that
ally learning to parse dependencies more accurately. the model is overfitting by ensuring that even un-
However, phrase structure parsers are built to max- likely trees are separated from the correct tree pro-
imize the accuracy of the phrase structure and use portional to their loss.
lexicalization as just an additional source of infor-
mation. Thus it is not too surprising that the de- 4 Summary
pendencies output by the Collins parser are not as We described a successful new method for training
accurate as our system, which is trained and built to dependency parsers. We use simple linear parsing
maximize accuracy on dependency trees. In com- models trained with margin-sensitive online training
plexity and run-time, our system is a huge improve- algorithms, achieving state-of-the-art performance
ment over the Collins parser. with relatively modest training times and no need
The final system in Table 3 takes the output of for pruning heuristics. We evaluated the system on
Collins-rules and adds a feature to MIRA-Normal both English and Czech data to display state-of-the-
that indicates for given edge, whether the Collins art performance without any language specific en-
parser believed this dependency actually exists, we hancements. Furthermore, the model can be aug-
call this system MIRA-Collins. This is a well known mented to include features over lexicalized phrase
discriminative training trick — using the sugges- structure parsing decisions to increase dependency
tions of a generative system to influence decisions. accuracy over those parsers.
This system can essentially be considered a correc- We plan on extending our parser in two ways.
tor of the Collins parser and represents a significant First, we would add labels to dependencies to rep-
improvement over it. However, there is an added resent grammatical roles. Those labels are very im-
complexity with such a model as it requires the out- portant for using parser output in tasks like infor-
put of the O(n5 ) Collins parser. mation extraction or machine translation. Second,
97
we are looking at model extensions to allow non- J. Eisner and G. Satta. 1999. Efficient parsing for bilexi-
projective dependencies, which occur in languages cal context-free grammars and head-automaton gram-
mars. In Proc. ACL.
such as Czech, German and Dutch.
J. Eisner. 1996. Three new probabilistic models for de-
Acknowledgments: We thank Jan Hajič for an- pendency parsing: An exploration. In Proc. COLING.
swering queries on the Prague treebank, and Joakim J. Hajič. 1998. Building a syntactically annotated cor-
Nivre for providing the Yamada and Matsumoto pus: The Prague dependency treebank. Issues of Va-
(2003) head rules for English that allowed for a di- lency and Meaning.
rect comparison with our systems. This work was
L. Huang and D. Chiang. 2005. Better k-best parsing.
supported by NSF ITR grants 0205456, 0205448, Technical Report MS-CIS-05-08, University of Penn-
and 0428193. sylvania.
Richard Hudson. 1984. Word Grammar. Blackwell.
References T. Joachims. 2002. Learning to Classify Text using Sup-
D.M. Bikel. 2004. Intricacies of Collins parsing model. port Vector Machines. Kluwer.
Computational Linguistics.
J. Lafferty, A. McCallum, and F. Pereira. 2001. Con-
Y. Censor and S.A. Zenios. 1997. Parallel optimization : ditional random fields: Probabilistic models for seg-
theory, algorithms, and applications. Oxford Univer- menting and labeling sequence data. In Proc. ICML.
sity Press.
M. Marcus, B. Santorini, and M. Marcinkiewicz. 1993.
E. Charniak. 2000. A maximum-entropy-inspired parser. Building a large annotated corpus of english: the penn
In Proc. NAACL. treebank. Computational Linguistics.
S. Clark and J.R. Curran. 2004. Parsing the WSJ using J. Nivre and M. Scholz. 2004. Deterministic dependency
CCG and log-linear models. In Proc. ACL. parsing of english text. In Proc. COLING.
M. Collins and B. Roark. 2004. Incremental parsing with A. Ratnaparkhi. 1996. A maximum entropy model for
the perceptron algorithm. In Proc. ACL. part-of-speech tagging. In Proc. EMNLP.
M. Collins, J. Hajič, L. Ramshaw, and C. Tillmann. 1999. A. Ratnaparkhi. 1999. Learning to parse natural
A statistical parser for Czech. In Proc. ACL. language with maximum entropy models. Machine
Learning.
M. Collins. 1999. Head-Driven Statistical Models for
Natural Language Parsing. Ph.D. thesis, University S. Riezler, T. King, R. Kaplan, R. Crouch, J. Maxwell,
of Pennsylvania. and M. Johnson. 2002. Parsing the Wall Street Journal
using a lexical-functional grammar and discriminative
M. Collins. 2002. Discriminative training methods for estimation techniques. In Proc. ACL.
hidden Markov models: Theory and experiments with
perceptron algorithms. In Proc. EMNLP. F. Sha and F. Pereira. 2003. Shallow parsing with condi-
tional random fields. In Proc. HLT-NAACL.
K. Crammer and Y. Singer. 2001. On the algorithmic
implementation of multiclass kernel based vector ma- Y. Shinyama, S. Sekine, K. Sudo, and R. Grishman.
chines. JMLR. 2002. Automatic paraphrase acquisition from news ar-
ticles. In Proc. HLT.
K. Crammer and Y. Singer. 2003. Ultraconservative on-
line algorithms for multiclass problems. JMLR. B. Taskar, C. Guestrin, and D. Koller. 2003. Max-margin
Markov networks. In Proc. NIPS.
K. Crammer, O. Dekel, S. Shalev-Shwartz, and Y. Singer.
2003. Online passive aggressive algorithms. In Proc. B. Taskar, D. Klein, M. Collins, D. Koller, and C. Man-
NIPS. ning. 2004. Max-margin parsing. In Proc. EMNLP.
A. Culotta and J. Sorensen. 2004. Dependency tree ker- H. Yamada and Y. Matsumoto. 2003. Statistical depen-
nels for relation extraction. In Proc. ACL. dency analysis with support vector machines. In Proc.
IWPT.
Y. Ding and M. Palmer. 2005. Machine translation using
probabilistic synchronous dependency insertion gram-
mars. In Proc. ACL.
98
Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.
Alternative Proxies: