Learning OT Constraint Rankings Using A Maximum Entropy Model
Learning OT Constraint Rankings Using A Maximum Entropy Model
Learning OT Constraint Rankings Using A Maximum Entropy Model
Entropy Model
Sharon Goldwater and Mark Johnson
Department of Cognitive and Linguistic Sciences, Brown University
{sharon goldwater, mark johnson}@brown.edu
Abstract. A weakness of standard Optimality Theory is its inability to account for grammars
with free variation. We describe here the Maximum Entropy model, a general statistical model,
and show how it can be applied in a constraint-based linguistic framework to model and learn
grammars with free variation, as well as categorical grammars. We report the results of using
the MaxEnt model for learning two different grammars: one with variation, and one without.
Our results are as good as those of a previous probabilistic version of OT, the Gradual Learning
Algorithm (Boersma, 1997), and we argue that our model is more general and mathematically
well-motivated.
1. Introduction
Maximum Entropy or log-linear models are a very general class of statistical models that
have been applied to problems in a wide range of fields, including computational linguistics.
Logistic regression models, exponential models, Boltzmann networks, Harmonic grammars,
probabilistic context free grammars, and Hidden Markov Models are all types of Maximum
Entropy models. Maximum Entropy models are motivated by information theory: they are
designed to include as much information as is known from the data while making no additional
assumptions (i.e. they are models that have as high an entropy as possible under the constraint
that they match the training data). Suppose we have some conditioning context x and a set of
possible outcomes Y(x) that depend on the context. Then a Maximum Entropy model defines
the conditional probability of any particular outcome y ∈ Y(x) given the context x as:
1 Xm
Pr(y|x) = exp( wi fi (y, x)), where (1)
Z(x) i=1
X X
m
Z(x) = exp( wi fi (y, x))
y∈Y(x) i=1
In these equations, f1 (y, x) . . . fm (y, x) are the values of m different features of the pair (y, x),
the wi are parameters (weights) associated with those features, and Z(x) is a normalizing
constant obtained by summing over all possible values that y could take on in the sample space
Y(x). In other words,
Pm
the log probability of y given x is proportional to a linear combination
of feature values, i=1 wi fi (y, x).
In the MaxEnt models considered here, x is an input phonological form, Y(x) is the set of
candidate output forms (i.e., Y is the Gen function) and y ∈ Y(x) is some particular candidate
output form. For an Optimality Theoretic analysis with m constraints C1 · · · Cm , we use a
Maximum Entropy model with m features, and let the features correspond to the constraints.
Thus the feature value fi (y, x) is the number of violations of constraint Ci incurred by the
input/output pair (y, x). We can think of the parameter weights wi as the ranking values of
the constraints.
Note that this Maximum Entropy model of phonology differs from standard Optimality
Theory in that constraint weights are additive in log probability. As a result, many violations
of lower-ranked constraints may outweigh fewer violations of higher-ranked constraints. This
is a property shared by the recent Linear Optimality Theory (Keller, 2000), as well as the
earlier theory of Harmonic Grammar (Legendre et al., 1990), on which OT is based.1 The
property of additivity makes the MaxEnt model more powerful and less restrictive than
standard OT. When there is sufficient distance between the constraint weights and a finite
bound on the number of constraint violations, the MaxEnt model simulates standard OT (see
Johnson (2002) for an explicit formula for the weights). The model can therefore account
for categorical grammars where a single violation of a highly ranked constraint outweighs
any number of violations of lower ranked constraints. However, by assigning closely spaced
constraint weights, the MaxEnt model can also produce grammars with variable outputs,
or gradient grammaticality effects caused by cumulative constraint violations (Keller, 2000;
Keller and Asudeh, 2002). The GLA is able to model grammars with free variation, but, like
standard OT, cannot account for these cases of cumulative constraint violations.
Given the generic Maximum Entropy model, we still need to find the correct constraint
weights for a given set of training data. We can do this using maximum likelihood estimation
on the conditional likelihood (or pseudo-likelihood) of the data given the observed outputs:
Y
n
PLw̄ (ȳ|x̄) = Prw̄ (Y = yj |x(Y ) = xj ) (2)
j=1
Here, ȳ = y1 . . . yn are the winning output forms for each of the n training examples
in the corpus, and the xj are the corresponding input forms. So the pseudo-likelihood of
the training corpus is simply the product of the conditional probabilities of each output form
given its input form. As with ordinary maximum likelihood estimation, we can maximize
the pseudo-likelihood function by taking its log and finding the maximum using any standard
optimization algorithm. In the experiments below, we used the Conjugate Gradient algorithm
(Press et al., 1992).
To prevent overfitting the training data, we introduce a regularizing bias term, or prior,
as described in Johnson et al. (1999). The prior for each weight wi is a Gaussian distribution
with mean µi and standard deviation σi that is multiplied by the psuedo-likelihood in (2). In
terms of the log likelihood, the prior term is a quadratic, so our learning algorithm finds the
wi that maximize the following objective function:
X
m
(wi − µi )2
log PLw̄ (ȳ|x̄) − (3)
i=1 2σi2
For simplicity, the experiments reported here were conducted using the same prior for
each constraint weight, with µi = 0 and σi = σ. (For possible theoretical implications of
this choice, see Section 4.1.) Informally, this prior specifies that zero is the default weight of
any constraint (which means the constraint has no effect on the output), so we can vary how
closely the model fits the data by varying the standard deviation, σ. Lower values of σ give
a more peaked prior distribution and require more data to force the constraint weights away
from zero, while higher values give a better fit with less data, but may result in overfitting
the data. In particular, multiplying the number of training examples by a factor of r (while
1
In fact, the Harmony function from Harmonic Grammar is simply log Pr(y|x) in (2) (Smolensky and
Legendre, 2002).
Constraint Weight
*RTR H I 33.89
PARSE [RTR ] 17.00
G ESTURE [C ONTOUR ] 10.00
PARSE [ATR ] 3.53
*ATR L O 0.41
keeping the empirical distribution fixed) will yield the same result as reducing σ by a factor of
√
r. In other words, if we vary n and σ but hold nσ 2 constant, the parameter weights learned
by the MaxEnt model will be the same.
3. Experimental Results
We ran experiments on two different sets of data, one categorical and one stochastic. Both
datasets are available as part of the Praat program (Boersma and Weenink, 2000). In this
section, we describe our experimental results and compare them to the results of the GLA on
the same datasets, as reported in Boersma (1999) and Boersma and Hayes (2001).
Table 2. Constraint violation patterns of four of B&H’s classes, with example words
shows the percentage of output forms of that class in the training data belonging to the
majority output. For example, in class 2, 100% of the output forms belong to the majority
output (in this case, /-iden/), whereas in class 4, the outputs are split 70/30 (the more common
ending in this case happens to be /-jen/). The “GLA” and “MaxEnt” columns show the
percentage of forms produced by these algorithms that match the majority output forms in
the training data. The MaxEnt results are for nσ 2 = 569,800. The GLA results are those
reported in B&H, and reflect an average taken over 100 separate runs of the algorithm. During
each run, the algorithm was presented with 388,000 training examples. The distribution of
input forms in training was according to their empirical frequencies in the corpus, as was
the distribution of output forms for each input. The training examples were presented in five
groups. The initial plasticity was set to 2.0, but was reduced after each group of examples, to
a final value of 0.002. The noise value began at 10.0 for the first group of training examples,
and was set to 2.0 for the remaining examples. In their paper, Boersma and Hayes argue
that reducing the plasticity corresponds to the child’s decreasing ability to learn with age, but
give no justification for the change in noise level. In any case, it is not clear how they chose
the particular training schedule they report, or whether other training schedules would yield
significantly different results. We discuss these points further in Section 4.4.
4. Discussion
In this section, we discuss some of the theoretical implications of our work and the question
of generalization. We then compare the results presented for the GLA and MaxEnt model and
argue in favor of the MaxEnt model on formal and practical grounds.
5. Conclusions
In this paper we have presented a new way of modeling constraint-based phonology using the
statistical framework of the Maximum Entropy model. We have shown that this model, in
conjunction with standard optimization algorithms, can learn both categorical and stochastic
grammars from a training corpus of input/output pairs. Its performance on these tasks is
similar to that of the GLA. We have not yet added any assumptions about the initial state or
learning path taken by the MaxEnt model, but we have described how this could easily be
done by changing the priors of the model or the optimization algorithm used.
In addition to these empirical facts about the MaxEnt model, we wish to emphasize its
strong theoretical foundations. Unlike the GLA, which is a somewhat ad hoc model designed
specifically for learning OT constraint rankings, the MaxEnt model is a very general statistical
model with an information theoretic justification that has been used successfully for many
different types of learning problems. The MaxEnt model also has fewer parameters than the
GLA and does not require complicated training schedules. Given our positive results so far
and the success of Maximum Entropy models for other types of machine learning tasks, we
believe that this model is worth pursuing as a framework for probabilistic constraint-based
phonology.
References
Arto Anttila. 1997a. Deriving variation from grammar: a study of finnish genitives. In F. Hinskens, R. van
Hout, and L. Wetzels, editors, Variation, change and phonological theory, pages 35–68. John Benjamins,
Amsterdam. Rutgers Optimality Archive ROA-63.
Arto Anttila. 1997b. Variation in Finnish phonology and morphology. Ph.D. thesis, Stanford Univ.
Adam L. Berger, Vincent J. Della Pietra, and Stephen A. Della Pietra. 1996. A maximum entropy approach to
natural language processing. Computational Linguistics, 22(1):39–71.
Paul Boersma and Bruce Hayes. 2001. Empirical tests of the gradual learning algorithm. Linguistic Inquiry,
32(1):45–86.
Paul Boersma and Clara Levelt. 1999. Gradual constraint-ranking learning algorithm predicts acquisition order.
In Proceedings of the 30th Child Language Research Forum.
Paul Boersma and David Weenink. 2000. Praat, a system for doing phonetics by computer.
http://www.praat.org.
Paul Boersma. 1997. How we learn variation, optionality, and probability. In Proceedings of the Institute of
Phonetic Sciences of the Univ. of Amsterdam, volume 21, pages 43–58.
Paul Boersma. 1999. Optimality-theoretic learning in the praat program. In Proceedings of the Institute of
Phonetic Sciences of the Univ. of Amsterdam, volume 23, pages 17–35.
Jason Eisner. 2000. Review of Kager: “Optimality Theory”. Computational Linguistics, 26(2):286–290.
Bruce Hayes. 2000. Gradient well-formedness in optimality theory. In J. Dekkers, F. van der Leeuw, and
J. van de Weijer, editors, Optimality Theory: Phonology, Syntax, and Acquisition. Oxford University Press,
Oxford.
Frederick Jelinek. 1997. Statistical Methods for Speech Recognition. MIT Press, Cambridge, MA.
Mark Johnson, Stuart Geman, Stephen Canon, Zhiyi Chi, and Stefan Riezler. 1999. Estimators for stochastic
‘unification-based’ grammars. In Proceedings of the 37th Annual Meeting of the Association for
Computational Linguistics.
Mark Johnson. 2002. Optimality-theoretic Lexical Functional Grammar. In Paula Merlo and Susan Stevenson,
editors, The Lexical Basis of Sentence Processing: Formal, Computational and Experimental Issues, pages
59–74. John Benjamins, Amsterdam, The Netherlands.
Frank Keller and Ash Asudeh. 2002. Probabilistic learning algorithms and optimality theory. Linguistic Inquiry,
33(2):225–244.
Frank Keller. 2000. Gradience in grammar: Experimental and computational aspects of degrees of
gramaticality. Ph.D. thesis, Univ. of Edinburgh.
Géraldine Legendre, Yoshiro Miyata, and Paul Smolensky. 1990. Harmonic grammar: A formal multi-level
connectionist theory of linguistic well-formedness: Theoretical foundations. Technical Report 90-5,
Institute of Cognitive Science, Univ. of Colorado.
David Marr. 1982. Vision. W.H. Freeman and Company, New York.
Naomi Nagy and Bill Reynolds. 1997. Optimality theory and variable word-final deletion in faetar. Language
Variation and Change, 9:37–55.
William Press, Saul Teukolsky, William Vetterling, and Brian Flannery. 1992. Numerical Recipes in C: The Art
of Scientific Computing. Cambridge University Press, Cambridge, England, 2 edition.
Alan Prince and Paul Smolensky. 1993. Optimality theory: Constraint interaction in generative grammar.
Technical Report TR-2, Rutgers Center for Cognitive Science, Rutgers Univ.
Alan Prince and Bruce Tesar. 1999. Learning phonotactic distributions. Technical Report TR-54, Rutgers
Center for Cognitive Science, Rutgers Univ. Rutgers Optimality Archive ROA-353.
Douglas Pulleyblank and William J. Turkel. 1996. Optimality theory and learning algorithms: The
representation of recurrent featural asymmetries. In J. Durand and B. Laks, editors, Current trends in
phonology: Models and methods, pages 653–684. Univ. of Salford.
Paul Smolensky and Géraldine Legendre. 2002. The harmonic mind: From neural computation to optimality-
theoretic grammar. Book draft.
Bruce Tesar and Paul Smolensky. 1993. The learnability of optimality theory: An algorithm and some basic
complexity results. Ms., Department of Computer Science and Institute of Cognitive Science, Univ. Of
Colorado, Boulder. Rutgers Optimality Archive ROA-2.