139 140 Chapter 5. A Probabilistic Model of Adjective-Noun Ambiguity
139 140 Chapter 5. A Probabilistic Model of Adjective-Noun Ambiguity
A Probabilistic Model of
Adjective-Noun Ambiguity
In this chapter we further evaluate the probabilistic model introduced in the previous chapter by
looking at polysemous adjective-noun combinations. More specifically, we concentrate on polysemous
adjectives whose meaning varies depending on the noun they modify (e.g., difficult ).
In contrast to the previous chapter, where we used a linguistic classification (i.e., Levin 1993) as
an inventory of the meanings of verbs in relation to their arguments, we derive the meanings of
adjective-noun combinations directly from the corpus. The acquired meanings and their ranking
are evaluated against human intuitions. We conduct an experiment which provides evidence
that our model produces a preference ordering on the meanings of adjective-noun combinations
which correlates reliably with human judgments.
5.1. Introduction
The semantic properties of adjectives have been extensively studied in the theoretical linguistics
literature. Adjectives exhibit a widely polymorphic behavior, aspects of which several semantic
classifications have attempted to capture. A well-known classification of adjectives is based
on their logical behavior and divides adjectives into three classes: extensional, intensional,
and scalar (Chierchia and McConnell-Ginet 1990). Extensional adjectives denote properties;
when they modify a noun the meaning of the adjective-noun phrase is the intersection of the
semantics contributed by the noun and the adjective. Red is a typical example of an extensional
adjective: a red dress is interpreted as the intersection of the set of red things and the set
of dresses. Intensional adjectives are property-modifying (i.e., they denote a function from
properties to properties). For example, the adjective-noun phrase former president does not
denote the individual that is a president and former; instead, it denotes the individual that
was president in a preceding term. Scalar adjectives denote properties relative to a norm or a
139
140 Chapter 5. A Probabilistic Model of Adjective-Noun Ambiguity
standard of comparison. For example, a small elephant is small for an elephant or small as
elephants go (i.e., it is not a small animal).
Adjectives are also classified in terms of their gradability (Lyons 1977; Quirk et al.
1985). Gradable adjectives denote a property that can vary by degrees (e.g., deep, fast, big ).
Gradability can be indicated by the use of degree modifiers (e.g., very, much, highly, extremely
) and by comparison markers (i.e., the addition of the suffixes -er and -est or by modification
with more and most ). The modifier locates the adjective on a scale of comparison, at
a position higher or lower than the one indicated by the adjective alone. Adjectives denoting
the highest position on a scale are non-gradable (e.g., dead, principal, pregnant ). Note that an
adjective can be extensional and gradable (e.g., an extremely red dress ) or scalar and gradable
(e.g., a very small elephant ).
An important semantic relation holding between pairs of adjectives is antonymy,
i.e., semantic opposition. Antonymy is the basis for the semantic organization of adjectives in
WordNet (Miller et al. 1990). Related to antonymy are the semantic orientation and markedness
of adjectives (Lyons 1977). The orientation usually indicates whether the adjective receives a
positive or negative interpretation. For example, the adjectives intelligent and simple have a
positive orientation, whereas the adjectives stupid and simplistic have a negative orientation.
Given two contrasting adjectives the unmarked adjective denotes a generic property without
explicitly making reference to a norm or a standard (e.g., the adjective tall in the question How
tall is Peter? ); the marked adjective denotes a property that deviates from the norm (e.g., the
adjective short in the question How short is Peter? ). Note that the orientation of an adjective
is highly dependent on contextual or pragmatic factors. For example, simple can have a positive
orientation when contrasted to complicated and a negative orientation when contrasted to
elegant.
Semantically, adjectives, more than other categories, are able to take on different meanings
depending on the noun they modify (Lahav 1989; Pustejovsky 1995; Sapir 1944; Vendler
1968). Consider the adjective red for example. A red ball is ball which is colored red, a red
party is a left-wing party, and a red pen is pen which writes red. Adjectives which denote physical
properties behave similarly: a hot car has a hot engine (or perhaps a hot interior), a hot dish
is spicy, and hot water is warm. Adjectives like difficult, easy, fast, or good pattern with hot
and red in that they receive different meanings when modifying different nouns. Furthermore,
they display different meanings even when they modify a single noun: these adjectives are ambiguous
across and within the nouns they modify. Consider the examples in (5.1). The meaning
of the adjective fast varies depending on whether it modifies the nouns programmer, scientist,
or plane. A fast programmer is typically “a programmer who programs quickly”, a fast plane
is typically “a plane that flies quickly”, a fast scientist can be “a scientist who publishes papers
quickly”, “who performs experiments quickly”, “who observes something quickly”, “who reasons,
thinks, or runs quickly”. Similarly, a fast plane is not only “a plane that flies quickly”,
5.1. Introduction 141
but also “a plane that lands, takes off, turns, or travels quickly”. Even the more restrictive fast
programmer allows more than one interpretation. As shown in (5.3), taken from Lascarides and
Copestake (1998: 394), the discourse context triggers the interpretation of “a programmer who
runs fast”.
(5.1) a. fast programmer
b. fast plane
c. fast scientist
(5.2) a. easy problem
b. difficult language
c. good cook
d. good soup
(5.3) a. All the office personnel took part in the company sports day last week.
b. One of the programmers was a good athlete, but the other was struggling to finish
the courses.
c. The fast programmer came first in the 100m.
Adjectives like fast have been extensively studied in the lexical semantics literature
(Bouillon 1997; Lahav 1989; Pustejovsky 1995) and their properties have been known at least
since Vendler (1968). The meaning of adjective-noun combinations like those in (5.1) and (5.2)
are usually paraphrased with a verb modified by the adjective in question or its corresponding
adverb. For example, an easy problem is “a problem that is easy to solve” or “a problem that
one can solve easily”. As Vendler (1968: 92) points out in most cases not one verb, but a family
of verbs is needed to account for the meaning of adjective-noun combinations like those in (5.1)
and (5.2). Vendler further observes that the noun figuring in an adjective-noun combination is
usually the subject or object of the paraphrasing verb. Although the adjective fast usually triggers
a verb-subject interpretation (see the examples in (5.1)), the adjectives easy and difficult
trigger a verb-object interpretation (see the examples in (5.2a,b)). An easy problem is usually
“a problem that is easy to solve”, whereas a difficult language is “a language that is difficult
to learn, speak, write, or understand”. Adjectives like good allow either verb-subject or verbobject
interpretations: a good cook is “a cook who cooks well”, whereas good soup is “soup
that tastes good” or “soup that is good to eat”.
The polysemy of adjectives like fast or easy has led Pustejovsky (1995) to argue
against lexicons which describe lexical meaning simply by sense enumeration. Deriving the
meaning of adjective-noun constructions like (5.1) and (5.2) within a sense enumerative framework
means that distinct senses have to be provided for each noun or, more generally, for each
noun class the adjective modifies. Pustejovsky’s account of adjectival polysemy relies on the
fact that the meaning of adjectives like easy is determined largely by the semantics of the noun
they modify. Pustejovsky assumes that nouns have a qualia structure as part of their lexical
142 Chapter 5. A Probabilistic Model of Adjective-Noun Ambiguity
entries, which among other things, specifies possible events associated with the entity. For example,
the telic (purpose) role of the qualia structure for problem has a value equivalent to
solve. Adjectives can be seen as modifying only one or a subset of the qualia for a noun. The
adjective easy is an event predicate, i.e., it selectively modifies the events associated with the
nouns it is in construction with. When easy is combined with problem, it predicates over the
telic role of problem and consequently the adjective-noun combination receives the interpretation
“a problem that is easy to solve”. Pustejovsky calls the semantic process of selecting and
operating on a specific substructure of a lexical entry Selective Binding. Note that in Pustejovsky’s
framework the polysemy of adjectives like easy is accounted for by lexical processes
operating on lexical entries, thus avoiding the proliferation of senses via enumeration.
Pustejovsky (1995) does not give an exhaustive list of the telic roles a given noun may
have. In contrast to Vendler (1968), who acknowledges the fact that adjective-noun combinations
like the ones in (5.1) and (5.2) trigger more than one interpretation (in other words,
there may be more than one possible event associated with the noun modified by the adjective
in question), Pustejovsky implicitly assumes that nouns or noun classes have one—perhaps
default—telic role. Although the number of possible interpretations for adjective-noun combinations
like fast scientist are virtually unlimited, some interpretations are more likely than
others. Out of context, fast scientist is more likely to be interpreted as “a scientist who performs
experiments quickly” or “who publishes quickly” rather than as “a scientist who draws
or drinks quickly”.
In this chapter we focus on polysemous adjective-noun combinations (see (5.1) and
(5.2)) and attempt to address the following questions: (a) Can the meanings of these adjectivenoun
combinations be acquired automatically from corpora? (b) Can we constrain the number
of interpretations by providing a ranking on the set of possible meanings? (c) Can we determine
if an adjective has a preference for a verb-subject or verb-object interpretation? We provide a
probabilistic model (based on the model introduced in Chapter 4) which combines distributional
information about how likely it is for any verb to be modified by the adjective in the
adjective-noun combination or its corresponding adverb with information about how likely it
is for any verb to take the modified noun as its object or subject. As in Chapter 4, we obtain
quantitative information about verb-adjective modification and verb-argument relations from
the BNC via partial parsing. Our results not only show that we can predict meaning differences
when the same adjective modifies different nouns, but we can also derive—taking into account
Vendler’s (1968) observation—a cluster of meanings for a single adjective-noun combination.
We evaluate our results by comparing the model’s predictions against human judgments
and show that the model’s ranking of meanings correlates reliably with human intuitions:
meanings that are found highly probable by the model are also rated as plausible by the
subjects. Furthermore, we demonstrate that the model’s predictions can be used to arrive at a tripartite
distinction of adjectives depending on the type of paraphrase they prefer: subject-biased
5.2. The Model 143
adjectives tend to modify nouns which act as subjects of the paraphrasing verb, object-biased
adjectives tend to modify nouns which act as objects of the paraphrasing verb, whereas equibiased
adjectives display no preference for either argument role. We show that the argument
preferences predicted by the model correspond to preferences displayed by humans.
In the following sections we present our probabilistic model of adjective-noun ambiguity
and describe the model parameters (see Section 5.2). In Experiment 7 we demonstrate the
properties of the model using examples from the literature (see Section 5.3). In Experiment 8
we use the model to derive the meaning paraphrases for adjective-noun combinations randomly
selected from the BNC (see Section 5.4) and formally evaluate the results against human intuitions
(see Section 5.4.2). Finally, in Experiment 9 we demonstrate that when compared against
human judgments our model outperforms a naive baseline in deriving a preference ordering for
the meanings of polysemous adjective-noun combinations (see Section 5.5).
5.2. The Model
5.2.1. Formalization of Adjective-Noun Polysemy
Consider again the adjective-noun combinations in (5.1) and (5.2). In order to come up with
the meaning of “plane that flies quickly” for fast plane we would like to find in the corpus
a sentence whose subject is the noun plane or planes and whose main verb is fly, which in
turn is modified by the adverb fast or quickly. In the general case we would like to find in the
corpus sentences indicating what planes do fast. Similarly, for the adjective-noun combination
fast scientist we would like to find in the corpus information indicating what the activities
that scientists perform fast are, whereas for easy problem we need information about what one
can do with problems easily (e.g., one can solve problems easily) or about what problems are
(e.g., easy to solve or set).
In sum, in order to come up with a paraphrase of the meaning of an adjective-noun
combination we need to know which verbs take the head noun as their subject or object and are
modified by an adverb corresponding to the modifying adjective. This can be expressed as the
joint probability P an rel v where v is the verbal predicate modified by the adverb a (directly
( ; ; ; )
derived from the adjective present in the adjective-noun combination) bearing the argument
relation rel (i.e., subject or object) to the head noun n. By choosing the ordering v na rel for h ; ; ; i
the variables an rel, and v we can rewrite P a n rel v (using the chain rule) as follows:
; ; ( ; ; ; )
Although the parameters P v and P nv can be straightforwardly estimated from the BNC, ( ) ( j )
P rel vn a
( j ; ; )=
f v na rel
( ; ; ; )
f v na
( ; ; )
(5.5)
One way to estimate f v na rel would be to fully parse the corpus so as to identify ( ; ; ; )
the verbs which take the head noun n as their subject or object and are modified by the adverb
a. For the adjective-noun combination fast plane there are only six sentences in the entire BNC
that could be used for the estimation of f v na rel (see the examples in (5.6)). According to ( ; ; ; )
these sentences the most likely interpretation for fast plane is “a plane that goes fast” (see examples
(5.6a)–(5.6c)). The interpretations “plane that swoops in fast”, “plane that drops down
fast” and “plane that flies fast” are all equally likely, since they are attested in the corpus only
once (see examples (5.6d)–(5.6f)). This is rather unintuitive since fast planes are more likely to
fly than swoop in fast. For the adjective-noun combination fast programmer there is only one
sentence relevant for the estimation of f v na rel in which the modifying adverbial is not ( ; ; ; )
fast but the semantically related quickly (see example (5.7)). The sparse data problem carries
over to the estimation of the frequency f vn a . ( ; ; )
We assume that the likelihood of seeing an adverbial modifying a verb bearing an argument
relation to a noun is independent of that specific noun (see equation (5.8)). In other words
we only estimate the likelihood of a verb to be modified by a particular adverb or adjective.
Accordingly, we assume that the likelihood of the argument relation rel given a verb v that
takes a noun n as its subject or object and is modified by an adverb a is independent of the
5.2. The Model 145
adverb a. By substituting (5.8) and (5.9) into (5.4), P an rel v can be written as: ( ; ; ; )
Pv ( )=
fv ( )
∑i
f vi( )
(5.11)
Pnv ( j )=
f nv( ; )
fv ( )
(5.12)
Pav ( j )=
f av( ; )
fv ( )
(5.13)
P rel vn( j ; )=
f rel v n
( ; ; )
f vn( ; )
(5.14)
By substituting equations (5.11)–(5.14) into (5.10) and simplifying the relevant terms, (5.10)
is rewritten as follows:
P a n rel v
( ; ; ; )_
f rel v n f av
( ; ; )_ ( ; )
f v∑( )_
i
f vi( )
(5.15)
Assume we want to discover a meaning paraphrase for the adjective-noun combination fast
plane. We need to find the verb or verbs v and the relation rel (i.e., subject or object) that maximize
the term P fast plane rel v . Table 5.1 gives a list of the most frequent verbs modified by
( ; ; ; )
the adverb fast in the BNC (see the term f av in equation (5.15)), whereas Table 5.2 lists the ( ; )
verbs for which the noun plane is the most likely object or subject (see the term f rel v n in ( ; ; )
equation (5.15)). We describe how the frequencies f rel vn , f a v , and f v were estimated ( ; ; ) ( ; ) ( )
go 29 work 6
grow 28 grow in 6
beat 27 learn 5
run 16 happen 5
rise 14 walk 4
travel 13 think 4
move 12 keep up 4
come 11 fly 4
drive 8 fall 4
get 7 disappear 4
Table 5.2: Most frequent verbs taking as an argument the noun plane
v f SUBJ v plane v f OBJ v plane
( ; ; ) ( ; ; )
fly 20 catch 24
come 17 board 15
go 15 take 14
take 14 fly 13
land 9 get 12
touch 8 have 11
make 6 buy 10
arrive 6 use 8
leave 5 shoot 8
begin 5 see 7
5.2.2. Parameter Estimation
We estimated the parameters of the model outlined in the previous section from a part-of-speech
tagged and lemmatized version of the BNC. The estimation of the terms f v and ∑i f vi ( ) ( )
(see (5.15)) reduces to the number of times a given verb is attested in the corpus. In order to
estimate the terms f rel v n and f a v the corpus was automatically parsed by Cass (Abney
( ; ; ) ( ; )
1996), a robust chunk parser designed for the shallow analysis of unrestricted text (for a detailed
description of the parser see Section 2.3.2 in Chapter 2). Section 5.2.2.1 details how the
frequencies f rel vn and f vn were acquired.
( ; ; ) ( ; )
and verb-objects (see the examples in (5.16)). The tuples obtained from the parser’s
output are an imperfect source of information about argument relations. Bracketing errors as
well as errors in identifying chunk categories accurately result in extracting tuples whose lex5.2.
The Model 147
ical items do not stand in a verb-argument relationship. For example, the verb is missing from
tuples (5.17a,b) (people and whose are identified as subjects of isolated and behalf, respectively),
the noun is missing from tuples (5.17c,d) (the adverb there is the subject of drink,
the adjective good is the subject of smile ), and both the verb and the noun are missing from
tuple (5.17e) (where the relative pronoun who is the subject of the adjective ill ).
(5.16) a. change situation SUBJ
b. analyse participant SUBJ
c. come off heroin OBJ
d. appear on screen OBJ
e. deal with situation OBJ
(5.17) a. isolated people SUBJ
b. behalf whose SUBJ
c. drink there SUBJ
d. smile good SUBJ
e. ill who SUBJ
(5.18) a. alten aus SUBJ
b. rolex symbol SUBJ
In order to compile a comprehensive count of verb-argument relations we tried to
eliminate from the parser’s output tuples containing erroneous verbs and nouns like those in
(5.17). We did this by matching the verbs contained in the tuples against a list of all words
tagged as verbs, and the nouns in the tuples against a list of all nouns in the BNC. Tuples
containing words not included in the list were discarded. Furthermore, tuples containing verbs
or nouns attested in a verb-argument relationship only once were also discarded, since they
were mostly tagging or parsing mistakes. See the examples in (5.18) where alten and rolex
are tagged as verbs (instead of nouns) and aus is mistakenly tagged as a noun. Finally, nonauxiliary
instances of the verb be (e.g., be embassy OBJ, be prawn SUBJ) were eliminated since
they contribute no semantic information with respect to the events or states that are possibly
associated with the noun with which the adjective is combined.
Particle verbs (see (5.16c)) were included in verb-subject and verb-object tuples only
if the particle was adjacent to the verb. Verbs followed by the preposition by and a head noun
were extracted and counted as instances of verb-subject relations. The verb-object tuples also
included prepositional objects (see (5.16d,e)). It was assumed that PPs adjacent to the verb
headed by either of the prepositions in, to, for, with, on, at, from, of, into, through, upon were
prepositional objects. This resulted in 737,390 types of verb-subject pairs and 1,078,053 types
of verb-object pairs (see Table 5.3 which contains information about the tuples extracted from
the corpus before and after the filtering).
Generally speaking, the frequency f av represents not only a verb modified by an
( ; )
adverb derived from the adjective in question (see example (5.19a)) but also constructions like
148 Chapter 5. A Probabilistic Model of Adjective-Noun Ambiguity
Table 5.3: Tuples extracted from the BNC
Tokens Types
Relation Parser Filtering Tuples Verbs Nouns
SUBJECT 4,759,950 4,587,762 737,390 14,178 25,900
OBJECT 3,723,998 3,660,897 1,078,053 12,026 35,867
the ones shown in (5.19b,c), where the adjective takes an infinitival VP complement whose
logical subject can be realized as a for -PP (see example (5.19c)). In cases of verb-adverb modification
we assume access to morphological information which specifies what counts as a valid
adverb for a given adjective. In most cases adverbs are formed by adding the suffix -ly to the
base of the adjective (e.g., slow-ly, easy-ly ). Some adjectives have identical adverbs (e.g., fast,
right ). Others have idiosyncratic adverbs (e.g., the adverb of good is well ). It is relatively
straightforward to develop an automatic process which maps an adjective to its corresponding
adverb, modulo exceptions and indiosyncracies, however in the experiments described in the
following sections this mapping was manually specified.
(5.19) a. comfortable chair a chair on which one sits comfortably
!
Note that in cases where the adverb does not immediately succeed the verb (see sentence
(5.7)) the parser is not guaranteed to produce a correct analysis due to its simple strategy
of leaving ambiguities unattached. In order to estimate the frequency f av we only looked
( ; )
at instances where the verb and the adverbial phrase modifying it (AdvP) were adjacent. More
specifically, in cases where the parser identified an AdvP following a VP, we extracted the verb
and the head of the AdvP (see examples (5.20b), (5.21b), (5.22c)). In cases where the AdvP
was not explicitly identified we extracted the verb and the adverb immediately following or
preceding it (see examples (5.20a), (5.21a), (5.22a), and (5.22b)) assuming that the verb and
the adverb stand in a modification relation. The examples below illustrate the parser’s output
and the information that was extracted for the estimation of the quantity f a v . ( ; )
(5.20) a. [NP Some art historians] [VP write] well [PP about the present.] write well
b. [NP Oriental art] [VP came] [AdvP more slowly.] come slowly
(5.21) a. [NP The accidents] [VP could have been easily avoided.] avoid easily
b. [NP The issues] [VP will not be resolved] [AdvP easily.] resolve easily
(5.22) a. [NP A system of molecules] [VP is easily shown] [VP to stay constant.] show easily
b. [NP Their economy] [VP was so well run.] run well
c. [NP Arsenal] [VP had been pushed] [AdvP too hard.] push hard
Adjectives with infinitival complements (see (5.19b,c)) were acquired as follows: we
concentrated only on adjectives immediately followed by infinitival complements with an op5.3.
Experiment 7: Comparison against the Literature 149
tionally intervening for -PP (see (5.19c)). The adjective and the main verb of the infinitival
complement were counted as instances of the quantity f av . The examples in (5.23) illustrate
( ; )
the process.
(5.23) a. [NP These early experiments] [VP were easy] [VP to interpret.] easy interpret
b. [NP It] [VP is easy] [PP for an artist] [VP to show work independently.] easy show
c. [NP It] [VP is easy] [VP to show] [VP how the components interact.] easy show
Finally, the frequency f a v collapsed the counts from cases where the adjective
( ; )
was followed by an infinitival complement (see the examples in (5.23)) and cases where the
verb was modified by the adverb corresponding to the related adjective (see the examples
in (5.20)–(5.22)). For example, assume that we are interested in the frequency f easy show . ( ; )
In this case, we will take into account not only sentences (5.23b,c) but also sentence (5.22a).
Assuming this was the only evidence in the corpus, the frequency f easy show would be ( ; )
three.
Once we have obtained the frequencies f av and f rel vn we can determine what ( ; ) ( ; ; )
the most likely interpretations for a given adjective-noun combination are. Depending on the
data (noisy or not) and the task at hand we may choose to estimate the probability P an rel v ( ; ; ; )
from reliable corpus frequencies only (e.g., f a v 1 and f rel v n 1). If we know the ( ; )> ( ; ; )>
interpretation preference of a given adjective (i.e., subject or object), we may vary only the
term v in P an rel v , keeping the terms n, a and rel constant. Alternatively, we could acquire
( ; ; ; )
the interpretation preferences automatically by varying both the terms rel and v. In Experiment
8 (see Section 5.4) we acquire both meanings and argument preferences for polysemous
adjective-noun combinations.
In what follows we explain the properties of the model by applying it to a small number
of adjective-noun combinations taken from the lexical semantics literature (i.e., Pustejovsky
1995 and Vendler 1968). We show that our model predicts variation in meaning when the same
adjective modifies different nouns, and furthermore that it provides a fairly intuitive ranking of
meanings for a given adjective-noun combination. We apply the model to examples randomly
selected from the BNC in Experiment 8 and evaluate its performance against human judgments
(see Section 5.4).
5.3. Experiment 7: Comparison against the Literature
5.3.1. Method
We selected 15 adjective-noun combinations discussed in the lexical semantics literature
(Pustejovsky 1995; Vendler 1968). The adjective-noun combinations and their respective interpretations
are given in Table 5.4. Note that although in some cases more than one meaning
is provided (e.g., a difficult language is “a language that is difficult to speak, learn, write, or
150 Chapter 5. A Probabilistic Model of Adjective-Noun Ambiguity
Table 5.4: Paraphrases for adjective-noun combinations taken from the literature
good knife a knife that cuts well (Pustejovsky 1995: 43)
!
good meal a tasty meal (a meal that tastes good) (Pustejovsky 1995: 43)
!
good poet a poet who writes poems well (Vendler 1968: 101)
!
good shoe a shoe that is good for wearing, for walking (Vendler 1968: 99)
!
fast boat a boat driven quickly/a boat that is inherently fast (Pustejovsky 1995: 44)
!
fast game the motions involved in the game are rapid and swift (Pustejovsky 1995: 44)
!
fast decision a decision which takes a short amount of time (Pustejovsky 1995: 44)
!
difficult language a language that is difficult to speak, learn, write, understand (Vendler 1968: 99)
!
careful scientist a scientist who observes, performs, runs experiments carefully (Vendler 1968: 92)
!
comfortable chair a chair on which one sits comfortably (Vendler 1968: 98)
!
understand”) in most cases a single interpretation is given (e.g., a good knife is “a knife that
cuts well”, an easy text is a “text that reads easily”, etc.). We derived paraphrases for each
adjective-noun combination in Table 5.4 using the probabilistic model outlined in Section 5.2.
The model’s parameters were estimated as explained in Section 5.2.2. No thresholds were employed
for the frequencies f av and f rel vn . Recall that the frequency f av collapses
( ; ) ( ; ; ) ( ; )
the counts of adjectives co-occurring with infinitival complements and verbs modified by adverbs.
We compiled counts corresponding to verb-adverb modification by mapping the adjective
good to the adverbs good and well, the adjective fast to the adverb fast, easy to easily
and comfortable to comfortably. The adverbial function of the adjective difficult is expressed
only periphrastically (i.e., in a difficult manner, with difficulty). As a result we obtained the frequency
f difficult v only on the basis of infinitival constructions (see the examples in (5.23)).
( ; )
Table 5.5 gives the five most likely interpretations for each adjective-noun combination (where
v1 is the most likely interpretation, v2 is the second most likely interpretation, etc.).
5.3.2. Results
Let us now consider in more detail the interpretations the model comes up with. Pustejovsky
(1995) suggests the interpretation “a knife that cuts well” for the adjective-noun combination
good knife. This is the second most likely interpretation according to our probabilistic model
(see Table 5.5). The model acquires additional less plausible meanings such as “a knife that
goes well”, “a knife that comes well”, “a knife that takes something well”, “a knife that buys
something well”. Although Pustejovsky focuses on a subject-related interpretation for good
knife, the model also derives object-related interpretations: a good knife is “a knife that is
good to use, to hold, to know, to draw, to take”. The interpretations are fairly plausible with the
exception perhaps of the paraphrase “a knife that is good to draw”.
5.3. Experiment 7: Comparison against the Literature 151
Table 5.5: Model-derived paraphrases for adjective-noun combinations, ranked in order of likelihood
P a n rel v v1 v2 v3 v4 v5
( ; ; ; )
Consider now the pair good meal whose intuitive interpretation is “a meal that tastes
good” (see Table 5.4). Although the model does not derive this particular interpretation, it
derives complementary meanings such as “a meal that goes well”, “a meal that cooks well”,
“a meal that serves well” (see Table 5.5). Similarly, although the model does not discover the
suggested interpretation for the pair good umbrella (i.e., “an umbrella that functions well”) it
comes up with a plausible meaning (i.e., “an umbrella that covers well”). In fact, the meaning
the model suggests can be considered as a subtype of the meaning suggested by Pustejovsky
(1995): an umbrella functions well if it opens well, closes well, covers well, etc. Note also that
the model derives object-related interpretations for good umbrella : “an umbrella that is good to
keep, good for waving, good to hold, good to run for, good to leave”. The meaning paraphrases
are fairly plausible with the exception perhaps of the latter one.
The model and Vendler (1968) agree in their interpretation of the pairs good poet and
good shoe. A good poet is “a poet who writes well”, whereas a good shoe is “shoe that is good
to wear” (see Table 5.5). The model further acquires the fairly plausible meanings “a poet who
expresses himself well” for good poet and “ a shoe that is good to keep, to buy, and get” for
good shoe. Our model also comes up with plausible interpretations for the combinations fast
boat, fast game, fast decision, and fast horse. In fact, the interpretations ranked as most likely
by the model are similar to the ones proposed in the lexical semantics literature. A fast boat
is “a boat that travels fast” according to the model; Pustejovsky’s (1995) interpretation is semantically
close (“a boat that is inherently fast”). Also notice the object-related interpretations
derived by the model for fast boat (see Table 5.5). Afast game is “a game that goes or runs fast”
152 Chapter 5. A Probabilistic Model of Adjective-Noun Ambiguity
according to our model; Pustejovsky’s proposal is semantically related: “the motions involved
in the game are rapid and swift”. The model correctly interprets a fast decision as “a decision
that is fast to make” (see Pustejovsky’s interpretation: “a decision which takes a short amount
of time”) and a fast horse as “a horse that runs fast” (see Vendler’s identical interpretation in
Table 5.4). Note further that according to the model a fast horse is not only “a horse that runs
fast” but also a “horse that learns, goes, comes, and rises fast”.
Similarly, an easy problem is not only “a problem that is easy to solve” (see Vendler’s
1968 identical interpretation in Table 5.4) but also “a problem that is easy to deal with, identify,
tackle, and handle” (see Table 5.5). The meaning of easy problem is different from the
meaning of easy text which in turn is “easy to read, handle, interpret, and understand”. The
interpretations the model arrives at for the adjective-noun combination difficult language are a
superset of the interpretations suggested by Vendler (see Table 5.4). The model comes up with
the additional meanings “language that is difficult to interpret” and “language that is difficult to
use”. Although the meanings acquired by the model for careful scientist do not overlap with the
ones suggested by Vendler (see Table 5.4) they seem intuitively plausible: a careful scientist is
a “scientist who calculates, proceeds, investigates, studies, and analyses carefully”. These are
all possible events associated with scientists. Finally, note that the meanings derived for comfortable
chair are also fairly plausible (the second most likely meaning is the one suggested by
Vendler, see Table 5.4).
5.3.3. Discussion
Experiment 7 presented an example of how the probabilistic model outlined in Section 5.2
can be used to discover the meanings of adjective-noun combinations taken from the lexical
semantics literature. Our probabilistic model combines distributional information about how
likely it is for a verb to be modified by the adjective or adverb derived from the adjective
present in an adjective-noun combination with information about how likely it is for any verb
to take the related noun as its object or subject. We obtained quantitative information about
verb-adjective and verb-adverb modification as well as verb-argument relations from the BNC
via partial parsing. The results of Experiment 7 indicate that the probabilistic model can be
used not only to predict different meanings when the same adjective modifies different nouns
but also to derive a cluster of meanings for a single adjective-noun combination.
Although the model can be used to provide several interpretations for a given adjectivenoun
combination, not all of these interpretations are useful or plausible. Experiment 7 showed
that meanings with top-ranked probabilities are intuitively plausible, although in some cases
implausible meanings are also assigned top-ranked probabilities (for example, “a knife that
buys something well” is ranked as the fifth most likely meaning for good knife, see Table 5.5).
Furthermore, we did not explore the status of meanings with low probabilities. For an ideal
model one would expect that top-ranked probabilities correspond to plausible meanings and
5.4. Experiment 8: Comparison against Human Judgments 153
bottom-ranked probabilities correspond to implausible meanings. We test this prediction in
Experiment 8. Another objection to the examples given in Tables 5.4 and 5.5 is that they may
not be entirely representative of the types of polysemous adjective-noun combinations occurring
in unrestricted text since they are taken from linguistic texts where emphasis is given on
explaining the phenomenon at hand and the selected examples are typically straightforward
illustrations of polysemous adjective-noun combinations. In other words, the adjective-noun
combinations discussed in the previous section may be too easy for the model to handle. In
Experiment 8 (see Section 5.4) we test our model on polysemous adjective-noun combinations
randomly sampled from the BNC and formally evaluate our results against human judgments.
Finally, note that the meanings acquired by our model are a simplified version of the
ones provided in the lexical semantics literature. In particular, note that an adjective-noun combination
may be paraphrased with another adjective-noun combination (see Table 5.4 where
good meal is paraphrased as “a tasty meal”) or with a an NP instead of an adverb (see the
paraphrase of fast decision in Table 5.4). We are making the simplifying assumption that a
polysemous adjective-noun combination can be paraphrased by a sentence consisting of a verb
whose argument is the noun the adjective is in construction with.
The probabilistic model discussed in the previous sections acquires meanings for polysemous
adjective-noun combinations out of context. The derived meanings can be thought of
as default semantic information associated with a particular adjective-noun combination. This
means that our model is unable to predict the meaning of fast programmer when embedded in
a context like the one given in (5.3).
5.4. Experiment 8: Comparison against Human Judgments
5.4.1. Method
The ideal test of the proposed model of adjective-noun polysemy will be with randomly chosen
materials. We evaluate the acquired meanings by comparing the model’s rankings against
judgments of meaning paraphrases elicited experimentally from human subjects. By comparing
the model-derived meaning paraphrases against human intuitions we are able to explore:
(a) whether plausible meanings are ranked higher than implausible ones; (b) whether the model
can be used to derive the argument preferences for a given adjective, i.e., whether the adjective
is biased towards a subject or object interpretation or whether it is equi-biased; (c) whether
there is a linear relationship between the model-derived likelihood of a given meaning and its
perceived plausibility, using correlation analysis.
In the following sections we describe our method for assembling the set of experimental
materials and eliciting judgments for model-derived adjective-noun interpretations.
Section 5.4.2 reports the results of comparing human judgments to model-derived meanings,
whereas Section 5.4.3 offers some discussion and concluding remarks.
154 Chapter 5. A Probabilistic Model of Adjective-Noun Ambiguity
5.4.1.1. Subjects
Sixty-five native speakers of English participated in the experiment. The subjects were recruited
over the Internet by postings to relevant newsgroups and mailing lists. Participation was voluntary
and unpaid. Subjects had to be linguistically naive, i.e., neither linguists nor students of
linguistics were allowed to participate.
The data of one subject were eliminated after inspection of his response times
showed that he had not completed the experiment in a realistic time frame (average response
time 1000ms). The data of four subjects were excluded because they were non-native speakers
<
of English.
This left 60 subjects for analysis. Of these, 54 subjects were right-handed, six lefthanded;
22 subjects were female, 38 male. The age of the subjects ranged from 18 to 54 years,
the mean was 27.4 years.
5.4.1.2. Materials and Design
We chose nine adjectives according to a set of minimal criteria and paired each adjective with
10 nouns randomly selected from the BNC. We chose the adjectives as follows: we first compiled
a list of all the polysemous adjectives mentioned in the lexical semantics literature (Pustejovsky
1995; Vendler 1968). From these we randomly sampled nine adjectives (difficult, easy,
fast, good, hard, right, safe, slow, and wrong). These adjectives had to be unambiguous with
respect to their part of speech: each adjective was unambiguously tagged as “adjective” 98.6%
of the time, measured as the number of different part-of-speech tags assigned to the word in
the BNC. The nine selected adjectives ranged in BNC frequency from 80 to 1,245 per million.
We identified adjective-noun pairs using Gsearch (see Section 2.3.1 in Chapter 2 for
details). Gsearch was run on a lemmatized version of the BNC so as to compile a comprehensive
corpus count of all nouns occurring in a modifier-head relationship with each of the nine
adjectives. From the syntactic analysis provided by the parser we extracted a table containing
the adjective and the head of the noun phrase following it. In the case of compound nouns,
we only included sequences of two nouns, and considered the rightmost occurring noun as the
head. From the retrieved adjective-noun pairs, we removed all pairs with BNC frequency of
one, as we wanted to reduce the risk of paraphrase ratings being influenced by adjective-noun
combinations unfamiliar to the subjects. Furthermore, we excluded pairs with deverbal nouns
(i.e., nouns derived from a verb) such as fast programmer since an interpretation can be easily
arrived at for these pairs by mapping the deverbal noun to its corresponding verb. A list
of deverbal nouns was obtained from two dictionaries, CELEX (Burnage 1990) and NOMLEX
(Macleod et al. 1998, see Section 7.2.1.2 for details).
We used the model outlined in Section 5.2 to derive meanings for the 90 adjectivenoun
combinations. We employed no threshold on the frequencies f v a and f rel v n . As ( ; ) ( ; ; )
the adjective to its corresponding adverb. For the adjectives difficult, easy, fast, and good the
mapping was the same as in Experiment 7. Furthermore, the adjective hard was mapped to
the adverb hard, the adjective right to rightly and right, safe to safely and safe, slow to slowly
and slow and wrong towrongly and wrong.We estimated the probability P a n rel v for each ( ; ; ; )
adjective-noun pair by varying both the terms v and rel. In other words, for each adjective-noun
combination we derived both subject-related and object-related paraphrases.
In order to generate stimuli covering a wide range of model-derived paraphrases corresponding
to different degrees of likelihood, for each adjective-noun combination we divided the
set of the derived meanings into three “probability bands” (High, Medium, and Low) of equal
size and randomly chose one interpretation from each band. The division ensured that the experimental
stimuli represented the model’s behavior for likely and unlikely paraphrases and
enabled us to test the hypothesis that likely paraphrases correspond to high ratings and unlikely
paraphrases correspond to low ratings. We performed separate divisions for object-related and
subject-related paraphrases resulting in a total of six interpretations for each adjective-noun
combination, as we wanted to determine whether there are differences in the model’s predictions
with respect to the argument function (i.e., object or subject) and also because we
wanted to compare experimentally-derived adjective biases against model-derived biases. Example
stimuli (with object-related interpretations only) are shown in Table 5.6 for each of the
nine adjectives.
Our experimental design consisted of the factors adjective-noun pair (Pair), grammatical
function (Func) and probability band (Band). The factor Pair included 90 adjective-noun
combinations. The factor Func had two levels, subject and object, whereas the factor Band
had three levels, High, Medium, and Low. This yielded a total of Pair Func Band = _ _
90 2 3 540 stimuli. The number of the stimuli was too large for subjects to judge in
_ _ =
156 Chapter 5. A Probabilistic Model of Adjective-Noun Ambiguity
one experimental session. We limited the size of the design by selecting a total of 270 stimuli
according to the following criteria: our initial design created two sets of stimuli, 270 subjectrelated
stimuli and 270 object-related stimuli. For each set of stimuli (i.e., object- and subjectrelated)
we randomly selected five nouns for each of the nine adjectives together with their corresponding
interpretations in the three probability bands (High, Medium, Low). This yielded a
total of Pair Func Band = 45 2 3 270 stimuli. This way, stimuli were created for
_ _ _ _ =
Low 23 91 86 18 25 85 22 46
� : : : � : � :
comparison. Items were presented in random order, with a new randomization being generated
for each subject.
The training set contained six horizontal lines. The range of the smallest to largest item
was 1:10. The items were distributed evenly over this range, with the largest item covering the
maximal window width of the web browser. A modulus item in the middle of the range was
provided.
Practice Phase. This phase allowed subjects to practice magnitude estimation of adjectivenoun
paraphrases. Presentation and response procedure was the same as in the training phase,
with linguistic stimuli being displayed instead of lines. Each subject judged the whole set of
practice items, again in random order.
The practice set consisted of eight paraphrase sentences that were representative of the
test materials (see Section 5.4.1.2 for details about the construction of experimental stimuli).
The paraphrases were based on the three probability bands and illustrated subject- and objectrelated
interpretations. A modulus item in the middle of the range was provided.
Experimental Phase. Presentation and response procedure in the experimental phase were the
same as in the practice phase. Each subject group saw 135 experimental stimuli (i.e., adjectivenoun
pairs and their paraphrases). As in the practice phase, the paraphrases were representative
of the three probability bands (i.e., High, Medium, Low) and the two grammatical functions
(i.e., object, subject). A modulus item in the middle of the range was provided (see Appendix
C). The modulus was the same for all subjects and remained on the screen all the time.
Subjects were assigned to subject groups at random, and a random stimulus order was generated
for each subject (for the complete list of experimental stimuli see Appendix C).
5.4.2. Results
The data were first normalized by dividing each numerical judgment by the modulus value that
the subject had assigned to the reference sentence. This operation creates a common scale for all
subjects. Then the data were transformed by taking the decadic logarithm. This transformation
ensures that the judgments are normally distributed and is standard practice for magnitude
estimation data (Bard et al. 1996; Lodge 1981). All analyses were conducted on the normalized,
log-transformed judgments.
5.4. Experiment 8: Comparison against Human Judgments 159
Table 5.8: Descriptive statistics for Experiment 8, by subjects
Rank Mean Std Dev StdEr Min Max
High 0005 2974 0384 68 49
�: : : �: :
p 01; F2 288 2907, p 01. The geometric mean of the ratings in the High band was
<: ( ; )= : <:
0005, compared to Medium items at 1754 and Low items at 2298 (see Table 5.8). Posthoc
�: �: �:
Tukey tests indicated that the differences between all pairs of conditions were significant at
α 01 in the by-subjects analysis. The difference between High and Medium items as well as
=:
High and Low items was significant at α 01 in the by-items analysis, whereas the difference
=:
between Medium and Low items did not reach significance. These results show that meaning
paraphrases derived by the model correspond to human intuitions: paraphrases assigned
high probabilities by the model are perceived as better than paraphrases that are assigned low
probabilities.
We further explored the linear relationship between the subjects’ rankings and the
corpus-based model, using correlation analysis. The elicited judgments were compared with
the interpretation probabilities which were obtained from the model described in Section 5.2
to examine the extent to which the proposed interpretations correlate with human intuitions. A
comparison between our model and the human judgments yielded a Pearson correlation coefficient
of 40 (p 01, N 270). Figure 5.1 plots the relationship between judgments and model
: <: =
probabilities. This verifies the Probability Band effect discovered by the ANOVA, in an analysis
which compares the individual interpretation likelihood for each item with elicited interpretation
preferences, instead of collapsing all the items in three equivalence classes (i.e., High,
Medium, Low).
In order to evaluate whether the grammatical function has any effect on the relationship
between the model-derived meaning paraphrases and the human judgments, we split the items
into those that received a subject interpretation and those that received an object interpretation.
A comparison between our model and the human judgments yielded a correlation of r 53 =:
for subject-related items. Note that a weaker correlation is obtained for subject-related inter160
Chapter 5. A Probabilistic Model of Adjective-Noun Ambiguity
-26 -24 -22 -20 -18 -16
model probabilities
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
elicited judgments
Figure 5.1: Correlation of elicited judgments and model-derived probabilities
pretations. One explanation for that could be the parser’s performance, i.e., the parser is better
at extracting verb-object tuples than verb-subject tuples. Another hypothesis (which we test
below) is that most adjectives included in the experimental stimuli have an object-bias, and
therefore subject-related interpretations are generally less preferred than object-related ones.
An important question is how well humans agree in their paraphrase judgments for
adjective-noun combinations. Inter-subject agreement gives an upper bound for the task and
allows us to interpret how well the model is doing in relation to humans. For each subject
group we performed correlations on the elicited judgments using a method similar to leaveone-
out cross-validation (Weiss and Kulikowski 1991). We divided the set of the subjects’
responses with size m into a set of size m 1 (i.e., the response data of all but one subject)
�
and a set of size one (i.e., the response data of a single subject). We then correlated the mean
ratings of the former set with the ratings of the later. This was repeated m times. Since each
group had 30 subjects we performed 30 correlation analyses and report their mean. For the first
group, the average inter-subject agreement was 67 (Min 03, Max 82, StdDev 14), and
: =: =: =:
for the second group 65 (Min 05, Max 82, StdDev 14). This means that our model
: =: =: =:
performs satisfactorily given that humans do not perfectly agree in their judgments (recall
that comparison between model probabilities and human judgments yielded a correlation coefficient
of 40).
:
In sum, the correlation analysis supports the claim that adjective-noun paraphrases
with high probability are judged more plausible than pairs with low probability. It also suggests
that the meaning preference ordering produced by the model is intuitively correct since
subjects’ perception of likely and unlikely meanings correlates with the probabilities assigned
by the model.
The elicited judgments can be further used to derive the grammatical function pref5.4.
Experiment 8: Comparison against Human Judgments 161
Table 5.9: Log-transformed model-derived argument preferences for polysemous adjectives
Adjective Preference Mean StdDev StdEr
difficult OBJ 21 62 1.36 .04
p � :
erences (i.e., subject or object) for a given adjective. In particular, we can determine which is
the preferred interpretation for individual adjectives and compare these preferences against the
ones produced by our model. Argument preferences can be easily derived from the model’s output
by comparing subject-related and object-related paraphrases. For each adjective we gathered
all the subject- and object-related interpretations derived by the model and performed an
ANOVA in order to determine the significance of the Grammatical Function effect.We interpret
a significant effect as bias towards a particular grammatical function. We classify an adjective
as object-biased if the mean of the model-derived probabilities for the object interpretation of
this particular adjective is larger than the mean for the subject interpretation; subject-biased
adjectives are classified accordingly, whereas adjectives for which no effect of Grammatical
Function is found are classified as equi-biased.
The effect of Grammatical Function was significant for the adjectives difficult
(F 11806 806, p 01), easy (F 1 1511 41 16, p 01), hard (F 11310 5767,
( ; )= : <: ( ; )= : <: ( ; )= :
p 01), safe (F 1382 542, p 05), right (F 1 2114 9 85, p 01), and fast
<: ( ; )= : <: ( ; )= : <:
(F 192 438, p 05). The effect of Grammatical Function was not significant for the
( ; )= : <:
adjectives good (F 1741 395, p 10), slow (F 1 759 5 30, p 13), and wrong ( ; )= : =: ( ; )= : =:
(F 1593 166, p 19). The biases for these adjectives are shown in Table 5.9. The presence
( ; )= : =:
of the symbol indicates significance of the Grammatical Function effect as well as the p
Ideally, we would like to elicit argument preferences from human subjects in a similar
fashion. However, since it is unpractical to experimentally elicit judgments for all paraphrases
derived by the model, we will obtain argument preferences from the judgments based on the restricted
set of experimental stimuli, under the assumption that they correspond to a wide range
of model paraphrases (i.e., they correspond to a wide range of probabilities) and therefore
they are representative of the entire set of model-derived paraphrases. This assumption is justified
by the fact that items were randomly chosen from the three probability bands (i.e., High,
Medium, Low). Again we consider an adjective biased if there is a significant effect of Grammatical
Function. Comparison of the mean of subject-related judgments against object-related
judgments determines the direction of the bias. The ANOVA indicated that the Grammatical
Function effect was significant for the adjective difficult in both by-subjects and by-items analyses
(F1 158 1798, p 01; F2 14 5372, p 01), and for the adjective easy in both
( ; )= : <: ( ; )= : <:
by-subjects and by-items analyses (F1 158 10, p 01; F2 1 4 8 48, p 44). The adjectives ( ; )= <: ( ; )= : =:
difficult and easy are both object-biased (see Table 5.10 which shows the biases for the
nine adjectives as derived from the human judgments). The adjective good is equi-biased (see
Table 5.10), since no effect of Grammatical Function was found (F1 1 58 2 55, p 12; ( ; )= : =:
F2 1 4 1 01, p 37).
( ; )= : =:
The effect of Grammatical Function was significant for the adjective hard in the bysubjects
analysis only (F1 1 58 11 436, p 01; F2 1 4 2 84, p 17), whereas for the ( ; )= : <: ( ; )= : =:
p 05; F2 14 694, p 058). The adjective hard is object-biased, whereas the adjective
<: ( ; )= : =:
slow is subject-biased (see Table 5.10). For the adjective safe the main effect was significant
by subjects and by items (F1 1 58 14 4, p 0005; F2 1 4 17 76, p 05), and for ( ; )= : <: ( ; )= : <:
the adjective right the main effect was significant in both by-subjects and by-items analyses
(F1 158 651, p 05; F2 14 1522, p 018). This translates into an object-bias for
( ; )= : <: ( ; )= : =:
both right and safe (see Table 5.10). The effect of Grammatical Function was significant for
the adjective wrong only by subjects (F1 158 599, p 05; F2 14 454, p 10) and ( ; )= : =: ( ; )= : =:
for the adjective fast by subjects only (F1 1 58 4 23, p 05; F2 1 4 4 43, p 10). ( ; )= : =: ( ; )= : =:
The adjective wrong has an object bias, whereas fast has a subject bias.
We expect a correct model to assign higher probabilities to object-related interpretations
and lower probabilities to subject-related interpretations for an object-biased adjective;
accordingly, we expect the model to assign on average higher probabilities to subject-related
interpretations for subject-biased adjectives. Comparison of the biases derived from the model
with ones derived from the elicited judgments shows that the model and the humans are in
agreement for all adjectives but slow, wrong, and safe. On the basis of human judgments slow
has a subject bias, whereas wrong has an object bias (see Table 5.10). Although the model
could not reproduce this result there is a tendency in the right direction (see Table 5.9).
Note that in our correlation analysis reported above the elicited judgments were compared
against model-derived paraphrases without taking argument preferences into account.
We would expect a correct model to produce intuitive meanings at least for the interpretation a
given adjective favors. We further examined the model’s behavior by performing separate correlation
analyses for preferred and dispreferred biases as determined previously by the ANOVAs
conducted for each adjective (see Table 5.10). Since the adjective good was equi-biased we included
both biases (i.e., object-related and subject-related) in both correlation analyses. The
comparison between our model and the human judgments yielded a Pearson correlation coefficient
of 52 (p 01, N 150) for the preferred interpretations and a correlation of 23
: <: = :
(p 01, N 150) for the dispreferred interpretations. The correlation for the preferred interpretations
<: =
is graphed in Figure 5.2. The result indicates that our model is particularly good
at deriving meanings corresponding to the argument-bias for a given adjective. However, the
dispreferred interpretations also correlate significantly with human judgments, which suggests
that the model derives plausible interpretations even in cases where the argument bias is overridden.
5.4.3. Discussion
We have demonstrated that the meanings acquired by our probabilistic model correlate reliably
with human intuitions. These meanings go beyond the examples found in the theoretical linguistics
literature. The adjective-noun combinations we interpret were randomly sampled from
164 Chapter 5. A Probabilistic Model of Adjective-Noun Ambiguity
-26 -24 -22 -20 -18 -16
model probabilities
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
elicited judgments
Figure 5.2: Correlation between model and human judgments for preferred argument interpretations
a large balanced corpus providing a rich inventory for their meanings. Our model does not only
acquire clusters of meanings (following Vendler’s 1968 insight) but furthermore can be used
to obtain a tripartite distinction of adjectives depending on the type of paraphrase they prefer:
subject-biased adjectives tend to modify nouns which act as subjects of the paraphrasing verb,
object-biased adjectives tend to modify nouns which act as objects of the paraphrasing verb,
whereas equi-biased adjectives display no preference for either argument role.
The interpretation biases generated by our model seem to correspond to human intuitions
about the interpretation of polysemous adjectives. This is an important result given the
simplifying assumptions underlying our probabilistic model.We have shown that the model has
three defining features: (a) it is able to derive intuitive meanings for adjective-noun combinations,
(b) it models the context dependency of polysemous adjectives (e.g., different meanings
are predicted for good when it modifies cook and soup), and (c) it accurately models the argument
bias of a given adjective. To address issue (c) the experimental design included both
subject- and object-related interpretations for all nine adjectives. A comparison between the
argument preferences produced by the model and human intuitions revealed that most adjectives
(six out of nine) display a preference for an object interpretation (see Table 5.10), two
adjectives are subject-biased (i.e., fast, slow) and one adjective is equi-biased (i.e., good ).
Note finally that the evaluation procedure to which we subject our model is rather
strict. The derived adjective-noun combinations were evaluated by subjects naive to linguistic
theory. Although adjective-noun polysemy is a well researched phenomenon in the theoretical
linguistics literature, the experimental approach advocated here is new to our knowledge. Despite
the fact that human data is noisy as evidenced by the fairly low inter-subject agreement
( 67 for the first group and 65 for the second group) we obtain reliable correlations between
: :
relation rel n: ;
P vrel n( j ; )=
f v rel n
( ; ; )
f rel n
( ; )
(5.24)
166 Chapter 5. A Probabilistic Model of Adjective-Noun Ambiguity
The model in 5.24 assumes that the meaning of an adjective-noun combination is
independent of the adjective in question. Consider for example the adjective-noun pair fast
plane. We need to find the verbs v and the argument relation rel that maximize the probability
P vrel plane . Intuitively speaking, the model in (5.24) takes into account only the verbs
( j ; )
that are associated with the noun modified by the adjective. In the case of fast plane the verb
that is most frequently associated with planes is fly (see Table 5.2 in Section 5.2.1). Note that
this model will come up with the same probabilities for fast plane and wrong plane since it
does not take the identity of the modifying adjective into account. We estimated the frequencies
f v rel n and f rel n from verb-object and verb-subject tuples extracted from the BNC
( ; ; ) ( ; )
using Cass (Abney 1996) (see Section 5.2.2 for details on the extraction and filtering of the
argument tuples).
5.5.2. Method
Using the naive model we calculated the meaning probability for each of the 270 stimuli included
in Experiment 8. Through correlation analysis we explored the linear relationship between
the elicited judgments and the naive baseline model. We further directly compared the
two models, our initial, linguistically more informed model, and the naive baseline. We report
our results in the following section.
5.5.3. Results
Using correlation analysis we explored which model performs better at deriving meaning paraphrases
for adjective-noun combinations. A comparison between the naive model’s probabilities
and the human judgments yielded a Pearson correlation coefficient of .25 (p 01, <:
our original model to the human judgments. Not surprisingly the two models are intercorrelated
(r 38, p 01, N 270). These correlations are shown in Table 5.11 where ‘Model’
=: <: =
refers to our initial model and ‘Baseline’ refers to the naive baseline. An important question is
whether the difference between the two correlation coefficients (r 40 and r 25) is due to =: =:
chance. Comparison of the two correlation coefficients revealed that their difference was significant
(t 267 242, p 01). This means that our original model (see Section 5.2) performs
( )= : <:
reliably better than a naive baseline at deriving interpretations for polysemous adjective-noun
combinations.
We further compared the naive baseline model and the human judgments separately
for subject-related and object-related items. The comparison yielded a correlation of r 29 =:
(p 01, N 135) for object interpretations. Recall that our original model yielded a correlation
<: =
coefficient of .53 (see Table 5.12). The two correlation coefficients were significantly
different (t 132 3 03, p 01). No correlation was found for the naive model when com5.6.
( )= : <:
Table 5.12: Correlation matrices for human judgments and the two corpus-based models
OBJECT SUBJECT
Judgments Model Judgments Model
Model .53** Model .20*
Baseline .29** .42** Baseline .09 .37**
*p 05 (2-tailed) **p 01 (2-tailed) *p 05 (2-tailed) **p 01 (2-tailed)
<: <: <: <:
pared against elicited subject interpretations (r 09, p 28, N 135, see Table 5.12).
=: =: =
5.5.4. Discussion
We have demonstrated that a naive baseline model which interprets adjective-noun combinations
by focusing solely on the events associated with the noun is outperformed by a more
detailed model which not only considers verb-argument relations but also adjective-verb and
adverb-verb dependencies. Although the events associated with the different nouns are crucially
important for the meaning of polysemous adjective-noun combinations, it seems that
more detailed linguistic knowledge is needed in order to produce intuitively plausible interpretations.
This is by no means surprising. To give a simple example consider the adjective-noun
pair fast horse. There is a variety of events associated with the noun horse, yet only a subset
of those are likely to occur fast. The three most likely interpretations for fast horse according
to the naive model are “a horse that needs something fast”, “a horse that gets something
fast”, “horse that does something fast”. A model which uses information about verb-adjective
or verb-adverb dependencies (see Section 5.2) provides a more plausible ranking: a fast horse
is “a horse that runs, learns, or goes fast”. A similar situation arises when one considers the pair
careful scientist. According to the naive model a careful scientist is more likely to “believe, say,
or make something carefully”. However, none of these events are particularly associated with
the adjective careful.
5.6. General Discussion
In this chapter we focused on polysemous adjective-noun combinations.We showed how adjectival
meanings can be acquired from a large corpus and provided a probabilistic model which
derives a preference ordering on the set of possible interpretations. In contrast to the study
168 Chapter 5. A Probabilistic Model of Adjective-Noun Ambiguity
presented in Chapter 4 (where the meanings of verbs were provided by Levin’s 1993 linguistic
classification) the meanings for polysemous adjectives were derived solely from the corpus
using a surface cueing approach. The probabilistic model reflects linguistic observations about
the nature of polysemous adjectives: it predicts the context dependency effect (i.e., the meaning
of the adjective varies with respect to the noun it modifies) and is faithful to Vendler’s (1968)
claim that polysemous adjectives are usually interpreted by a cluster of meanings instead of a
single meaning.
Furthermore, the proposed model can be viewed as complementary to linguistic theory:
it automatically derives a ranking of meanings, thus distinguishing likely from unlikely
interpretations. Even if linguistic theory was able to enumerate all possible interpretations for
a given adjective (note that in the case of polysemous adjectives we would have to take into account
all nouns or noun classes the adjective could possibly modify) it has no means to indicate
which ones are likely and which ones are not. Our model fares well on both tasks. It recasts the
problem of adjective-noun polysemy in a probabilistic framework and approximates meaning
by taking into account the relation of a verb to its argument (i.e., the noun the adjective is in
construction with) together with the relation of the same verb and the adjective (or its corresponding
adverb), deriving thus a large number of interpretations not readily available from
linguistic introspection. The information acquired from the corpus can be also used to quantify
the argument preferences of a given adjective. These are only implicit in the lexical semantics
literature where certain adjectives are exclusively given a verb-subject or verb-object interpretation
(see the adjectives fast and difficult in Table 5.4). We have demonstrated that we can
empirically derive argument biases for a given adjective that correspond to human intuitions.
Our model is ignorant about the potential different meanings of the noun in the
adjective-noun pair. For example, the combination fast plane may be a fast aircraft, or a fast
tool, or a fast geometrical plane. Our model derives meanings related to all three senses of the
noun plane. For example, a fast plane is not only “a plane (i.e., an aircraft) which flies, lands,
or travels quickly”, but also “a plane (i.e., a surface) which transposes or rotates quickly” and
“a plane (i.e., a tool) which smoothes something quickly”. However, more paraphrases are derived
for the aircraft sense of plane; these paraphrases also receive a higher ranking. This is not
surprising since the number of verbs related with the aircraft sense of plane are more frequent
than the verbs related with the other two senses. Note also that fast is more likely to be related
with motion verbs than verbs related to the events denoted by plane in the sense of “surface” or
“tool”. There are also cases where a model-derived paraphrase does not provide disambiguation
clues with respect to the meaning of the noun. Consider the adjective-noun combination
fast game from Section 5.3. The model comes up with the paraphrases “game that runs fast”
or “game that goes fast”. Both paraphrases may well refer to either the “contest”, “activity”, or
“prey” sense of game.
Our results provide further support for the surface cueing approach. In Chapter 3 we
5.7. Related Work 169
used surface cues as indicators about meaning. Similarly in this chapter we derive the meanings
of adjective-noun combinations by taking into account co-occurrence frequencies acquired
through shallow syntactic processing of the corpus. Recall that in Chapter 3 we used the acquired
frequencies to quantify and augment Levin’s (1993) generalizations about alternating
verbs. Here we examined the empirical validity of Pustejovsky’s (1995) and Vendler’s (1968)
claims about context-sensitive adjectives. In Chapter 4 we showed how a probabilistic model
which uses Levin’s taxonomy as an inventory of verb meanings can make use of co-occurrence
frequencies to derive a ranking of interpretations for polysemous verbs. Although a similar approach
is taken in this chapter (our probabilistic model provides a preference ordering on the
set of acquired interpretations for polysemous adjective-noun combinations) adjective meanings
are derived directly from the corpus without recourse to a predefined taxonomy.
The acquired adjective-noun meanings could be potentially useful for a variety of NLP
tasks. One obvious application is Natural Language Generation. The acquisition task can be
cast in terms of finding the corresponding paraphrase for a given adjective-noun combination.
For example, a generator that has knowledge of the fact that fast plane corresponds to “a plane
that flies fast” can exploit this information either to render the text shorter (in cases where
the input representation is a sentence) or longer (in cases where the input representation is an
adjective-noun pair). Information retrieval is another application that easily comes to mind.
Consider a search engine faced with the query fast plane. Presumably one would not like to
obtain information about planes in general or about planes that go down or burn fast but rather
about planes that fly or travel fast. So knowledge about the most likely interpretations of fast
plane could help rank relevant documents before non-relevant ones or restrict the number of
retrieved documents.
5.7. RelatedWork
Previous corpus-based work relating to adjectives has focused on two directions: the automatic
classification of adjectives in terms of their semantic features (e.g., gradable, marked, positive,
negative, see Section 5.1) and the disambiguation of their senses. Work in classification
has concentrated on exploring the contribution of several linguistic indicators (using machine
learning) for determining the semantic behavior of a given adjective. The approach aims at determining
what is the most likely class for an adjective rather than determining what is its specific
class in a given context. The word sense disambiguation approach aims at determining the
meaning of the adjective within its surrounding context. Our work is not a classification task,
we aim at discovering meanings for polysemous adjective-noun combinations, using, however,
linguistic indicators. Our model provides a ranking on the set of possible meanings, ignoring
the context surrounding a given adjective-noun pair. Our adjective-noun paraphrases can
be thought of as input to a sense disambiguation process which would have to choose among
170 Chapter 5. A Probabilistic Model of Adjective-Noun Ambiguity
them. In what follows we review work on classification and word sense disambiguation and
compare it to our own work.
Hatzivassiloglou and McKeown (1995b) present an empirical method which discovers
semantically related adjectives. Their approach makes use of simple co-occurrence frequencies
(adjective-noun and adjective-adjective pairs) to measure the similarity between adjectives
without recourse to linguistic information other than the one present in the corpus. The derived
groups of semantically related adjectives can be further filtered so as to distinguish gradable
from non-gradable adjectives. Hatzivassiloglou and Wiebe (2000) propose a log-linear statistical
model that classifies adjectives in terms of their gradability (i.e., gradable or non-gradable).
The model achieves a high precision (87.97%) by simply taking into account the number of
times an adjective has been observed in the context of a degree modifier (e.g., very).
Hatzivassiloglou and McKeown (1995a) develop a method for selecting the semantically
unmarked term out of a pair of antonymous adjectives (see Section 5.1 for details on
markedness). The approach exploits several linguistic diagnostics for markedness such as text
frequency (unmarked terms are more frequent than marked ones) and morphological complexity
(unmarked terms are morphologically simpler). Hatzivassiloglou and McKeown’s results
show that the best predictor for markedness is frequency achieving an accuracy of 80.64%. A
similar approach is put forward in Hatzivassiloglou and McKeown (1997) in order to automatically
identify the semantic orientation of adjectives (see Section 5.1). The approach correlates
linguistic indicators with semantic orientation. For example, in most cases coordinated adjectives
are of the same orientation (e.g., fair and legitimate ); when the connective is but, the
adjectives are usually of different orientation. Hatzivassiloglou and McKeown present a loglinear
regression model which relies solely on information present in conjunctions of adjectives
(extracted from a corpus) and achieves a precision of 82% at determining if two conjoined adjectives
are of the same or different orientation. A clustering algorithm is used to separate
the adjectives in two subsets of different orientation (i.e., positive or negative). The approach
receives a 92% accuracy on the classification task.
The approach put forward by Justeson and Katz (1995b) uses nouns as indicators for
discriminating among the senses of adjectives that modify them. Justeson and Katz observe that
some nouns in adjective-noun combinations are strongly associated with specific adjectival
senses. For example, when the adjective old modifies the noun man, it is typically used in
the sense “aged”, whereas when the same adjective modifies the noun house it is used in the
sense “not new”. Justeson and Katz discover which nouns are reliable indicators of a particular
adjective sense by looking at antonyms modifying the same noun in corpus sentences. For
example, in a sentence where old and young modify the noun man it is safe to assume that old
is interpretable as “not young”, whereas in a sentence where old and new co-occur as modifiers
of the noun house, old is interpretable as “not new”. Justeson and Katz’s study focuses on
five ambiguous adjectives (hard, light, old, right, and short ) with two antonym-related senses
5.7. Related Work 171
(e.g., hard relates to “not easy” and “not soft”) and shows that adjectives can be disambiguated
with very high precision (97%) on the basis of their nouns.
Chao and Dyer (2000) propose a method for the disambiguation of polysemous adjectives
which exploits WordNet’s taxonomic information. More specifically, Chao and Dyer
introduce a probabilistic model which estimates the likelihood of each adjective sense given the
semantic features of the noun it modifies.WordNet’s inventory of senses is used both for adjectives
and nouns, while the model’s parameters are estimated by submitting queries (e.g., great
hurricane ) to the Altavista search engine and extrapolating from the number of returned documents
the frequency of the adjective-noun pair (see Mihalcea and Moldovan 1998 for details of
this technique). Manually disambiguated adjective-noun combinations are represented in terms
of Bayesian belief networks (encoding the distribution of an adjective sense over the semantic
features of the noun) which are in turn used to disambiguate unseen adjective-noun combinations.
The method achieves a precision of 81.4%. Chao and Dyer’s approach is conceptually
similar to Justeson and Katz’s (1995b) work. The main assumption underlying both proposals
is that the noun modified by the adjective in question plays a key role in its disambiguation.
Our approach focuses on a novel task, the interpretation of systematically polysemous
adjectives (i.e., adjectives whose meanings are not fixed but vary with respect to the noun they
modify). In contrast to Chao and Dyer (2000) and Justeson and Katz (1995b) we do not employ
a static predefined inventory of adjective senses (e.g.,WordNet or antonymic relations). Instead,
we derive the meanings of polysemous adjective-noun combinations dynamically from a large
balanced corpus. Similarly to Hatzivassiloglou and McKeown (1995a, 1997), Hatzivassiloglou
andWiebe (2000), and Chao and Dyer (2000) our approach is probabilistic: the acquired meanings
are ranked in terms of their likelihood in the corpus. We estimate the parameters of our
model straightforwardly by approximating the meaning of an adjective-noun pair to a verb
which is modified by the adjective (or its corresponding adverb) and whose subject or object is
the noun the adjective is in construction with. In contrast to Chao and Dyer, our model makes
minimal assumptions about how the meaning of a word is represented and combined with the
meaning of other words.
Although our approach patterns with Hatzivassiloglou and McKeown (1995a, 1997),
and Hatzivassiloglou and Wiebe (2000) in that it exploits linguistic diagnostics (e.g., verbadverb
dependencies, verb-argument relations) for deriving the interpretations of polysemous
adjectives, it goes beyond finding correspondences between linguistic features and corpus data.
Experiments 7 and 8 examine the empirical basis of theoretical generalizations about the behavior
of context sensitive adjectives (Pustejovsky 1995; Vendler 1968). We showed that the
meanings derived by our model not only correspond to the interpretations discussed in the lexical
semantics literature (see Section 5.3) but also to human intuitions (see Section 5.4). We
were further able to quantify implicit assumptions in the lexical semantics literature such the
interpretation bias (i.e., subject, object, or none) of a given adjective.
172 Chapter 5. A Probabilistic Model of Adjective-Noun Ambiguity
5.8. Summary
In this chapter we investigated polysemous adjectives whose meaning varies depending on the
nouns they modify. We acquired the meanings of these adjectives from a large corpus and proposed
a probabilistic model which provides a ranking on the set of possible interpretations. We
identified lexical semantic information automatically by exploiting the consistent correspondences
between surface syntactic cues and lexical meaning.
We evaluated our results against paraphrase judgments elicited experimentally from
subjects naive to linguistic theory and showed that the model’s ranking of meanings correlates
reliably with human intuitions: meanings that were found highly probable by the model
were also rated as plausible by the subjects. More specifically, comparison between our model
and human judgments yields a reliable correlation of 40 when the upper bound for the task
:
(i.e., inter-subject agreement) is approximately 65. Furthermore, our model performs reliably
:
better than a naive baseline model, which only achieves a correlation of 25.
: