Inducing Neural Models of Script Knowledge

Inducing Neural Models of Script Knowledge
Ashutosh Modi Ivan Titov

MMCI, ILLC,
Saarland University, Germany University of Amsterdam, Netherlands
amodi@mmci.uni-saarland.de titov@uva.nl
Abstract example chains (Chambers and Jurafsky, 2008)

or more general directed acyclic graphs (Regneri
Induction of common sense knowledge
et al., 2010). These graphs are scenario-specific,
about prototypical sequence of events has
nodes in them correspond to events (and associ-
recently received much attention (e.g.,
ated with sets of potential event mentions) and arcs
Chambers and Jurafsky (2008); Regneri
encode the temporal precedence relation. These
et al. (2010)). Instead of inducing this
graphs can then be used to inform NLP applica-
knowledge in the form of graphs, as in
tions (e.g., question answering) by providing in-
much of the previous work, in our method,
formation whether one event is likely to precede
distributed representations of event real-
or succeed another. Note that these graphs en-
izations are computed based on distributed
code common-sense knowledge about prototypi-
representations of predicates and their ar-
cal ordering of events rather than temporal order
guments, and then these representations
of events as described in a given text.
are used to predict prototypical event or-
derings. The parameters of the composi- Though representing the script knowledge as
tional process for computing the event rep- graphs is attractive from the human interpretability
resentations and the ranking component perspective, it may not be optimal from the appli-
of the model are jointly estimated. We cation point of view. More specifically, these rep-
show that this approach results in a sub- resentations (1) require a model designer to choose
stantial boost in performance on the event an appropriate granularity of event mentions (e.g.,
ordering task with respect to the previous whether nodes in the graph should be associated
approaches, both on natural and crowd- with verbs, or also their arguments); (2) do not
sourced texts. provide a mechanism for deciding which scenario
applies in a given discourse context and (3) often
1 Introduction do not associate confidence levels with informa-
It is generally believed that natural language un- tion encoded in the graph (e.g., the precedence re-
derstanding systems would benefit from incorpo- lation in Regneri et al. (2010)).
rating common-sense knowledge about prototyp- Instead of constructing a graph and using it to
ical sequences of events and their participants. provide information (e.g., prototypical event or-
Early work focused on structured representations dering) to NLP applications, in this work we ad-
of this knowledge (called scripts (Schank and vocate for constructing a statistical model which is
Abelson, 1977)) and manual construction of script capable to “answer” at least some of the questions
knowledge bases. However, these approaches do these graphs can be used to answer, but doing this
not scale to complex domains (Mueller, 1998; without explicitly representing the knowledge as a
Gordon, 2001). More recently, automatic induc- graph. In our method, the distributed representa-
tion of script knowledge from text have started tions (i.e. vectors of real numbers) of event real-
to attract attention: these methods exploit ei- izations are computed based on distributed repre-
ther natural texts (Chambers and Jurafsky, 2008, sentations of predicates and their arguments, and
2009) or crowdsourced data (Regneri et al., 2010), then the event representations are used in a ranker
and, consequently, do not require expensive ex- to predict the prototypical ordering of events. Both
pert annotation. Given a text corpus, they ex- the parameters of the compositional process for
tract structured representations (i.e. graphs), for computing the event representation and the rank-
49
Proceedings of the Eighteenth Conference on Computational Language Learning, pages 49–57,
c
Baltimore, Maryland USA, June 26-27 2014. 2014 Association for Computational Linguistics
event embedding f (e)
ing component of the model are estimated from
texts (either relying on unambiguous discourse
Ah
clues or natural ordering in text). In this way we hidden layer h
build on recent research on compositional distri-
butional semantics (Baroni and Zamparelli, 2011;
T a1 Rp T a2
Socher et al., 2012), though our approach specif-
ically focuses on embedding predicate-argument
structures rather than arbitrary phrases, and learn- arg embedding
a1 = C(bus)
predicate embedding arg embedding
p = C(disembark) a2 = C(passenger)
ing these representation to be especially informa-
tive for prototypical event ordering.
bus disembarked passengers
In order to get an intuition why the embedding
approach may be attractive, consider a situation Figure 1: Computation of an event representation
where a prototypical ordering of events the bus for a predicate with two arguments (the bus disem-
disembarked passengers and the bus drove away barked passengers), an arbitrary number of argu-
needs to be predicted. An approach based on fre- ments is supported by our approach.
quency of predicate pairs (Chambers and Jurafsky,
2008) (henceforth CJ08), is unlikely to make a
right prediction as driving usually precedes disem- ertheless, we believe that the framework (a proba-
barking. Similarly, an approach which treats the bilistic model using event embeddings as its com-
whole predicate-argument structure as an atomic ponent) can be extended to represent other aspects
unit (Regneri et al., 2010) will probably fail as of script knowledge by modifying the learning ob-
well, as such a sparse model is unlikely to be ef- jective, but we leave this for future work. In this
fectively learnable even from large amounts of un- paper, we show how our model can be used to pre-
labeled data. However, our embedding method dict if two event mentions are likely paraphrases
would be expected to capture relevant features of of the same event.
the verb frames, namely, the transitive use for the The approach is evaluated in two set-ups. First,
predicate disembark and the effect of the particle we consider the crowdsourced dataset of Regneri
away, and these features will then be used by the et al. (2010) and demonstrate that using our model
ranking component to make the correct prediction. results in the 13.5% absolute improvement in F 1
In previous work on learning inference on event ordering with respect to their graph in-
rules (Berant et al., 2011), it has been shown duction method (84.1% vs. 70.6%). Secondly,
that enforcing transitivity constraints on the we derive an event ordering dataset from the Gi-
inference rules results in significantly improved gaword corpus, where we also show that the em-
performance. The same is likely to be true for bedding method beats the frequency-based base-
the event ordering task, as scripts have largely line (i.e. reimplementation of the scoring compo-
linear structure, and observing that a ≺ b and nent of CJ08) by 22.8% in accuracy (83.5% vs.
b ≺ c is likely to imply a ≺ c. Interestingly, in 60.7%).
our approach we learn the model which satisfies
transitivity constraints, without the need for any
2 Model
explicit global optimization on a graph. This In this section we describe the model we use for
results in a significant boost of performance when computing event representations as well as the
using embeddings of just predicates (i.e. ignoring ranking component of our model.
arguments) with respect to using frequencies of
ordered verb pairs, as in CJ08 (76% vs. 61% on 2.1 Event Representation
the natural data). Learning and exploiting distributed word repre-
Our model is solely focusing on the ordering sentations (i.e. vectors of real values, also known
task, and admittedly does not represent all the in- as embeddings) have been shown to be benefi-
formation encoded by a script graph structure. For cial in many NLP applications (Bengio et al.,
example, it cannot be directly used to predict a 2001; Turian et al., 2010; Collobert et al., 2011).
missing event given a set of events (the narrative These representations encode semantic and syn-
cloze task (Chambers and Jurafsky, 2009)). Nev- tactic properties of a word, and are normally
50
learned in the language modeling setting (i.e. 2.2 Learning to Order
learned to be predictive of local word context),
The task of learning stereotyped order of events
though they can also be specialized by learning
naturally corresponds to the standard ranking set-
in the context of other NLP applications such as
ting. We assume that we are provided with se-
PoS tagging or semantic role labeling (Collobert
quences of events, and our goal is to capture this
et al., 2011). More recently, the area of dis-
order. We discuss how we obtain this learning ma-
tributional compositional semantics have started
terial in the next section. We learn a linear ranker
to emerge (Baroni and Zamparelli, 2011; Socher
(characterized by a vector w) which takes an event
et al., 2012), they focus on inducing represen-
representation and returns a ranking score. Events
tations of phrases by learning a compositional
are then ordered according to the score to yield
model. Such a model would compute a represen-
the model prediction. Note that during the learn-
tation of a phrase by starting with embeddings of
ing stage we estimate not only w but also the
individual words in the phrase, often this composi-
event representation parameters, i.e. matrices T ,
tion process is recursive and guided by some form
R and A, and the word embedding C. Note that
of syntactic structure.
by casting the event ordering task as a global rank-
In our work, we use a simple compositional
ing problem we ensure that the model implicitly
model for representing semantics of a verb frame
exploits transitivity of the relation, the property
e (i.e. the predicate and its arguments). We will
which is crucial for successful learning from finite
refer to such verb frames as events. The model is
amount of data, as we argued in the introduction
shown in Figure 1. Each word ci in the vocabu-
and will confirm in our experiments.
lary is mapped to a real vector based on the cor-
responding lemma (the embedding function C). At training time, we assume that each training
(k) (k)
The hidden layer is computed by summing linearly example k is a list of events e1 , . . . , en(k) pro-
(k) (k)
transformed predicate and argument1 embeddings vided in the stereotypical order (i.e. ei ≺ ej if
and passing it through the logistic sigmoid func- i < j), n(k) is the length of the list k. We mini-
tion. We use different transformation matrices for mize the L2 -regularized ranking hinge loss:
arguments and predicates, T and R, respectively.
The event representation f (e) is then obtained by � � (k) (k)
max(0, 1−wTf (ei ; Θ)+wTf (ej ; Θ))
applying another linear transform (matrix A) fol-
k i<j≤n(k)
lowed by another application of the sigmoid func-
tion. Another point to note in here is that, as in + α(�w�2 + �Θ�2 ),
previous work on script induction, we use lemmas
for predicates and specifically filter out any tense where f (e; Θ) is the embedding computed
markers as our goal is to induce common-sense for event e, Θ are all embedding parame-
knowledge about an event rather than properties ters corresponding to elements of the matrices
predictive of temporal order in a specific discourse {R, C, T, A}. We use stochastic gradient descent,
context. gradients w.r.t. Θ are computed using back propa-
We leave exploration of more complex and gation.
linguistically-motivated models for future work.2
These event representations are learned in the con- 3 Experiments
text of event ranking: the transformation parame-
ters as well as representations of words are forced We evaluate our approach in two different set-ups.
to be predictive of the temporal order of events. First, we induce the model from the crowdsourced
In our experiments, we also consider initialization data specifically collected for script induction by
of predicate and arguments with the SENNA word Regneri et al. (2010), secondly, we consider an
embeddings (Collobert et al., 2011). arguably more challenging set-up of learning the
1
Only syntactic heads of arguments are used in this work. model from news data (Gigaword (Parker et al.,
If an argument is a coffee maker, we will use only the word 2011)), in the latter case we use a learning sce-
maker. nario inspired by Chambers and Jurafsky (2008).3
2
In this study, we apply our model in two very differ-
ent settings, learning from crowdsourced and natural texts.
3
Crowdsourced collections are relatively small and require not Details about downloading the data and models are at:
over-expressive models. http://www.coli.uni-saarland.de/projects/smile/docs/nmReadme.txt
51
Precision (%) Recall (%) F1 (%)
BL EEverb MSA BS EE BL EEverb MSA BS EE BL EEverb MSA BS EE
Bus 70.1 81.9 80.0 76.0 85.1 71.3 75.8 80.0 76.0 91.9 70.7 78.8 80.0 76.0 88.4
Coffee 70.1 73.7 70.0 68.0 69.5 72.6 75.1 78.0 57.0 71.0 71.3 74.4 74.0 62.0 70.2
Fastfood 69.9 81.0 53.0 97.0 90.0 65.1 79.1 81.0 65.0 87.9 67.4 80.0 64.0 78.0 88.9
Return 74.0 94.1 48.0 87.0 92.4 68.6 91.4 75.0 72.0 89.7 71.0 92.8 58.0 79.0 91.0
Iron 73.4 80.1 78.0 87.0 86.9 67.3 69.8 72.0 69.0 80.2 70.2 69.8 75.0 77.0 83.4
Microw. 72.6 79.2 47.0 91.0 82.9 63.4 62.8 83.0 74.0 90.3 67.7 70.0 60.0 82.0 86.4
Eggs 72.7 71.4 67.0 77.0 80.7 68.0 67.7 64.0 59.0 76.9 70.3 69.5 66.0 67.0 78.7
Shower 62.2 76.2 48.0 85.0 80.0 62.5 80.0 82.0 84.0 84.3 62.3 78.1 61.0 85.0 82.1
Phone 67.6 87.8 83.0 92.0 87.5 62.8 87.9 86.0 87.0 89.0 65.1 87.8 84.0 89.0 88.2
Vending 66.4 87.3 84.0 90.0 84.2 60.6 87.6 85.0 74.0 81.9 63.3 84.9 84.0 81.0 88.2
Average 69.9 81.3 65.8 85.0 83.9 66.2 77.2 78.6 71.7 84.3 68.0 79.1 70.6 77.6 84.1
Table 1: Results on the crowdsourced data for the verb-frequency baseline (BL), the verb-only embed-
ding model (EEverb ), Regneri et al. (2010) (MSA), Frermann et al. (2014)(BS) and the full model (EE).
3.1 Learning from Crowdsourced Data quence Alignment (MSA) algorithm. As in their
3.1.1 Data and task work, each description was preprocessed to extract
a predicate and heads of argument noun phrases to
Regneri et al. (2010) collected descriptions (called be used in the model.
event sequence descriptions, ESDs) of various The methods are evaluated on human anno-
types of human activities (e.g., going to a restau- tated scenario-specific tests: the goal is to classify
rant, ironing clothes) using crowdsourcing (Ama- event pairs as appearing in a stereotypical order or
zon Mechanical Turk), this dataset was also com- not (Regneri et al., 2010).4
plemented by descriptions provided in the OMICS The model was estimated as explained in Sec-
corpus (Gupta and Kochenderfer, 2004). The tion 2.2 with the order of events in ESDs treated
datasets are fairly small, containing 30 ESDs per as gold standard. We used 4 held-out scenarios
activity type in average (we will refer to different to choose model parameters, no scenario-specific
activities as scenarios), but in principle the col- tuning was performed, and the 10 test scripts were
lection can easily be extended given the low cost not used to perform model selection. The selected
of crowdsourcing. The ESDs list events forming model used the dimensionality of 10 for event and
the scenario and are written in a bullet-point style. word embeddings. The initial learning rate and the
The annotators were asked to follow the prototyp- regularization parameter were set to 0.005 and 1.0,
ical event order in writing. As an example, con- respectively and both parameters were reduced by
sider a ESD for the scenario prepare coffee : the factor of 1.2 every epoch the error function
{go to coffee maker} → {fill water in coffee went up. We used 2000 epochs of stochastic gradi-
maker} → {place the filter in holder} → {place ent descent. Dropout (Hinton et al., 2012) with the
coffee in filter} → {place holder in coffee maker} rate of 20% was used for the hidden layers in all
→ {turn on coffee maker} our experiments. When testing, we predicted that
the event pair (e1 ,e2 ) is in the stereotypical order
Regneri et al. also automatically extracted pred-
(e1 ≺ e2 ) if the ranking score for e1 exceeded the
icates and heads of arguments for each event, as
ranking score for e2 .
needed for their MSA system and our composi-
tional model. 3.1.2 Results and discussion
Though individual ESDs may seem simple, the We evaluated our event embedding model (EE)
learning task is challenging because of the limited against baseline systems (BL , MSA and BS). MSA
amount of training data, variability in the used vo- is the system of Regneri et al. (2010). BS is a
cabulary, optionality of events (e.g., going to the hierarchical Bayesian model by Frermann et al.
coffee machine may not be mentioned in a ESD), (2014). BL chooses the order of events based on
different granularity of events and variability in the preferred order of the corresponding verbs in
the ordering (e.g., coffee may be put in the filter the training set: (e1 , e2 ) is predicted to be in the
before placing it in the coffee maker). Unlike our 4
The event pairs are not coming from the same ESDs
work, Regneri et al. (2010) relies on WordNet to making the task harder as the events may not be in any tem-
provide extra signal when using the Multiple Se- poral relation.
52
stereotypical order if the number of times the cor- too high for small crowd-sourced datasets, as it
responding verbs v1 and v2 appear in this order forces us to use larger matrices T and R. More-
in the training ESDs exceeds the number of times over, the SENNA embeddings are estimated from
they appear in the opposite order (not necessary at Wikipedia, and the activities in our crowdsourced
adjacent positions); a coin is tossed to break ties domain are perhaps underrepresented there.
(or if v1 and v2 are the same verb). This frequency
3.1.3 Paraphrasing
counting method was previously used in CJ08.5
We also compare to the version of our model Regneri et al. (2010) additionally measure para-
which uses only verbs (EEverbs ). Note that phrasing performance of the MSA system by com-
EEverbs is conceptually very similar to BL, as it es- paring it to human annotation they obtained: a sys-
sentially induces an ordering over verbs. However, tem needs to predict if a pair of event mentions are
this ordering can benefit from the implicit transi- paraphrases or not. The dataset contains 527 event
tivity assumption used in EEverbs (and EE), as we pairs for the 10 test scenarios. Each pair consists
discussed in the introduction. The results are pre- of events from the same scenario. The dataset is
sented in Table 1. fairly balanced containing from 47 to 60 examples
The first observation is that the full model im- per scenario.
proves substantially over the baseline and the pre- This task does not directly map to any statisti-
vious method (MSA) in F1 (13.5% improvement cal inference problem with our model. Instead we
over MSA and 6.5% improvement over BS). Note use an approach inspired by the interval algebra of
also that this improvement is consistent across sce- Allen (1983).
narios: EE outperforms MSA and BS on 9 scenar- Our ranking model maps event mentions to po-
ios out of 10 and 8 out of 10 scenarios in case of sitions on the time line (see Figure 2). However,
BS. Unlike MSA and BS, no external knowledge it would be more natural to assume that events are
(i.e. WordNet) was exploited in our method. intervals rather than points. In principle, these in-
We also observe a substantial improvement in tervals can be overlapping to encode a rich set of
all metrics from using transitivity, as seen by com- temporal relations (see (Allen, 1983)). However,
paring the results of BL and EEverb (11% improve- we make a simplifying assumption that the inter-
ment in F1). This simple approach already sub- vals do not overlap and every real number belongs
stantially outperforms the pipelined MSA system. to an interval. In other words, our goal is to induce
These results seem to support our hypothesis in a segmentation of the line: event mentions corre-
the introduction that inducing graph representa- sponding to the same interval are then regarded as
tions from scripts may not be an optimal strategy paraphrases.
from the practical perspective. One natural constraint on this segmentation is
We performed additional experiments using the the following: if two event mentions are from the
SENNA embeddings (Collobert et al., 2011). In- same training ESD, they cannot be assigned to the
stead of randomly initializing arguments and pred- same interval (as events in ESD are not supposed
icate embeddings (vectors), we initialized them to be paraphrases). In Figure 2 arcs link event
with pre-trained SENNA embeddings. We have mentions from the same ESD. We look for a seg-
not observed any significant boost in performance mentation which produces the minimal number of
from using the initialization (average F1 of 84.0% segments and satisfy the above constraint for event
for EE). We attribute the lack of significant im- mentions appearing in training data.
provement to the following three factors. First Though inducing intervals given a set of tem-
of all, the SENNA embeddings tend to place poral constraints is known to be NP-hard in gen-
antonyms / opposites near each other (e.g., come eral (see, e.g., (Golumbic and Shamir, 1993)), for
and go, or end and start). However, ‘opposite’ our constraints a simple greedy algorithm finds an
predicates appear in very different positions in optimal solution. We trace the line from the left
scripts. Additionally, the SENNA embeddings maintaining a set of event mentions in the current
have dimensionality of 50 which appears to be unfinished interval and create a boundary when the
constraint is violated; we repeat the process un-
5
They scored permutations of several events by summing til we processed all mentions. In Figure 2, we
the logarithmed differences of the frequencies of ordered verb
pairs. However, when applied to event pairs, their approach would create the first boundary between arrive
would yield exactly the same prediction rule as BL. in a restaurant and order beverages: order bev-
53
Scenario F1 (%)
APBL MSA BS EE
Take bus 53.7 74.0 47.0 63.5
Make coffee 42.1 65.0 52.0 63.5
a s u
ns ... Order fastfood 37.0 59.0 80.0 62.6
ra n
t er ge en
te ura ive ran
in t
rd era m ptio Return food back 64.8 71.0 67.0 81.1
n a r u O ev a o
E st Ar ta b se ew nu Iron clothes 43.3 67.0 60.0 56.7
re re
s ow evi me
B r
R a Microwave cooking 43.2 75.0 82.0 57.8
in Scrambled eggs 57.6 69.0 76.0 53.0
Take shower 42.1 78.0 67.0 55.7
Figure 2: Events on the time line, dotted arcs link Answer telephone 71.0 89.0 81.0 79.4
Vending machine 56.1 69.0 77.0 69.3
events from the same ESD. Average 51.1 71.6 68.9 64.5
Table 2: Paraphrasing results on the crowdsourced

erages and enter a restaurant are from the same data for Regneri et al. (2010) (MSA), Frermann
ESD and continuing the interval would violate et al. (2014)(BS) and the all-paraphrase baseline
the constraint. It is not hard to see that this re- (APBL) and using intervals induced from our
sults in an optimal segmentation. First, the seg- model (EE).
mentation satisfies the constraint by construction.
Secondly, the number of segments is minimal as
the arcs which caused boundary creation are non- investigation for future work.
overlapping, each of these arcs needs to be cut and
our algorithm cuts each arc exactly once. 3.2 Learning from Natural Text
This algorithm prefers to introduce a bound- In the second set of experiments we consider a
ary as late as possible. For example, it would more challenging problem, inducing knowledge
introduce a boundary between browse a menu about the stereotyped ordering of events from nat-
and review options in a menu even though the ural texts. In this work, we are largely inspired
corresponding points are very close on the line. by the scenario of CJ08. The overall strategy is
We modify the algorithm by moving the bound- the following: we process the Gigaword corpus
aries left as long as this move does not result with a high precision rule-based temporal classi-
in new constraint violations and increases mar- fier relying on explicit clues (e.g., “then”, “after”)
gin at boundaries. In our example, the boundary to get ordered pairs of events and then we train
would be moved to be between order beverages our model on these pairs (note that clues used by
and browse a menu, as desired. the classifier are removed from the examples, so
The resulting performance is reported in Ta- the model has to rely on verbs and their argu-
ble 2. We report results of our method, as well as ments). Conceptually, the difference between our
results for MSA, BS and a simple all-paraphrase approach and CJ08 is in using a different tempo-
baseline which predict that all mention pairs in a ral classifier, not enforcing that event pairs have
test set are paraphrases (APBL).6 We can see that the same protagonist, and learning an event em-
interval induction technique results in a lower F1 bedding model instead of scoring event sequences
than that of MSA or BS. This might be partially based on verb-pair frequencies.
due to not using external knowledge (WordNet) in We also evaluate our system on examples ex-
our method. tracted using the same temporal classifier (but val-
We performed extra analyses on the develop- idated manually) which allows us to use much
ment scenario doorbell. The analyses revealed that larger tests set, and, consequently, provide more
the interval induction approach is not very robust detailed and reliable error analysis.
to noise: removing a single noisy ESD results in a 3.2.1 Data and task
dramatic change in the interval structure induced
The Gigaword corpus consists of news data from
and in a significant increase of F1. Consequently,
different news agencies and newspapers. For test-
soft versions of the constraint would be beneficial.
ing and development we took the AFP (Agence
Alternatively, event embeddings (i.e. continuous
France-Presse) section, as it appeared most differ-
vectors) can be clustered directly. We leave this
ent from the rest when comparing sets of extracted
6
The results for the random baseline are lower: F1 of event pairs (other sections correspond mostly to
40.6% in average. US agencies). The AFP section was not used for
54
Accuracy (%) consisting of the verb pair ordering counts pro-
BL 60.7 vided by Chambers and Jurafsky (2008).8 We re-
CJ08 60.1 fer this baseline as CJ08. Note also that BL can be
EEverb 75.9 regarded as a reimplementation of CJ08 but with
EE 83.5 a different temporal classifier. We report results in
Table 3.
Table 3: Results on the Gigaword data for the The observations are largerly the same as be-
verb-frequency baseline (BL), the verb-only em- fore: (1) the full model substantially outperforms
bedding model (EEverb ), the full model (EE) and all other approaches (p-level < 0.001 with the per-
CJ08 rules. mutation test); (2) enforcing transitivity is very
helpful (75.9 % for EEverb vs. 60.1% for BL).
Surprisingly CJ08 rules produce as good results
training. This selection strategy was chosen to cre- as BL, suggesting that maybe our learning set-ups
ate a negative bias for our model which is more are not that different.
expressive than the baseline methods and, conse- However, an interesting question is in which sit-
quently, better at memorizing examples. uations using a more expressive model, EE, is ben-
As a rule-based temporal classifier, we used eficial. If these accuracy gains have to do with
high precision “happens-before” rules from the memorizing the data, it may not generalize well
VerbOcean system (Chklovski and Pantel, 2004). to other domains or datasets. In order to test this
Consider “to �verb-x� and then �verb-y�” as one hypothesis we divided the test examples in three
example of such rule. We used predicted collapsed frequency bands according to the frequency of the
Stanford dependencies (de Marneffe et al., 2006) corresponding verb pairs in the training set (to-
to extract arguments of the verbs, and used only tal, in both orders). There are 513, 249 and 50
a subset of dependents of a verb.7 This prepro- event pairs in the bands corresponding to unseen
cessing ensured that (1) clues which form part of pairs of verbs, frequency ≤ 10 and frequency >
a pattern are not observable by our model both at 10, respectively. These counts emphasize that cor-
train and test time; (2) there is no systematic dif- rect predictions on unseen pairs are crucial and
ference between both events (e.g., for collapsed these are exactly where BL would be equivalent
dependencies, the noun subject is attached to both to a random guess. Also, this suggest, even before
verbs even if the verbs are conjoined); (3) no in- looking into the results, that memorization is irrel-
formation about the order of events in text is avail- evant. The results for BL, CJ08, EEverb and EE
able to the models. Applying these rules resulted are shown in Figure 3.
in 22,446 event pairs for training, and we split One observation is that most gains for EE and
additional 1,015 pairs from the AFP section into EEverb are due to an improvement on unseen pairs.
812 for final testing and 203 for development. We This is fairly natural, as both transitivity and in-
manually validated random 50 examples and all 50 formation about arguments are the only sources
of them followed the correct temporal order, so we of information available. In this context it is im-
chose not to hand correct the test set. portant to note that some of the verbs are light,
We largely followed the same training and eval- in the sense that they have little semantic content
uation regime as for the crowdsourced data. We of their own (e.g., take, get) and the event seman-
set the regularization parameter and the learning tics can only be derived from analyzing their argu-
rate to 0.01 and 5.e − 4 respectively. The model ments (e.g., take an exam vs. take a detour). On
was trained for 600 epochs. The embedding sizes the high frequency verb pairs all systems perform
were 30 and 50 dimensions for words and events, equally well, except for CJ08 as it was estimated
respectively. from somewhat different data.
3.2.2 Results and discussion In order to understand how transitivity works,
we considered a few unseen predicate pairs where
In our experiments, as before, we use BL as a the EEverb model was correctly predicting their
baseline, and EEverb as a verb-only simplified order. For many of these pairs there were no infer-
version of our approach. We used another baseline
8
These verb pair frequency counts are available at
7
The list of dependencies not considered: aux, auxpass, www.usna.edu/Users/cs/nchamber/data/schemas/acl09/verb-
attr, appos, cc, conj, complm, cop, dep, det, punct, mwe. pair-orders.gz
55
96.0 96.0 ity to the methods proposed in this work but they
 94.1
CJ08 are mostly limited to binary relations and deal with
BL

81.8
83.1
EEverb predicting missing relations rather than with tem-
82.4 81.2
 77.8
EE
poral reasoning of any kind.
71.0 Identification of temporal relations within a text

57.2 62.7 is a challenging problem and an active area of re-
 50.0
search (see, e.g., the TempEval task (UzZaman
 et al., 2013)). Many rule-based and supervised ap-
proaches have been proposed in the past. How-

   ever, integration of common sense knowledge in-
duced from large-scale unannotated resources still
Figure 3: Results for different frequency bands: remains a challenge. We believe that our approach
unseen, medium frequency (between 1 and 10) will provide a powerful signal complementary to
and high frequency (> 10) verb pairs. information exploited by most existing methods.
5 Conclusions
ence chains of length 2 (e.g., chain of length 2 was
found for the pair accept ≺ carry: accept ≺ get We have developed a statistical model for rep-
and get ≺ carry but not many other pairs). This resenting common sense knowledge about proto-
observation suggest that our model captures some typical event orderings. Our model induces dis-
non-trivial transitivity rules. tributed representations of events by composing
predicate and argument representations. These
4 Related Work representations capture properties relevant to pre-
Additionally to the work on script induction dicting stereotyped orderings of events. We learn
discussed above (Chambers and Jurafsky, 2008, these representations and the ordering component
2009; Regneri et al., 2010), other methods for from unannotated data. We evaluated our model
unsupervised learning of event semantics have in two different settings: from crowdsourced data
been proposed. These methods include unsu- and natural news texts. In both set-ups our method
pervised frame induction techniques (O’Connor, outperformed baselines and previously proposed
2012; Modi et al., 2012). Frames encode situa- systems by a large margin. This boost in perfor-
tions (or objects) along with their participants and mance is primarily caused by exploiting transitiv-
properties (Fillmore, 1976). Events in these un- ity of temporal relations and capturing information
supervised approaches are represented with cate- encoded by predicate arguments.
gorical latent variables, and they are induced rely- The primary area of future work is to exploit
ing primarily on the selectional preferences’ sig- our method in applications such as question an-
nal. The very recent work of Cheung et al. (2013) swering. Another obvious applications is discov-
can be regarded as their extension but Cheung et ery of temporal relations within documents (Uz-
al. also model transitions between events with Zaman et al., 2013) where common sense knowl-
Markov models. However, neither of these ap- edge implicit in script information, induced from
proaches considers (or directly optimizes) the dis- large unannotated corpora, should be highly ben-
criminative objective of learning to order events, eficial. Our current model uses a fairly naive se-
and neither of them uses distributed representa- mantic composition component, we plan to extend
tions to encode semantic properties of events. it with more powerful recursive embedding meth-
ods which should be especially beneficial when
As we pointed out before, our embedding ap-
considering very large text collections.
proach is similar (or, in fact, a simplification of)
the phrase embedding methods studied in the re-
6 Acknowledgements
cent work on distributional compositional seman-
tics (Baroni and Zamparelli, 2011; Socher et al., Thanks to Lea Frermann, Michaela Regneri and
2012). However, they have not specifically looked Manfred Pinkal for suggestions and help with the
into representing script information. Approaches data. This work is partially supported by the
which study embeddings of relations in knowledge MMCI Cluster of Excellence at the Saarland Uni-
bases (e.g., Riedel et al. (2013)) bear some similar- versity.
56
References Andrew Gordon. 2001. Browsing image collec-
tions with representations of common-sense ac-
James F Allen. 1983. Maintaining knowledge
tivities. JAIST, 52(11).
about temporal intervals. Communications of
the ACM, 26(11):832–843. Rakesh Gupta and Mykel J. Kochenderfer. 2004.
Common sense data acquisition for indoor mo-
Marco Baroni and Robert Zamparelli. 2011. bile robots. In Proceedings of AAAI.
Nouns are vectors, adjectives are matrices: Rep-
Geoffrey E. Hinton, Nitish Srivastava, Alex
resenting adjective-noun constructions in se-
Krizhevsky, Ilya Sutskever, and Ruslan
mantic space. In Proceedings of EMNLP.
Salakhutdinov. 2012. Improving neural net-
Yoshua Bengio, Réjean Ducharme, and Pascal works by preventing co-adaptation of feature
Vincent. 2001. A neural probabilistic language detectors. arXiv: CoRR, abs/1207.0580.
model. In Proceedings of NIPS. Ashutosh Modi, Ivan Titov, and Alexandre Kle-
Jonathan Berant, Ido Dagan, and Jacob Gold- mentiev. 2012. Unsupervised induction of
berger. 2011. Global learning of typed entail- frame-semantic representations. In Proceedings
ment rules. In Proceedings of ACL. of the NAACL-HLT Workshop on Inducing Lin-
Nathanael Chambers and Dan Jurafsky. 2009. Un- guistic Structure. Montreal, Canada.
supervised learning of narrative schemas and Erik T. Mueller. 1998. Natural Language Process-
their participants. In Proceedings of ACL. ing with Thought Treasure. Signiform.
Nathanael Chambers and Daniel Jurafsky. 2008. Brendan O’Connor. 2012. Learning frames from
Unsupervised learning of narrative event chains. text with an unsupervised latent variable model.
In Proceedings of ACL. CMU Technical Report.
Jackie Chi Kit Cheung, Hoifung Poon, and Lucy Robert Parker, David Graff, Junbo Kong,
Vanderwende. 2013. Probabilistic frame induc- Ke Chen, and Kazuaki Maeda. 2011. En-
tion. In Proceedings of NAACL. glish gigaword fifth edition. Linguistic Data
Consortium.
Timothy Chklovski and Patrick Pantel. 2004. Ver-
Michaela Regneri, Alexander Koller, and Manfred
bocean: Mining the web for fine-grained se-
Pinkal. 2010. Learning script knowledge with
mantic verb relations. In Proceedings of
web experiments. In Proceedings of ACL.
EMNLP.
Sebastian Riedel, Limin Yao, Andrew McCal-
R. Collobert, J. Weston, L. Bottou, M. Karlen, lum, and Benjamin Marlin. 2013. Relation ex-
K. Kavukcuoglu, and P. Kuksa. 2011. Natural traction with matrix factorization and universal
language processing (almost) from scratch. schemas. TACL.
Journal of Machine Learning Research,
R. C Schank and R. P Abelson. 1977. Scripts,
12:2493–2537.
Plans, Goals, and Understanding. Lawrence
Marie-Catherine de Marneffe, Bill MacCartney, Erlbaum Associates, Potomac, Maryland.
and Christopher D. Manning. 2006. Generating
Richard Socher, Brody Huval, Christopher D.
typed dependency parses from phrase structure
Manning, and Andrew Y. Ng. 2012. Seman-
parses. In Proceedings of LREC.
tic compositionality through recursive matrix-
Charles Fillmore. 1976. Frame semantics and the vector spaces. In Proceedings of EMNLP.
nature of language. Annals of the New York Joseph Turian, Lev Ratinov, and Yoshua Bengio.
Academy of Sciences, 280(1):20–32. 2010. Word representations: A simple and gen-
Lea Frermann, Ivan Titov, and Manfred Pinkal. eral method for semi-supervised learning. In
2014. A hierarchical bayesian model for un- Proceedings of ACL.
supervised induction of script knowledge. In Naushad UzZaman, Hector Llorens, Leon Der-
EACL, Gothenberg, Sweden. czynski, James Allen, Marc Verhagen, and
Martin Charles Golumbic and Ron Shamir. 1993. James Pustejovsky. 2013. Semeval-2013 task
Complexity and algorithms for reasoning about 1: Tempeval-3: Evaluating time expressions,
time: A graph-theoretic approach. Journal of events, and temporal relations. In Proceedings
ACM, 40(5):1108–1133. of SemEval.
57

Inducing Neural Models of Script Knowledge

Uploaded by

Copyright:

Available Formats

Inducing Neural Models of Script Knowledge

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Inducing Neural Models of Script Knowledge

Uploaded by

Copyright:

Available Formats

Inducing Neural Models of Script Knowledge

Ashutosh Modi Ivan Titov

Abstract example chains (Chambers and Jurafsky, 2008)

Table 2: Paraphrasing results on the crowdsourced

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.