Reinforced Mnemonic Reader For Machine Comprehension: Minghao Hu Yuxing Peng Xipeng Qiu
Reinforced Mnemonic Reader For Machine Comprehension: Minghao Hu Yuxing Peng Xipeng Qiu
Reinforced Mnemonic Reader For Machine Comprehension: Minghao Hu Yuxing Peng Xipeng Qiu
…
…
…
SFU SFU
x̃cm cm c̄tm ĉtm čtm
BiLSTM
BiLSTM
softmax
softmax
x̃q1 q1
…
Figure 1: The high-level overview of Reinforced Mnemonic Reader. In the feature-rich encoder, {x̃qi }ni=1 , {x̃cj }mj=1 are embed-
ding matrices of query and context respectively. {qi }ni=1 and {cj }mj=1 are concatenated hidden states of encoding BiLSTM. In
the iterative aligner, {c̄tj }m t m
j=1 and {ĉj }j=1 are the query-aware and self-aware context representation in the t-th hop respec-
tively, while {čj }j=1 is the fully-aware context representation. In the memory-based answer pointer, zsl and zel are memory
t m
vectors used for predicting probability distributions of the answer span (pls and ple ) in the l-th hop respectively. SFU refers to
the semantic fusion unit.
to directly optimize both the EM metric and the F1 score, verts each word w to its respective word embedding xw ,
we introduce a new objective function which combines the where n and m denote the length of query and context re-
maximum-likelihood cross-entropy loss with rewards from spectively. At a more low-level granularity, the encoder also
reinforcement learning. Experiments show that Reinforced embeds each word w by encoding its character sequence
Mnemonic Reader obtains state-of-the-art results on Trivi- with a bidirectional long short-term memory network (BiL-
aQA and SQuAD datasets. STM) (Hochreiter and Schmidhuber 1997). The last hidden
states are considered as the character-level embedding xc .
Model Overview Each word embedding x is then represented as the concate-
For machine comprehension (MC) task, a query Q and nation of character-level embedding and word-level embed-
a context C are given, our task is to predict an answer ding, denoted as x = [xw ; xc ] ∈ Rd , where d is the total
A, which is constrained as a segment of text of C (Ra- dimension of word-level embedding and character-level em-
jpurkar et al. 2016; Joshi et al. 2017). We design the Re- bedding.
inforced Mnemonic Reader to model the probability distri-
bution pθ (A|C, Q), where θ is the set of all trainable param- Additional Features To better identify key concepts and
eters. Our model consists of three basic modules: feature- key entities in both the query and the context, we go beyond
rich encoder, iterative aligner and memory-based answer word embeddings and utilize several additional linguistic
pointer, which are depicted in Figure 1. Below we discuss features. We first use a simple yet effective binary feature of
them in more details. exact matching (EM), which has been successfully used in
MC tasks (Chen et al. 2017). This binary feature indicates
Feature-rich Encoder whether a word in context can be exactly matched to one
query word, and vice versa for a word in query. Moreover,
In MC task, the context and the query are both word se- we create additional look-up based embedding matrices for
quences. The feature-rich encoder is responsible for map- parts-of-speech tags and named-entity tags. For each word
ping these word sequences to their corresponding word em- in the query and the context, we simply lookup its tag em-
beddings, and encoding these embeddings for further pro- beddings and concatenate them with the word embedding
cessing. We utilize both word-level and character-level em- and the EM feature.
bedding as well as additional features to enhance the capac-
The explicit query category (i.e., who, when, where, and
ity of the encoder, which is described as follows.
so on) is obviously a high-level abstraction of the expression
Hybrid Embedding Given a query Q = {wiq }ni=1 and a of queries, providing additional clues for searching the an-
context C = {wjc }m
j=1 , the feature-rich encoder firstly con- swer. For example, a “when” query pays more attention on
temporal information while a “where” query seeks for spa- with the attended query vector q̃jt by an effective heuristics,
tial information. Following (Zhang et al. 2017), we count and feed them into a semantic fusion unit as follows:
the key word frequency of all queries and obtain top-9 query
categories: what, how, who, when, which, where, why, be, c̄tj = SFU(čt−1 t t−1
j , q̃j , čj ◦ q̃jt , čt−1
j − q̃jt ) (3)
other. Each query category is then represented by a trainable
embedding. For each query, we lookup its query-category where C̄ t = {c̄tj }m
j=1 denotes the query-aware representa-
embedding and use a feedforward neural network for projec- tion of context, ◦ stands for element-wise multiplication and
tion. The result is then added into the corresponding query − means the element-wise subtraction.
embedding matrix. After incorporating all of these addi- Semantic Fusion Unit To efficiently fuse attended infor-
tional features with original embeddings, we obtain {x̃qi }ni=1 mation into original word, we design a simple semantic fu-
for the query and {x̃cj }m
j=1 for the context. sion unit (SFU). A SFU accepts an input vector r and a set
Encoding In order to model the word sequence under its of fusion vectors {fi }ki=1 , and generates an output vector o.
contextual information, we use another BiLSTM to encode The dimension of the output vector and all of fusion vec-
both the context and the query as follows: tors are the same as the input vector. The output vector is
expected to not only retrieve correlative information from
qi = BiLSTM(qi−1 , x̃qi ), ∀i ∈ [1, ..., n] fusion vectors, but also retain partly unchanged as the input
cj = BiLSTM(cj−1 , x̃cj ), ∀j ∈ [1, ..., m] (1) vector.
SFU consists of two components: composition and gate.
where qi and cj ∈ R2h are concatenated hidden states of The composition component produces a hidden state r̃
BiLSTM for the i-th query word and the j-th context word, which is a linear interpolation of the input vector and fusion
and h is the dimension of hidden state. Finally we obtain two vectors. The gate component generates an update gate g to
encoded matrices: Q0 = {qi }ni=1 ∈ R2h×n for the query and control the composition degree to which the hidden state is
C 0 = {cj }m 2h×m exposed. The output vector is calculated as follows:
j=1 ∈ R for the context.
r̃ = tanh(Wr ([r; f1 ; ...; fk ]) + br )
Iterative Aligner
g = σ(Wg ([r; f1 ; ...; fk ]) + bg )
The iterative aligner consists of multiple horizontal hops. In
o = g ◦ r̃ + (1 − g) ◦ r (4)
each hop, the module updates the representation of each con-
text word by attending to both the query and the context it- where Wr and Wg are trainable weight matrices, br and bg
self. A novel semantic fusion unit is used to perform the up- are trainable biases, and σ is the sigmoid activation function.
dating. Finally, an aggregating operation is taken to further
encourage information to flow across the context at the end Self Aligning After the interactive aligning, the iterative
of each hop. aligner further aligns the query-aware context representation
C̄ t with itself to synthesize contextual information among
Interactive Aligning The iterative aligner totally reads the context words. The insight is that, the limited capability of
context and the query over T hops. In the t-th hop, the recurrent neural networks make it difficult to model long-
aligner first attends to both the query and the context simul- term dependencies of contexts (Bengio, Simard, and Fras-
taneously to capture the interaction between them, and gen- coni 1994). Each word is only aware of its surrounding
erates the query-aware context representation. More specif- neighbour and has no cues about the entire context. Align-
ically, we first compute a coattention matrix B t ∈ Rn×m ing information between context words allows crucial clues
between the query and the context: Bij t
= qi T · čt−1
j , where to be fused into the context representation.
t−1
čj denotes the representation for the j-th context word in Similar to the interactive aligning, we first compute a self-
the previous hop and č0j = cj . The matrix element Bij t
indi- coattention matrix B̃ t ∈ Rm×m for the context:
cates the similarity between the i-th query word and the j-th
context word.
t
B̃ij = 1(i 6= j)c̄ti T · c̄tj (5)
For each context word, we intend to find the most relevant where Bij t
indicates the similarity between the i-th context
query word by computing an attended query vector, and fuse word and the j-th context word, and the diagonal of self-
it back to the context word. Let btj ∈ Rn denote the normal- coattention matrix is set to be zero in case of the word being
ized attention distribution of query for the j-th context word, aligned with itself. Next, for each context word, we compute
and q̃jt ∈ R2h denotes the corresponding attended query vec- an attended context vector c̃tj ∈ R2h in a similar way of Eq.
tor. The computation is defined in Eq. 2:
2. Hence, {c̃tj }m
j=1 contains attended context vectors for all
btj = softmax(B:j
t
) query-aware context words.
Finally, we use the same heuristics as in Eq. 3 to com-
q̃jt = Q0 · btj , ∀j ∈ [1, ..., m] (2) bine the query-aware context representation c̄tj with the at-
Hence {q̃jt }m tended context vectors c̃tj for each context word, and fuse
j=1 represents attended query vectors for all con-
text words. them through another SFU to produce the self-aware con-
To compute the query-aware representation for the j-th text representation ĉtj . Collecting these representation for all
context word, we combine the context representation čjt−1 context words yield Ĉ t = {ĉtj }m j=1 .
Aggregating At the end of each hop, we utilize another Training Procedure
BiLSTM as in Eq. 1 to model the self-aware context repre-
In this section, we propose two different ways for train-
sentation Ĉ t with its contextual interaction. Notice that this ing our model. Specifically, we use the standard maximum-
step is different from the encoding step in that this step cap- likelihood estimation (MLE) for maximizing the sum of log
tures the short-term information among context words con- probabilities of the ground-truth answer span, and we pro-
ditioned on entire query and context. The output is the fully- pose a reinforcement learning approach which aims at di-
aware context representation Č t = {čtj }mj=1 , which is used rectly optimizing the evaluation metric of MC task.
for the next hop of alignment or the answer prediction.
Supervised Learning with Boundary Detecting The
Memory-based Answer Pointer moset widely used training method for MC task is the
The MC task requires the model to find a sub-phrase of the boundary detecting method (Wang and Jiang 2017), which
context to answer the query, which is derived by predict- minimizes the sum of negative log probabilities of the true
ing the start position i and the end position j of the answer start and end position based on predicted distributions:
phrase A,conditioned on both the context C and the query
N
Q: X
JM LE (θ) = − log pL s L e
s (yi ) + log pe (yi ) (9)
pθ (A|C, Q) = ps (i|C, Q) · pe (j, |i, C, Q) (6) i=1
Here we denote ps (i|C, Q) as ps (i) and pe (j, |i, C, Q) as where yis and yie are the ground-truth start and end position
pe (j) for abbreviation. of the i-th example, and N is the number of examples in the
We propose a memory-based answer pointing module dataset.
which maintains a memory vector to record necessary read- Minimizing JM LE (θ) can be viewed as optimizing the
ing knowledge for continuously refining the predicted an- Exact Match (EM) metric, since the log-likelihood objective
swer span. The memory vector is initialized as a query sum- is equivalent to a KL divergence between a delta distribu-
mary ~q ∈ R2h and is allowed to be updated with relevant tion δ(A|A∗ ) and the model distribution pθ (A|C, Q), where
evidences during the prediction. In our experiment we found δ(A|A∗ ) = 1 at A = A∗ and 0 at A 6= A∗ (Norouzi et
that simply using last hidden states of encoded query repre- al. 2016), and A∗ denotes the ground-truth answer. How-
sentations as ~q results in good performance. ever, directly determining the exact boundary may be diffi-
The answer pointing module totally consists of L hops. In cult in some situations where the answer boundary is fuzzy
the l-th hop, the module firstly attends over the fully-aware or too long. For example, it is quite hard to define the answer
context representation with the memory vector zsl to produce boundary of a “why” query, where the answer usually is not
the probability distribution of the start position pls , by using a distinguishable entity.
a pointer network (Vinyals, Fortunato, and Jaitly 2015):
Reinforcement Learning for Machine Comprehension
sli = FN(čTi , zsl , čTi ◦ zsl ) One way to tackle this problem is to directly optimizing the
pls (i) = softmax(wsl sli ) (7) F1 score with reinforcement learning. The F1 score mea-
sures the overlap between the predicted answer and the
where the abbreviation FN means a feedforward neural net- ground-truth answer, serving as a “soft” metric compared
work that provides a non-linear mapping of its input. wsl ∈ to the “hard” EM. Taking the F1 score as reward, we use the
Rh is a trainable weight vector. REINFORCE algorithm (Williams 1992) to maximize the
The normalized probability pls ∈ Rm points out the po- model’s expected reward. For each sampled answer Â, we
tential start position of the answer, and can be viewed as define the loss as:
an attention distribution for aggregating information of the
current prediction, yielding an evidence vector uls ∈ R2h : JRL (θ) = −EÂ∼pθ (A|C,Q) [R(Â, A∗ )] (10)
uls = Č T · pls . The memory vector can retrieve relevant cues
from the evidence vector to refine itself. Here we use the where pθ is the policy to be learned, and R(Â, A∗ ) is
SFU for the refinement, which takes the memory vector zsl the reward function for a sampled answer, computed as
and the evidence vector uls as inputs and outputs the new the F1 score with the ground-truth answer A∗ . Â is ob-
memory vector zel : zel = SFU(zsl , uls ). tained by sampling from the predicted probability distribu-
The probability distribution of the end position ple is com- tion pθ (A|C, Q).
puted similarly as Eq. 7, by using zel instead of zsl : To further stabilize training and prevent the model from
elj = FN(čTj , zel , čTj ◦ zel ) overwriting its earlier training, we integrate the maximum-
likelihood estimation with the reinforcement learning by us-
ple (j) = softmax(wel elj ) (8) ing a linear interpolation:
If the l-th hop is not the last hop, then the normalized J(θ) = λJM LE (θ) + (1 − λ)JRL (θ) (11)
probability ple ∈ Rm is also used as an attention distribution
to generate an evidence vector ule ∈ R2h , which is further where λ is a scaling hyperparameter tuned in experiments.
fed with the memory vector zel through another SFU to yield Minimizing this loss is equivalent to optimize both the EM
the memory vector zsl+1 for the next-hop prediction. metric and the F1 score.
Train Dev Test Avg.L Model EM F1
Contexts 18.8K 2.0K - 122 Logistic Regression Baseline1 40.4 51.0
SQuAD
Queries 87K 10K - 11 Match-LSTM with Ans-Ptr (Boundary)∗2 67.9 77.0
TriviaQA Contexts 110K 14K 13K 495 FastQAExt 3 70.9 78.9
Wikipedia Queries 61.8K 7.9K 7.7K 15 Document Reader 4 70.7 79.4
TriviaQA Contexts 528K 68K 65K 458 Ruminating Reader 5 70.6 79.5
Web Queries 76.5K 9.9K 9.5K 13 M-Reader+RL 73.2 81.8
Dynamic Coattention Networks∗6 71.6 80.4
Table 1: Data statistics for SQuAD and TriviaQA. Both Multi-Perspective Matching∗7 73.8 81.3
of SQuAD and the Wikipedia domain are evaluated over jNet ∗8 73.0 81.5
queries while the Web domain of TriviaQA is evaluated over BiDAF∗9 73.7 81.5
contexts. Avg.L refers to average length. SEDT∗10 74.1 81.7
ReasoNet∗11 75.0 82.6
Full Verified MEMEN∗12 75.4 82.7
Model Domain R-Net∗13 76.9 84.0
EM F1 EM F1
Classifier1 22.5 26.5 27.2 31.4 M-Reader+RL∗ 77.7 84.9
BiDAF2 40.3 45.9 44.9 50.7
Wiki
MEMEN3 43.2 46.9 49.3 55.8 Table 3: The performance of Mnemonic Reader and
M-Reader 46.9 52.9 54.5 59.5 other competing approaches on SQuAD test set: Ra-
Classifier1 24.0 28.4 30.2 34.7 jpurkar et al.(2016)1 , Wang & Jiang(2017)2 , Weissenborn et
BiDAF2 40.7 47.1 49.5 55.8 al.(2017)3 , Chen et al.(2017)4 , Gong et al. (2017)5 , Xiong
Web
MEMEN3 44.3 48.3 53.3 57.6 et al.(2017)6 , Wang et al.(2016)7 , Zhang et al.(2017)8 , Seo
M-Reader 46.7 52.9 57.0 61.5 et al.(2017)9 , Liu et al.(2017)1 0, Shen et al.(2016)1 1, Pan et
al.(2017)1 2 and Wang et al.(2017)13 . ∗ indicates ensemble
models.
Table 2: The performance of Mnemonic Reader and other
competing approaches on TriviaQA test set: Joshi et
al.(2017)1 , Seo et al.(2017)2 , and Pan et al.(2017)3 . Experimental Configuration
We evaluate the Mnemonic Reader (M-Reader) and the
Reinforced Mnemonic Reader (M-Reader+RL) by running
Evaluation the following experiments on both datasets. We first run
Datasets maximum-likelihood estimation (MLE) to train M-Reader
until convergence by optimizing Eq. 9. We then finetune this
We mainly focus on two large-scale machine comprehension model with reinforcement learning (RL) by optimizing Eq.
datasets to train and evaluate our model. One is the SQuAD 11, until the F1 score on the development set no longer im-
and the other is the recently released TriviaQA. SQuAD proves. We use a λ = 0.01 for the M-Reader+RL during RL
(Rajpurkar et al. 2016) is a machine comprehension dataset, training.
totally containing more than 100K queries manually anno- We use the Adam optimizer (Kingma and Ba 2014) with
tated by crowdsourcing workers on a set of Wikipedia arti- an initial learning rate of 0.0008 for MLE training, which is
cles. halved whenever meeting a bad iteration. We use the SGD
TriviaQA (Joshi et al. 2017) is a newly available machine optimizer with a learning rate of 0.001 for RL training. The
comprehension dataset consisting of over 650K context- batch size is set to be 45. A dropout rate (Srivastava et al.
query-answer triples. The contexts are automatically gener- 2014) of 0.2 is used to prevent overfitting. Word embeddings
ated from either Wikipedia or Web search results. Table 1 are initialized with 100-dimensional Glove vectors (Pen-
contains the statistics of both datasets. Notice that the query- nington, Socher, and Manning 2014) and remain fixed dur-
answer tuple in the Wikipedia domain may refer to multi- ing training. Out of vocabulary words are randomly sampled
ple evidence contexts. Therefore we perform the prediction from Gaussian distributions. The sizes of character embed-
for each context separately and choose the candidate answer ding and character hidden state are 50. The dimension of
with the highest confidence score. hidden state is 100. The number of hops is set to be 2 for both
The length of contexts in TriviaQA (average 2895 words) the aligner and the answer pointer. We set the max length of
is much more longer than the one in SQuAD (average 122 context to be 300 for SQuAD and 500 for TriviaQA.
words). To efficiently train our model on TriviaQA, we first
approximate the answer span by finding the first match of Overall Results
answer string in contexts. Then we truncate around the an- Two metrics, Exact Match (EM) and F1 score, are used to
swer sentence in a window size of 8 for the train set and evaluate the performance of the model on both SQuAD and
development set, and we truncate the context to the first 800 TriviaQA. Table 2 shows the performance comparison on
words for the test set. After the preprocessing, the average the test set of TriviaQA. M-Reader outperforms compet-
length of contexts is rescaled under 500 words. itive baselines such as BiDAF and MEMEN on both the
Model EM F1 Aligner Answer pointer
Hops
M-Reader+RL 72.1 81.6 EM F1 EM F1
M-Reader 71.8 81.2 0 62.4 71.6 70.1 79.8
- feature-rich encoder 70.5 80.1 1 70.7 80.3 71.4 80.6
- interactive aligning 65.2 74.3 2 71.8 81.2 71.8 81.2
- self aligning 69.7 78.9 3 71.5 81.1 71.3 80.9
- memory-based answer pointer 70.1 79.8 4 71.7 80.9 71.2 80.6
Table 4: Abalation results on SQuAD dev set. Table 5: Performance of M-Reader across different number
of hops on SQuAD dev set. Note that for the 0-hop case, the
iterative aligner is discarded while the memory-based an-
Wikipedia domain and the Web domain. swer pointer is replaced by Pointer Network (Vinyals, For-
Table 3 shows the performance comparison of our mod- tunato, and Jaitly 2015).
els and other competing models on the test set of SQuAD.
M-Reader+RL achieves an EM score of 73.2% and a F1
score of 81.8%. Since SQuAD is a very competitive ma- presents the self-coattention matrix. We can see that this
chine comprehension benchmark, we also build an ensemble matrix is symmetric since the context is aligned with itself,
model which consists of 16 single models with the same ar- and the diagonal weight is low because the word is not al-
chitecture but initialized with different parameters. The an- lowed to attend to itself. Besides, several semantically sim-
swer with the highest score is chosen among all models for ilar phrases have been successfully aligned with each other.
each query. Our ensemble model improves the EM score to One of the most substantial aligned pairs is “Denver Bron-
77.7%, and the F1 score to 84.9%. cos” and “Carolina Panthers”, both of which are highly re-
lated candidate answers. Right figure shows the estimated
Ablation Results distributions of answer spans. In the first hop, the model is
To evaluate the individual contribution of each compo- not confident in choosing “Denver Broncos” since the score
nent of our model, we run an ablation study on the the of “Carolina Panthers” is also high. But in the subsequent
SQuAD development set, which is shown in Table 4. The M- hop, the model adjusts the answer span by lowering the
Reader+RL obtain the highest performance on both metrics, probability of “Carolina Panthers”. This demonstrates that
demonstrating the usefulness of RL training. Ablating ad- the model is capable of gradually locating the correct span
dtional features of the encoder results a performance drop of by incorporating knowledge of candidate answers with the
nearly 1.2%. The interactive aligning is most critical to per- query memory.
formance as ablating it results a drop of nearly 7% on both
metrics. The self aligning accounts for about 2.3% of per- Related Work
formance degradation on F1 score, which clearly shows the Reading comprehension. The significant advance on read-
effectiveness of aligning context words against themselves. ing comprehension has largely benefited from the availabil-
Finally, we replace the memory-based answer pointer with ity of large-scale datasets. Large cloze-style datasets such as
the standard pointer network. The result shows that memory- CNN/DailyMail (Hermann et al. 2015) and Childrens Book
based answer pointer outperforms the pointer network by Test (Hill et al. 2016) were first released, make it possible to
nearly 1.5% on both metrics. solve MC tasks with deep neural architectures. The SQuAD
Table 5 shows the performance under different number of (Rajpurkar et al. 2016) and the TriviaQA (Joshi et al. 2017)
hops. We set the default number of hops as 2 in both the are more recently released datasets, which take a segment of
iterative aligner and the memory-based answer pointer, and text instead of a single entity as the answer, and contain sub-
change the number separately to compare the performance. stantial syntactic and lexical variability between the query
As we can see, both metrics increase sharply as the number and the context.
of hops enlarges to the default value. The model with 2 hops Attention mechanism. The coattention mechanism (Xiong,
achieves the best performance. The larger number of hops Zhong, and Socher 2017; Seo et al. 2017) has been widely
potentially result in overfitting on the training set, therefore used in end-to-end neural networks for machine comprehen-
harming the performance. sion (MC). Unlike the original attention mechanism (Bah-
danau, Cho, and Bengio 2014) that uses a summary vec-
Visualization tor of the query to attend to the context (Hermann et al.
We provide a qualitative inspection of our model on SQuAD 2015), the coattention is computed as an alignment ma-
development set. We visualize the attention matrices as well trix corresponding to all pairs of context words and query
as the estimated probability distributions in Figure 2, which words, which can model complex interaction between the
are all extracted from the last hop of the aligner and the an- query and the context (Cui et al. 2016; Seo et al. 2017;
swer pointer. Xiong, Zhong, and Socher 2017).
Left subfigure shows the coattention matrix, in which sev- Self-attention is an attention mechanism aiming at align-
eral phrases in the query have been successfully aligned ing the sequence with itself, which has been successfully
with the corresponding context phrases. Middel subfigure used in a variety of tasks inculding textual entailment
Conference
Conference
represented
American
champion
champion
National
Carolina
Panthers
defeated
Football
Broncos
Football
Denver
Which
Super
Super
Bowl
Bowl
AFC
NFC
third
their
team
AFC
NFL
earn
The
title
Hop-1 Hop-2
the
the
24
10
50
to
at
–
(
)
(
)
?
.
The The The
American American American
Football Football Football
Conference Conference Conference
( ( (
AFC AFC AFC
) ) )
champion champion champion
Denver Denver Denver
Broncos Broncos Broncos
defeated defeated defeated
the the the
National National National
Football Football Football
Conference Conference Conference
( ( (
NFC NFC NFC
) ) )
champion champion champion
Carolina Carolina Carolina
Panthers Panthers Panthers
24 24 24
– – –
10 10 10
to to to
earn earn earn
their their their
third third third
Super Super Super
Bowl Bowl Bowl
title title title
. . .
Figure 2: A visualized example of the attention mechanism and the memory-based answer pointing. Left: The coattention matrix
(each row is a context word, each column is a query word). Middel: The self-coattention matrix (both rows and columns are
context words). Right: The estimated probability distributions of answer spans in two hops. Different colors represent different
aligned text pairs. The ground-truth answer is “Denver Broncos”.
(Cheng, Dong, and Lapata 2016), neural machine transla- sic idea is to consider the evaluation metrics (like BLEU or
tion (Vaswani et al. 2017) and sentence embedding (Lin ROUGE) which are not differentiable as the reward, and ap-
et al. 2017). In MC, R-Net (Wang et al. 2017) utilizes self- ply REINFORCE algorithm (Williams 1992) to maximize
attention to refine the representation of the context, while the expected reward. Previous models for MC only utilize
Reinforced Mnemonic Reader uses iterative self-coattention maximum-likelihood estimation to optimize the EM metric,
to model the long-term dependencies of the context. Be- and they may fail when the answer span is too long or fuzzy.
sides, Jia et al. (2017) shows that our model outperforms However, Reinforced Mnemonic Reader uses reinforcement
previous models by about 6 F1 points under adversarial at- learning to directly optimize the F1 score.
tacks, demonstrating the effectiveness of the self-coattention
mechanism. Conclusion
Reasoning mechanism. Inspired by the phenomenon that
human increase their understanding by reread the context In this paper, we propose the Reinforced Mnemonic Reader,
and the query, multi-hop reasoning models have been pro- an enhanced attention reader with mnemonic information
posed for MC tasks (Shen et al. 2016; Xiong, Zhong, and such as syntactic and lexical features, information of interac-
Socher 2017). These models typically maintains a memory tive alignment and self alignment, and evidence-augmented
state which incorporates the current information of reason- query memory. We further combine maximum-likelihood
ing with the previous information in the memory, by follow- estimation with reinforcement learning for directly optimiz-
ing the framework of Memory Networks (Sukhbaatar et al. ing the evaluation metrics. Experiments on TriviaQA and
2015). ReasoNet (Shen et al. 2016) utilizes reinforcement SQuAD showed that these mnemonic information and the
learning to dynamically determine when to stop reading. In new training strategy lead to significant performance im-
contrast to their model, Reinforced Mnemonic Reader con- provements. For future work, we will introduce more use-
tains a memory-based answer pointer which is able to con- ful mnemonic information, such as knowledge or common
tinuously refine the answer span. sense, to further augment the attention reader.
Reinforcement learning. Reinforcement learning has been
successfully used to solve a wide variety of problems in the Acknowledgments
field of NLP, including abstractive summarization (Paulus,
Xiong, and Socher 2017), question generation (Yuan et We thank Pranav Rajpurkar for help in SQuAD submissions
al. 2017) and dialogue system (Li et al. 2016). The ba- and Mandar Joshi for help in TriviaQA submissions.
References Paulus, R.; Xiong, C.; and Socher, R. 2017. A deep rein-
Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural ma- forced model for abstractive summarization. arXiv preprint
chine translation by jointly learning to align and translate. arXiv:1705.04304.
arXiv preprint arXiv:1409.0473. Pennington, J.; Socher, R.; and Manning, C. D. 2014. Glove:
Bengio, Y.; Simard, P.; and Frasconi, P. 1994. Learning Global vectors for word representation. In Proceedings of
long-term dependencies with gradient descent is difficult. EMNLP.
IEEE transactions on neural networks 5(2):157–166. Rajpurkar, P.; Zhang, J.; Lopyrev, K.; and Liang, P. 2016.
Chen, D.; Fisch, A.; Weston, J.; and Bordes, A. 2017. Squad: 100,000+ questions for machine comprehension of
Reading wikipedia to answer open-domain questions. arXiv text. In Proceedings of EMNLP.
preprint arXiv:1704.00051. Seo, M.; Kembhavi, A.; Farhadi, A.; and Hajishirzi, H. 2017.
Cheng, J.; Dong, L.; and Lapata, M. 2016. Long short- Bidirectional attention flow for machine comprehension. In
term memory-networks for machine reading. arXiv preprint Proceedings of ICLR.
arXiv:1601.06733. Shen, Y.; Huang, P.-S.; Gao, J.; and Chen, W. 2016. Rea-
Chung, J.; Gulcehre, C.; Cho, K.; and Bengio, Y. 2014. Em- sonet: Learning to stop reading in machine comprehension.
pirical evaluation of gated recurrent neural networks on se- arXiv preprint arXiv:1609.05284.
quence modeling. In Proceedings of NIPS. Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; and
Cui, Y.; Chen, Z.; Wei, S.; Wang, S.; Liu, T.; and Hu, G. Salakhutdinov, R. 2014. Dropout: A simple way to prevent
2016. Attention-over-attention neural networks for reading neural networks from overfitting. The Journal of Machine
comprehension. arXiv preprint arXiv:1607.04423. Learning Research 1929–1958.
Gong, Y., and Bowman, S. R. 2017. Ruminating reader: Sukhbaatar, S.; Szlam, A.; Weston, J.; and Fergus, R. 2015.
Reasoning with gated multi-hop attention. arXiv preprint End-to-end memory networks. In Proceedings of NIPS.
arXiv:1704.07415. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; and
Goodfellow, I.; Bengio, Y.; and Courville, A. 2016. Deep Jones, L. 2017. Attention is all you need. arXiv preprint
learning. MIT Press. arXiv:1706.03762.
Hermann, K. M.; Kocisky, T.; Grefenstette, E.; Espeholt, L.; Vinyals, O.; Fortunato, M.; and Jaitly, N. 2015. Pointer
Kay, W.; Suleyman, M.; ; and Blunsom, P. 2015. Teaching networks. In Proceedings of NIPS.
machines to read and comprehend. In Proceedings of NIPS. Wang, S., and Jiang, J. 2017. Machine comprehension using
Hill, F.; Bordes, A.; Chopra, S.; and Weston, J. 2016. The match-lstm and answer pointer. In Proceedings of ICLR.
goldilocks principle: Reading childrens books with explicit Wang, Z.; Mi, H.; Hamza, W.; and Florian, R. 2016. Multi-
memory representations. In Proceedings of ICLR. perspective context matching for machine comprehension.
Hochreiter, S., and Schmidhuber, J. 1997. Long short-term arXiv preprint arXiv:1612.04211.
memory. Neural computation 9(8):1735–1780. Wang, W.; Yang, N.; Wei, F.; Chang, B.; and Zhou, M. 2017.
Jia, R., and Liang, P. 2017. Adversarial examples for eval- Gated self-matching networks for reading comprehension
uating reading comprehension systems. In Proceedings of and question answering. In Proceedings of ACL.
EMNLP. Weissenborn, D.; Wiese, G.; and Seiffe, L. 2017. Fastqa: A
Joshi, M.; Choi, E.; Weld, D. S.; and Zettlemoyer, L. simple and efficient neural architecture for question answer-
2017. Triviaqa: A large scale distantly supervised challenge ing. arXiv preprint arXiv:1703.04816.
dataset for reading comprehension. In Proceedings of ACL. Williams, R. J. 1992. Simple statistical gradient-following
Kingma, D. P., and Ba, L. J. 2014. Adam: A method for algorithms for connectionist reinforcement learning. Ma-
stochastic optimization. In CoRR, abs/1412.6980. chine learning 8(3-4):229–256.
Li, J.; Monroe, W.; Ritter, A.; Galley, M.; Gao, J.; and Ju- Xiong, C.; Zhong, V.; and Socher, R. 2017. Dynamic coat-
rafsky, D. 2016. Deep reinforcement learning for dialogue tention networks for question answering. In Proceedings of
generation. In Proceedings of EMNLP. ICLR.
Lin, Z.; Feng, M.; dos Santos, C. N.; Yu, M.; Xiang, B.; Yuan, X.; Wang, T.; Gulcehre, C.; Sordoni, A.; and Trischler,
Zhou, B.; and Bengio, Y. 2017. A structured self-attentive A. 2017. Machine comprehension by text-to-text neural
sentence embedding. arXiv preprint arXiv:1703.03130. question generation. arXiv preprint arXiv:1705.02012.
Liu, R.; Hu, J.; Wei, W.; and Nyberg, E. 2017. Structural Zhang, J.; Zhu, X.; Chen, Q.; Dai, L.; Wei, S.; and Jiang,
embedding of syntactic trees for machine comprehension. H. 2017. Exploring question understanding and adaptation
arXiv preprint arXiv:1703.00572. in neural-network-based question answering. arXiv preprint
Norouzi, M.; Bengio, S.; Chen, Z.; and Schuurmans, D. arXiv:1703.04617.
2016. Reward augmented maximum likelihood for neural
structured prediction. In Proceedings of NIPS.
Pan, B.; Li, H.; Zhao, Z.; Cao, B.; Cai, D.; and He, X. 2017.
Memen: Multi-layer embedding with memory networks for
machine comprehension. arXiv preprint arXiv:1707.09098.