10 Encdec Attention Notes
10 Encdec Attention Notes
attention
Herman Kamper
Machine translation
Greedy decoding
Beam search
Basic attention
Attention variants
Common misconceptions
1
Machine translation
In this note, we will use the running example of NMT as a way to look
at encoder-decoder models (also called sequence-to-sequence models)
and attention.
2
An encoder-decoder RNN for MT
NMT is often framed as a sequence-to-sequence learning problem.
The particular architecture often used within this learning framework
is called an encoder-decoder architecture:
he threw me </s>
arg
ma
x
<s> he threw me
y0 y1 y2 y3
hy het my gegooi </s>
x1 x2 x3 x4 x5
3
The (N)MT problem
We want to map input sentence X = x1:N in the source language to
the output sentence Y = y1:T in the target language.
t=1
4
Loss function
In NMT, we calculate Pθ (Y |X) by outputting the conditional proba-
bility at every time step t:
1X T
J(θ) = − log Pθ (yt |y1:t−1 , X)
T t=1
1X T
1X T
=− log ŷt,yt = − log Jt (θ)
T t=1 T t=1
1
PT
J(θ) = T t=1 Jt (θ) = J1 (θ) + J2 (θ) + J3 (θ) + J4 (θ)
<s> he threw me
y0 y1 y2 y3
hy het my gegooi </s>
x1 x2 x3 x4 x5
5
Encoder-decoder modelling choices
he threw me </s>
arg
ma
x
<s> he threw me
y0 y1 y2 y3
hy het my gegooi </s>
x1 x2 x3 x4 x5
6
Conditioning the decoder on the encoder output
• Above we used the last hidden vector from the encoder to
initialise the decoder RNN.
he threw me </s>
arg
ma
x
<s> he threw me
y0 y1 y2 y3
hy het my gegooi </s>
x1 x2 x3 x4 x5
<s>
Yao Ming reaches the finals </s>
7
Encoder-decoder: More general
Encoder
Using f to denote the transformation in the encoder’s RNN, we can
write the recurrance as:
hn = f (xn , hn−1 )
In the general case, the encoder transforms all its hidden states into a
single fixed-dimensional representation:
c = q(h1:N )
Decoder
Decoder hidden state st depends on the previous model output, the
previous decoder hidden state, and the encoder output:
st = g(yt−1 , c, st−1 )
Pθ (yt |y1:t−1 , c)
e.g.
ŷt = softmax(Who st + bo ) ∈ [0, 1]|V|
8
Evaluating MT: BLEU
We want to compare a predicted translation Ŷ to a reference Y .
9
Example: BLEU
10
Greedy decoding
At test time, we can translate some input X by taking the arg max
at every step of the decoder:
Input:
Decoding:
• he ...
• he hit ...
• he hit a ...
At the third decoding step, a is the most probable next word, given the
previously generated outputs. But it might have been better to take
a less-probable word at this third step, maybe getting higher overall
probabilities at some later decoding step. But now we’ve selected a
and there is no way to recover – you are stuck with it.
11
Beam search
Beam search: On each decoding step, keep track of the K most
probable partial translations.
12
Example: Beam search
−1.7 hit
−1.0
−0.6
−0.6 it
−1.2
got
−1.0 −1.8
was
−1.6
log P (y1 |X) log P (y1 |X) log P (y2 |y1 , X) log P (y1:2 |X)
The score at each node is the log probability of that partial translation
according to the model θ:
i=1
13
If we only show the top two nodes at every decoding step:
−4.0 −4.8
tart in −5.1
−2.8 −3.4 −4.5 −4.3 on
−1.7 a pie with pie −4.4
−0.7 hit −2.5 −3.3 −3.7 −4.6 </s>
he −2.9 me with a tart
struck −3.5 −4.3 −5.0
on one pie
<s> −2.9 −4.5
−1.6 hit </s>
−0.6 was −3.8
it −1.8 struck
got
14
What would have happened with greedy search?
−4.0 −4.8
tart in −5.1
−2.8 −3.4 −4.5 −4.3 on
−1.7 a pie with pie −4.4
−0.7 hit −2.5 −3.3 −3.7 −4.6 </s>
he −2.9 me with a tart
struck −3.5 −4.3 −5.0
on one pie
<s> −2.9 −4.5
−1.6 hit </s>
−0.6 was −3.8
it −1.8 struck
got
Just looking at the first three decoding steps, we see that we would
have continued with
15
Penalising shorter hypotheses
Completed hypotheses could have different lengths, as in the example
above if you consider the second most likely path. This can cause
issues in some cases, since naive beam search would probably prefer
shorter sequences for Ŷ since this corresponds to adding together
fewer log probability terms.
1 t
s(y1:t ) = log Pθ (yi |y1:i−1 , X)
X
t i=1
16
NMT with encoder-decoder summary
We have covered:
• Training
• Evaluation: BLEU
Is there anything left? Does this just always work? Any issues you
can think of with the model? (Not just for NMT, but maybe for other
problems as well?)
he threw me </s>
<s> he threw me
y0 y1 y2 y3
hy het my gegooi </s>
x1 x2 x3 x4 x5
17
Basic attention
First skip this and look at the MT example on the next few pages.
Then come back and map what you saw there to the equations here.
n=1
• Concatenate: h i
ct ; st ∈ R2D
and continue as in the non-attention decoding case, e.g.
h i
ŷt = softmax Who ct ; st + bo
18
hy het my gegooi </s> <s>
x1 x2 x3 x4 x5 y0
19
hy het my gegooi </s> <s>
x1 x2 x3 x4 x5 y0
20
he
softmax
ŷ1
21
he threw
softmax
ŷ1 ŷ2
22
he threw me
softmax
23
Attention: More general
1
Figures from (Zhang et al., 2021).
24
Attention with queries, keys and values
In general we have:
Basic attention
In the basic version of attention above, the values and the keys are
the same: vn = kn = hn . I.e. the keys and values are the encoder
hidden states (Bahdanau et al., 2014).
25
All variants of attention have the following components
• Output of attention: Context vector
N
c= α(q, kn )vn ∈ RD
X
n=1
• Attention weight:
• Attention score:
a(q, kn ) ∈ R
c Attention
+
output
k1 a α × v1
k2 a α × v2
softmax
Keys k3 a α × v3 Values
kN a α × vN
Query q
26
Attention variants
Different scoring options
Dimensionalities: q ∈ RD1 and k ∈ RD2
a(q, k) = q⊤ k
Requires dimensionalities D1 = D2
q⊤ k
a(q, k) = √
D1
Requires dimensionalities D1 = D2
• Multiplicative attention:
a(q, k) = q⊤ Wk
• Additive attention:
a(q, k) = MLP(q, k)
= w⊤ tanh (Wq q + Wk k)
27
Common misconceptions
Attention weights are not like model weights
Students sometimes think of attention weights as trainable model
parameters. But the attention scores a and weights α are not learned
and then fixed during training. They are based on a comparison
between vectors which will be different for different inputs. The
vectors that are compared will themself depend on the parameters,
but the operation to calculate a and α from them is deterministic.
a(q, k) = q⊤ Wk
28
Videos covered in this note
• A basic encoder-decoder model for machine translation (13 min)
• Training and loss for encoder-decoder models (10 min)
• Encoder-decoder models in general (18 min)
• Greedy decoding (5 min)
• Beam search (18 min)
• Basic attention (22 min)
• Attention - More general (13 min)
• Evaluating machine translation with BLEU (23 min)
Acknowledgements
This note relied very very heavily on content from:
References
D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
jointly learning to align and translate,” in ICLR, 2015.
29