0% found this document useful (0 votes)
7 views

10 Encdec Attention Notes

Uploaded by

lucky.pics45
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

10 Encdec Attention Notes

Uploaded by

lucky.pics45
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Encoder-decoder models and

attention

Herman Kamper

2024-02, CC BY-SA 4.0

Machine translation

An encoder-decoder RNN for MT

Encoder-decoder modelling choices

Evaluating MT: BLEU

Greedy decoding

Beam search

Basic attention

Attention: More general

Attention variants

Common misconceptions

1
Machine translation

Old-school machine translation (MT) was done with big complex


models with several subsystems in a paradigm called statistical machine
translation (SMT).

The most recent paradigm of MT models, starting in around 2014, is


referred to as neural machine translation (NMT).

In this note, we will use the running example of NMT as a way to look
at encoder-decoder models (also called sequence-to-sequence models)
and attention.

2
An encoder-decoder RNN for MT
NMT is often framed as a sequence-to-sequence learning problem.
The particular architecture often used within this learning framework
is called an encoder-decoder architecture:

he threw me </s>

ŷ1 ŷ2 ŷ3 ŷ4

arg
ma
x

<s> he threw me
y0 y1 y2 y3
hy het my gegooi </s>
x1 x2 x3 x4 x5

Dashed line: What happens at test time

The decoder output is used as the input to the next step.

3
The (N)MT problem
We want to map input sentence X = x1:N in the source language to
the output sentence Y = y1:T in the target language.

Convention: y0 = <s> and xN = yT = </s>

The goal is to find


arg max Pθ (Y |X)
Y

We can decompose the conditional probability using the product rule:

Pθ (Y |X) = Pθ (y1:T |X)


= Pθ (yT |y1:T −1 , X) Pθ (yT −1 |y1:T −2 , X) · · · Pθ (y1 |X)
T
= Pθ (yt |y1:t−1 , X)
Y

t=1

4
Loss function
In NMT, we calculate Pθ (Y |X) by outputting the conditional proba-
bility at every time step t:

ŷt,k = Pθ (yt = k|y1:t−1 , X)

We train the encoder-decoder model by optimizing the per-word nega-


tive log likelihood:

1X T
J(θ) = − log Pθ (yt |y1:t−1 , X)
T t=1
1X T
1X T
=− log ŷt,yt = − log Jt (θ)
T t=1 T t=1

1
PT
J(θ) = T t=1 Jt (θ) = J1 (θ) + J2 (θ) + J3 (θ) + J4 (θ)

− log ŷ1,he − log ŷ2,threw − log ŷ3,me − log ŷ4,</s>

ŷ1 ŷ2 ŷ3 ŷ4

<s> he threw me
y0 y1 y2 y3
hy het my gegooi </s>
x1 x2 x3 x4 x5

5
Encoder-decoder modelling choices
he threw me </s>

ŷ1 ŷ2 ŷ3 ŷ4

arg
ma
x
<s> he threw me
y0 y1 y2 y3
hy het my gegooi </s>
x1 x2 x3 x4 x5

Output conditioning at training and test time


• Training time:

– We condition the decoder step t on the ground truth word


yt−1 from the previous time step. This is called teacher
forcing.
– There are also training variants where you would sometimes
use ŷt−1 during training to better match what happens at
test time (below).

• Test time: We don’t have the ground truth yt−1 . So we


condition the decoder step t on the predicted word ŷt−1 from
the previous time step. We can take the arg max as in the
figures so far, or do something more fancy (later).

6
Conditioning the decoder on the encoder output
• Above we used the last hidden vector from the encoder to
initialise the decoder RNN.

• We could feed the final hidden representation from the encoder


into some fully connected layers before conditioning the decoder.

• We could condition every decoder step on the encoder output:

he threw me </s>

ŷ1 ŷ2 ŷ3 ŷ4

arg
ma
x

<s> he threw me
y0 y1 y2 y3
hy het my gegooi </s>
x1 x2 x3 x4 x5

Output units: Words, characters, subwords


In the above we used words as output units. But we could also use
characters (maybe a good choice if the target language uses a non-
Latin script). Or we could use subword units like BPE. This could
address sparsity problems where some words are not in the vocabulary.
</s>

ŷ1 ŷ2 ŷ3 ŷ4 ŷ5 ŷ6 ŷ7 ŷ8

<s>
Yao Ming reaches the finals </s>

7
Encoder-decoder: More general
Encoder
Using f to denote the transformation in the encoder’s RNN, we can
write the recurrance as:

hn = f (xn , hn−1 )

In the general case, the encoder transforms all its hidden states into a
single fixed-dimensional representation:

c = q(h1:N )

In the examples above:


c = hN

Decoder
Decoder hidden state st depends on the previous model output, the
previous decoder hidden state, and the encoder output:

st = g(yt−1 , c, st−1 )

Normally the decoder hidden state st is passed on to some output


operation in order to get to

Pθ (yt |y1:t−1 , c)

e.g.
ŷt = softmax(Who st + bo ) ∈ [0, 1]|V|

8
Evaluating MT: BLEU
We want to compare a predicted translation Ŷ to a reference Y .

Let pn denote the precision of n-grams of order n:

• Out of all the n-grams in Ŷ , how many of them occur in Y ?

• Counts are capped by the number of occurrences in the reference.


E.g. if Y = b b and Ŷ = b b b, then p1 = 23 since Ŷ can only
get credit for unigram b up to its count in Y (which is two).

Example from Zhang et al., (2021):


Y = the cat sat on the mat
Ŷ = the cat cat sat on
Then p1 = 45 , p2 = 34 , p3 = 1
3
and p4 = 02 .

Let |Y | denote the sequence length, e.g. if Y = y1:T then |Y | = T .

The BLEU score is defined as (Papineni et al., 2002):


( !) N
|Y | n
BLEU = exp min 0, 1 − pn1/2
Y
|Ŷ | n=1

where N is the longest n-gram used for matching.

• If the reference and prediction match, the BLEU is 1.

• Longer n-grams are more difficult, so assign these a larger weight:


n
For a fixed pn we have p1/2
n increasing for larger n.

• Very short sequence tend to get high pn , which is unwanted:


Penalise these with the exponential term. E.g. when N = 2
with Y = a b c d e fn and Ŷo = a b, although p1 = p2 = 1,
the penality factor exp 1 − 62 = 0.14 lowers the BLEU.

In practice there is often more than one reference.

9
Example: BLEU

10
Greedy decoding
At test time, we can translate some input X by taking the arg max
at every step of the decoder:

ŷt = arg max Pθ (yt = w|y1:t−1 , X)


w∈V

But this might be short sighted!

Input:

hy het my met 'n tert geslaan

(he hit me with a pie)

Decoding:

• he ...

• he hit ...

• he hit a ...

At the third decoding step, a is the most probable next word, given the
previously generated outputs. But it might have been better to take
a less-probable word at this third step, maybe getting higher overall
probabilities at some later decoding step. But now we’ve selected a
and there is no way to recover – you are stuck with it.

11
Beam search
Beam search: On each decoding step, keep track of the K most
probable partial translations.

• K is called the beam size (5 to 10 for MT)

• The partial translation is called a hypothesis

Beam search is not guaranteed to find overall optimal solution:

• But way more efficient than brute-forcing all paths (exhaustive


search)

• Likely to find a better solution than the greedy approach (if


there is one)

With K = 1, beam search is equivalent to greedy decoding.

12
Example: Beam search

−1.7 hit

−1.0

−0.7 −2.9 struck


−2.2
he
−0.7
<s>

−0.6
−0.6 it

−1.2
got
−1.0 −1.8

was
−1.6

log P (y1 |X) log P (y1 |X) log P (y2 |y1 , X) log P (y1:2 |X)

The score at each node is the log probability of that partial translation
according to the model θ:

s(y1:t ) = log Pθ (y1:t |X)


t
= log Pθ (yi |y1:i−1 , X)
X

i=1

13
If we only show the top two nodes at every decoding step:
−4.0 −4.8
tart in −5.1
−2.8 −3.4 −4.5 −4.3 on
−1.7 a pie with pie −4.4
−0.7 hit −2.5 −3.3 −3.7 −4.6 </s>
he −2.9 me with a tart
struck −3.5 −4.3 −5.0
on one pie
<s> −2.9 −4.5
−1.6 hit </s>
−0.6 was −3.8
it −1.8 struck
got

14
What would have happened with greedy search?
−4.0 −4.8
tart in −5.1
−2.8 −3.4 −4.5 −4.3 on
−1.7 a pie with pie −4.4
−0.7 hit −2.5 −3.3 −3.7 −4.6 </s>
he −2.9 me with a tart
struck −3.5 −4.3 −5.0
on one pie
<s> −2.9 −4.5
−1.6 hit </s>
−0.6 was −3.8
it −1.8 struck
got

Just looking at the first three decoding steps, we see that we would
have continued with

it was hit ... (−2.9)

but would have missed

he hit me ... (−2.5)

because this better option only appeared later.

15
Penalising shorter hypotheses
Completed hypotheses could have different lengths, as in the example
above if you consider the second most likely path. This can cause
issues in some cases, since naive beam search would probably prefer
shorter sequences for Ŷ since this corresponds to adding together
fewer log probability terms.

A length normalisation approach is therefore normally incorporated.

One simple approach: Normalise by number of words in hypothesis

1 t
s(y1:t ) = log Pθ (yi |y1:i−1 , X)
X
t i=1

16
NMT with encoder-decoder summary
We have covered:

• Training

• Test-time decoding: Beam search

• Evaluation: BLEU

Is there anything left? Does this just always work? Any issues you
can think of with the model? (Not just for NMT, but maybe for other
problems as well?)

he threw me </s>

ŷ1 ŷ2 ŷ3 ŷ4


arg
ma
x

<s> he threw me
y0 y1 y2 y3
hy het my gegooi </s>
x1 x2 x3 x4 x5

17
Basic attention
First skip this and look at the MT example on the next few pages.
Then come back and map what you saw there to the equations here.

We are at time step t of the decoder.

• Encoder hidden states:


h1 , h2 , . . . , hN ∈ RD

• Decoder hidden state at time step t:


st ∈ R D

• Attention score for encoder hidden state hn :


a(st , hn ) = s⊤
t hn ∈ R

• Attention weight for encoder hidden state hn :


α(st , hn ) = softmaxn (a(st , hn ))
exp {a(st , hn )}
= PN ∈ [0, 1]
j=1 exp {a(st , hj )}

• Context vector at decoder time step t:


N
ct = α(st , hn )hn ∈ RD
X

n=1

• Concatenate: h i
ct ; st ∈ R2D
and continue as in the non-attention decoding case, e.g.
 h i 
ŷt = softmax Who ct ; st + bo

18
hy het my gegooi </s> <s>
x1 x2 x3 x4 x5 y0

19
hy het my gegooi </s> <s>
x1 x2 x3 x4 x5 y0

20
he
softmax

ŷ1

hy het my gegooi </s> <s>


x1 x2 x3 x4 x5 y0

21
he threw
softmax

ŷ1 ŷ2

hy het my gegooi </s> <s> he


x1 x2 x3 x4 x5 y0 y1

22
he threw me
softmax

ŷ1 ŷ2 ŷ3

hy het my gegooi </s> <s> he threw me


x1 x2 x3 x4 x5 y0 y1 y2 y3

23
Attention: More general

Nonvolitional cues (keys):1

Volitional cues (queries):

1
Figures from (Zhang et al., 2021).

24
Attention with queries, keys and values
In general we have:

• Query q ∈ RD1 : Volitional cues

• Keys k1 , k2 , . . . , kN ∈ RD2 : Nonvolitional cues

• Values v1 , v2 , . . . , vN ∈ RD : What is attended to

The values vn ∈ RD and the output context vector c ∈ RD have the


same dimensionality. But the dimensionalities of the query q ∈ RD1
and keys kn ∈ RD2 need not match, as long as there is a way to get
the attention score a.

Basic attention
In the basic version of attention above, the values and the keys are
the same: vn = kn = hn . I.e. the keys and values are the encoder
hidden states (Bahdanau et al., 2014).

25
All variants of attention have the following components
• Output of attention: Context vector
N
c= α(q, kn )vn ∈ RD
X

n=1

• Attention weight:

α(q, kn ) = softmaxn (a(q, kn ))


exp {a(q, kn )}
= PN ∈ [0, 1]
j=1 exp {a(q, kj )}

• Attention score:
a(q, kn ) ∈ R

c Attention
+
output

k1 a α × v1

k2 a α × v2
softmax

Keys k3 a α × v3 Values

kN a α × vN

Query q

26
Attention variants
Different scoring options
Dimensionalities: q ∈ RD1 and k ∈ RD2

• Dot product attention:

a(q, k) = q⊤ k

Requires dimensionalities D1 = D2

• Scaled dot product attention:

q⊤ k
a(q, k) = √
D1
Requires dimensionalities D1 = D2

• Multiplicative attention:

a(q, k) = q⊤ Wk

where W ∈ RD1 ×D2

• Additive attention:
a(q, k) = MLP(q, k)
= w⊤ tanh (Wq q + Wk k)

with w ∈ RD3 , Wq ∈ RD3 ×D1 and Wk ∈ RD3 ×D2

Different ways to use the context vector ct in decoding


There are also different approaches to how and whether ct gets passed
on to the next hidden representation st+1 in the decoder. See (Voita,
2022) for details on two possible schemes.

27
Common misconceptions
Attention weights are not like model weights
Students sometimes think of attention weights as trainable model
parameters. But the attention scores a and weights α are not learned
and then fixed during training. They are based on a comparison
between vectors which will be different for different inputs. The
vectors that are compared will themself depend on the parameters,
but the operation to calculate a and α from them is deterministic.

E.g. in multiplicative attention we have:

a(q, k) = q⊤ Wk

The W is a trainable parameter matrix, q and k are vectors that will


depend on other model parameters, but the resulting attention score
a is not a parameter that is stored during training.

28
Videos covered in this note
• A basic encoder-decoder model for machine translation (13 min)
• Training and loss for encoder-decoder models (10 min)
• Encoder-decoder models in general (18 min)
• Greedy decoding (5 min)
• Beam search (18 min)
• Basic attention (22 min)
• Attention - More general (13 min)
• Evaluating machine translation with BLEU (23 min)

Acknowledgements
This note relied very very heavily on content from:

• Chris Manning’s CS224N course at Stanford University


• The D2L textbook from Zhang et al. (2021)

References
D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
jointly learning to align and translate,” in ICLR, 2015.

C. Manning, “CS224N: Machine translation, sequence-to-sequence


and attention,” Stanford University, 2022.

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: A method


for automatic evaluation of machine translation,” in ACL, 2002.

A. Zhang, Z. C. Lipton, M. Li, and A. J. Smola, Dive into Deep


Learning, 2021.

29

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy