Cs224n 2020 Lecture08 NMT
Cs224n 2020 Lecture08 NMT
CS224N/Ling284
Lecture 8:
Machine Translation,
Sequence-to-sequence and Attention
Abigail See, Matthew Lamm
Announcements
• We are taking attendance today
• Sign in with the TAs outside the auditorium
• No need to get up now – there will be plenty of time to sign in after
the lecture ends
• For attendance policy special cases, see Piazza post for
clarification
2
Overview
Today we will:
3
Section 1: Pre-Neural Machine
Translation
4
Machine Translation
Machine Translation (MT) is the task of translating a
sentence x from one language (the source language) to a
sentence y in another language (the target language).
x: L'homme est né libre, et partout il est dans les fers
- Rousseau
5
1950s: Early Machine Translation
Machine Translation research
began in the early 1950s.
• Russian → English
(motivated by the Cold War!)
1 minute video showing 1954 MT:
https://youtu.be/K-HfpsHPmvw
6
1990s-2010s: Statistical Machine Translation
• Core idea: Learn a probabilistic model from data
• Suppose we’re translating French → English.
• We want to find best English sentence y, given French sentence
x
8
Learning alignment for SMT
• Question: How to learn translation model from
the parallel corpus?
9
What is alignment?
Alignment is the correspondence between particular words in the translated sentence pair.
10 Examples from: “The Mathematics of Statistical Machine Translation: Parameter Estimation", Brown et al, 1993. http://www.aclweb.org/
anthology/J93-2003
Alignment is complex
Alignment can be many-to-one
11 Examples from: “The Mathematics of Statistical Machine Translation: Parameter Estimation", Brown et al, 1993. http://www.aclweb.org/
anthology/J93-2003
Alignment is complex
Alignment can be one-to-many
Examples from: “The Mathematics of Statistical Machine Translation: Parameter Estimation", Brown et al, 1993. http://www.aclweb.org/
12 anthology/J93-2003
Alignment is complex
Some words are very fertile!
he hit me with a pie
il he il
a hit a
m’ me
m’
entarté with
a entarté
pie
13
Alignment is complex
Alignment can be many-to-many (phrase-level)
14 Examples from: “The Mathematics of Statistical Machine Translation: Parameter Estimation", Brown et al, 1993. http://www.aclweb.org/
anthology/J93-2003
Learning alignment for SMT
• We learn as a combination of many factors,
including:
• Probability of particular words aligning (also depends on
position in sent)
• Probability of particular words having particular fertility
(number of corresponding words)
• etc.
• Alignments a are latent variables: They aren’t explicitly
specified in the data!
• Require the use of special learning aglos (like Expectation-
Maximization) for learning the parameters of distributions
with latent variables (CS 228)
15
Decoding for SMT
Language Model
Question:
How to compute Translation
this argmax? Model
16
Viterbi: Decoding with Dynamic Programming
• Impose strong independence assumptions in model:
Source: “Speech and Language Processing", Chapter A, Jurafsky and Martin, 2019.
17
1990s-2010s: Statistical Machine Translation
• SMT was a huge research field
• The best systems were extremely complex
• Hundreds of important details we haven’t mentioned
here
• Systems had many separately-designed subcomponents
• Lots of feature engineering
• Need to design features to capture particular language
phenomena
• Require compiling and maintaining extra resources
• Like tables of equivalent phrases
• Lots of human effort to maintain
• Repeated effort for each language pair!
18
Section 2: Neural Machine Translation
19
What is Neural Machine Translation?
• Neural Machine Translation (NMT) is a way to do Machine
Translation with a single neural network
20
Neural Machine Translation (NMT)
The sequence-to-sequence model
Target sentence (output)
Encoding of the source sentence.
Provides initial hidden state
he hit me with a pie <END>
for Decoder RNN.
argmax
argmax
argmax
argmax
argmax
argmax
argmax
Encoder RNN
Decoder RNN
il a m’ entarté <START> he hit me with a pie
22
Neural Machine Translation (NMT)
• The sequence-to-sequence model is an example of a
Conditional Language Model.
• Language Model because the decoder is predicting the
next word of the target sentence y
• Conditional because its predictions are also conditioned on the
source sentence x
Decoder RNN
il a m’ entarté <START> he hit me with a pie
argmax
argmax
argmax
argmax
argmax
argmax
argmax
<START> he hit me with a
pie
• This is greedy decoding (take most probable word on each
step)
• Problems with this method?
25
Problems with greedy decoding
• Greedy decoding has no way to undo decisions!
• Input: il a m’entarté (he hit me with a pie)
• → he ____
• → he hit ____
• → he hit a ____ (whoops! no going back now…)
26
Exhaustive search decoding
• Ideally we want to find a (length T) translation y that
maximizes
27
Beam search decoding
• Core idea: On each step of decoder, keep track of the k most
probable partial translations (which we call hypotheses)
• k is the beam size (in practice around 5 to 10)
<START>
Calculate prob
29 dist of next word
Beam search decoding: example
Beam size = k = 2. Blue numbers =
<START>
I
-0.9 = log PLM(I|<START>)
-1.7
-0.7 hit
he
struck
-2.9
<START>
-1.6
was
I
got
-0.9
-1.8
Of these k2 hypotheses,
32 just keep k with highest scores
Beam search decoding: example
Beam size = k = 2. Blue numbers =
-2.8
-1.7 a
-0.7 hit
he me
struck -2.5
-2.9
<START> -2.9
-1.6
hit
was
I struck
got
-0.9 -3.8
-1.8
Of these k2 hypotheses,
34 just keep k with highest scores
Beam search decoding: example
Beam size = k = 2. Blue numbers =
-4.0
tart
-2.8
-1.7 pie
a
-0.7 -3.4
hit
he me -3.3
struck -2.5 with
-2.9
<START> -2.9 on
-1.6
hit -3.5
was
I struck
got
-0.9 -3.8
-1.8
For each of the k hypotheses, find
35 top k next words and calculate scores
Beam search decoding: example
Beam size = k = 2. Blue numbers =
-4.0
tart
-2.8
-1.7 pie
a
-0.7 -3.4
hit
he me -3.3
struck -2.5 with
-2.9
<START> -2.9 on
-1.6
hit -3.5
was
I struck
got
-0.9 -3.8
-1.8
Of these k2 hypotheses,
36 just keep k with highest scores
Beam search decoding: example
Beam size = k = 2. Blue numbers =
-4.0 -4.8
tart in
-2.8
-1.7 pie with
a
-0.7 -3.4 -4.5
hit
he me -3.3 -3.7
struck -2.5 with a
-2.9
<START> -2.9 on one
-1.6
hit -3.5 -4.3
was
I struck
got
-0.9 -3.8
-1.8
For each of the k hypotheses, find
37 top k next words and calculate scores
Beam search decoding: example
Beam size = k = 2. Blue numbers =
-4.0 -4.8
tart in
-2.8
-1.7 pie with
a
-0.7 -3.4 -4.5
hit
he me -3.3 -3.7
struck -2.5 with a
-2.9
<START> -2.9 on one
-1.6
hit -3.5 -4.3
was
I struck
got
-0.9 -3.8
-1.8
Of these k2 hypotheses,
38 just keep k with highest scores
Beam search decoding: example
Beam size = k = 2. Blue numbers =
-4.0 -4.8
tart in
-2.8 -4.3
-1.7 pie with
a pie
-0.7 -3.4 -4.5
hit
he me -3.3 -3.7 tart
struck -2.5 with a -4.6
-2.9
<START> -2.9 on one -5.0
-1.6
hit -3.5 -4.3 pie
was
I struck tart
got
-0.9 -3.8 -5.3
-1.8
For each of the k hypotheses, find
39 top k next words and calculate scores
Beam search decoding: example
Beam size = k = 2. Blue numbers =
-4.0 -4.8
tart in
-2.8 -4.3
-1.7 pie with
a pie
-0.7 -3.4 -4.5
hit
he me -3.3 -3.7 tart
struck -2.5 with a -4.6
-2.9
<START> -2.9 on one -5.0
-1.6
hit -3.5 -4.3 pie
was
I struck tart
got
-0.9 -3.8 -5.3
-1.8
-4.0 -4.8
tart in
-2.8 -4.3
-1.7 pie with
a pie
-0.7 -3.4 -4.5
hit
he me -3.3 -3.7 tart
struck -2.5 with a -4.6
-2.9
<START> -2.9 on one -5.0
-1.6
hit -3.5 -4.3 pie
was
I struck tart
got
-0.9 -3.8 -5.3
-1.8
42
Beam search decoding: finishing up
• We have our list of completed hypotheses.
• How to select top one with highest score?
43
Advantages of NMT
Compared to SMT, NMT has many advantages:
• Better performance
• More fluent
• Better use of context
• Better use of phrase similarities
44
Disadvantages of NMT?
Compared to SMT:
45
How do we evaluate Machine Translation?
BLEU (Bilingual Evaluation Understudy) You’ll see BLEU in
detail in Assignment
4!
• BLEU compares the machine-written translation to one or
several human-written translation(s), and computes a
similarity score based on:
• n-gram precision (usually for 1, 2, 3 and 4-grams)
• Plus a penalty for too-short system translations
46 Source: ”BLEU: a Method for Automatic Evaluation of Machine Translation", Papineni et al, 2002. http://aclweb.org/anthology/P02-1040
MT progress over time
[Edinburgh En-De WMT newstest2013 Cased BLEU; NMT 2015 from U. Montréal]
20.3
13.5
6.8
0
2013 2014 2015 2016
47 Source: http://www.meta-net.eu/events/meta-forum-2016/slides/09_sennrich.pdf
MT progress over time
48
NMT: the biggest success story of NLP Deep Learning
• This is amazing!
• SMT systems, built by hundreds of engineers over many
years, outperformed by NMT systems trained by a
handful of engineers in a few months
49
So is Machine Translation solved?
• Nope!
• Many difficulties remain:
• Out-of-vocabulary words
• Domain mismatch between train and test data
• Maintaining context over longer text
• Low-resource language pairs
51
So is Machine Translation solved?
• Nope!
• NMT picks up biases in training data
52 Source: https://hackernoon.com/bias-sexist-or-this-is-the-way-it-should-be-
ce1f7c8c683c
So is Machine Translation solved?
• Nope!
• Uninterpretable systems do strange things
ATTENTION
54
Section 3: Attention
55
Sequence-to-sequence: the bottleneck problem
Encoding of the
source sentence.
Target sentence (output)
Decoder RNN
il a m’ entarté <START> he hit me with a pie
56
Sequence-to-sequence: the bottleneck problem
Encoding of the
source sentence.
This needs to capture all Target sentence (output)
information about the
source sentence.
Information bottleneck! he hit me with a pie
<END>
Encoder RNN
Decoder RNN
il a m’ <START> he hit me with a
entarté pie
57
Attention
• Attention provides a solution to the bottleneck problem.
58
Sequence-to-sequence with attention
dot product
Encoder
Attention
scores
Decoder RNN
RNN
il a m’ entarté <START>
dot product
Encoder
Attention
scores
Decoder RNN
RNN
il a m’ entarté <START>
dot product
Encoder
Attention
scores
Decoder RNN
RNN
il a m’ entarté <START>
dot product
Encoder
Attention
scores
Decoder RNN
RNN
il a m’ entarté <START>
Decoder RNN
RNN
il a m’ entarté <START>
Decoder RNN
RNN
il a m’ entarté <START>
Decoder RNN
RNN
il a m’ entarté <START>
Decoder RNN
RNN
Decoder RNN
RNN
Decoder RNN
RNN
Decoder RNN
RNN
Decoder RNN
RNN
71
Attention is great
• Attention significantly improves NMT performance
• It’s very useful to allow decoder to focus on certain parts of the
source
• Attention solves the bottleneck problem
• Attention allows decoder to look directly at source; bypass bottleneck
• Attention helps with vanishing gradient problem
• Provides shortcut to faraway states
• Attention provides some interpretability
he hit me wit a pie
• By inspecting attention distribution, we can see
h
what the decoder was focusing on il
m’
• This is cool because we never explicitly trained
entarté
an alignment system
• The network just learned alignment by itself
72
Attention is a general Deep Learning technique
• We’ve seen that attention is a great way to improve the
sequence-to-sequence model for Machine Translation.
• However: You can use attention in many architectures
(not just seq2seq) and many tasks (not just MT)
Intuition:
• The weighted sum is a selective summary of the
information contained in the values, where the query
determines which values to focus on.
• Attention is a way to obtain a fixed-size representation
of an arbitrary set of representations (the values),
dependent on some other representation (the query).
74
There are several attention variants
• We have some values and a query
• Multiplicative attention:
• Where is a weight matrix
• Additive attention:
• Where are weight matrices and
is a weight vector.
• d3 (the attention dimensionality) is a hyperparameter
More information:
“Deep Learning for NLP Best Practices”, Ruder, 2017. http://ruder.io/deep-learning-nlp-best-practices/
76 index.html#attention
“Massive Exploration of Neural Machine Translation Architectures”, Britz et al, 2017, https://arxiv.org/pdf/
1703.03906.pdf
Summary of today’s lecture
• We learned some history of Machine Translation (MT)
• Sequence-to-sequence is the
architecture for NMT (uses 2 RNNs)